fix: suppress context errors and skip reorgs during validator shutdown#4532
fix: suppress context errors and skip reorgs during validator shutdown#4532joshuacolvin0 wants to merge 19 commits intomasterfrom
Conversation
- possiblyFatal now suppresses context.Canceled and DeadlineExceeded
- Extract handleValidationResult to deduplicate iterativeValidation{Progress,SentProgress}
- Skip reorg attempts when context is already cancelled (shutdown)
- Fix Reorg guard: count==1 (reorg to genesis) is valid, only reject count==0
- Add comprehensive unit tests for all new behavior
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #4532 +/- ##
==========================================
- Coverage 34.98% 34.16% -0.83%
==========================================
Files 494 494
Lines 58915 58954 +39
==========================================
- Hits 20611 20139 -472
- Misses 34708 35276 +568
+ Partials 3596 3539 -57 |
❌ 14 Tests Failed:
View the top 3 failed tests by shortest run time
📣 Thoughts on this report? Let Codecov know! | Powered by Codecov |
…silience Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…og.Debug Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tine leak Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pmikolajczyk41
left a comment
There was a problem hiding this comment.
just one comment, but generally LGTM; resolve the comment at your will
| moreWork, err := v.sendNextRecordRequests(ctx) | ||
| if err != nil { | ||
| log.Error("error trying to record for validation node", "err", err) | ||
| } |
There was a problem hiding this comment.
here probably we could have similar error handling as in iterativeValidationEntryCreator
…trics - Extract isShutdownCancellation helper and classifyValidationError enum to deduplicate 3-way error classification across all validation paths - Tighten shutdown detection: check ctx.Err() is specifically context.Canceled - Don't increment failure counter on expected shutdown cancellations - Escalate non-shutdown errors via possiblyFatal in iterative loops and handleValidationResult so persistent errors surface as fatal - Update lastValidGS only after successful DB write - Remove duplicate possiblyFatal call after Reorg in handleValidationResult - Consolidate tests into table-driven form, add 8-case truth table for isShutdownCancellation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…org correctness - Add classifyValidationError to distinguish shutdown, transient, and fatal errors — fixes spawner-shutdown race producing spurious fatal errors - Consolidate error handling into handleValidationResult across all four validation pipeline stages - Fix writeLastValidated to set lastValidGS only after successful DB write - Fix Reorg guard: count==0 is rejected (was count<=1, blocking valid reorg-to-genesis); escalate via possiblyFatal - Escalate writeLastValidated DB failures in Reorg instead of log-and-continue - Move ErrResultNotFound to execution package to break import cycle - Add comprehensive unit tests for error classification, shutdown race, DB failure escalation, Reorg guards, and writeLastValidated ordering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
staker/block_validator.go
Outdated
| markSuccess = false | ||
| validationStatus.DoneEntry.Err = err | ||
| log.Error("error while validating", "err", err, "start", validationStatus.DoneEntry.Start, "end", validationStatus.DoneEntry.End) | ||
| if classifyValidationError(ctx, err, "sendValidations") == validationFatal { |
There was a problem hiding this comment.
Race window: double classification of the same error
classifyValidationError is called here in the sendValidations untracked goroutine to decide whether to increment the failed-validations counter, and then called again at line 945 in advanceValidations when it reads DoneEntry.Err to decide how to handle the error (shutdown/transient/fatal).
These two calls happen at different points in time with the same ctx. Between them, the block validator's context can transition from live to canceled. This means the same error could be classified as validationFatal here (incrementing the counter) but as validationShutdown in advanceValidations (suppressing it), or vice versa. This would corrupt the arb/validator/validations/failed metric. It also means every error gets logged twice (once here, once in advanceValidations).
Suggestion: Classify once when DoneEntry.Err is set and store the severity alongside the error:
type validationDoneEntry struct {
Success bool
Err error
ErrSeverity validationErrorSeverity // classify once, use everywhere
// ...
}This also eliminates the duplicate logging as a side effect.
- Add validationErrorSeverity enum (shutdown/transient/fatal) with String() method - Add classifyValidationError to classify and log errors at the appropriate level - Store ErrSeverity on validationDoneEntry via setError() helper - Add handleValidationResult to centralize error/reorg handling across all CallIterativelyWith callbacks - advanceValidations switches on pre-classified severity instead of re-checking context state, avoiding TOCTOU issues - Harden all writeLastValidated callers (advanceValidations, Reorg, UpdateLatestStaked, InitAssumeValid) to escalate DB write failures - Restore diagnostic context (created/processed counts) in iterativeValidationEntryCreator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add validationErrorSeverity enum (shutdown/transient/fatal) with String() method - Add classifyValidationError to classify and log errors at the appropriate level - Store ErrSeverity on validationDoneEntry via setError() helper - Add handleValidationResult to centralize error/reorg handling across all CallIterativelyWith callbacks - advanceValidations switches on pre-classified severity instead of re-checking context state, avoiding TOCTOU issues - Harden all writeLastValidated callers (advanceValidations, Reorg, UpdateLatestStaked, InitAssumeValid) to escalate DB write failures - Restore diagnostic context (created/processed counts) in iterativeValidationEntryCreator - Downgrade "Fatal error channel full" to Warn (first fatal already triggers shutdown; second is supplementary context) - Downgrade entry-creation pre-classification log to Debug to avoid double-logging at Warn - Simplify Reorg possiblyFatal invariant comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ents and tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-shutdown-resilience
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com