Skip to content

fix: suppress context errors and skip reorgs during validator shutdown#4532

Open
joshuacolvin0 wants to merge 19 commits intomasterfrom
fix/block-validator-shutdown-resilience
Open

fix: suppress context errors and skip reorgs during validator shutdown#4532
joshuacolvin0 wants to merge 19 commits intomasterfrom
fix/block-validator-shutdown-resilience

Conversation

@joshuacolvin0
Copy link
Copy Markdown
Member

  • possiblyFatal now suppresses context.Canceled and DeadlineExceeded
  • Extract handleValidationResult to deduplicate iterativeValidation{Progress,SentProgress}
  • Skip reorg attempts when context is already cancelled (shutdown)
  • Fix Reorg guard: count==1 (reorg to genesis) is valid, only reject count==0
  • Add comprehensive unit tests for all new behavior

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

- possiblyFatal now suppresses context.Canceled and DeadlineExceeded
- Extract handleValidationResult to deduplicate iterativeValidation{Progress,SentProgress}
- Skip reorg attempts when context is already cancelled (shutdown)
- Fix Reorg guard: count==1 (reorg to genesis) is valid, only reject count==0
- Add comprehensive unit tests for all new behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 6.97674% with 80 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.16%. Comparing base (eb994bc) to head (2fb11e7).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4532      +/-   ##
==========================================
- Coverage   34.98%   34.16%   -0.83%     
==========================================
  Files         494      494              
  Lines       58915    58954      +39     
==========================================
- Hits        20611    20139     -472     
- Misses      34708    35276     +568     
+ Partials     3596     3539      -57     

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 19, 2026

❌ 14 Tests Failed:

Tests completed Failed Passed Skipped
4828 14 4814 0
View the top 3 failed tests by shortest run time
TestAliasingFlaky
Stack Traces | -0.000s run time
=== RUN   TestAliasingFlaky
=== PAUSE TestAliasingFlaky
=== CONT  TestAliasingFlaky
    common_test.go:777: BuildL1 deployConfig: DeployBold=true, DeployReferenceDAContracts=false
TestBatchPosterL1SurplusMatchesBatchGasFlaky
Stack Traces | 0.550s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
panic: runtime error: invalid memory address or nil pointer dereference [recovered, repanicked]
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x207ec92]

goroutine 74 [running]:
testing.tRunner.func1.2({0x37e7280, 0x62039b0})
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1872 +0x237
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1875 +0x35b
panic({0x37e7280?, 0x62039b0?})
	/opt/hostedtoolcache/go/1.25.8/x64/src/runtime/panic.go:783 +0x132
github.com/offchainlabs/nitro/arbnode.(*InboxTracker).GetBatchCount(0x121cf900?)
	/home/runner/work/nitro/nitro/arbnode/inbox_tracker.go:210 +0x12
github.com/offchainlabs/nitro/arbnode.(*InboxTracker).FindInboxBatchContainingMessage(0x0, 0x6)
	/home/runner/work/nitro/nitro/arbnode/inbox_tracker.go:225 +0x2f
github.com/offchainlabs/nitro/system_tests.TestBatchPosterL1SurplusMatchesBatchGasFlaky(0xc0003e8e00)
	/home/runner/work/nitro/nitro/system_tests/batch_poster_test.go:839 +0x725
testing.tRunner(0xc0003e8e00, 0x41ba0c8)
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1934 +0xea
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1997 +0x465
TestParentChainEthConfigForkTransition
Stack Traces | 3.460s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
INFO [04-03|23:34:56.193] All fork specifications can be found at https://ethereum.github.io/execution-specs/src/ethereum/forks/
INFO [04-03|23:34:56.193] 
INFO [04-03|23:34:56.193] ---------------------------------------------------------------------------------------------------------------------------------------------------------
INFO [04-03|23:34:56.193] 
INFO [04-03|23:34:56.194] Loaded most recent local block           number=0    hash=079574..4b9dfc age=57y3w5d
WARN [04-03|23:34:56.194] Failed to load snapshot                  err="missing or corrupted snapshot"
INFO [04-03|23:34:56.194] Rebuilding state snapshot
INFO [04-03|23:34:56.194] Starting work on payload                 id=0x038046a67f914ee0
INFO [04-03|23:34:56.194] Initialized transaction indexer          range="last 126230400 blocks"
INFO [04-03|23:34:56.194] Imported new potential chain segment     number=271  hash=e5011e..41093e blocks=1  txs=1   mgas=0.021  elapsed=1.997ms      mgasps=10.511   triediffs=649.57KiB triedirty=112.44KiB
INFO [04-03|23:34:56.195] Chain head was updated                   number=271  hash=e5011e..41093e root=9eb67d..005c59 elapsed="57.538µs"
INFO [04-03|23:34:56.195] Updated payload                          id=0x038046a67f914ee0 number=11   hash=78401d..ec7de8 txs=1   withdrawals=0 gas=2,307,962  fees=2.307962e-06   root=d77082..8e1038 elapsed="655.263µs"
WARN [04-03|23:34:56.195] Sequencer ReadFromTxQueueTimeout is higher than MaxBlockSpeed ReadFromTxQueueTimeout=1s MaxBlockSpeed=10ms
INFO [04-03|23:34:56.195] Imported new potential chain segment     number=1037 hash=d361fe..365b7b blocks=1  txs=1   mgas=0.021  elapsed=2.331ms      mgasps=9.009    triediffs=699.33KiB triedirty=469.97KiB
INFO [04-03|23:34:56.195] Submitted contract creation              hash=0x9842452a1736756e133e5c0bbd9b701e982287542bc6cf0430d6dda1a4293908 from=0x57Ff0F473737a1c161bfF9efDF016F7991585088 nonce=10  contract=0xd8Ef51E29965b7c1388c0BFD630D4cFc46D661b5 value=0
INFO [04-03|23:34:56.195] Chain head was updated                   number=1037 hash=d361fe..365b7b root=aeb24a..1c056c elapsed="64.37µs"
INFO [04-03|23:34:56.196] Submitted transaction                    hash=0x213386de43415fc55e21b537a466bf17253bb29f3f7388456c3b1a29c76b2c10 from=0xaF24Ca6c2831f4d4F629418b50C227DF0885613A nonce=45  recipient=0xaF24Ca6c2831f4d4F629418b50C227DF0885613A value=1,000,000,000,000
INFO [04-03|23:34:56.193] Stopping work on payload                 id=0x03ba95736beb24c5 reason=delivery
INFO [04-03|23:34:56.196] Stopping work on payload                 id=0x038046a67f914ee0 reason=delivery
--- FAIL: TestParentChainEthConfigForkTransition (3.46s)

📣 Thoughts on this report? Let Codecov know! | Powered by Codecov

joshuacolvin0 and others added 5 commits March 25, 2026 08:42
pmikolajczyk41
pmikolajczyk41 previously approved these changes Mar 26, 2026
Copy link
Copy Markdown
Member

@pmikolajczyk41 pmikolajczyk41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one comment, but generally LGTM; resolve the comment at your will

Comment on lines 842 to 845
moreWork, err := v.sendNextRecordRequests(ctx)
if err != nil {
log.Error("error trying to record for validation node", "err", err)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here probably we could have similar error handling as in iterativeValidationEntryCreator

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call

…trics

- Extract isShutdownCancellation helper and classifyValidationError enum
  to deduplicate 3-way error classification across all validation paths
- Tighten shutdown detection: check ctx.Err() is specifically context.Canceled
- Don't increment failure counter on expected shutdown cancellations
- Escalate non-shutdown errors via possiblyFatal in iterative loops and
  handleValidationResult so persistent errors surface as fatal
- Update lastValidGS only after successful DB write
- Remove duplicate possiblyFatal call after Reorg in handleValidationResult
- Consolidate tests into table-driven form, add 8-case truth table for
  isShutdownCancellation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…org correctness

- Add classifyValidationError to distinguish shutdown, transient, and fatal
  errors — fixes spawner-shutdown race producing spurious fatal errors
- Consolidate error handling into handleValidationResult across all four
  validation pipeline stages
- Fix writeLastValidated to set lastValidGS only after successful DB write
- Fix Reorg guard: count==0 is rejected (was count<=1, blocking valid
  reorg-to-genesis); escalate via possiblyFatal
- Escalate writeLastValidated DB failures in Reorg instead of log-and-continue
- Move ErrResultNotFound to execution package to break import cycle
- Add comprehensive unit tests for error classification, shutdown race,
  DB failure escalation, Reorg guards, and writeLastValidated ordering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pmikolajczyk41
pmikolajczyk41 previously approved these changes Mar 31, 2026
@joshuacolvin0 joshuacolvin0 removed their assignment Mar 31, 2026
markSuccess = false
validationStatus.DoneEntry.Err = err
log.Error("error while validating", "err", err, "start", validationStatus.DoneEntry.Start, "end", validationStatus.DoneEntry.End)
if classifyValidationError(ctx, err, "sendValidations") == validationFatal {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race window: double classification of the same error

classifyValidationError is called here in the sendValidations untracked goroutine to decide whether to increment the failed-validations counter, and then called again at line 945 in advanceValidations when it reads DoneEntry.Err to decide how to handle the error (shutdown/transient/fatal).

These two calls happen at different points in time with the same ctx. Between them, the block validator's context can transition from live to canceled. This means the same error could be classified as validationFatal here (incrementing the counter) but as validationShutdown in advanceValidations (suppressing it), or vice versa. This would corrupt the arb/validator/validations/failed metric. It also means every error gets logged twice (once here, once in advanceValidations).

Suggestion: Classify once when DoneEntry.Err is set and store the severity alongside the error:

type validationDoneEntry struct {
    Success         bool
    Err             error
    ErrSeverity     validationErrorSeverity // classify once, use everywhere
    // ...
}

This also eliminates the duplicate logging as a side effect.

@eljobe eljobe assigned joshuacolvin0 and unassigned eljobe Apr 2, 2026
- Add validationErrorSeverity enum (shutdown/transient/fatal) with String() method
- Add classifyValidationError to classify and log errors at the appropriate level
- Store ErrSeverity on validationDoneEntry via setError() helper
- Add handleValidationResult to centralize error/reorg handling across all
  CallIterativelyWith callbacks
- advanceValidations switches on pre-classified severity instead of re-checking
  context state, avoiding TOCTOU issues
- Harden all writeLastValidated callers (advanceValidations, Reorg,
  UpdateLatestStaked, InitAssumeValid) to escalate DB write failures
- Restore diagnostic context (created/processed counts) in
  iterativeValidationEntryCreator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
joshuacolvin0 and others added 2 commits April 2, 2026 22:16
- Add validationErrorSeverity enum (shutdown/transient/fatal) with String() method
- Add classifyValidationError to classify and log errors at the appropriate level
- Store ErrSeverity on validationDoneEntry via setError() helper
- Add handleValidationResult to centralize error/reorg handling across all
  CallIterativelyWith callbacks
- advanceValidations switches on pre-classified severity instead of re-checking
  context state, avoiding TOCTOU issues
- Harden all writeLastValidated callers (advanceValidations, Reorg,
  UpdateLatestStaked, InitAssumeValid) to escalate DB write failures
- Restore diagnostic context (created/processed counts) in
  iterativeValidationEntryCreator
- Downgrade "Fatal error channel full" to Warn (first fatal already
  triggers shutdown; second is supplementary context)
- Downgrade entry-creation pre-classification log to Debug to avoid
  double-logging at Warn
- Simplify Reorg possiblyFatal invariant comment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ganeshvanahalli
ganeshvanahalli previously approved these changes Apr 3, 2026
Copy link
Copy Markdown
Contributor

@ganeshvanahalli ganeshvanahalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

joshuacolvin0 and others added 2 commits April 3, 2026 15:35
…ents and tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
joshuacolvin0 and others added 2 commits April 3, 2026 15:52
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants