fix: suppress context errors and skip reorgs during validator shutdown by joshuacolvin0 · Pull Request #4532 · OffchainLabs/nitro

joshuacolvin0 · 2026-03-19T16:27:17Z

possiblyFatal now suppresses context.Canceled and DeadlineExceeded
Extract handleValidationResult to deduplicate iterativeValidation{Progress,SentProgress}
Skip reorg attempts when context is already cancelled (shutdown)
Fix Reorg guard: count==1 (reorg to genesis) is valid, only reject count==0
Add comprehensive unit tests for all new behavior

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

- possiblyFatal now suppresses context.Canceled and DeadlineExceeded - Extract handleValidationResult to deduplicate iterativeValidation{Progress,SentProgress} - Skip reorg attempts when context is already cancelled (shutdown) - Fix Reorg guard: count==1 (reorg to genesis) is valid, only reject count==0 - Add comprehensive unit tests for all new behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-03-19T16:52:38Z

Codecov Report

❌ Patch coverage is 6.97674% with 80 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.16%. Comparing base (eb994bc) to head (2fb11e7).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4532      +/-   ##
==========================================
- Coverage   34.98%   34.16%   -0.83%     
==========================================
  Files         494      494              
  Lines       58915    58954      +39     
==========================================
- Hits        20611    20139     -472     
- Misses      34708    35276     +568     
+ Partials     3596     3539      -57

github-actions · 2026-03-19T17:28:27Z

❌ 14 Tests Failed:

Tests completed	Failed	Passed	Skipped
4828	14	4814	0

View the top 3 failed tests by shortest run time

TestAliasingFlaky

Stack Traces | -0.000s run time

=== RUN   TestAliasingFlaky
=== PAUSE TestAliasingFlaky
=== CONT  TestAliasingFlaky
    common_test.go:777: BuildL1 deployConfig: DeployBold=true, DeployReferenceDAContracts=false

TestBatchPosterL1SurplusMatchesBatchGasFlaky

Stack Traces | 0.550s run time

... [CONTENT TRUNCATED: Keeping last 20 lines]
panic: runtime error: invalid memory address or nil pointer dereference [recovered, repanicked]
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x207ec92]

goroutine 74 [running]:
testing.tRunner.func1.2({0x37e7280, 0x62039b0})
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1872 +0x237
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1875 +0x35b
panic({0x37e7280?, 0x62039b0?})
	/opt/hostedtoolcache/go/1.25.8/x64/src/runtime/panic.go:783 +0x132
github.com/offchainlabs/nitro/arbnode.(*InboxTracker).GetBatchCount(0x121cf900?)
	/home/runner/work/nitro/nitro/arbnode/inbox_tracker.go:210 +0x12
github.com/offchainlabs/nitro/arbnode.(*InboxTracker).FindInboxBatchContainingMessage(0x0, 0x6)
	/home/runner/work/nitro/nitro/arbnode/inbox_tracker.go:225 +0x2f
github.com/offchainlabs/nitro/system_tests.TestBatchPosterL1SurplusMatchesBatchGasFlaky(0xc0003e8e00)
	/home/runner/work/nitro/nitro/system_tests/batch_poster_test.go:839 +0x725
testing.tRunner(0xc0003e8e00, 0x41ba0c8)
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1934 +0xea
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1997 +0x465

TestParentChainEthConfigForkTransition

Stack Traces | 3.460s run time

... [CONTENT TRUNCATED: Keeping last 20 lines]
INFO [04-03|23:34:56.193] All fork specifications can be found at https://ethereum.github.io/execution-specs/src/ethereum/forks/
INFO [04-03|23:34:56.193] 
INFO [04-03|23:34:56.193] ---------------------------------------------------------------------------------------------------------------------------------------------------------
INFO [04-03|23:34:56.193] 
INFO [04-03|23:34:56.194] Loaded most recent local block           number=0    hash=079574..4b9dfc age=57y3w5d
WARN [04-03|23:34:56.194] Failed to load snapshot                  err="missing or corrupted snapshot"
INFO [04-03|23:34:56.194] Rebuilding state snapshot
INFO [04-03|23:34:56.194] Starting work on payload                 id=0x038046a67f914ee0
INFO [04-03|23:34:56.194] Initialized transaction indexer          range="last 126230400 blocks"
INFO [04-03|23:34:56.194] Imported new potential chain segment     number=271  hash=e5011e..41093e blocks=1  txs=1   mgas=0.021  elapsed=1.997ms      mgasps=10.511   triediffs=649.57KiB triedirty=112.44KiB
INFO [04-03|23:34:56.195] Chain head was updated                   number=271  hash=e5011e..41093e root=9eb67d..005c59 elapsed="57.538µs"
INFO [04-03|23:34:56.195] Updated payload                          id=0x038046a67f914ee0 number=11   hash=78401d..ec7de8 txs=1   withdrawals=0 gas=2,307,962  fees=2.307962e-06   root=d77082..8e1038 elapsed="655.263µs"
WARN [04-03|23:34:56.195] Sequencer ReadFromTxQueueTimeout is higher than MaxBlockSpeed ReadFromTxQueueTimeout=1s MaxBlockSpeed=10ms
INFO [04-03|23:34:56.195] Imported new potential chain segment     number=1037 hash=d361fe..365b7b blocks=1  txs=1   mgas=0.021  elapsed=2.331ms      mgasps=9.009    triediffs=699.33KiB triedirty=469.97KiB
INFO [04-03|23:34:56.195] Submitted contract creation              hash=0x9842452a1736756e133e5c0bbd9b701e982287542bc6cf0430d6dda1a4293908 from=0x57Ff0F473737a1c161bfF9efDF016F7991585088 nonce=10  contract=0xd8Ef51E29965b7c1388c0BFD630D4cFc46D661b5 value=0
INFO [04-03|23:34:56.195] Chain head was updated                   number=1037 hash=d361fe..365b7b root=aeb24a..1c056c elapsed="64.37µs"
INFO [04-03|23:34:56.196] Submitted transaction                    hash=0x213386de43415fc55e21b537a466bf17253bb29f3f7388456c3b1a29c76b2c10 from=0xaF24Ca6c2831f4d4F629418b50C227DF0885613A nonce=45  recipient=0xaF24Ca6c2831f4d4F629418b50C227DF0885613A value=1,000,000,000,000
INFO [04-03|23:34:56.193] Stopping work on payload                 id=0x03ba95736beb24c5 reason=delivery
INFO [04-03|23:34:56.196] Stopping work on payload                 id=0x038046a67f914ee0 reason=delivery
--- FAIL: TestParentChainEthConfigForkTransition (3.46s)

📣 Thoughts on this report? Let Codecov know! | Powered by Codecov

staker/block_validator_test.go

staker/block_validator.go

…silience Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…og.Debug Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tine leak Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pmikolajczyk41

just one comment, but generally LGTM; resolve the comment at your will

pmikolajczyk41 · 2026-03-26T09:33:37Z

staker/block_validator.go

 	moreWork, err := v.sendNextRecordRequests(ctx)
 	if err != nil {
 		log.Error("error trying to record for validation node", "err", err)
 	}


here probably we could have similar error handling as in iterativeValidationEntryCreator

…trics - Extract isShutdownCancellation helper and classifyValidationError enum to deduplicate 3-way error classification across all validation paths - Tighten shutdown detection: check ctx.Err() is specifically context.Canceled - Don't increment failure counter on expected shutdown cancellations - Escalate non-shutdown errors via possiblyFatal in iterative loops and handleValidationResult so persistent errors surface as fatal - Update lastValidGS only after successful DB write - Remove duplicate possiblyFatal call after Reorg in handleValidationResult - Consolidate tests into table-driven form, add 8-case truth table for isShutdownCancellation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…org correctness - Add classifyValidationError to distinguish shutdown, transient, and fatal errors — fixes spawner-shutdown race producing spurious fatal errors - Consolidate error handling into handleValidationResult across all four validation pipeline stages - Fix writeLastValidated to set lastValidGS only after successful DB write - Fix Reorg guard: count==0 is rejected (was count<=1, blocking valid reorg-to-genesis); escalate via possiblyFatal - Escalate writeLastValidated DB failures in Reorg instead of log-and-continue - Move ErrResultNotFound to execution package to break import cycle - Add comprehensive unit tests for error classification, shutdown race, DB failure escalation, Reorg guards, and writeLastValidated ordering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eljobe · 2026-04-02T08:45:38Z

staker/block_validator.go

 					markSuccess = false
 					validationStatus.DoneEntry.Err = err
-					log.Error("error while validating", "err", err, "start", validationStatus.DoneEntry.Start, "end", validationStatus.DoneEntry.End)
+					if classifyValidationError(ctx, err, "sendValidations") == validationFatal {


Race window: double classification of the same error

classifyValidationError is called here in the sendValidations untracked goroutine to decide whether to increment the failed-validations counter, and then called again at line 945 in advanceValidations when it reads DoneEntry.Err to decide how to handle the error (shutdown/transient/fatal).

These two calls happen at different points in time with the same ctx. Between them, the block validator's context can transition from live to canceled. This means the same error could be classified as validationFatal here (incrementing the counter) but as validationShutdown in advanceValidations (suppressing it), or vice versa. This would corrupt the arb/validator/validations/failed metric. It also means every error gets logged twice (once here, once in advanceValidations).

Suggestion: Classify once when DoneEntry.Err is set and store the severity alongside the error:

type validationDoneEntry struct { Success bool Err error ErrSeverity validationErrorSeverity // classify once, use everywhere // ... }

This also eliminates the duplicate logging as a side effect.

- Add validationErrorSeverity enum (shutdown/transient/fatal) with String() method - Add classifyValidationError to classify and log errors at the appropriate level - Store ErrSeverity on validationDoneEntry via setError() helper - Add handleValidationResult to centralize error/reorg handling across all CallIterativelyWith callbacks - advanceValidations switches on pre-classified severity instead of re-checking context state, avoiding TOCTOU issues - Harden all writeLastValidated callers (advanceValidations, Reorg, UpdateLatestStaked, InitAssumeValid) to escalate DB write failures - Restore diagnostic context (created/processed counts) in iterativeValidationEntryCreator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add validationErrorSeverity enum (shutdown/transient/fatal) with String() method - Add classifyValidationError to classify and log errors at the appropriate level - Store ErrSeverity on validationDoneEntry via setError() helper - Add handleValidationResult to centralize error/reorg handling across all CallIterativelyWith callbacks - advanceValidations switches on pre-classified severity instead of re-checking context state, avoiding TOCTOU issues - Harden all writeLastValidated callers (advanceValidations, Reorg, UpdateLatestStaked, InitAssumeValid) to escalate DB write failures - Restore diagnostic context (created/processed counts) in iterativeValidationEntryCreator - Downgrade "Fatal error channel full" to Warn (first fatal already triggers shutdown; second is supplementary context) - Downgrade entry-creation pre-classification log to Debug to avoid double-logging at Warn - Simplify Reorg possiblyFatal invariant comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ganeshvanahalli

LGTM

…ents and tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…-shutdown-resilience

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

joshuacolvin0 assigned pmikolajczyk41 Mar 19, 2026

joshuacolvin0 added 2 commits March 19, 2026 20:55

Merge branch 'master' into fix/block-validator-shutdown-resilience

33c2e63

Merge branch 'master' into fix/block-validator-shutdown-resilience

d5bd1ed

pmikolajczyk41 reviewed Mar 23, 2026

View reviewed changes

staker/block_validator_test.go Outdated Show resolved Hide resolved

staker/block_validator.go Outdated Show resolved Hide resolved

staker/block_validator.go Show resolved Hide resolved

pmikolajczyk41 assigned joshuacolvin0 and unassigned pmikolajczyk41 Mar 23, 2026

joshuacolvin0 and others added 5 commits March 25, 2026 08:42

Merge branch 'master' into fix/block-validator-shutdown-resilience

c0faa03

Harden StopWaiter, block validator, and blocks reexecutor shutdown re…

1300139

…silience Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Address review feedback: fix stale comments, restore changelog, use l…

8b74b1b

…og.Debug Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'master' into fix/block-validator-shutdown-resilience

ae9f7a5

Add test helpers, improve promise panic recovery, and verify no gorou…

c6ccf0e

…tine leak Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

joshuacolvin0 assigned pmikolajczyk41 and unassigned joshuacolvin0 Mar 26, 2026

pmikolajczyk41 previously approved these changes Mar 26, 2026

View reviewed changes

Merge branch 'master' into fix/block-validator-shutdown-resilience

04b9073

joshuacolvin0 assigned joshuacolvin0 and unassigned pmikolajczyk41 and joshuacolvin0 Mar 30, 2026

joshuacolvin0 dismissed pmikolajczyk41’s stale review via 354fed0 March 30, 2026 22:03

Merge branch 'master' into fix/block-validator-shutdown-resilience

41cb78f

joshuacolvin0 assigned pmikolajczyk41 Mar 30, 2026

pmikolajczyk41 previously approved these changes Mar 31, 2026

View reviewed changes

pmikolajczyk41 assigned joshuacolvin0 and unassigned pmikolajczyk41 Mar 31, 2026

joshuacolvin0 assigned eljobe Mar 31, 2026

joshuacolvin0 removed their assignment Mar 31, 2026

eljobe reviewed Apr 2, 2026

View reviewed changes

eljobe assigned joshuacolvin0 and unassigned eljobe Apr 2, 2026

joshuacolvin0 dismissed pmikolajczyk41’s stale review via 818aaf0 April 2, 2026 22:31

joshuacolvin0 and others added 2 commits April 2, 2026 22:16

Fix gci lint: align struct field comments in block_validator_test.go

0a05606

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

joshuacolvin0 assigned ganeshvanahalli and unassigned joshuacolvin0 Apr 3, 2026

ganeshvanahalli previously approved these changes Apr 3, 2026

View reviewed changes

ganeshvanahalli assigned eljobe and unassigned ganeshvanahalli Apr 3, 2026

joshuacolvin0 and others added 2 commits April 3, 2026 15:35

Address PR review: fix silent failures, escalate errors, improve comm…

cefabc1

…ents and tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into fix/block-validator…

dee5c2a

…-shutdown-resilience

joshuacolvin0 dismissed ganeshvanahalli’s stale review via dee5c2a April 3, 2026 22:43

joshuacolvin0 and others added 2 commits April 3, 2026 15:52

Fix gci lint: reorder imports and align var block in executionengine.go

67a382a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix gci lint: move uuid import to default group in executionengine.go

2fb11e7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: suppress context errors and skip reorgs during validator shutdown#4532

fix: suppress context errors and skip reorgs during validator shutdown#4532
joshuacolvin0 wants to merge 19 commits intomasterfrom
fix/block-validator-shutdown-resilience

joshuacolvin0 commented Mar 19, 2026

Uh oh!

codecov bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pmikolajczyk41 left a comment

Uh oh!

pmikolajczyk41 Mar 26, 2026

Uh oh!

joshuacolvin0 Mar 31, 2026

Uh oh!

eljobe Apr 2, 2026

Uh oh!

ganeshvanahalli left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

joshuacolvin0 commented Mar 19, 2026

Uh oh!

codecov bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 14 Tests Failed:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pmikolajczyk41 left a comment

Choose a reason for hiding this comment

Uh oh!

pmikolajczyk41 Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

joshuacolvin0 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

eljobe Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ganeshvanahalli left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Mar 19, 2026 •

edited

Loading

github-actions bot commented Mar 19, 2026 •

edited

Loading