Skip to content

fix: raft leader re-election delays after SIGTERM#3230

Draft
auricom wants to merge 4 commits intomainfrom
fix/3229-raft-re-election
Draft

fix: raft leader re-election delays after SIGTERM#3230
auricom wants to merge 4 commits intomainfrom
fix/3229-raft-re-election

Conversation

@auricom
Copy link
Copy Markdown
Contributor

@auricom auricom commented Apr 7, 2026

Summary

Fixes two bugs found in SIGTERM HA testing (cycle 7 — poc-ha-2 never rejoining within the 300s kill interval).

Previous fixes (already in branch)

  • Wait for the raft FSM to catch up to the commit index instead of the last log index when leadership changes
  • Clear the retriever apply callback on stop so a follower→leader transition does not keep writing into a dead channel

Bug A — invalid block height crash on follower restart (Critical)

Root cause: After SIGTERM, the EVM state is persisted to disk at height N (e.g. 6768). The Raft FSM snapshot is at an older height M (e.g. 6691). On restart, hashicorp/raft restores the snapshot asynchronously; verifyState may run before the FSM finishes log replay and GetState() still returns height M. IsSyncedWithRaft computes diff = N − M = +77 and calls RecoverFromRaft({Height: M}), which crashed with "invalid block height: 6691 (expected 6769)" — treating a stale-snapshot artifact as a fork.

Fix (block/internal/syncing/syncer.goRecoverFromRaft): When currentState.LastBlockHeight > raftState.Height, look up the local block at raftState.Height and compare its hash against the raft snapshot hash. Matching → EVM history is correct, FSM is stale → skip recovery. Mismatch → genuine divergence → return error.

Bug B — spurious FSM-lag abdication → 67–69s elections (High)

Primary cause: Bug A forced poc-ha-2 into a Docker restart backoff loop (~2–3 min). When it finally rejoined and won an election it had a large FSM backlog, causing waitForMsgsLanded(200ms) to time out → abdication → second election round.

Secondary cause (independent): waitForMsgsLanded used two repeating tickers with the same effective deadline (SendTimeout/2 poll, SendTimeout timeout). At the deadline both cases are ready in select and Go picks randomly — the timeout could win even when AppliedIndex >= CommitIndex.

Fix (pkg/raft/node.gowaitForMsgsLanded): Replaced the repeating deadline ticker with a one-shot time.NewTimer; added a final check in the deadline branch; reduced poll interval to min(50ms, timeout/4).

Testing

  • TestSyncer_RecoverFromRaft_LocalAheadOfStaleSnapshot — new test covering the matching-hash (skip) and diverged-hash (error) paths
  • go test ./block/internal/syncing/... ./pkg/raft/... passes

Closes #3229

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

📝 Walkthrough

Walkthrough

The changes fix raft retriever shutdown behavior and message landing detection to prevent apply channel congestion and timeout issues during graceful shutdown. A new callback clearing step was added to the retriever's Stop method, and the shutdown's message landing condition was adjusted from comparing to LastIndex to CommitIndex.

Changes

Cohort / File(s) Summary
Raft Retriever Shutdown Cleanup
block/internal/syncing/raft_retriever.go, block/internal/syncing/raft_retriever_test.go
Added explicit call to SetApplyCallback(nil) when stopping the retriever to clear the previously registered apply callback. New unit test verifies this cleanup occurs correctly.
Raft Node Message Landing Detection
pkg/raft/node.go
Modified waitForMsgsLanded loop condition to check when AppliedIndex >= CommitIndex instead of AppliedIndex >= LastIndex, changing the point at which shutdown considers raft messages fully applied.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 The apply channel clears with a whispered goodbye,
As callbacks find rest where they once stood high,
From LastIndex to CommitIndex we trace,
A rabbit's refactor brings shutdown to grace!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The code changes directly address issue #3229's requirements: waiting for commit index instead of last index improves FSM synchronization, clearing the apply callback prevents dead channel writes, and regression tests ensure against recurrence.
Out of Scope Changes check ✅ Passed All changes directly support the linked objective to fix raft leader re-election delays. Modifications to waitForMsgsLanded, raftRetriever.Stop(), and test coverage are scoped to the issue requirements without extraneous changes.
Description check ✅ Passed The PR description provides comprehensive context, clear root cause analysis for two distinct bugs, detailed fixes with file locations, and references a test. It follows the overview template requirement with a detailed summary explaining the problem and solution.
Title check ✅ Passed The title directly addresses the main problem being fixed: raft leader re-election delays after SIGTERM, which aligns with the core objective and the three file changes that collectively resolve the issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3229-raft-re-election

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 7, 2026

Claude finished @auricom's task in 3m 42s —— View job


Code Review

  • Gather context and read CLAUDE.md/AGENTS.md
  • Read changed files
  • Review pkg/raft/node.go changes (Bug B fix)
  • Review block/internal/syncing/syncer.go changes (Bug A fix)
  • Review block/internal/syncing/raft_retriever.go changes
  • Review test files
  • Post comprehensive review

This PR addresses two real bugs exposed by HA failover testing. The root cause analyses are accurate and the fixes are generally correct. Several things are worth flagging before merge.


Bug B — waitForMsgsLanded (pkg/raft/node.go)

CommitIndex vs LastIndex — The original comparison against LastIndex() was wrong; uncommitted log entries from a pre-transition leader could inflate it. CommitIndex() is the right target. Good change.

✅ One-shot timer — Replacing the repeating timeoutTicker with time.NewTimer correctly eliminates the select race where both cases were simultaneously ready. The final re-check in the deadline.C branch before returning the error is a nice defensive addition.

⚠️ Zero-duration panic riskpollInterval := min(50*time.Millisecond, timeout/4) will produce 0 when timeout < 4 (nanoseconds). time.NewTicker(0) panics. In practice SendTimeout is likely large, but a floor like max(1*time.Millisecond, min(50*time.Millisecond, timeout/4)) makes the invariant explicit. Fix this →

⚠️ No unit test for the fixed logic — Codecov reports 0% coverage on node.go changes. waitForMsgsLanded is the precise source of the 67–69 s elections described in the PR, yet it has no test. A simple test using a fake/stubbed raft.Raft or at least a synctest harness would prevent regressions. Fix this →


Bug B (part 2) — applyMu data race (pkg/raft/node.go)

✅ Correct synchronization pattern — Snapshotting ch := f.applyCh under RLock and sending on ch outside the lock avoids holding the lock during the channel operation while still preventing a nil/stale read. This is the textbook pattern for this scenario.

SetApplyCallback write-locks — The write lock in SetApplyCallback is consistent with the read lock in Apply. The data race identified by CodeRabbit is now properly addressed.


Raft retriever shutdown (block/internal/syncing/raft_retriever.go)

✅ Correct placement of SetApplyCallback(nil) — Clearing the callback after wg.Wait() ensures raftApplyLoop has fully exited before the callback is nulled. Combined with the applyMu fix, this is now race-free.

⚠️ Start/Stop ordering windowStop() sets r.cancel = nil and releases r.mtx before wg.Wait() completes. A concurrent Start() call will see r.cancel == nil and proceed while the old loop is still winding down, creating two concurrent raftApplyLoop goroutines sharing the same raftNode. This is pre-existing, but the new SetApplyCallback(nil) line makes it slightly more dangerous (the second Start() registers its callback, which Stop() then immediately clears). Worth addressing, even if out of scope here.


Bug A — Stale snapshot recovery (block/internal/syncing/syncer.go)

✅ Hash-verification approach — Fetching the local block at raftState.Height and comparing hashes before deciding to skip recovery is correct. It catches the two cases cleanly: matching hash → stale FSM, skip; mismatched hash → real fork, error.

⚠️ raftState.Height == 0 edge case — If the raft snapshot height is 0 and local state is LastBlockHeight > 0, the new branch is entered and s.store.GetHeader(ctx, 0) is called. Height-0 headers typically don't exist in the store (genesis pre-dates persistence). This would return an error wrapped as "cannot verify hash", which may surface confusingly on first-boot scenarios. A guard or comment clarifying this edge case would help. Fix this →

ℹ️ Pre-existing: gap > 1 silently ignored — When raftState.Height > currentState.LastBlockHeight + 1 (raft is more than 1 block ahead), RecoverFromRaft falls through to the final return nil without doing anything. This is pre-existing behavior, but it means a node with a large lag silently claims "recovered". Not introduced by this PR, but worth tracking.


Tests

TestSyncer_RecoverFromRaft_LocalAheadOfStaleSnapshot — Covers both paths of the new branch (matching hash → no error; mismatched hash → error). Well structured and realistic. The subtests are clear.

⚠️ TestRaftRetrieverStopClearsApplyCallback — The test is correct but as CodeRabbit noted, converting to a table-driven form would make it easier to add idempotency and restart-cycle coverage without duplication (per the project's own guidelines in AGENTS.md). Fix this →


Summary

Area Verdict
Bug A fix (stale snapshot crash) ✅ Correct, minor edge-case gap
Bug B fix (spurious abdication) ✅ Correct
Data race fix (applyMu) ✅ Correct
Test coverage (syncer.go) ✅ Good
Test coverage (node.go) ❌ Missing
Stop/Start race (pre-existing) ⚠️ Not introduced, but exposed

The two critical issues to address before merge are the missing test for waitForMsgsLanded (the exact code path that caused the 67–69s elections) and the raftState.Height == 0 edge case in RecoverFromRaft. The zero-duration ticker panic is low-risk in practice but trivial to fix.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 8, 2026, 3:14 PM

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
block/internal/syncing/raft_retriever_test.go (1)

42-61: Consider converting this to a table-driven test.

The current case is good, but a table shape will make it easier to add stop idempotency and start/stop-cycle variants without duplicating setup.

As per coding guidelines "Use table-driven tests in Go unit tests".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/raft_retriever_test.go` around lines 42 - 61, The test
TestRaftRetrieverStopClearsApplyCallback should be converted into a table-driven
test to cover multiple scenarios (current stop behavior, stop idempotency,
start/stop cycles) without duplicating setup: create a slice of test cases each
with a name and a sequence of actions (e.g., start, stop, stop again, start/stop
cycle), and in the t.Run loop instantiate a fresh stubRaftNode and retriever via
newRaftRetriever, call retriever.Start and retriever.Stop according to the case,
then assert expected recordedCallbacks via stubRaftNode.recordedCallbacks; keep
using require.NoError for Start and require assertions on callback length and
nil/non-nil entries as in the original test. Ensure each case isolates state by
creating new retriever and stubRaftNode within the loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@block/internal/syncing/raft_retriever.go`:
- Line 77: The call to r.raftNode.SetApplyCallback(nil) races with FSM.Apply
because Apply reads/sends on applyCh while the raft node may concurrently invoke
the callback; fix by adding a mutex to the raft node to guard access to the
apply callback: protect the callback field and its setter Get/SetApplyCallback
(or SetApplyCallback and any internal invocation sites) with the new mutex so
that FSM.Apply (which reads/sends on applyCh via the callback) cannot see a nil
or changing callback mid-invocation; update the raft node's invocation path that
calls the callback (where Apply is invoked) to acquire the same mutex (or use a
read lock) when reading the callback and release it immediately after obtaining
the pointer, then call the callback outside the lock if needed to avoid
long-held locks.

---

Nitpick comments:
In `@block/internal/syncing/raft_retriever_test.go`:
- Around line 42-61: The test TestRaftRetrieverStopClearsApplyCallback should be
converted into a table-driven test to cover multiple scenarios (current stop
behavior, stop idempotency, start/stop cycles) without duplicating setup: create
a slice of test cases each with a name and a sequence of actions (e.g., start,
stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh
stubRaftNode and retriever via newRaftRetriever, call retriever.Start and
retriever.Stop according to the case, then assert expected recordedCallbacks via
stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require
assertions on callback length and nil/non-nil entries as in the original test.
Ensure each case isolates state by creating new retriever and stubRaftNode
within the loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9bc4987-af22-4eb8-a04c-1c9ef989e55a

📥 Commits

Reviewing files that changed from the base of the PR and between 04c9cad and 2d28b20.

📒 Files selected for processing (3)
  • block/internal/syncing/raft_retriever.go
  • block/internal/syncing/raft_retriever_test.go
  • pkg/raft/node.go

r.mtx.Unlock()

r.wg.Wait()
r.raftNode.SetApplyCallback(nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify unsynchronized callback read/write sites in raft node implementation.
rg -n -C3 'type FSM struct|applyCh|func \(n \*Node\) SetApplyCallback|func \(f \*FSM\) Apply' pkg/raft/node.go

Repository: evstack/ev-node

Length of output: 1336


🏁 Script executed:

# Check the actual raft_retriever.Stop() implementation to understand what wg.Wait() is waiting for
cat -n block/internal/syncing/raft_retriever.go | grep -A 15 "func (r \*raftRetriever) Stop"

Repository: evstack/ev-node

Length of output: 590


🏁 Script executed:

# Check if there's synchronization in the raft node's Stop or shutdown methods
rg -n -B3 -A10 "func \(n \*Node\) Stop" pkg/raft/node.go

Repository: evstack/ev-node

Length of output: 553


🏁 Script executed:

# Check if the raft library itself (from ethraft package) provides synchronization around FSM.Apply
rg -n "applyCh" pkg/raft/node.go | head -20

Repository: evstack/ev-node

Length of output: 212


🏁 Script executed:

# Check where raftNode is initialized and where Stop is called in context
rg -n -B5 -A15 "func.*NewRaftRetriever|raftNode.*Stop\(\)" block/internal/syncing/raft_retriever.go

Repository: evstack/ev-node

Length of output: 41


🏁 Script executed:

# Check if there's a parent Syncer that coordinates shutdown of both raftNode and raftRetriever
rg -n -B3 -A8 "type Syncer struct|raftRetriever\)" block/internal/syncing/syncer.go | head -40

Repository: evstack/ev-node

Length of output: 328


🏁 Script executed:

# Check the full Stop() implementation and see if there are any shutdown calls to raftNode before SetApplyCallback
cat -n block/internal/syncing/raft_retriever.go | sed -n '60,90p'

Repository: evstack/ev-node

Length of output: 951


🏁 Script executed:

# Find where raftRetriever.Stop() is called
rg -n "raftRetriever.*Stop\(\)|\.Stop\(\)" block/internal/syncing/syncer.go | head -20

Repository: evstack/ev-node

Length of output: 186


🏁 Script executed:

# Get the Syncer.Stop() method to see the full shutdown sequence
rg -n -B3 -A20 "func \(s \*Syncer\) Stop" block/internal/syncing/syncer.go

Repository: evstack/ev-node

Length of output: 646


🏁 Script executed:

# Check where r.raftNode is stopped/shut down
rg -n "raftNode\.*Stop\|raftRetriever.*Stop" block/internal/syncing/raft_retriever.go

Repository: evstack/ev-node

Length of output: 41


Synchronize callback access to prevent data race with FSM.Apply.

Line 77's SetApplyCallback(nil) races with FSM.Apply in pkg/raft/node.go, which reads and sends on applyCh without synchronization. The wg.Wait() only ensures the consumer (raftApplyLoop) has exited, but raftNode continues running and can still invoke Apply(). Guard callback access with a mutex in the raft node implementation.

Suggested fix (in pkg/raft/node.go)
 import (
 	"context"
 	"errors"
 	"fmt"
 	"io"
 	"net"
 	"os"
 	"path/filepath"
 	"strings"
+	"sync"
 	"sync/atomic"
 	"time"
@@
 type FSM struct {
 	logger  zerolog.Logger
 	state   *atomic.Pointer[RaftBlockState]
+	applyMu sync.RWMutex
 	applyCh chan<- RaftApplyMsg
 }
@@
 func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) {
+	n.fsm.applyMu.Lock()
+	defer n.fsm.applyMu.Unlock()
 	n.fsm.applyCh = ch
 }
@@
-	if f.applyCh != nil {
+	f.applyMu.RLock()
+	ch := f.applyCh
+	f.applyMu.RUnlock()
+	if ch != nil {
 		select {
-		case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
+		case ch <- RaftApplyMsg{Index: log.Index, State: &state}:
 		default:
 			// on a slow consumer, the raft cluster should not be blocked. Followers can sync from DA or other peers, too.
 			f.logger.Warn().Msg("apply channel full, dropping message")
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/raft_retriever.go` at line 77, The call to
r.raftNode.SetApplyCallback(nil) races with FSM.Apply because Apply reads/sends
on applyCh while the raft node may concurrently invoke the callback; fix by
adding a mutex to the raft node to guard access to the apply callback: protect
the callback field and its setter Get/SetApplyCallback (or SetApplyCallback and
any internal invocation sites) with the new mutex so that FSM.Apply (which
reads/sends on applyCh via the callback) cannot see a nil or changing callback
mid-invocation; update the raft node's invocation path that calls the callback
(where Apply is invoked) to acquire the same mutex (or use a read lock) when
reading the callback and release it immediately after obtaining the pointer,
then call the callback outside the lock if needed to avoid long-held locks.

@auricom auricom marked this pull request as draft April 7, 2026 15:24
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 37.93103% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.91%. Comparing base (d2a29e8) to head (52d7cda).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/raft/node.go 0.00% 15 Missing ⚠️
block/internal/syncing/syncer.go 76.92% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3230      +/-   ##
==========================================
+ Coverage   61.67%   61.91%   +0.23%     
==========================================
  Files         120      120              
  Lines       12635    12656      +21     
==========================================
+ Hits         7793     7836      +43     
+ Misses       3968     3944      -24     
- Partials      874      876       +2     
Flag Coverage Δ
combined 61.91% <37.93%> (+0.23%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

auricom and others added 2 commits April 8, 2026 17:03
Bug A: RecoverFromRaft crashed with "invalid block height" when the node
restarted after SIGTERM and the EVM state (persisted before kill) was ahead
of the raft FSM snapshot (which hadn't finished log replay yet). The function
now verifies the hash of the local block at raftState.Height — if it matches
the snapshot hash the EVM history is correct and recovery is safely skipped;
a mismatch returns an error indicating a genuine fork.

Bug B: waitForMsgsLanded used two repeating tickers with the same effective
period (SendTimeout/2 poll, SendTimeout timeout), so both could fire
simultaneously in select and the timeout would win even when AppliedIndex >=
CommitIndex. Replaced the deadline ticker with a one-shot time.NewTimer,
added a final check in the deadline branch, and reduced poll interval to
min(50ms, timeout/4) for more responsive detection.

Fixes the crash-restart Docker backoff loop observed in SIGTERM HA test
cycle 7 (poc-ha-2 never rejoining within the 300s kill interval).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SetApplyCallback(nil) called from raftRetriever.Stop() raced with
FSM.Apply reading applyCh: wg.Wait() only ensures the consumer goroutine
exited, but the raft library can still invoke Apply concurrently.

Add applyMu sync.RWMutex to FSM; take write lock in SetApplyCallback and
read lock in Apply before reading the channel pointer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@auricom auricom changed the title Fix raft leader re-election delays after SIGTERM fix: raft leader re-election delays after SIGTERM Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Raft leader re-election takes up to 90s after SIGTERM on a 3-node cluster

1 participant