fix: raft leader re-election delays after SIGTERM#3230
Conversation
📝 WalkthroughWalkthroughThe changes fix raft retriever shutdown behavior and message landing detection to prevent apply channel congestion and timeout issues during graceful shutdown. A new callback clearing step was added to the retriever's Stop method, and the shutdown's message landing condition was adjusted from comparing to LastIndex to CommitIndex. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude finished @auricom's task in 3m 42s —— View job Code Review
This PR addresses two real bugs exposed by HA failover testing. The root cause analyses are accurate and the fixes are generally correct. Several things are worth flagging before merge. Bug B —
|
| Area | Verdict |
|---|---|
| Bug A fix (stale snapshot crash) | ✅ Correct, minor edge-case gap |
| Bug B fix (spurious abdication) | ✅ Correct |
| Data race fix (applyMu) | ✅ Correct |
| Test coverage (syncer.go) | ✅ Good |
| Test coverage (node.go) | ❌ Missing |
| Stop/Start race (pre-existing) |
The two critical issues to address before merge are the missing test for waitForMsgsLanded (the exact code path that caused the 67–69s elections) and the raftState.Height == 0 edge case in RecoverFromRaft. The zero-duration ticker panic is low-risk in practice but trivial to fix.
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
block/internal/syncing/raft_retriever_test.go (1)
42-61: Consider converting this to a table-driven test.The current case is good, but a table shape will make it easier to add stop idempotency and start/stop-cycle variants without duplicating setup.
As per coding guidelines "Use table-driven tests in Go unit tests".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@block/internal/syncing/raft_retriever_test.go` around lines 42 - 61, The test TestRaftRetrieverStopClearsApplyCallback should be converted into a table-driven test to cover multiple scenarios (current stop behavior, stop idempotency, start/stop cycles) without duplicating setup: create a slice of test cases each with a name and a sequence of actions (e.g., start, stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh stubRaftNode and retriever via newRaftRetriever, call retriever.Start and retriever.Stop according to the case, then assert expected recordedCallbacks via stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require assertions on callback length and nil/non-nil entries as in the original test. Ensure each case isolates state by creating new retriever and stubRaftNode within the loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@block/internal/syncing/raft_retriever.go`:
- Line 77: The call to r.raftNode.SetApplyCallback(nil) races with FSM.Apply
because Apply reads/sends on applyCh while the raft node may concurrently invoke
the callback; fix by adding a mutex to the raft node to guard access to the
apply callback: protect the callback field and its setter Get/SetApplyCallback
(or SetApplyCallback and any internal invocation sites) with the new mutex so
that FSM.Apply (which reads/sends on applyCh via the callback) cannot see a nil
or changing callback mid-invocation; update the raft node's invocation path that
calls the callback (where Apply is invoked) to acquire the same mutex (or use a
read lock) when reading the callback and release it immediately after obtaining
the pointer, then call the callback outside the lock if needed to avoid
long-held locks.
---
Nitpick comments:
In `@block/internal/syncing/raft_retriever_test.go`:
- Around line 42-61: The test TestRaftRetrieverStopClearsApplyCallback should be
converted into a table-driven test to cover multiple scenarios (current stop
behavior, stop idempotency, start/stop cycles) without duplicating setup: create
a slice of test cases each with a name and a sequence of actions (e.g., start,
stop, stop again, start/stop cycle), and in the t.Run loop instantiate a fresh
stubRaftNode and retriever via newRaftRetriever, call retriever.Start and
retriever.Stop according to the case, then assert expected recordedCallbacks via
stubRaftNode.recordedCallbacks; keep using require.NoError for Start and require
assertions on callback length and nil/non-nil entries as in the original test.
Ensure each case isolates state by creating new retriever and stubRaftNode
within the loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d9bc4987-af22-4eb8-a04c-1c9ef989e55a
📒 Files selected for processing (3)
block/internal/syncing/raft_retriever.goblock/internal/syncing/raft_retriever_test.gopkg/raft/node.go
| r.mtx.Unlock() | ||
|
|
||
| r.wg.Wait() | ||
| r.raftNode.SetApplyCallback(nil) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify unsynchronized callback read/write sites in raft node implementation.
rg -n -C3 'type FSM struct|applyCh|func \(n \*Node\) SetApplyCallback|func \(f \*FSM\) Apply' pkg/raft/node.goRepository: evstack/ev-node
Length of output: 1336
🏁 Script executed:
# Check the actual raft_retriever.Stop() implementation to understand what wg.Wait() is waiting for
cat -n block/internal/syncing/raft_retriever.go | grep -A 15 "func (r \*raftRetriever) Stop"Repository: evstack/ev-node
Length of output: 590
🏁 Script executed:
# Check if there's synchronization in the raft node's Stop or shutdown methods
rg -n -B3 -A10 "func \(n \*Node\) Stop" pkg/raft/node.goRepository: evstack/ev-node
Length of output: 553
🏁 Script executed:
# Check if the raft library itself (from ethraft package) provides synchronization around FSM.Apply
rg -n "applyCh" pkg/raft/node.go | head -20Repository: evstack/ev-node
Length of output: 212
🏁 Script executed:
# Check where raftNode is initialized and where Stop is called in context
rg -n -B5 -A15 "func.*NewRaftRetriever|raftNode.*Stop\(\)" block/internal/syncing/raft_retriever.goRepository: evstack/ev-node
Length of output: 41
🏁 Script executed:
# Check if there's a parent Syncer that coordinates shutdown of both raftNode and raftRetriever
rg -n -B3 -A8 "type Syncer struct|raftRetriever\)" block/internal/syncing/syncer.go | head -40Repository: evstack/ev-node
Length of output: 328
🏁 Script executed:
# Check the full Stop() implementation and see if there are any shutdown calls to raftNode before SetApplyCallback
cat -n block/internal/syncing/raft_retriever.go | sed -n '60,90p'Repository: evstack/ev-node
Length of output: 951
🏁 Script executed:
# Find where raftRetriever.Stop() is called
rg -n "raftRetriever.*Stop\(\)|\.Stop\(\)" block/internal/syncing/syncer.go | head -20Repository: evstack/ev-node
Length of output: 186
🏁 Script executed:
# Get the Syncer.Stop() method to see the full shutdown sequence
rg -n -B3 -A20 "func \(s \*Syncer\) Stop" block/internal/syncing/syncer.goRepository: evstack/ev-node
Length of output: 646
🏁 Script executed:
# Check where r.raftNode is stopped/shut down
rg -n "raftNode\.*Stop\|raftRetriever.*Stop" block/internal/syncing/raft_retriever.goRepository: evstack/ev-node
Length of output: 41
Synchronize callback access to prevent data race with FSM.Apply.
Line 77's SetApplyCallback(nil) races with FSM.Apply in pkg/raft/node.go, which reads and sends on applyCh without synchronization. The wg.Wait() only ensures the consumer (raftApplyLoop) has exited, but raftNode continues running and can still invoke Apply(). Guard callback access with a mutex in the raft node implementation.
Suggested fix (in pkg/raft/node.go)
import (
"context"
"errors"
"fmt"
"io"
"net"
"os"
"path/filepath"
"strings"
+ "sync"
"sync/atomic"
"time"
@@
type FSM struct {
logger zerolog.Logger
state *atomic.Pointer[RaftBlockState]
+ applyMu sync.RWMutex
applyCh chan<- RaftApplyMsg
}
@@
func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) {
+ n.fsm.applyMu.Lock()
+ defer n.fsm.applyMu.Unlock()
n.fsm.applyCh = ch
}
@@
- if f.applyCh != nil {
+ f.applyMu.RLock()
+ ch := f.applyCh
+ f.applyMu.RUnlock()
+ if ch != nil {
select {
- case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
+ case ch <- RaftApplyMsg{Index: log.Index, State: &state}:
default:
// on a slow consumer, the raft cluster should not be blocked. Followers can sync from DA or other peers, too.
f.logger.Warn().Msg("apply channel full, dropping message")
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@block/internal/syncing/raft_retriever.go` at line 77, The call to
r.raftNode.SetApplyCallback(nil) races with FSM.Apply because Apply reads/sends
on applyCh while the raft node may concurrently invoke the callback; fix by
adding a mutex to the raft node to guard access to the apply callback: protect
the callback field and its setter Get/SetApplyCallback (or SetApplyCallback and
any internal invocation sites) with the new mutex so that FSM.Apply (which
reads/sends on applyCh via the callback) cannot see a nil or changing callback
mid-invocation; update the raft node's invocation path that calls the callback
(where Apply is invoked) to acquire the same mutex (or use a read lock) when
reading the callback and release it immediately after obtaining the pointer,
then call the callback outside the lock if needed to avoid long-held locks.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3230 +/- ##
==========================================
+ Coverage 61.67% 61.91% +0.23%
==========================================
Files 120 120
Lines 12635 12656 +21
==========================================
+ Hits 7793 7836 +43
+ Misses 3968 3944 -24
- Partials 874 876 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Bug A: RecoverFromRaft crashed with "invalid block height" when the node restarted after SIGTERM and the EVM state (persisted before kill) was ahead of the raft FSM snapshot (which hadn't finished log replay yet). The function now verifies the hash of the local block at raftState.Height — if it matches the snapshot hash the EVM history is correct and recovery is safely skipped; a mismatch returns an error indicating a genuine fork. Bug B: waitForMsgsLanded used two repeating tickers with the same effective period (SendTimeout/2 poll, SendTimeout timeout), so both could fire simultaneously in select and the timeout would win even when AppliedIndex >= CommitIndex. Replaced the deadline ticker with a one-shot time.NewTimer, added a final check in the deadline branch, and reduced poll interval to min(50ms, timeout/4) for more responsive detection. Fixes the crash-restart Docker backoff loop observed in SIGTERM HA test cycle 7 (poc-ha-2 never rejoining within the 300s kill interval). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SetApplyCallback(nil) called from raftRetriever.Stop() raced with FSM.Apply reading applyCh: wg.Wait() only ensures the consumer goroutine exited, but the raft library can still invoke Apply concurrently. Add applyMu sync.RWMutex to FSM; take write lock in SetApplyCallback and read lock in Apply before reading the channel pointer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Fixes two bugs found in SIGTERM HA testing (cycle 7 — poc-ha-2 never rejoining within the 300s kill interval).
Previous fixes (already in branch)
Bug A —
invalid block heightcrash on follower restart (Critical)Root cause: After SIGTERM, the EVM state is persisted to disk at height N (e.g. 6768). The Raft FSM snapshot is at an older height M (e.g. 6691). On restart, hashicorp/raft restores the snapshot asynchronously;
verifyStatemay run before the FSM finishes log replay andGetState()still returns height M.IsSyncedWithRaftcomputesdiff = N − M = +77and callsRecoverFromRaft({Height: M}), which crashed with"invalid block height: 6691 (expected 6769)"— treating a stale-snapshot artifact as a fork.Fix (
block/internal/syncing/syncer.go—RecoverFromRaft): WhencurrentState.LastBlockHeight > raftState.Height, look up the local block atraftState.Heightand compare its hash against the raft snapshot hash. Matching → EVM history is correct, FSM is stale → skip recovery. Mismatch → genuine divergence → return error.Bug B — spurious FSM-lag abdication → 67–69s elections (High)
Primary cause: Bug A forced poc-ha-2 into a Docker restart backoff loop (~2–3 min). When it finally rejoined and won an election it had a large FSM backlog, causing
waitForMsgsLanded(200ms)to time out → abdication → second election round.Secondary cause (independent):
waitForMsgsLandedused two repeating tickers with the same effective deadline (SendTimeout/2poll,SendTimeouttimeout). At the deadline both cases are ready inselectand Go picks randomly — the timeout could win even whenAppliedIndex >= CommitIndex.Fix (
pkg/raft/node.go—waitForMsgsLanded): Replaced the repeating deadline ticker with a one-shottime.NewTimer; added a final check in the deadline branch; reduced poll interval tomin(50ms, timeout/4).Testing
TestSyncer_RecoverFromRaft_LocalAheadOfStaleSnapshot— new test covering the matching-hash (skip) and diverged-hash (error) pathsgo test ./block/internal/syncing/... ./pkg/raft/...passesCloses #3229