feat(consensus): report catching_up when fallen behind peers by pratikspatil024 · Pull Request #40 · 0xPolygon/cometbft

pratikspatil024 · 2026-06-02T12:52:14Z

Summary

/status's catching_up is derived solely from the consensus reactor's WaitSync() latch. WaitSync is set once at startup and cleared the first time the node switches from block-sync to the consensus reactor, after which it stays false for the rest of the process. So the field only ever reflects the initial block-sync and never reflects a node that later stops making progress relative to its peers — that node keeps answering catching_up=false even while it has fallen well behind, and anything that reads the field as a readiness/health signal (load balancers, k8s probes, dashboards, and in-process consumers) treats it as fully synced.

This PR adds Reactor.IsBehind(), evaluated from peer-reported heights, and folds it into the field:

CatchingUp = WaitSync() || IsBehind()

IsBehind() is true when a connected peer is more than catchup_lag_threshold blocks ahead of our consensus height, or when the node has no peers and is not the sole validator (it cannot make progress on its own, so it cannot establish that it is current).

It uses peer heights, not local block-time staleness, on purpose. A bare staleness check cannot distinguish:

a node that is genuinely behind its peers (some peer reports a greater height) — which is catching up, from
a network in which every node has legitimately stopped at the same height (no peer is ahead) — which is not catching up.

Reporting the second case as catching up would be a false positive, so height comparison is the discriminator rather than "my last block is old".

The zero-peer case is self-determining (no configuration): a node with no peers reports catching_up=true unless it is the only validator in the set, in which case it finalizes blocks on its own and needs no peers. Sole-validator status uses the same local-validator membership test as the signing path and is read live from consensus state, so it follows validator-set changes the node has committed into local round state and behaves identically on mainnet, testnet, and local devnets — no network-specific config, and the scenario is testable on any kurtosis network id.

A short debounce (catchup_debounce_duration) requires the lag condition to hold continuously before catching_up flips to true, damping flapping at the threshold boundary; the transition back to false is immediate.

New `[consensus]` config (defaults)

Key	Default	Notes
`catchup_lag_threshold`	`5`	Blocks a peer may be ahead before we report behind. `0` disables peer-height lag detection only — the zero-peer rule still applies. Must be `>=2` when enabled, to absorb the normal one-height round skew between synced peers.
`catchup_debounce_duration`	`10s`	How long the lag condition must hold before flipping to `true`.

Both are additive with their defaults; behaviour only changes once a peer is >threshold ahead, or the node has no peers while other validators exist. The zero-peer rule needs no config.

Implementation notes / design decisions

Fixed at the reactor, OR'd into status.go. IsBehind() is a new read-only method on *Reactor; the RPC consensusReactor interface gains it and Status ORs it into CatchingUp.
WaitSync is left untouched. It also gates consensus message handling (reactor.go channel handlers), so re-arming it to express "behind" would disable consensus rather than just relabel health. IsBehind() is therefore a separate signal.
Zero-peer handling is self-determining, not configured. A node with no peers reports behind unless it is the sole validator (State.isLocalSoleValidator, the same membership test as the signing path). This needs no per-network configuration and is testable on any devnet regardless of chain id; a non-validator node has no private validator and so always needs peers.
Decision logic is split into pure helpers (evaluateBehind, applyDebounceLocked) so the branching is unit-testable without p2p plumbing; isBehindRaw only gathers heights and the sole-validator flag.
Offline inspect server stubs IsBehind() => false.

Executed tests

New table-driven unit tests (consensus/reactor_catchup_test.go):
- evaluateBehind: peer ahead beyond / one over the threshold, peer ahead within threshold, equal-height (network halt), one-height round skew, no peer height learned, threshold disabled, sole-validator zero-peer (healthy), non-sole zero-peer (behind), and zero-peer behind even when the lag check is disabled.
- applyDebounceLocked: within-debounce stays false, past-debounce flips true, transient blip resets the timer, zero-debounce immediate, not-behind resets state.
go build ./consensus/... ./config/... ./rpc/... ./node/... ./inspect/... — clean.
go test ./consensus/ ./config/ ./node/ — pass.
gofmt -l — clean. golangci-lint / gofumpt / diffguard are enforced by CI (not run locally).

Rollout notes

Not consensus-affecting. This relabels a reported /status field; it does not change block validity, state transition, app hash, or wire format.
Backward-compatible. Defaults leave catching_up unchanged for synced nodes; single-validator / single-node deployments are unaffected (the sole validator advances without peers).
No configuration required. The zero-peer behavior ships in the binary and is identical across networks; operators only need config work if they want to tune catchup_lag_threshold / catchup_debounce_duration away from the defaults.

Known limitations

Peer heights are only as fresh as gossip. If peers stay in the peer set but stop sending round-state updates (e.g. a half-open TCP connection before p2p liveness evicts them), maxPeerHeight is stale until eviction, so IsBehind() can lag until the peer set reflects reality. p2p keepalive tuning mitigates this; it's a property of any peer-height-based signal.
IsBehind() is read-triggered. The debounce state advances on /status reads, so a sparse poller can under-report sustained lag until a second poll occurring at least catchup_debounce_duration after the first lagging observation. Reactor-side sampling/caching is a possible follow-up if /status semantics need to be wall-clock accurate independent of callers.
maxPeerHeight == 0 (no peer height learned yet) is treated as not-behind.

Open design decisions (feedback wanted)

catchup_lag_threshold default (5). Right value? And should the threshold be height-based (current) or time-derived (e.g. "behind by more than N block-times")? Height-based is deterministic and simpler; time-based is more portable across block-time configs.
~~min_expected_peers default.~~ Resolved — replaced with the self-determining sole-validator rule above (no config knob), chosen so the zero-peer scenario is testable on any kurtosis network id rather than hardcoding network names.
catchup_debounce_duration default (10s). Worth tuning against typical block time; also whether the debounce should be symmetric (currently fast-to-recover, slow-to-assert).
External authoritative height. For forks that have a stronger external height signal (e.g. a checkpoint-derived height), should IsBehind() optionally accept that as an additional input, or is that better layered downstream of this method?
Surface area. Should "behind" also be exposed as its own /status field and/or a metric, in addition to folding into catching_up, so consumers can distinguish "initial block-sync" from "fell behind after sync"?

Part of a broader reliability effort; downstream wiring/tuning lands separately.

The catching_up field in /status is derived solely from the consensus reactor's WaitSync() latch. WaitSync is set once at startup and cleared the first time the node switches from block-sync to the consensus reactor, after which it stays false for the remainder of the process. As a result the field only ever reflects the initial block-sync and never reflects a node that later stops making progress relative to its peers, so such a node keeps reporting catching_up=false while operators and downstream consumers treat it as fully synced. Add Reactor.IsBehind(), evaluated from peer-reported heights, and OR it into catching_up: CatchingUp = WaitSync() || IsBehind() IsBehind reports true when a connected peer is more than catchup_lag_threshold blocks ahead of our consensus height, or when the connected-peer count is below min_expected_peers. The decision uses peer heights rather than local block-time staleness on purpose: staleness cannot distinguish a node that is behind its peers (some peer is ahead) from a network in which every node has legitimately stopped at the same height (no peer is ahead). Only the former is catching up; reporting the latter as catching up would be a false positive. The lag condition must hold continuously for catchup_debounce_duration before catching_up flips to true, which damps flapping at the threshold boundary; the transition back to false is immediate. New [consensus] config (with defaults): - catchup_lag_threshold = 5 (0 disables; must be >=2 to absorb the one-height round skew between synced peers) - min_expected_peers = 0 (off by default so single-node deployments stay healthy) - catchup_debounce_duration = 10s WaitSync is left untouched: it gates consensus message handling, so IsBehind is an independent read-only signal rather than a re-arming of that latch. The offline inspect server's reactor stub returns IsBehind()=false. Tests cover the lag decision (peer ahead beyond/at/just over the threshold, equal-height, one-height skew, no peer height learned, threshold disabled, single-node zero-peer, and the min-peers guard on/off) and the debounce state machine. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cffls · 2026-06-02T21:00:44Z

 		DoubleSignCheckHeight:       int64(0),
 		BlockTimeTolerance:          60 * time.Second,
+		CatchupLagThreshold:         5,
+		MinExpectedPeers:            0,


Should this be 1? If a node doesn't have peer, we should assume it is already behind.

Good Catch!

Kept it at 0 because this default also applies to single-node and test setups - with 1, a node with 0 peers would report catching_up=true forever (breaks single-node devnets and the !catching_up test waits).

We can keep the default 0 and set min_expected_peers=1 in our production config templates, where peers are always expected. Gives prod the "0 peers -> behind" behavior without breaking single-node/test. WDYT?

Sounds good, thanks for the explanation!

min_expected_peers=1 should be default config for production (Amoy/mainnet) config, so users won't need to manually modify this value.

Agreed.
So I'll set it as a network-aware in-code default in initCometBFTConfig (1 for mainnet/Amoy, 0 for local, keyed on the chain flag) in heimdall once this PR is merged and the new version is updated. Doing the same in comet is a bit tricky and risky given it cannot differentiate between networks. Updated the plan.

Does it sounds good?

Is it possible to check whether the node is the only validator in the network itself, and make an exception for it? That is, only allow 0 if it is the only validator in the network. I am not a big fan of hardcoding Amoy/Mainnet in the config, because we want to test this exact scenario in kurtosis network, which has a different network id from Amoy and mainnet.

Done here. Thanks!

kamuikatsurgi · 2026-06-03T11:07:09Z

Should this kind of behaviour reported upstream? They would have faced this issue somewhere no? Worth confirming why they have deliberately chosen this behaviour.

vbhattaccmu · 2026-06-03T13:44:09Z

+func (conR *Reactor) IsBehind() bool {
+	raw := conR.isBehindRaw()
+
+	conR.mtx.Lock()


IsBehind() is read-triggered, debounce state advances only on /status calls, so sustained lag can be under-reported until a second poll. May be we can try reactor-side sampling/caching.

Good observation.

The current debounce is poll-driven, not an independent background sampler. For Bor this is effectively continuous because SpanStore refreshes Heimdall /status every 200ms, and the wait gate also re-polls at that cadence while it is blocked. Sparse operator/LB probes can still under-report until the next poll that occurs at least catchup_debounce_duration after the first lagging observation. I'd document that in this PR and treat reactor-side cached sampling as a follow-up unless we want /status semantics to be wall-clock accurate independent of callers

A node that has finished initial block-sync but later ends up with no peers cannot observe whether the network has advanced past its height. The earlier approach gated this on a `min_expected_peers` config value (default 0), which had to be raised per network to have any effect. That encoded network-specific assumptions in configuration, did not reach already-initialized nodes on a binary upgrade (the key is absent from their config.toml), and could not be exercised on a local devnet whose chain id differs from the production networks. Replace the config knob with a self-determining rule: a node with zero peers reports catching_up=true unless it is the sole validator, in which case it can finalize blocks on its own and needs no peers to make progress. Sole-validator status uses the same local-validator membership test as the signing path (Validators.HasAddress on the private validator's address) and is read live from consensus state on each call, so it follows validator-set changes the node has committed into local round state without any cached value or operator action. A non-validator node has no private validator and so always needs peers, and is reported accordingly. This requires no configuration and is network-agnostic: the behavior is identical on mainnet, testnet, and local devnets, and ships entirely in the binary with no per-network value to deploy. Details: - consensus/reactor.go: evaluateBehind takes an isSoleValidator bool instead of a peer-count threshold; isBehindRaw sources it from State.isLocalSoleValidator. - consensus/state.go: add isLocalSoleValidator, read under the state lock. - config: remove the MinExpectedPeers field, its default, its ValidateBasic check, and the config.toml template entry. ReactorCatchupConfig and its node wiring drop the corresponding argument. - config: clarify that catchup_lag_threshold = 0 disables peer-height lag detection only; the zero-peer rule applies independently of it. - consensus/reactor_catchup_test.go: table updated for the sole-validator input. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-06-04T04:17:09Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cffls reviewed Jun 2, 2026

View reviewed changes

pratikspatil024 marked this pull request as ready for review June 3, 2026 09:54

pratikspatil024 requested a review from a team June 3, 2026 09:54

vbhattaccmu reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(consensus): report catching_up when fallen behind peers#40

feat(consensus): report catching_up when fallen behind peers#40
pratikspatil024 wants to merge 2 commits into
developfrom
ppatil/honest-catching-up

pratikspatil024 commented Jun 2, 2026 •

edited

Loading

Uh oh!

cffls Jun 2, 2026

Uh oh!

pratikspatil024 Jun 3, 2026

Uh oh!

cffls Jun 3, 2026

Uh oh!

cffls Jun 3, 2026

Uh oh!

pratikspatil024 Jun 4, 2026

Uh oh!

cffls Jun 4, 2026

Uh oh!

pratikspatil024 Jun 4, 2026

Uh oh!

kamuikatsurgi commented Jun 3, 2026

Uh oh!

vbhattaccmu Jun 3, 2026 •

edited

Loading

Uh oh!

pratikspatil024 Jun 4, 2026

Uh oh!

sonarqubecloud Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pratikspatil024 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New [consensus] config (defaults)

Implementation notes / design decisions

Executed tests

Rollout notes

Known limitations

Open design decisions (feedback wanted)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kamuikatsurgi commented Jun 3, 2026

Uh oh!

vbhattaccmu Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 4, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pratikspatil024 commented Jun 2, 2026 •

edited

Loading

New `[consensus]` config (defaults)

vbhattaccmu Jun 3, 2026 •

edited

Loading