fix(harness): unique chain id per run; stop reusing stale genesis#446
Conversation
Root cause of the nightly hang: the cronjobs pass a static SEI_CHAIN_ID
("bench"/"rel"), so every run reuses the prior run's persisted genesis. The
nodes boot with fresh validator/P2P keys, but the genesis — keyed by chain id —
still names the prior run's validator set, so the live nodes are not the
validators genesis expects. Consensus can never reach 2/3 of the genesis set and
halts at height 1 (validators stuck in RoundStepPropose, voting_power=0, dialing
phantom peer NodeIDs that exist nowhere in the live config). With no blocks, the
EVM RPC never serves and the harness's EVM-readiness gate blocks → NightlyRunFailed.
benchmark/release used SEI_CHAIN_ID raw; chaos suffixed only by scenario, so it
collided across runs too. The harness image is distroless (no shell), so the id
can't be made unique in the CronJob — derive it in-process: runChainID appends a
per-run token to the base, matching the chaos suite's existing base semantics.
Also reverts the storage.state_store.enable=false override from #445 — that was
based on an incorrect state-store-deadlock diagnosis (seid was never wedged; it
ran fine, the chain was halted). The state store returns to its image default.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR SummaryMedium Risk Overview
Reviewed by Cursor Bugbot for commit c484971. Bugbot is set up for automated code reviews on this repo. Configure here. |
…#446) xreview hardening: - runChainID uses UnixNano (was Unix): 1-second resolution could alias a same-second manual re-trigger onto a prior run's not-yet-reaped chain and reproduce the height-1 halt. Nanosecond resolution closes that window. - Correct the runLabelKey comment: the nightly-gc label sweep (sei.io/harness-run, >5h) already ships in platform and reaps abnormal-exit orphans — the prior "pending platform deliverable" note was stale and implied an unbounded leak. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
#446 wired runChainID into the load/release/chaos suites but missed TestChainUpgrade, which provisions a 4-validator chain on a static SEI_CHAIN_ID ("upg") and is therefore exposed to the same stale-genesis halt on any same-id collision (back-to-back run, or SIGKILL leaving genesis behind). Its own doc already calls SEI_CHAIN_ID "(base)"; this makes the code match. Completes the per-run unique chain id across all four suites. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Root cause
The nightly chains halt at height 1 and never produce blocks, so nothing becomes EVM-serving and the harness blocks →
NightlyRunFailed.Why: the cronjobs pass a static
SEI_CHAIN_ID(bench,rel). Every run reuses the prior run's persisted genesis (keyed by chain id). The nodes boot with fresh validator/P2P keys, but genesis still names the prior run's validator set — so the live validators aren't the ones genesis expects, consensus can't reach 2/3 of the genesis set, and it halts at height 1 (RoundStepPropose,voting_power=0, validators dialing phantom peer NodeIDs —5ad3366f…etc. — that exist nowhere in the live config; the realpersistent_peersare correct).Evidence (live): all bench nodes frozen at
tendermint_consensus_height=1for 5m+, 4 peers each, validator logs spammingr.runConn err="peer NodeID = <actual>, want <stale-genesis-id>".Fix
runChainID(base)appends a per-run token to the base chain id, so each run provisions a fresh genesis + keys and never collides with prior state. benchmark/release usedSEI_CHAIN_IDraw; chaos suffixed only by scenario (also collided across runs). The harness image isdistroless/static(no shell), so this can't be done in the CronJob — it's derived in-process, matching the chaos suite's existing "base" semantics.Also reverts the
storage.state_store.enable=falseoverride from #445 — that came from an incorrect "state-store deadlock" diagnosis (seid was never wedged; a goroutine dump showed it fully running — the chain was halted). State store returns to the image default.Validation
go vet -tags integration ./test/integration/clean; gofmt clean.:8545, seiload runs and emitsseiload_run_duration_seconds→NightlyRunFailedclears.Supersedes #445 (wrong diagnosis). PLT-758 to be re-scoped/closed accordingly.
🤖 Generated with Claude Code