Skip to content

fix(harness): unique chain id per run; stop reusing stale genesis#446

Merged
bdchatham merged 2 commits into
mainfrom
fix/nightly-unique-chain-id
Jun 24, 2026
Merged

fix(harness): unique chain id per run; stop reusing stale genesis#446
bdchatham merged 2 commits into
mainfrom
fix/nightly-unique-chain-id

Conversation

@bdchatham

Copy link
Copy Markdown
Collaborator

Root cause

The nightly chains halt at height 1 and never produce blocks, so nothing becomes EVM-serving and the harness blocks → NightlyRunFailed.

Why: the cronjobs pass a static SEI_CHAIN_ID (bench, rel). Every run reuses the prior run's persisted genesis (keyed by chain id). The nodes boot with fresh validator/P2P keys, but genesis still names the prior run's validator set — so the live validators aren't the ones genesis expects, consensus can't reach 2/3 of the genesis set, and it halts at height 1 (RoundStepPropose, voting_power=0, validators dialing phantom peer NodeIDs — 5ad3366f… etc. — that exist nowhere in the live config; the real persistent_peers are correct).

Evidence (live): all bench nodes frozen at tendermint_consensus_height=1 for 5m+, 4 peers each, validator logs spamming r.runConn err="peer NodeID = <actual>, want <stale-genesis-id>".

Fix

runChainID(base) appends a per-run token to the base chain id, so each run provisions a fresh genesis + keys and never collides with prior state. benchmark/release used SEI_CHAIN_ID raw; chaos suffixed only by scenario (also collided across runs). The harness image is distroless/static (no shell), so this can't be done in the CronJob — it's derived in-process, matching the chaos suite's existing "base" semantics.

Also reverts the storage.state_store.enable=false override from #445 — that came from an incorrect "state-store deadlock" diagnosis (seid was never wedged; a goroutine dump showed it fully running — the chain was halted). State store returns to the image default.

Validation

  • go vet -tags integration ./test/integration/ clean; gofmt clean.
  • Post-merge: rebuild harness image → bump the platform cronjob → re-trigger load → the chain should advance past height 1, seid binds :8545, seiload runs and emits seiload_run_duration_secondsNightlyRunFailed clears.

Supersedes #445 (wrong diagnosis). PLT-758 to be re-scoped/closed accordingly.

🤖 Generated with Claude Code

Root cause of the nightly hang: the cronjobs pass a static SEI_CHAIN_ID
("bench"/"rel"), so every run reuses the prior run's persisted genesis. The
nodes boot with fresh validator/P2P keys, but the genesis — keyed by chain id —
still names the prior run's validator set, so the live nodes are not the
validators genesis expects. Consensus can never reach 2/3 of the genesis set and
halts at height 1 (validators stuck in RoundStepPropose, voting_power=0, dialing
phantom peer NodeIDs that exist nowhere in the live config). With no blocks, the
EVM RPC never serves and the harness's EVM-readiness gate blocks → NightlyRunFailed.

benchmark/release used SEI_CHAIN_ID raw; chaos suffixed only by scenario, so it
collided across runs too. The harness image is distroless (no shell), so the id
can't be made unique in the CronJob — derive it in-process: runChainID appends a
per-run token to the base, matching the chaos suite's existing base semantics.

Also reverts the storage.state_store.enable=false override from #445 — that was
based on an incorrect state-store-deadlock diagnosis (seid was never wedged; it
ran fine, the chain was halted). The state store returns to its image default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 24, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes how nightly chains are named and provisioned in-cluster (genesis collision fix) and reverts test storage overrides; scope is integration tests only, but failures would block release/load/chaos signal.

Overview
Nightly integration suites were reusing a static SEI_CHAIN_ID, so new validator keys did not match persisted genesis from earlier runs and chains could stall at height 1. The harness now derives the effective id with runChainID, appending a nanosecond token in base36 to the env base so benchmark, release, and chaos each get a fresh genesis/key set; docs treat SEI_CHAIN_ID as a base id.

memiavlStorageConfig drops the explicit storage.state_store.enable=false override and only pins memiavl_only state commitment, leaving the state store at the image default. Comments were updated for nightly-gc label reaping and the corrected storage rationale.

Reviewed by Cursor Bugbot for commit c484971. Bugbot is set up for automated code reviews on this repo. Configure here.

…#446)

xreview hardening:
- runChainID uses UnixNano (was Unix): 1-second resolution could alias a
  same-second manual re-trigger onto a prior run's not-yet-reaped chain and
  reproduce the height-1 halt. Nanosecond resolution closes that window.
- Correct the runLabelKey comment: the nightly-gc label sweep (sei.io/harness-run,
  >5h) already ships in platform and reaps abnormal-exit orphans — the prior
  "pending platform deliverable" note was stale and implied an unbounded leak.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham bdchatham merged commit 1b966e5 into main Jun 24, 2026
5 checks passed
@bdchatham bdchatham deleted the fix/nightly-unique-chain-id branch June 24, 2026 17:55
bdchatham added a commit that referenced this pull request Jun 24, 2026
#446 wired runChainID into the load/release/chaos suites but missed
TestChainUpgrade, which provisions a 4-validator chain on a static SEI_CHAIN_ID
("upg") and is therefore exposed to the same stale-genesis halt on any same-id
collision (back-to-back run, or SIGKILL leaving genesis behind). Its own doc
already calls SEI_CHAIN_ID "(base)"; this makes the code match. Completes the
per-run unique chain id across all four suites.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant