fix(harness): unique chain id per run; stop reusing stale genesis by bdchatham · Pull Request #446 · sei-protocol/sei-k8s-controller

bdchatham · 2026-06-24T17:38:59Z

Root cause

The nightly chains halt at height 1 and never produce blocks, so nothing becomes EVM-serving and the harness blocks → NightlyRunFailed.

Why: the cronjobs pass a static SEI_CHAIN_ID (bench, rel). Every run reuses the prior run's persisted genesis (keyed by chain id). The nodes boot with fresh validator/P2P keys, but genesis still names the prior run's validator set — so the live validators aren't the ones genesis expects, consensus can't reach 2/3 of the genesis set, and it halts at height 1 (RoundStepPropose, voting_power=0, validators dialing phantom peer NodeIDs — 5ad3366f… etc. — that exist nowhere in the live config; the real persistent_peers are correct).

Evidence (live): all bench nodes frozen at tendermint_consensus_height=1 for 5m+, 4 peers each, validator logs spamming r.runConn err="peer NodeID = <actual>, want <stale-genesis-id>".

Fix

runChainID(base) appends a per-run token to the base chain id, so each run provisions a fresh genesis + keys and never collides with prior state. benchmark/release used SEI_CHAIN_ID raw; chaos suffixed only by scenario (also collided across runs). The harness image is distroless/static (no shell), so this can't be done in the CronJob — it's derived in-process, matching the chaos suite's existing "base" semantics.

Also reverts the storage.state_store.enable=false override from #445 — that came from an incorrect "state-store deadlock" diagnosis (seid was never wedged; a goroutine dump showed it fully running — the chain was halted). State store returns to the image default.

Validation

go vet -tags integration ./test/integration/ clean; gofmt clean.
Post-merge: rebuild harness image → bump the platform cronjob → re-trigger load → the chain should advance past height 1, seid binds :8545, seiload runs and emits seiload_run_duration_seconds → NightlyRunFailed clears.

Supersedes #445 (wrong diagnosis). PLT-758 to be re-scoped/closed accordingly.

🤖 Generated with Claude Code

Root cause of the nightly hang: the cronjobs pass a static SEI_CHAIN_ID ("bench"/"rel"), so every run reuses the prior run's persisted genesis. The nodes boot with fresh validator/P2P keys, but the genesis — keyed by chain id — still names the prior run's validator set, so the live nodes are not the validators genesis expects. Consensus can never reach 2/3 of the genesis set and halts at height 1 (validators stuck in RoundStepPropose, voting_power=0, dialing phantom peer NodeIDs that exist nowhere in the live config). With no blocks, the EVM RPC never serves and the harness's EVM-readiness gate blocks → NightlyRunFailed. benchmark/release used SEI_CHAIN_ID raw; chaos suffixed only by scenario, so it collided across runs too. The harness image is distroless (no shell), so the id can't be made unique in the CronJob — derive it in-process: runChainID appends a per-run token to the base, matching the chaos suite's existing base semantics. Also reverts the storage.state_store.enable=false override from #445 — that was based on an incorrect state-store-deadlock diagnosis (seid was never wedged; it ran fine, the chain was halted). The state store returns to its image default. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor · 2026-06-24T17:39:06Z

PR Summary

Medium Risk
Changes how nightly chains are named and provisioned in-cluster (genesis collision fix) and reverts test storage overrides; scope is integration tests only, but failures would block release/load/chaos signal.

Overview
Nightly integration suites were reusing a static SEI_CHAIN_ID, so new validator keys did not match persisted genesis from earlier runs and chains could stall at height 1. The harness now derives the effective id with runChainID, appending a nanosecond token in base36 to the env base so benchmark, release, and chaos each get a fresh genesis/key set; docs treat SEI_CHAIN_ID as a base id.

memiavlStorageConfig drops the explicit storage.state_store.enable=false override and only pins memiavl_only state commitment, leaving the state store at the image default. Comments were updated for nightly-gc label reaping and the corrected storage rationale.

^{Reviewed by Cursor Bugbot for commit c484971. Bugbot is set up for automated code reviews on this repo. Configure here.}

…#446) xreview hardening: - runChainID uses UnixNano (was Unix): 1-second resolution could alias a same-second manual re-trigger onto a prior run's not-yet-reaped chain and reproduce the height-1 halt. Nanosecond resolution closes that window. - Correct the runLabelKey comment: the nightly-gc label sweep (sei.io/harness-run, >5h) already ships in platform and reaps abnormal-exit orphans — the prior "pending platform deliverable" note was stale and implied an unbounded leak. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

#446 wired runChainID into the load/release/chaos suites but missed TestChainUpgrade, which provisions a 4-validator chain on a static SEI_CHAIN_ID ("upg") and is therefore exposed to the same stale-genesis halt on any same-id collision (back-to-back run, or SIGKILL leaving genesis behind). Its own doc already calls SEI_CHAIN_ID "(base)"; this makes the code match. Completes the per-run unique chain id across all four suites. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

bdchatham merged commit 1b966e5 into main Jun 24, 2026
5 checks passed

bdchatham deleted the fix/nightly-unique-chain-id branch June 24, 2026 17:55

bdchatham mentioned this pull request Jun 24, 2026

fix(harness): unique chain id for the major-upgrade suite too #447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(harness): unique chain id per run; stop reusing stale genesis#446

fix(harness): unique chain id per run; stop reusing stale genesis#446
bdchatham merged 2 commits into
mainfrom
fix/nightly-unique-chain-id

bdchatham commented Jun 24, 2026

Uh oh!

cursor Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bdchatham commented Jun 24, 2026

Root cause

Fix

Validation

Uh oh!

cursor Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented Jun 24, 2026 •

edited

Loading