fix(harness): disable SeiDB state store on nightly nodes, drop dead ss write-mode#445
Conversation
…s write-mode The load/release/chaos suites pinned storage.state_store.write_mode=memiavl_only. On current sei-chain that key no longer exists — StateStoreConfig has no write-mode field (EVM routing is the evm-split bool), so the override was a silently-ignored no-op. The real trigger is ss-enable: the latest image defaults the SeiDB state store ON for full nodes, and enabling it on a fresh-genesis RPC follower deadlocks seid at store-open (all threads futex_wait, no listeners bound), so the benchmark's EVM-readiness gate never passes and NightlyRunFailed fires. Validators are unaffected (controller role-default ss-enable=false). Aligns the override with the current config schema: drop the dead write-mode key, set storage.state_store.enable=false to match the validators. State commitment stays on memiavl. FlatKV-migration coverage is unaffected — that lives in the SC layer (sc-write-mode), not the historical state store. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR SummaryLow Risk Overview
Major-upgrade suite still omits this map. Reviewed by Cursor Bugbot for commit 05ccb5f. Bugbot is set up for automated code reviews on this repo. Configure here. |
* fix(harness): unique chain id per run; stop reusing stale genesis
Root cause of the nightly hang: the cronjobs pass a static SEI_CHAIN_ID
("bench"/"rel"), so every run reuses the prior run's persisted genesis. The
nodes boot with fresh validator/P2P keys, but the genesis — keyed by chain id —
still names the prior run's validator set, so the live nodes are not the
validators genesis expects. Consensus can never reach 2/3 of the genesis set and
halts at height 1 (validators stuck in RoundStepPropose, voting_power=0, dialing
phantom peer NodeIDs that exist nowhere in the live config). With no blocks, the
EVM RPC never serves and the harness's EVM-readiness gate blocks → NightlyRunFailed.
benchmark/release used SEI_CHAIN_ID raw; chaos suffixed only by scenario, so it
collided across runs too. The harness image is distroless (no shell), so the id
can't be made unique in the CronJob — derive it in-process: runChainID appends a
per-run token to the base, matching the chaos suite's existing base semantics.
Also reverts the storage.state_store.enable=false override from #445 — that was
based on an incorrect state-store-deadlock diagnosis (seid was never wedged; it
ran fine, the chain was halted). The state store returns to its image default.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(harness): nanosecond run token + correct stale GC comment (xreview #446)
xreview hardening:
- runChainID uses UnixNano (was Unix): 1-second resolution could alias a
same-second manual re-trigger onto a prior run's not-yet-reaped chain and
reproduce the height-1 halt. Nanosecond resolution closes that window.
- Correct the runLabelKey comment: the nightly-gc label sweep (sei.io/harness-run,
>5h) already ships in platform and reaps abnormal-exit orphans — the prior
"pending platform deliverable" note was stale and implied an unbounded leak.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Problem
Nightly load suite hangs: the RPC follower's seid parks all threads on
futex_wait_queueright afterFound 0 WALs, opens no listeners (/proc/net/tcphas no:8545/:26657/:26656), so the benchmark's EVM-readiness gate never passes andNightlyRunFailedfires.Root cause (grounded in canonical sei-chain main @
51eb1fd, the nightly image's commit)storage.state_store.write_modeno longer exists —sei-db/config/ss_config.go'sStateStoreConfighas no write-mode field (EVM routing is theevm-splitbool); seid's app.toml template renders noss-write-mode. The harness'sss-write-mode=memiavl_onlywas a silently-ignored no-op.ss-enable:DefaultStateStoreConfig{Enable: true}— the latest image defaults the SeiDB state store on for full nodes. Enabling it on a fresh-genesis RPC follower deadlocks seid at store-open. Validators are fine because the controller role-defaults them toss-enable=false(confirmed live: followerss-enable=true, validatorss-enable=false; everything else identical).Fix — align the override with the current config schema
In
memiavlStorageConfig(applied to validators + followers):storage.state_store.write_mode(dead key)storage.state_store.enable: "false"(string→bool coerces via sei-config's typed registry, same as the existingevm.worker_pool_size: "32"int override)storage.state_commit.write_mode: memiavl_onlyThis matches the healthy validators and costs no FlatKV-migration coverage — that lives in the SC layer (
sc-write-mode), not the historical state store. The major-upgrade suite still omits the map.Validation
go vet -tags integration ./test/integration/passes; gofmt clean.:8545, the follower's startup log shows the SS store disabled (not wedged), and seiload emitsseiload_run_duration_seconds.A separate sei-chain robustness bug (a fresh-genesis
ss-enable=truefull node should boot or fail-fast, not silently deadlock) is filed independently.🤖 Generated with Claude Code