fix(harness): disable SeiDB state store on nightly nodes, drop dead ss write-mode by bdchatham · Pull Request #445 · sei-protocol/sei-k8s-controller

bdchatham · 2026-06-24T15:56:23Z

Problem

Nightly load suite hangs: the RPC follower's seid parks all threads on futex_wait_queue right after Found 0 WALs, opens no listeners (/proc/net/tcp has no :8545/:26657/:26656), so the benchmark's EVM-readiness gate never passes and NightlyRunFailed fires.

Root cause (grounded in canonical sei-chain main @ `51eb1fd`, the nightly image's commit)

storage.state_store.write_mode no longer exists — sei-db/config/ss_config.go's StateStoreConfig has no write-mode field (EVM routing is the evm-split bool); seid's app.toml template renders no ss-write-mode. The harness's ss-write-mode=memiavl_only was a silently-ignored no-op.
The real trigger is ss-enable: DefaultStateStoreConfig{Enable: true} — the latest image defaults the SeiDB state store on for full nodes. Enabling it on a fresh-genesis RPC follower deadlocks seid at store-open. Validators are fine because the controller role-defaults them to ss-enable=false (confirmed live: follower ss-enable=true, validator ss-enable=false; everything else identical).

Fix — align the override with the current config schema

In memiavlStorageConfig (applied to validators + followers):

drop storage.state_store.write_mode (dead key)
add storage.state_store.enable: "false" (string→bool coerces via sei-config's typed registry, same as the existing evm.worker_pool_size: "32" int override)
keep storage.state_commit.write_mode: memiavl_only

This matches the healthy validators and costs no FlatKV-migration coverage — that lives in the SC layer (sc-write-mode), not the historical state store. The major-upgrade suite still omits the map.

Validation

go vet -tags integration ./test/integration/ passes; gofmt clean.
Post-merge: new harness image → bump cronjob → re-trigger load → confirm seid opens :8545, the follower's startup log shows the SS store disabled (not wedged), and seiload emits seiload_run_duration_seconds.

A separate sei-chain robustness bug (a fresh-genesis ss-enable=true full node should boot or fail-fast, not silently deadlock) is filed independently.

🤖 Generated with Claude Code

…s write-mode The load/release/chaos suites pinned storage.state_store.write_mode=memiavl_only. On current sei-chain that key no longer exists — StateStoreConfig has no write-mode field (EVM routing is the evm-split bool), so the override was a silently-ignored no-op. The real trigger is ss-enable: the latest image defaults the SeiDB state store ON for full nodes, and enabling it on a fresh-genesis RPC follower deadlocks seid at store-open (all threads futex_wait, no listeners bound), so the benchmark's EVM-readiness gate never passes and NightlyRunFailed fires. Validators are unaffected (controller role-default ss-enable=false). Aligns the override with the current config schema: drop the dead write-mode key, set storage.state_store.enable=false to match the validators. State commitment stays on memiavl. FlatKV-migration coverage is unaffected — that lives in the SC layer (sc-write-mode), not the historical state store. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor · 2026-06-24T15:56:31Z

PR Summary

Low Risk
Integration-test provisioning config only; aligns followers with validator ss-enable=false and removes a no-op key.

Overview
Fixes nightly integration hangs where RPC followers never open EVM/Tendermint listeners because SeiDB state store defaults on for full nodes on the current image.

memiavlStorageConfig (load, release, chaos suites) now sets storage.state_store.enable: "false" instead of the obsolete storage.state_store.write_mode key, which sei-chain no longer reads. storage.state_commit.write_mode: memiavl_only is unchanged. Comments document the deadlock on fresh-genesis followers and that FlatKV migration coverage stays on the SC layer.

Major-upgrade suite still omits this map.

^{Reviewed by Cursor Bugbot for commit 05ccb5f. Bugbot is set up for automated code reviews on this repo. Configure here.}

* fix(harness): unique chain id per run; stop reusing stale genesis Root cause of the nightly hang: the cronjobs pass a static SEI_CHAIN_ID ("bench"/"rel"), so every run reuses the prior run's persisted genesis. The nodes boot with fresh validator/P2P keys, but the genesis — keyed by chain id — still names the prior run's validator set, so the live nodes are not the validators genesis expects. Consensus can never reach 2/3 of the genesis set and halts at height 1 (validators stuck in RoundStepPropose, voting_power=0, dialing phantom peer NodeIDs that exist nowhere in the live config). With no blocks, the EVM RPC never serves and the harness's EVM-readiness gate blocks → NightlyRunFailed. benchmark/release used SEI_CHAIN_ID raw; chaos suffixed only by scenario, so it collided across runs too. The harness image is distroless (no shell), so the id can't be made unique in the CronJob — derive it in-process: runChainID appends a per-run token to the base, matching the chaos suite's existing base semantics. Also reverts the storage.state_store.enable=false override from #445 — that was based on an incorrect state-store-deadlock diagnosis (seid was never wedged; it ran fine, the chain was halted). The state store returns to its image default. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(harness): nanosecond run token + correct stale GC comment (xreview #446) xreview hardening: - runChainID uses UnixNano (was Unix): 1-second resolution could alias a same-second manual re-trigger onto a prior run's not-yet-reaped chain and reproduce the height-1 halt. Nanosecond resolution closes that window. - Correct the runLabelKey comment: the nightly-gc label sweep (sei.io/harness-run, >5h) already ships in platform and reaps abnormal-exit orphans — the prior "pending platform deliverable" note was stale and implied an unbounded leak. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

bdchatham mentioned this pull request Jun 24, 2026

fix(harness): route nightly state store to flatkv, not memiavl_only #444

Closed

bdchatham merged commit 94ccfe6 into main Jun 24, 2026
5 checks passed

bdchatham deleted the fix/nightly-disable-ss-store-on-followers branch June 24, 2026 16:00

bdchatham mentioned this pull request Jun 24, 2026

fix(harness): unique chain id per run; stop reusing stale genesis #446

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(harness): disable SeiDB state store on nightly nodes, drop dead ss write-mode#445

fix(harness): disable SeiDB state store on nightly nodes, drop dead ss write-mode#445
bdchatham merged 1 commit into
mainfrom
fix/nightly-disable-ss-store-on-followers

bdchatham commented Jun 24, 2026

Uh oh!

cursor Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bdchatham commented Jun 24, 2026

Problem

Root cause (grounded in canonical sei-chain main @ 51eb1fd, the nightly image's commit)

Fix — align the override with the current config schema

Validation

Uh oh!

cursor Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Root cause (grounded in canonical sei-chain main @ `51eb1fd`, the nightly image's commit)

cursor Bot commented Jun 24, 2026 •

edited

Loading