Skip to content

fix(harness): disable SeiDB state store on nightly nodes, drop dead ss write-mode#445

Merged
bdchatham merged 1 commit into
mainfrom
fix/nightly-disable-ss-store-on-followers
Jun 24, 2026
Merged

fix(harness): disable SeiDB state store on nightly nodes, drop dead ss write-mode#445
bdchatham merged 1 commit into
mainfrom
fix/nightly-disable-ss-store-on-followers

Conversation

@bdchatham

Copy link
Copy Markdown
Collaborator

Problem

Nightly load suite hangs: the RPC follower's seid parks all threads on futex_wait_queue right after Found 0 WALs, opens no listeners (/proc/net/tcp has no :8545/:26657/:26656), so the benchmark's EVM-readiness gate never passes and NightlyRunFailed fires.

Root cause (grounded in canonical sei-chain main @ 51eb1fd, the nightly image's commit)

  • storage.state_store.write_mode no longer existssei-db/config/ss_config.go's StateStoreConfig has no write-mode field (EVM routing is the evm-split bool); seid's app.toml template renders no ss-write-mode. The harness's ss-write-mode=memiavl_only was a silently-ignored no-op.
  • The real trigger is ss-enable: DefaultStateStoreConfig{Enable: true} — the latest image defaults the SeiDB state store on for full nodes. Enabling it on a fresh-genesis RPC follower deadlocks seid at store-open. Validators are fine because the controller role-defaults them to ss-enable=false (confirmed live: follower ss-enable=true, validator ss-enable=false; everything else identical).

Fix — align the override with the current config schema

In memiavlStorageConfig (applied to validators + followers):

  • drop storage.state_store.write_mode (dead key)
  • add storage.state_store.enable: "false" (string→bool coerces via sei-config's typed registry, same as the existing evm.worker_pool_size: "32" int override)
  • keep storage.state_commit.write_mode: memiavl_only

This matches the healthy validators and costs no FlatKV-migration coverage — that lives in the SC layer (sc-write-mode), not the historical state store. The major-upgrade suite still omits the map.

Validation

  • go vet -tags integration ./test/integration/ passes; gofmt clean.
  • Post-merge: new harness image → bump cronjob → re-trigger load → confirm seid opens :8545, the follower's startup log shows the SS store disabled (not wedged), and seiload emits seiload_run_duration_seconds.

A separate sei-chain robustness bug (a fresh-genesis ss-enable=true full node should boot or fail-fast, not silently deadlock) is filed independently.

🤖 Generated with Claude Code

…s write-mode

The load/release/chaos suites pinned storage.state_store.write_mode=memiavl_only.
On current sei-chain that key no longer exists — StateStoreConfig has no
write-mode field (EVM routing is the evm-split bool), so the override was a
silently-ignored no-op. The real trigger is ss-enable: the latest image defaults
the SeiDB state store ON for full nodes, and enabling it on a fresh-genesis RPC
follower deadlocks seid at store-open (all threads futex_wait, no listeners
bound), so the benchmark's EVM-readiness gate never passes and NightlyRunFailed
fires. Validators are unaffected (controller role-default ss-enable=false).

Aligns the override with the current config schema: drop the dead write-mode key,
set storage.state_store.enable=false to match the validators. State commitment
stays on memiavl. FlatKV-migration coverage is unaffected — that lives in the SC
layer (sc-write-mode), not the historical state store.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 24, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Integration-test provisioning config only; aligns followers with validator ss-enable=false and removes a no-op key.

Overview
Fixes nightly integration hangs where RPC followers never open EVM/Tendermint listeners because SeiDB state store defaults on for full nodes on the current image.

memiavlStorageConfig (load, release, chaos suites) now sets storage.state_store.enable: "false" instead of the obsolete storage.state_store.write_mode key, which sei-chain no longer reads. storage.state_commit.write_mode: memiavl_only is unchanged. Comments document the deadlock on fresh-genesis followers and that FlatKV migration coverage stays on the SC layer.

Major-upgrade suite still omits this map.

Reviewed by Cursor Bugbot for commit 05ccb5f. Bugbot is set up for automated code reviews on this repo. Configure here.

@bdchatham bdchatham merged commit 94ccfe6 into main Jun 24, 2026
5 checks passed
@bdchatham bdchatham deleted the fix/nightly-disable-ss-store-on-followers branch June 24, 2026 16:00
bdchatham added a commit that referenced this pull request Jun 24, 2026
* fix(harness): unique chain id per run; stop reusing stale genesis

Root cause of the nightly hang: the cronjobs pass a static SEI_CHAIN_ID
("bench"/"rel"), so every run reuses the prior run's persisted genesis. The
nodes boot with fresh validator/P2P keys, but the genesis — keyed by chain id —
still names the prior run's validator set, so the live nodes are not the
validators genesis expects. Consensus can never reach 2/3 of the genesis set and
halts at height 1 (validators stuck in RoundStepPropose, voting_power=0, dialing
phantom peer NodeIDs that exist nowhere in the live config). With no blocks, the
EVM RPC never serves and the harness's EVM-readiness gate blocks → NightlyRunFailed.

benchmark/release used SEI_CHAIN_ID raw; chaos suffixed only by scenario, so it
collided across runs too. The harness image is distroless (no shell), so the id
can't be made unique in the CronJob — derive it in-process: runChainID appends a
per-run token to the base, matching the chaos suite's existing base semantics.

Also reverts the storage.state_store.enable=false override from #445 — that was
based on an incorrect state-store-deadlock diagnosis (seid was never wedged; it
ran fine, the chain was halted). The state store returns to its image default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(harness): nanosecond run token + correct stale GC comment (xreview #446)

xreview hardening:
- runChainID uses UnixNano (was Unix): 1-second resolution could alias a
  same-second manual re-trigger onto a prior run's not-yet-reaped chain and
  reproduce the height-1 halt. Nanosecond resolution closes that window.
- Correct the runLabelKey comment: the nightly-gc label sweep (sei.io/harness-run,
  >5h) already ships in platform and reaps abnormal-exit orphans — the prior
  "pending platform deliverable" note was stale and implied an unbounded leak.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant