Skip to content

feat(inprocess): in-process N-validator harness#3642

Draft
bdchatham wants to merge 10 commits into
mainfrom
feat/inprocess-harness
Draft

feat(inprocess): in-process N-validator harness#3642
bdchatham wants to merge 10 commits into
mainfrom
feat/inprocess-harness

Conversation

@bdchatham

@bdchatham bdchatham commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

in-process N-validator harness

Stands up N sei-chain validators in a single Go process, reaching real CometBFT consensus and each serving its own RPC stack (Tendermint RPC + EVM JSON-RPC HTTP/WS), with deterministic teardown. The in-process provisioning foundation for the SDK "local" provider.

Gated behind the inprocess build tag — the heavy sei-tendermint/sei-cosmos bring-up never enters a normal seid build (verified: go build ./cmd/seid is unaffected). The harness-only app.App accessors live in app/app_inprocess.go behind the same tag, so production app.App's public surface does not widen.

What's here

  • inprocess/ package: Start(ctx, Options) (*Network, error), per-node Node handles, WaitReady, idempotent Close.
  • app/app.go: EVM listener Stop() handles + a redirectable serve-error channel; app/app_inprocess.go (build-tagged): the SetEVMServeErr/EVMHTTPServer/EVMWebSocketServer accessors. Production seid behavior is unchanged when no channel is set (still panics on a listener-start failure).

Served surface

TM RPC + EVM JSON-RPC HTTP/WS. No gRPC: the harness never calls servergrpc.StartGRPCServer, so the cosmos gRPC server stays off (enabling it would advertise a port nothing binds). REST is an honest "" parity stub (part of the SDK handle shape; not started by the harness).

The load-bearing recipe (vs testutil/network)

  1. genDoc.Validators = nil — derive the valset from InitChain. testutil/network pins []{self}, which fails consensus replay for N>1.
  2. Full P2P mesh: nodeID@127.0.0.1:p2pPort persistent-peers across all N (wired via the gentx memos in collectGentxs) — without the mesh nodes never gossip and consensus never forms for N>1.
  3. EVM enabled on per-node ports — without it TestAppOpts hard-disables the listeners and no node serves EVM.
  4. Instrumentation.Prometheus = false — metrics off avoids the dup-registry panic from the process-wide registries. Invariant: metrics must stay off until the evmrpc/EVM-keeper metrics are de-globalized — re-enabling Prometheus without that reintroduces the panic.
  5. TM RPC / P2P scoped to loopback. Caveat (accepted): the EVM HTTP/WS listeners bind all interfaces (0.0.0.0) for the harness lifetime; only TM RPC/P2P are loopback-scoped. They run on free ephemeral ports, dialed via 127.0.0.1. Tightening requires a bind-host option in evmrpc (not yet present).
  6. MaxIncomingConnectionAttempts raised for the loopback conn-tracker burst — without the raise the burst trips the per-IP cap and peers are rejected.

Productionization beyond the spike

  • Fresh per-run chain-id (no cross-run persisted-genesis collision).
  • Partial-startup cleanup: a failed node K tears down nodes 0..K-1 and the owned temp dir.
  • Per-node EVM serve-error channel + goroutine recover, so one node's listener-start failure reports instead of killing all N. Construct-time bind (NewEVM*Server) is still synchronous fail-fast — it panics and kills all N.
  • Handle methods mirror the SDK sei.NodeHandle/NetworkHandle signatures by name (Name, EVMRPC, TendermintRPC, REST, WaitReady(ctx), Object) so a future thin adapter satisfies the interface structurally — without importing the SDK (its module graph + toolchain skew would break the seid build).

Test

TestInProcessNetwork stands up N=4, asserts each node serves TM RPC + EVM, and round-trips a tx (broadcast on node0, observed on node1's independent RPC). Plus TestStartRejectsZeroValidators and TestFreshChainIDPerRun.

go test -tags inprocess -run TestInProcessNetwork -v -timeout 300s ./inprocess/

All three pass; full suite ~17s.

For reviewers

  • EVM worker pool is a process singleton, not Network-owned. Close deliberately does NOT close it: the sync.Once never re-fires, so a second Start in the same process would inherit a closed pool. De-globalizing it in evmrpc is the proper fix if repeated Start/Close in one process is needed. Today's tests run one network per process.
  • Parked serve goroutines on early Close. If a node is Closed before its EVM start-signal fires, its 2 serve goroutines park (blocked on the start-signal receive) until process exit. Bounded under go test; un-defer if the harness is embedded in a long-lived process.
  • The readiness probes (readiness.go) duplicate the SDK's WaitHeightAdvances/WaitEVMServing stdlib-only, marked for a mechanical swap once the SDK toolchain skew is resolved.

Draft, no reviewers — full Coral review-gate (idiomatic + sei-network + systems) + Bugbot before a human reviewer is added.

🤖 Generated with Claude Code

bdchatham and others added 2 commits June 24, 2026 14:05
RegisterLocalServices constructs the EVM HTTP/WS listeners in detached
goroutines that panic on a bind failure. An in-process host running N apps
in one process needs (a) the listener handles to Stop() at teardown and
(b) a single node's bind failure to be a reportable error, not a
process-wide panic that kills all N.

Add evmHTTPServer/evmWSServer handles (EVMHTTPServer/EVMWebSocketServer
getters), and SetEVMServeErr to redirect Start() failures to a buffered
channel. With no channel set (production seid) behavior is unchanged: the
listener still panics on bind failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stands up N sei-chain validators in one Go process reaching real CometBFT
consensus, each serving its own RPC stack (Tendermint RPC + EVM JSON-RPC
HTTP/WS + gRPC), with deterministic teardown. Gated behind the inprocess
build tag so the heavy bring-up never enters a normal seid build.

The load-bearing recipe (vs testutil/network): empty genesis valset
(derive from InitChain), full P2P mesh, EVM enabled on per-node loopback
ports, metrics off, raised conn-tracker ceiling for the loopback burst.

Productionization: fresh per-run chain-id (no cross-run genesis collision),
partial-startup cleanup, per-node EVM serve-error channel, idempotent
Close. Handle methods mirror the SDK sei.NodeHandle signatures by name so a
future adapter satisfies the interface structurally — without importing the
SDK (its module graph + grpc replace conflict would break the seid build).

Test: TestInProcessNetwork stands up N=4, asserts each node serves its RPC
stack, and round-trips a tx (broadcast on node0, observed on node1).

  go test -tags inprocess -run TestInProcessNetwork -v -timeout 300s ./inprocess/

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Contributor Author

bugbot run

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 25, 2026, 1:19 AM

@cursor

cursor Bot commented Jun 24, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Large new consensus/RPC bring-up path with fragile genesis/P2P invariants, though isolated behind the inprocess tag; production seid only gains unused EVM server handle fields unless the tag is enabled.

Overview
Adds a new inprocess package ( inprocess build tag ) that boots 1 or ≥3 validators in one process with real CometBFT consensus, per-node Tendermint RPC, and EVM JSON-RPC HTTP/WS, plus WaitReady / idempotent Close. It replaces testutil/network patterns with explicit genesis (empty valset, gentx-derived P2P mesh), loopback TM/P2P binding, Prometheus off, and injected AppOptions so EVM listeners use distinct ports.

app.App now retains EVM HTTP/WS server handles in RegisterLocalServices; app/app_inprocess.go exposes EVMHTTPServer() / EVMWebSocketServer() only under the same tag so the harness can Stop() listeners at teardown without widening the production API.

The integration YAML runner gains an execer seam: Docker stays the default; WithInProcessNetwork (tagged) runs suite commands on the host via a seid shim (--home, --node, EVM env) against harness node homes. Tests cover N=4 consensus + cross-node tx, validation guards, and an in-process bank_module/send_funds_test.yaml run.

Reviewed by Cursor Bugbot for commit 4e143fc. Bugbot is set up for automated code reviews on this repo. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit a869e15. Configure here.

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.15%. Comparing base (c528303) to head (6320c19).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
integration_test/runner/runner.go 0.00% 9 Missing ⚠️
app/app.go 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3642      +/-   ##
==========================================
- Coverage   59.12%   58.15%   -0.97%     
==========================================
  Files        2259     2176      -83     
  Lines      186489   176898    -9591     
==========================================
- Hits       110255   102871    -7384     
+ Misses      66353    64935    -1418     
+ Partials     9881     9092     -789     
Flag Coverage Δ
sei-chain-pr 47.23% <0.00%> (?)
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
app/app.go 71.19% <0.00%> (-0.09%) ⬇️
integration_test/runner/runner.go 0.00% <0.00%> (ø)

... and 83 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The harness never starts a cosmos gRPC listener (servergrpc.StartGRPCServer
is only on the seid start path), so enabling GRPC in app.toml and exposing
Node.GRPC() advertised a port nothing binds. Remove the gRPC surface
entirely: harness serves TM RPC + EVM (HTTP/WS) only. REST stays an honest
"" parity stub.

Also: move the harness-only app.App accessors (SetEVMServeErr,
EVMHTTPServer, EVMWebSocketServer) behind //go:build inprocess in
app/app_inprocess.go so production App's public surface stays unchanged;
remove the dead wireMesh path (collectGentxs is authoritative for
persistent-peers); correct serve-error wording to listener-start (construct
-time bind is still fail-fast); state the metrics-off and 0.0.0.0-bind
invariants as standing conditions; stripScheme via strings.CutPrefix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Contributor Author

bugbot run

Comment thread inprocess/handle.go
Explicitly set GRPC.Enable/GRPCWeb.Enable=false so app.toml matches the
"gRPC stays off" comment and can't collide on the fixed default port if the
standard start path is ever wired. Scope doc.go recipe #5's bare "listeners"
to consensus/RPC, and note on EVMRPC/EVMWS that the URL dials loopback while
the listener binds 0.0.0.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Contributor Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

1 issue from previous review remains unresolved.

Fix All in Cursor

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit a1ae06c. Configure here.

@bdchatham bdchatham changed the title feat(inprocess): in-process N-validator harness (C1) feat(inprocess): in-process N-validator harness Jun 24, 2026
…unner execer (C2)

Wire the integration_test/runner to drive a real bank query/tx suite against
the C1 inprocess.Network — no docker.

Runner seam: extract execCmd into an `execer` interface. The docker-exec arm
stays the zero-value default (existing yaml_integration runs unaffected). A new
build-tagged in-process arm (runner_inprocess.go, tag `inprocess`) runs each
command on the host against a `seid` it builds once, redirected to a node via a
PATH shim that prepends `--home "$SEID_HOME"` — so opaque sourced helpers that
call bare `seid` land on the right node without rewriting the commands.

Harness bridge: keyring moves into the node home (so host `seid --home` resolves
it), each home gets a client.toml pinning test keyring + chain-id + that node's
loopback RPC, and Options.ExtraKeys genesis-funds non-validator signing keys
(admin on node 0) mirroring the docker localnode topology the suites sign as.

bank_module/send_funds_test.yaml is GREEN in-memory (N=3, the min topology that
leaves block-sync and forms consensus): a real admin->bank-test send plus
historical balance queries at distinct heights, all four verifiers passing.

  go test -tags inprocess -run TestInProcessBankModule ./integration_test/runner/

Out of scope (process/binary boundary): upgrade + statesync suites.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Contributor Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

1 issue from previous review remains unresolved.

Fix All in Cursor

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 3a6be99. Configure here.

…geting + cleanup

The N-floor was documented as a >2/3 voting-power quorum (N<3 stay in
block-sync). That is wrong. The real constraint is CometBFT's block-sync
handoff, verified against sei-tendermint and empirically:

- N=1 produces blocks as solo proposer IF onlyValidatorIsUs fires — which
  needs state.Validators.Size()==1 at the blockSync decision. Recipe #1's
  empty genesis valset leaves size 0 there (decision precedes InitChain), so
  the solo node fell into block-sync and hung at height 1. Fixed by pinning
  the single validator into genesis for N=1.
- N=2 deadlocks: each node has exactly 1 peer and BlockPool.IsCaughtUp
  requires >1. Start now rejects N=2 loudly instead of hanging.
- N>=3 works (>=2 peers each). Bank suite stays at N=3.

Corrected the false call-site comment + Options doc; added a doc.go recipe
entry. Guard test now asserts N=2 rejected.

Hardening:
- F2: shim injects --node (client subcommands only; --node is not
  root-persistent, so keys/* would break) so RPC targeting is explicit, not
  client.toml-only. writeClientConfig returns its error (keyring-backend=test
  resolves only from it). Fixed the stale "injects same values defensively"
  comment.
- F5: t.Cleanup removes the temp build dir holding seid.real + shim.
- repoRoot surfaces the Getwd error instead of degrading to ".".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Contributor Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

1 issue from previous review remains unresolved.

Fix All in Cursor

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 8403b1e. Configure here.

waitEVMServing now selects on the node's serveErr channel alongside the
poll tick, so a reported EVM listener-start failure short-circuits with
the real error instead of polling eth_blockNumber until the ctx deadline
and masking it as a generic timeout.

Consumption is non-destructive: the received error is re-sent (non-blocking,
slot just freed) so Node.ServeErr() still observes it after WaitReady
returns. Production seid (nil channel -> panic in app.go) is untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Contributor Author

bugbot run

…leanup

The harness's P2P mesh is derived implicitly: collectGentxs mutates each
node's tmCfg.P2P.PersistentPeers in place. Correct the doc to describe that
mechanism (it never set PersistentPeers itself) and add a post-collectGentxs
guard that fails loudly for N>=2 if the wiring didn't land, turning a fragile
silent dependency into a fast failure.

Replace the rot-prone "recipe #N" taxonomy with self-describing named
invariants referenced at point-of-use (empty-valset, gentx-derived peer mesh,
EVM-enable injection, metrics-off constraint, loopback bind scope / 0.0.0.0 EVM
caveat, loopback conn-tracker ceiling, validator-count rule).

Also: nolint:gosec on the seid build exec (consistency with siblings); drop the
F5 step-tag comment; probeInterval var -> const; document that ServeErr() must
be read after WaitReady, not concurrently; add a test asserting metrics stay off.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Contributor Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 30459ac. Configure here.

The diversion (route a listener Start() failure to a channel instead of
panicking) only softened a rare EVM port-bind collision. For a test
harness a loud panic on a rare event is fine, and the diversion's
production footprint is not worth it. Production app.go reverts to the
original bare panic(err) serve goroutines; the sole retained production
change is keeping the constructed EVM HTTP/WS handles so the harness can
Stop() them at teardown.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Contributor Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 4e143fc. Configure here.

Consolidate the package doc (lead with the validator-count rule, centralize
the N=1 mechanism, distill the invariant prose) and strip work-item
provenance from code comments — "productionizes the spike", "C2 end-to-end
proof", "proven live by", "the point of this demo". Collapse the N-count
re-derivation in the runner test to a cross-reference; the canonical
statement lives in the package doc. No constraint dropped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant