Skip to content

test(e2e): split node-killing HA test into its own composed suite file#24503

Open
spalladino wants to merge 2 commits into
merge-train/spartan-v5from
spl/e2e-r2-composed
Open

test(e2e): split node-killing HA test into its own composed suite file#24503
spalladino wants to merge 2 commits into
merge-train/spartan-v5from
spl/e2e-r2-composed

Conversation

@spalladino

@spalladino spalladino commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Round-2 e2e consolidation, PR 9/9 — owns composed/**.

Clock-skew suite stays on the composed HA Postgres cluster

The four Clock Skew and Timezone Safety its in composed/ha/e2e_ha_full.parallel.test.ts exercise PostgresSlashingProtectionDatabase directly against the running 5-node HA cluster's real dockerized PostgreSQL slashing-protection DB:

  • TZ-independent duty timestamps (absolute-time storage)
  • cleanupOldDuties keeps recent duties when the node clock is 2h ahead
  • cleanupOldDuties deletes old duties by DB time even when the node clock is 1h behind
  • cleanupOwnStuckDuties keeps recent stuck duties when the node clock is 3h ahead

The property under test is that these DB-clock semantics (CURRENT_TIMESTAMP, timestamptz) are immune to node clock/timezone skew — the node and its database must be able to genuinely diverge in clock and timezone environment for the assertion to mean anything. That only holds against a real, separate Postgres process. An earlier revision moved these into an in-process PGlite file; since PGlite runs inside the Node process, the DB and node can never actually diverge, so that variant is dropped. The tests keep riding the composed cluster's docker Postgres via the shared HaFullTestContext (t.mainPool / t.dateProvider), exactly as before; dateProvider simulates the skewed node clock while the DB uses its own clock. The @electric-sql/pglite dependency added for the PGlite variant is removed from end-to-end (other workspaces that use pglite for unit tests are unaffected).

Order-dependent HA test → own file

The should distribute work across multiple HA nodes test was annotated must run last because it permanently kills every node. Extracted verbatim into composed/ha/e2e_ha_distribute_work.test.ts, which shares the cluster setup with the remaining suite via a new non-test ha_full_setup.ts module (the ~380-line beforeAll/afterAll/helpers moved verbatim into an HaFullTestContext class). Result:

  • The killer test gets its own cluster; the ordering contract is gone (no hidden reliance on definition order).
  • It has a single it, so it is a plain .test.ts (not .parallel) and runs as one whole-file CI container.
  • The remaining e2e_ha_full.parallel.test.ts keeps its 3 non-destructive its (block production, governance voting, keystore reload) plus the clock-skew describe, all order-independent.
  • Zero CI runtime change: setup was moved, not rewritten, so behavior is unchanged. No assertions dropped.

Added a .test_patterns.yml flaky entry for the new file mirroring the existing e2e_ha_full one (owner unchanged).

Stranded composed tests — left in place, still excluded

composed/e2e_persistence.test.ts and composed/integration_proof_verification.test.ts are excluded from every CI list and have run nowhere since April 2025. I moved each into a category dir, ran it, and reverted the move because both are broken on the current branch — root causes outside this diff's editable scope:

File Disposition Why
integration_proof_verification.test.ts left in composed/, excluded, documented Committed fixtures/dumps/epoch_proof_result.json is stale (last regenerated Feb 2026; circuits/VK changed since). bb and the on-chain HonkVerifier both reject the proof (Failed to verify RootRollupArtifact proof!). Needs the fixture regenerated; better relocated to the bb-prover circuit tests.
e2e_persistence.test.ts left in composed/, excluded, documented beforeAll no longer completes: the single-node sequencer stalls in checkpoint proposal (waitForAttestationsAndEnqueueSubmissionAsync) and the 600s hook times out. The root cause is in shared setup/sequencer, not this file.

Each file's header comment and the bootstrap.sh exclusion now document the real reason (previously "excluded for unknown reasons"). Decision items for follow-up: regenerate the epoch-proof fixture and relocate integration_proof_verification to bb-prover; fix the single-node checkpoint stall and refile e2e_persistence under single-node/.

Verified locally

  • yarn build (full), yarn format end-to-end, yarn lint end-to-end — all clean.
  • Confirmed both stranded tests fail on this branch (evidence above), which is why they stay excluded.
  • The compose/ha suites need docker (Postgres/Web3Signer) and were not run locally; the e2e_ha_full / e2e_ha_distribute_work split and the clock-skew describe rely on CI. Setup is moved verbatim and each .parallel it is already isolated per container, so runtime behavior is unchanged.

- Extract the four Clock Skew / Timezone Safety assertions out of the
  5-node docker HA suite (composed/ha/e2e_ha_full.parallel.test.ts) into a
  new in-process multi-node/high-availability/clock_skew.test.ts. They test
  PostgreSQL DB-clock semantics (timestamp storage + CURRENT_TIMESTAMP-based
  cleanup, immune to node clock skew), so they run against PGlite (real
  Postgres in WASM) rather than the mock-gossip stack, whose shared DB is
  driven by the node clock. Requires adding @electric-sql/pglite to end-to-end.
- Split the node-killing "distribute work" HA test into its own
  e2e_ha_distribute_work.parallel.test.ts, sharing the cluster setup via a new
  ha_full_setup.ts module. This removes the "must run last" ordering contract:
  it gets its own cluster, and the remaining e2e_ha_full tests are
  order-independent. No assertions changed in either file.
- Leave the two stranded composed tests in place, still excluded from CI:
  integration_proof_verification's committed epoch-proof fixture is stale (the
  proof no longer verifies) and e2e_persistence's beforeAll no longer completes
  (single-node sequencer stalls in checkpoint proposal). Both root causes are
  outside this diff; documented in the files and bootstrap.sh.
@spalladino spalladino added wip Work in progress and removed wip Work in progress labels Jul 3, 2026
…bute-work .parallel

Address review feedback on the HA suite split:

- Restore the "Clock Skew and Timezone Safety" describe into the composed
  e2e_ha_full.parallel.test.ts suite, where it runs against the cluster's real
  dockerized PostgreSQL slashing-protection DB. Timezone/clock divergence between
  the node and its database can only be reproduced against a genuine, separate
  Postgres process, so the in-process PGlite variant is removed and the
  @electric-sql/pglite dependency dropped from end-to-end.

- Rename e2e_ha_distribute_work.parallel.test.ts to e2e_ha_distribute_work.test.ts:
  it has a single it(), so it needs no per-test .parallel container splitting.
@spalladino spalladino changed the title test(e2e): extract HA clock-skew suite and split node-killing HA test test(e2e): split node-killing HA test into its own composed suite file Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant