test(e2e): split node-killing HA test into its own composed suite file#24503
Open
spalladino wants to merge 2 commits into
Open
test(e2e): split node-killing HA test into its own composed suite file#24503spalladino wants to merge 2 commits into
spalladino wants to merge 2 commits into
Conversation
- Extract the four Clock Skew / Timezone Safety assertions out of the 5-node docker HA suite (composed/ha/e2e_ha_full.parallel.test.ts) into a new in-process multi-node/high-availability/clock_skew.test.ts. They test PostgreSQL DB-clock semantics (timestamp storage + CURRENT_TIMESTAMP-based cleanup, immune to node clock skew), so they run against PGlite (real Postgres in WASM) rather than the mock-gossip stack, whose shared DB is driven by the node clock. Requires adding @electric-sql/pglite to end-to-end. - Split the node-killing "distribute work" HA test into its own e2e_ha_distribute_work.parallel.test.ts, sharing the cluster setup via a new ha_full_setup.ts module. This removes the "must run last" ordering contract: it gets its own cluster, and the remaining e2e_ha_full tests are order-independent. No assertions changed in either file. - Leave the two stranded composed tests in place, still excluded from CI: integration_proof_verification's committed epoch-proof fixture is stale (the proof no longer verifies) and e2e_persistence's beforeAll no longer completes (single-node sequencer stalls in checkpoint proposal). Both root causes are outside this diff; documented in the files and bootstrap.sh.
…bute-work .parallel Address review feedback on the HA suite split: - Restore the "Clock Skew and Timezone Safety" describe into the composed e2e_ha_full.parallel.test.ts suite, where it runs against the cluster's real dockerized PostgreSQL slashing-protection DB. Timezone/clock divergence between the node and its database can only be reproduced against a genuine, separate Postgres process, so the in-process PGlite variant is removed and the @electric-sql/pglite dependency dropped from end-to-end. - Rename e2e_ha_distribute_work.parallel.test.ts to e2e_ha_distribute_work.test.ts: it has a single it(), so it needs no per-test .parallel container splitting.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Round-2 e2e consolidation, PR 9/9 — owns
composed/**.Clock-skew suite stays on the composed HA Postgres cluster
The four
Clock Skew and Timezone Safetyits incomposed/ha/e2e_ha_full.parallel.test.tsexercisePostgresSlashingProtectionDatabasedirectly against the running 5-node HA cluster's real dockerized PostgreSQL slashing-protection DB:cleanupOldDutieskeeps recent duties when the node clock is 2h aheadcleanupOldDutiesdeletes old duties by DB time even when the node clock is 1h behindcleanupOwnStuckDutieskeeps recent stuck duties when the node clock is 3h aheadThe property under test is that these DB-clock semantics (
CURRENT_TIMESTAMP,timestamptz) are immune to node clock/timezone skew — the node and its database must be able to genuinely diverge in clock and timezone environment for the assertion to mean anything. That only holds against a real, separate Postgres process. An earlier revision moved these into an in-process PGlite file; since PGlite runs inside the Node process, the DB and node can never actually diverge, so that variant is dropped. The tests keep riding the composed cluster's docker Postgres via the sharedHaFullTestContext(t.mainPool/t.dateProvider), exactly as before;dateProvidersimulates the skewed node clock while the DB uses its own clock. The@electric-sql/pglitedependency added for the PGlite variant is removed from end-to-end (other workspaces that use pglite for unit tests are unaffected).Order-dependent HA test → own file
The
should distribute work across multiple HA nodestest was annotatedmust run lastbecause it permanently kills every node. Extracted verbatim intocomposed/ha/e2e_ha_distribute_work.test.ts, which shares the cluster setup with the remaining suite via a new non-testha_full_setup.tsmodule (the ~380-linebeforeAll/afterAll/helpers moved verbatim into anHaFullTestContextclass). Result:it, so it is a plain.test.ts(not.parallel) and runs as one whole-file CI container.e2e_ha_full.parallel.test.tskeeps its 3 non-destructive its (block production, governance voting, keystore reload) plus the clock-skew describe, all order-independent.Added a
.test_patterns.ymlflaky entry for the new file mirroring the existinge2e_ha_fullone (owner unchanged).Stranded composed tests — left in place, still excluded
composed/e2e_persistence.test.tsandcomposed/integration_proof_verification.test.tsare excluded from every CI list and have run nowhere since April 2025. I moved each into a category dir, ran it, and reverted the move because both are broken on the current branch — root causes outside this diff's editable scope:integration_proof_verification.test.tscomposed/, excluded, documentedfixtures/dumps/epoch_proof_result.jsonis stale (last regenerated Feb 2026; circuits/VK changed since). bb and the on-chain HonkVerifier both reject the proof (Failed to verify RootRollupArtifact proof!). Needs the fixture regenerated; better relocated to the bb-prover circuit tests.e2e_persistence.test.tscomposed/, excluded, documentedbeforeAllno longer completes: the single-node sequencer stalls in checkpoint proposal (waitForAttestationsAndEnqueueSubmissionAsync) and the 600s hook times out. The root cause is in shared setup/sequencer, not this file.Each file's header comment and the
bootstrap.shexclusion now document the real reason (previously "excluded for unknown reasons"). Decision items for follow-up: regenerate the epoch-proof fixture and relocateintegration_proof_verificationto bb-prover; fix the single-node checkpoint stall and refilee2e_persistenceundersingle-node/.Verified locally
yarn build(full),yarn format end-to-end,yarn lint end-to-end— all clean.e2e_ha_full/e2e_ha_distribute_worksplit and the clock-skew describe rely on CI. Setup is moved verbatim and each.parallelit is already isolated per container, so runtime behavior is unchanged.