test: replication receive-side stress regressions (catch-up memory + blob save) by kriszyp · Pull Request #150 · HarperFast/harper-pro

kriszyp · 2026-05-15T00:03:27Z

Summary

Two cluster integration tests targeting the receive-side failure modes that knocked a prod node off the cluster earlier this month. Both are guards against regression of fixes already on main / pending review.

Test	Guards	Currently in main?
`receiveBacklogMemory.test.mjs`	PR #147 (`RECEIVE_EVENT_HIGH_WATER_MARK` backpressure)	yes — should pass
`blobSaveRejectionContainment.test.mjs`	PR #149 (`outstandingBlobsToFinish` stores catch-handled promise)	no — passes only after #149 merges

Why

In early May a Harper Pro 5.0.16 node OOM'd repeatedly during peer catch-up. Two distinct receive-side defects compounded:

onWSMessage decoded every audit record in a WS message synchronously inside a tight do { ... } while (...) loop. A single message with thousands of records ballooned heap past the 2 GB old-gen ceiling — ERR_WORKER_OUT_OF_MEMORY every ~25 s. Fixed by PR fix: bound replication receive memory to stop worker OOM crash loops #147, which yields when the consumer queue exceeds RECEIVE_EVENT_HIGH_WATER_MARK = 100.
When saveBlob rejected (the same node was producing Blob save failed for X from peer Y at ~35/s, ENOENT during catch-up), the raw rejecting promise sat in outstandingBlobsToFinish and await Promise.all(...) inside end_txn's onCommit propagated it out as uncaughtException. Fixed by PR fix(replication): swallow blob save rejection in outstandingBlobsToFinish #149, which stores the catch-handled promise instead.

These tests reproduce both conditions directly so future changes can't quietly reintroduce either.

What each test does

`receiveBacklogMemory.test.mjs`

Brings up a 2-node cluster.
killHarper(B), then bursts 40 transactions × 500 records each on A — each transaction is a single WS message with 500 audit entries (well past the HWM of 100). Total backlog: 20 k records.
Restarts B; samples system_information.memory.rss every 500 ms while B catches up.
Polls describe_table.record_count for an unambiguous catch-up signal (no dependency on cluster_status.lastReceivedVersion shape).
Asserts: catch-up completes, no ERR_WORKER_OUT_OF_MEMORY in log, peak RSS < 1.5 GB.

`blobSaveRejectionContainment.test.mjs`

Brings up a 2-node cluster with the existing Location blob-bearing fixture deployed to both.
Pre-installs a new fault-injection component on B only (via setupHarperWithFixture) that monkey-patches fs.createWriteStream to emit ENOENT for every 7th call targeting /blobs/. The injector arms only when HARPER_TEST_BLOB_FAIL_INTERVAL is set, so it's inert if the fixture is ever picked up elsewhere.
Drives 400 /Location/{n} requests on A → each creates a blob via sourcedFrom → replicates to B → every 7th blob save on B trips the injector.
Asserts:
1. The injector actually fired (test is non-vacuous).
2. [error] [replication]: Blob save failed for <id> from <peer> appears (the .catch ran).
3. No uncaughtException lines mentioning Blob/ENOENT.*blobs — this is the regression.
4. B still reports A connected in cluster_status.
5. A fresh write on A still propagates to B after the failures (liveness).

Where to look

fixture-blob-fail-injector/resources.js — the monkey-patch uses createRequire to get the CJS fs module (ESM namespace objects are frozen) and replaces createWriteStream. Harper's dist code uses require('node:fs').createWriteStream(...) at call time, so the patch is picked up live. Confirmed via cross-model review.
The BATCH_SIZE = 500 and FAIL_INTERVAL = 7 are tuned to comfortably exceed the relevant thresholds without blowing test time. Could be made more aggressive if CI runners turn out to be faster than expected; the bounds in the assertions stay valid either way.
The 1.5 GB peak-RSS bound is intentionally generous. The bug burst past 2 GB; anything well under that means the HWM-driven pause is taking effect. Tuning down later is easy and orthogonal.

Risk / known-flaky areas

Local validation hit a pre-existing harness/env issue ("Maximum call stack size exceeded" inside replicationTopology.test.mjs setup) that also affects main. The new tests are structurally identical to replicationLoad.test.mjs and should run cleanly in CI even though local was noisy.
blobSaveRejectionContainment will fail on main today because PR fix(replication): swallow blob save rejection in outstandingBlobsToFinish #149 isn't merged yet. Either land that first, or this PR is the trailing half of the same change and lands after.

Testing

npx oxlint --quiet on all new files → 0 errors, only the pre-existing new Array(concurrency) warning in clusterShared.mjs that lint:required already tolerates on main
npx prettier --check → clean
node --check on all new mjs/js → clean
Cross-model review (Gemini): positive; no required changes

Test plan

CI passes for receiveBacklogMemory (it should — PR fix: bound replication receive memory to stop worker OOM crash loops #147 is in main)
Once fix(replication): swallow blob save rejection in outstandingBlobsToFinish #149 merges, blobSaveRejectionContainment passes
Spot-check on a CI run that the injector banner appears in B's log (grep '[blob-fail-injector] installed' ...) — assertion 1 should fail loudly if not

🤖 Generated with Claude Code

claude · 2026-05-15T00:06:52Z

Reviewed; no blockers found.

…lob save) Two cluster integration tests covering the receive-side failure modes that took a prod node off the cluster: 1. receiveBacklogMemory.test.mjs — guards PR #147's RECEIVE_EVENT_HIGH_WATER_MARK fix. Kills receiver B, bursts 40 transactions of 500 records each on A (each transaction = one WS message → 500 audit entries decoded), restarts B, samples memory while it catches up, asserts peak RSS < 1.5 GB and no ERR_WORKER_OUT_OF_MEMORY in the log. 2. blobSaveRejectionContainment.test.mjs — guards PR #149's contract that a rejected saveBlob promise is logged exactly once and never escapes onCommit as uncaughtException. Installs a fault-injection component on B only that monkey-patches fs.createWriteStream to fail every 7th /blobs/ write with ENOENT, drives Location-component blob traffic from A, asserts the "Blob save failed for ..." line appears but uncaughtException lines do not, and that liveness (a fresh write) still propagates after failures. Adds shared helpers to clusterShared.mjs: readLog, waitForCatchUp, getMemoryInfo, peakMemory. The fault-injection fixture lives at integrationTests/cluster/fixture-blob-fail-injector/ and is opt-in via HARPER_TEST_BLOB_FAIL_INTERVAL env var. These exercise the same failure surface that affected wtk-ap-west-1 in May: unbounded synchronous decode on receive, and blob save rejections escaping the commit confirmation path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI sets HARPER_INTEGRATION_TEST_LOG_DIR, which causes the integration-testing harness to redirect Harper's `logging.root` to a per-suite directory exposed on `ctx.harper.logDir` rather than `{dataRootDir}/log/hdb.log`. readLog() was only checking the dataRootDir path, so in CI it returned an empty string — making the blob-fail-injector banner assertion fail even when the component had loaded correctly. Check both locations now. The receiveBacklogMemory test was also affected (its no-OOM assertion was reading the wrong file) but happened to pass vacuously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The shared `fixture/` Location source produces 7,500-byte blobs, which fall under Harper's FILE_STORAGE_THRESHOLD (8192 bytes) and are stored inline in the record. With no file write, the receiver's createWriteStream is never called and the fault injector has nothing to intercept — assertion #2 ("Blob save failed line appeared") failed even though the injector loaded correctly (banner present in B's log twice). Add a dedicated `fixture-large-blob-source/` with a `LargeLocation` table whose `sourcedFrom` produces 50 KB streamed blobs — comfortably above the threshold, guaranteed to take the file-backed write path on the receiver. Switch the test to deploy/hit this fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GET routing on a `sourcedFrom` table can be subtle: hitting /LargeLocation/{id} on the receiver of a partially-populated record may re-invoke the cache-miss handler instead of returning the locally stored record, and the request can fail in ways that don't show up as Harper log lines. The previous run confirmed assertions 1-4 are airtight (35 Blob save failed log entries, 0 uncaughtException, still connected per cluster_status) — the test was failing only on the liveness GET timing out. Switch to comparing describe_table.record_count before/after the upsert: a direct, unambiguous signal that doesn't depend on REST GET semantics for sourcedFrom tables. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cb1kenobi

LGTM

…ceive-tests

kriszyp · 2026-05-29T04:42:36Z

Heads-up on the blobSaveRejectionContainment CI failure (the liveness-probe assertion, intermittent across Node v22/v24/v25):

It's not a flaky test — it surfaced a real receiver-side replication bug. When an injected blob save fails mid-stream, saveBlob's pipeline destroys the PassThrough and fires close, but a subsequent chunk then hits the backpressure branch, calls addPauseReason() (pausing the WS), and waits for a drain/close that already fired — stranding the pause and wedging the entire receive loop. The receiver goes silent and the sender reconnects forever, so the liveness write never lands. The v22/v25-fail / v24-pass split is the async-destroy-vs-remaining-chunks race.

Root-cause fix is up in #243 (fix/blob-receive-stall-on-save-failure). With it, this test goes green (36 injected ENOENTs, 0 escaped, liveness restored). I also checked fix/replication-stuck-connection-watchdog against this test — it does not recover the stall within the window, so the watchdog alone isn't sufficient here; #243 is the necessary fix.

Suggested merge order: land #243 first, then this PR's CI goes green and it lands as the regression guard. (Alternatively the test could move into #243 to make that PR self-contained — your call.)

🤖 AI-generated (Claude) on behalf of Kris.

kriszyp · 2026-05-29T04:57:22Z

Superseded by #243. The blob-save root-cause fix and both receive-side stress tests (blobSaveRejectionContainment, receiveBacklogMemory + fixtures + clusterShared helpers) have been moved there — cherry-picked so your original commits/authorship are preserved — so the fix and its regression guards land together. Tracking continues on #243.

🤖 AI-generated (Claude) on behalf of Kris.

kriszyp requested a review from a team as a code owner May 15, 2026 00:03

kriszyp and others added 2 commits May 18, 2026 22:47

kriszyp force-pushed the stress/replication-receive-tests branch from 3337190 to e51f137 Compare May 19, 2026 04:48

kriszyp and others added 2 commits May 18, 2026 22:55

kriszyp mentioned this pull request May 19, 2026

test(stress): long-running soak + worker cascade + orphan + adversity #171

Open

Merge branch 'main' into stress/replication-receive-tests

ed117e2

cb1kenobi approved these changes May 26, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into stress/replication-re…

21c6b56

…ceive-tests

kriszyp mentioned this pull request May 29, 2026

fix(replication): blob-receive stall on save failure + receive-side stress tests #243

Open

kriszyp closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: replication receive-side stress regressions (catch-up memory + blob save)#150

test: replication receive-side stress regressions (catch-up memory + blob save)#150
kriszyp wants to merge 6 commits into
mainfrom
stress/replication-receive-tests

kriszyp commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026 •

edited

Loading

Uh oh!

cb1kenobi left a comment

Uh oh!

kriszyp commented May 29, 2026

Uh oh!

kriszyp commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kriszyp commented May 15, 2026

Summary

Why

What each test does

receiveBacklogMemory.test.mjs

blobSaveRejectionContainment.test.mjs

Where to look

Risk / known-flaky areas

Testing

Test plan

Uh oh!

claude Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cb1kenobi left a comment

Choose a reason for hiding this comment

Uh oh!

kriszyp commented May 29, 2026

Uh oh!

kriszyp commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`receiveBacklogMemory.test.mjs`

`blobSaveRejectionContainment.test.mjs`

claude Bot commented May 15, 2026 •

edited

Loading