test: replication receive-side stress regressions (catch-up memory + blob save)#150
test: replication receive-side stress regressions (catch-up memory + blob save)#150kriszyp wants to merge 6 commits into
Conversation
|
Reviewed; no blockers found. |
…lob save) Two cluster integration tests covering the receive-side failure modes that took a prod node off the cluster: 1. receiveBacklogMemory.test.mjs — guards PR #147's RECEIVE_EVENT_HIGH_WATER_MARK fix. Kills receiver B, bursts 40 transactions of 500 records each on A (each transaction = one WS message → 500 audit entries decoded), restarts B, samples memory while it catches up, asserts peak RSS < 1.5 GB and no ERR_WORKER_OUT_OF_MEMORY in the log. 2. blobSaveRejectionContainment.test.mjs — guards PR #149's contract that a rejected saveBlob promise is logged exactly once and never escapes onCommit as uncaughtException. Installs a fault-injection component on B only that monkey-patches fs.createWriteStream to fail every 7th /blobs/ write with ENOENT, drives Location-component blob traffic from A, asserts the "Blob save failed for ..." line appears but uncaughtException lines do not, and that liveness (a fresh write) still propagates after failures. Adds shared helpers to clusterShared.mjs: readLog, waitForCatchUp, getMemoryInfo, peakMemory. The fault-injection fixture lives at integrationTests/cluster/fixture-blob-fail-injector/ and is opt-in via HARPER_TEST_BLOB_FAIL_INTERVAL env var. These exercise the same failure surface that affected wtk-ap-west-1 in May: unbounded synchronous decode on receive, and blob save rejections escaping the commit confirmation path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI sets HARPER_INTEGRATION_TEST_LOG_DIR, which causes the integration-testing
harness to redirect Harper's `logging.root` to a per-suite directory exposed
on `ctx.harper.logDir` rather than `{dataRootDir}/log/hdb.log`.
readLog() was only checking the dataRootDir path, so in CI it returned an
empty string — making the blob-fail-injector banner assertion fail even when
the component had loaded correctly. Check both locations now.
The receiveBacklogMemory test was also affected (its no-OOM assertion was
reading the wrong file) but happened to pass vacuously.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3337190 to
e51f137
Compare
The shared `fixture/` Location source produces 7,500-byte blobs, which fall under Harper's FILE_STORAGE_THRESHOLD (8192 bytes) and are stored inline in the record. With no file write, the receiver's createWriteStream is never called and the fault injector has nothing to intercept — assertion #2 ("Blob save failed line appeared") failed even though the injector loaded correctly (banner present in B's log twice). Add a dedicated `fixture-large-blob-source/` with a `LargeLocation` table whose `sourcedFrom` produces 50 KB streamed blobs — comfortably above the threshold, guaranteed to take the file-backed write path on the receiver. Switch the test to deploy/hit this fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GET routing on a `sourcedFrom` table can be subtle: hitting /LargeLocation/{id}
on the receiver of a partially-populated record may re-invoke the cache-miss
handler instead of returning the locally stored record, and the request can
fail in ways that don't show up as Harper log lines.
The previous run confirmed assertions 1-4 are airtight (35 Blob save failed
log entries, 0 uncaughtException, still connected per cluster_status) — the
test was failing only on the liveness GET timing out. Switch to comparing
describe_table.record_count before/after the upsert: a direct, unambiguous
signal that doesn't depend on REST GET semantics for sourcedFrom tables.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Heads-up on the It's not a flaky test — it surfaced a real receiver-side replication bug. When an injected blob save fails mid-stream, Root-cause fix is up in #243 ( Suggested merge order: land #243 first, then this PR's CI goes green and it lands as the regression guard. (Alternatively the test could move into #243 to make that PR self-contained — your call.) 🤖 AI-generated (Claude) on behalf of Kris. |
|
Superseded by #243. The blob-save root-cause fix and both receive-side stress tests ( 🤖 AI-generated (Claude) on behalf of Kris. |
Summary
Two cluster integration tests targeting the receive-side failure modes that knocked a prod node off the cluster earlier this month. Both are guards against regression of fixes already on
main/ pending review.receiveBacklogMemory.test.mjsRECEIVE_EVENT_HIGH_WATER_MARKbackpressure)blobSaveRejectionContainment.test.mjsoutstandingBlobsToFinishstores catch-handled promise)Why
In early May a Harper Pro 5.0.16 node OOM'd repeatedly during peer catch-up. Two distinct receive-side defects compounded:
onWSMessagedecoded every audit record in a WS message synchronously inside a tightdo { ... } while (...)loop. A single message with thousands of records ballooned heap past the 2 GB old-gen ceiling —ERR_WORKER_OUT_OF_MEMORYevery ~25 s. Fixed by PR fix: bound replication receive memory to stop worker OOM crash loops #147, which yields when the consumer queue exceedsRECEIVE_EVENT_HIGH_WATER_MARK = 100.saveBlobrejected (the same node was producingBlob save failed for X from peer Yat ~35/s, ENOENT during catch-up), the raw rejecting promise sat inoutstandingBlobsToFinishandawait Promise.all(...)insideend_txn'sonCommitpropagated it out asuncaughtException. Fixed by PR fix(replication): swallow blob save rejection in outstandingBlobsToFinish #149, which stores the catch-handled promise instead.These tests reproduce both conditions directly so future changes can't quietly reintroduce either.
What each test does
receiveBacklogMemory.test.mjskillHarper(B), then bursts 40 transactions × 500 records each on A — each transaction is a single WS message with 500 audit entries (well past the HWM of 100). Total backlog: 20 k records.system_information.memory.rssevery 500 ms while B catches up.describe_table.record_countfor an unambiguous catch-up signal (no dependency oncluster_status.lastReceivedVersionshape).ERR_WORKER_OUT_OF_MEMORYin log, peak RSS < 1.5 GB.blobSaveRejectionContainment.test.mjsLocationblob-bearing fixture deployed to both.setupHarperWithFixture) that monkey-patchesfs.createWriteStreamto emitENOENTfor every 7th call targeting/blobs/. The injector arms only whenHARPER_TEST_BLOB_FAIL_INTERVALis set, so it's inert if the fixture is ever picked up elsewhere./Location/{n}requests on A → each creates a blob viasourcedFrom→ replicates to B → every 7th blob save on B trips the injector.[error] [replication]: Blob save failed for <id> from <peer>appears (the.catchran).uncaughtExceptionlines mentioningBlob/ENOENT.*blobs— this is the regression.cluster_status.Where to look
fixture-blob-fail-injector/resources.js— the monkey-patch usescreateRequireto get the CJSfsmodule (ESM namespace objects are frozen) and replacescreateWriteStream. Harper's dist code usesrequire('node:fs').createWriteStream(...)at call time, so the patch is picked up live. Confirmed via cross-model review.BATCH_SIZE = 500andFAIL_INTERVAL = 7are tuned to comfortably exceed the relevant thresholds without blowing test time. Could be made more aggressive if CI runners turn out to be faster than expected; the bounds in the assertions stay valid either way.Risk / known-flaky areas
replicationTopology.test.mjssetup) that also affectsmain. The new tests are structurally identical toreplicationLoad.test.mjsand should run cleanly in CI even though local was noisy.blobSaveRejectionContainmentwill fail onmaintoday because PR fix(replication): swallow blob save rejection in outstandingBlobsToFinish #149 isn't merged yet. Either land that first, or this PR is the trailing half of the same change and lands after.Testing
npx oxlint --quieton all new files → 0 errors, only the pre-existingnew Array(concurrency)warning inclusterShared.mjsthatlint:requiredalready tolerates on mainnpx prettier --check→ cleannode --checkon all new mjs/js → cleanTest plan
receiveBacklogMemory(it should — PR fix: bound replication receive memory to stop worker OOM crash loops #147 is in main)blobSaveRejectionContainmentpassesgrep '[blob-fail-injector] installed' ...) — assertion 1 should fail loudly if not🤖 Generated with Claude Code