Skip to content

Reproduce and fix hnsw race #386#415

Closed
kriszyp wants to merge 3 commits into
mainfrom
reproduce-HNSW-race-386
Closed

Reproduce and fix hnsw race #386#415
kriszyp wants to merge 3 commits into
mainfrom
reproduce-HNSW-race-386

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented Apr 28, 2026

Reproduction and fix for #386
Depends on HarperFast/rocksdb-js#531
This is marked as ready for review, although it is not ready to merge.

kriszyp added 2 commits April 27, 2026 05:50
…gnaling

This commit enables coordinated retry mode for RocksDB transactions and updates the commit logic to handle retry signals from the database. Instead of catching ERR_BUSY errors, transactions now check for RETRY_NOW_VALUE in the commit result and retry accordingly. Also adds retryOnBusy option to structure save transactions.
],
});
for (let w = 0; w < WORKER_COUNT; w++) {
workers.push(new Worker(__dirname + '/vectorIndex-thread.js'));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing worker file — test will always fail.

vectorIndex-thread.js is referenced here but does not exist in the PR (or anywhere in the repo). new Worker with a nonexistent path causes each worker to emit an error event; the test collects those via the worker.once('error', ...) handler and the final assert.deepEqual(errors, []) will always fail.

The existing analog for this pattern is unitTests/resources/create-thread.js. A vectorIndex-thread.js counterpart needs to be added — it should:

  • Import setupTestDBPath + setMainIsWorker, call both at startup
  • Set up the HNSWConcurrentTest table
  • Listen on parentPort for { type: 'insert', start, count, dims } messages
  • Perform the PUTs and reply with { type: 'done' } (or { type: 'error', ... } on failure)

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 28, 2026

Review: PR #415 — HNSW concurrent PUT / coordinated RocksDB retry

Summary

This PR has two independent parts:

  1. DatabaseTransaction.ts / RecordEncoder.ts — switches RocksDB conflict handling from an error-rejection model (ERR_BUSY thrown) to a coordinated-retry model (RETRY_NOW_VALUE resolved). Both new RocksTransaction(…, { coordinatedRetry: true }) call sites are updated, and RecordEncoder.saveStructures gains { retryOnBusy: true } on its sync transaction.
  2. vectorIndex.test.js — adds a multi-worker HNSW concurrent PUT test to reproduce issue HNSW concurrent PUT race — 500 'Cannot read properties of undefined (reading 0)' on harperfast/harper:5.0.1 #386.

1. Missing vectorIndex-thread.js (blocker)

File: unitTests/resources/vectorIndex.test.js:316

What: The new describe block spawns workers with new Worker(__dirname + '/vectorIndex-thread.js'), but that file is not in this PR or anywhere in the repo.

Why it matters: Each worker emits an error event immediately (module not found). The test catches those errors via worker.once('error', resolve), collects them, and then hits assert.deepEqual(errors, []) — which always fails. The test that's meant to prove issue #386 is fixed is itself structurally broken.

Suggested fix: Add unitTests/resources/vectorIndex-thread.js modeled on the existing create-thread.js. It should call setupTestDBPath() + setMainIsWorker(true) at startup, open the HNSWConcurrentTest table, and on { type: 'insert', start, count, dims } messages, perform the PUTs and reply { type: 'done' } (or { type: 'error', … } on failure).


What I traced (no other blockers found)

  • Coordinated-retry logic (DatabaseTransaction.ts:265-313): The commitResult === RETRY_NOW_VALUE branch correctly mirrors the old ERR_BUSY rejection path — same exponential back-off, same MAX_RETRIES guard, same transaction reuse. outstandingCommit tracking is safe: .finally() clears the pointer before .then() fires, so the retry call sees a clean slot.
  • RecordEncoder.saveStructures: Adding { retryOnBusy: true } to the sync transaction is independent of the async coordinated-retry path and addresses the busy conflict for structure saves. Logic inside the callback is unchanged.
  • const RETRY_NOW_VALUE placement (DatabaseTransaction.ts:12): Sits between two import statements. Imports are hoisted so this is semantically correct, and TypeStrip (erasableSyntaxOnly) handles it fine — no erasability concern.
  • LMDB guard: Both describe blocks correctly skip when HARPER_STORAGE_ENGINE=lmdb; test infrastructure (setupTestDBPath, setMainIsWorker) calls match the existing pattern in validation.test.js and resource-operation.test.js.

@kriszyp kriszyp changed the title Reproduce hnsw race 386 Reproduce and fix hnsw race #386 Apr 28, 2026
@kriszyp kriszyp added the patch label Apr 28, 2026
@kriszyp kriszyp marked this pull request as ready for review April 28, 2026 20:33
@kriszyp kriszyp requested a review from a team as a code owner April 28, 2026 20:33
@dawsontoth
Copy link
Copy Markdown
Contributor

The integration and unit tests seem proper miffed with the changes...

@cb1kenobi
Copy link
Copy Markdown
Member

The integration and unit tests seem proper miffed with the changes...

Yeah, that's because this PR is dependent on changes in rocksdb-js that haven't been merged/published.

@cb1kenobi
Copy link
Copy Markdown
Member

The fix is much simpler than introducing this RETRY_NOW logic: #423.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants