Skip to content

Process-global Verification Table with coordinated write-lock#526

Draft
kriszyp wants to merge 10 commits into
mainfrom
feature/verification-table
Draft

Process-global Verification Table with coordinated write-lock#526
kriszyp wants to merge 10 commits into
mainfrom
feature/verification-table

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented Apr 25, 2026

Summary

Adds a process-wide, lock-free Verification Table (VT) to the rocksdb-js native binding. The VT is a fixed-size array of std::atomic<uint64_t> slots addressed by hash(dbPtr, cfId, key). It solves two problems in Harper:

  1. Stale cache detection ? per-thread JS caches can silently serve outdated records after another thread commits. verifyVersion / populateVersion give callers a fast O(1) way to confirm a cached record is still fresh without hitting RocksDB.

  2. Blind optimistic retry ? IsBusy conflicts previously triggered quadratic-backoff retries with no coordination. coordinatedRetry: true parks the retrying JS promise on the VT slot of the conflicting key; it wakes automatically (via TSFN) the moment the blocker commits, eliminating unnecessary backoff.


Changes by phase

Phase 1 ? VerificationTable class: lock-free slot array, slotFor, verifyVersion, populateVersion, extractVersionFromValue, POPULATE_VERSION_FLAG / FRESH_VERSION_FLAG fast-path in getSync, RocksDatabase.config({ verificationTableEntries }), JS-side verifyVersion / populateVersion on NativeDatabase.

Phase 2 ? LockTracker struct: installed in the VT slot at putSync/removeSync time (not deferred to commit). Per-CF opt-in via verificationTable: true on the DB open options. releaseIntent() CASes the slot back to 0 after commit (success or IsBusy). store.ts now forwards verificationTable through to NativeDatabase.open().

Phase 3 ? coordinatedRetry: true on TransactionOptions. When IsBusy fires, the native complete callback checks savedSlots for an active lock and parks a TSFN Waiter on the tracker. LockTracker::wake() fires all waiters after releaseIntent(), resolving the commit promise with RETRY_NOW_VALUE instead of rejecting. database.ts loops immediately on RETRY_NOW.

Phase 4 ? LockTracker.dbPtr field + VerificationTable::cancelForDB(dbPtr): a full-table scan called from DBDescriptor::close() after all TransactionHandle closables have been closed. Defensive safety net ensuring no TSFN waiter can park forever if DB close races a pending commit. New stress-test/vt-stress.stress.test.ts with an 8-slot VT forces hash collisions on every write, covering ABA, concurrent contention, coordinatedRetry under collisions, and the cancelForDB lifecycle.

Initial Prompt

I would like to implement a cache verification and write lock/tracking mechanism. The goal is to improve read performance and reduce contention induced transaction retries. In our Harper application, we have used an LRU cache (https://www.npmjs.com/package/weak-lru-cache) to store deserialized objects from the RocksDB database. We would like to be able to access that LRU cache, and when there is a hit, verify that the entry in the cache is still the most recent in the database. We have many worker/threads, and the cache is specific to each one (since it is in JS isolated), so the JS doesn't know if there is a change in the database from another thread. So (after a cache hit in the JS LRU cache) I would to be able to call get/getSync, providing a version number (the version numbers are recorded in the JS LRU cache), and if that version number is verified by rocksdb-js to be fresh, than the get can return a flag/indicator indicating that the caller's cached JS object is indeed still fresh and can be used. If it is not fresh, than the normal process of retrieving the binary data and returning it is followed. Also in our Harper application, we use a lock-free transaction mechanism, where we use RocksDB's optimistic transaction, reading from a snapshot and writing to the transaction, also recording a list of writes. And if there is contention (another thread wrote to a record in the transaction), the optimistic transaction fails (IsBusy error), and we then replay/retry the transaction, re-executing all the writes into a new RocksDB transaction. However, with highly contentious writes, this could be problematic; just naively retrying the transaction could continue to lead to continued conflicts and repeated retries.

Proposed approach

I would like to propose using a fixed array, with hash-keyed access to values that represent the known fresh version or a lock indicator/reference. Using a fixed hashed-key array (a probalistic bloomier filter?), should give us a lock-free way to quickly check the array for the freshness of an entry, or assign the latest version number on a cache miss. I believe there should be a single cache array for the whole system, across all databases. It can be allocated at the same time as the block cache that is shared by all databases. Once created, it is fixed and access is fast. We can default to a size of 1MB. I would propose that the indexes are a hash of the database (name or pointer), column family name, and record key. I think this should give good distribution. Of course this is probabilistic, there can certainly be hash collisions, but that should just result in false positives on cache hits, and safely revert to the slow path of retrieval. Each entry should be a 64-bit word/number. When there is cache miss, the entry can be updated with the latest version number from the retrieved object, if there is no active writes for this entry. The version number is always the first (64-bit) word of the record and can be retrieved directly from the beginning of the record. By ensuring no actives writes (no open transactions that have a pending write to a record key that hashes to this index), we should be able to preserve the invariant that there is one single active version for any record whose key hashes to this index/entry (no other versions of a record in any pending transaction). Note that the version numbers are big-endian timestamps, so there is some predicatibility to their basic format (always positive, usually starts with 66 in this era, which could be useful for distinguishing from a write indicator that could use a flag/bit that won't match a positive version number). When a record is written in a transaction, and there are no other active writes, we need to update the cache entry to an indicator that the record is now being written (perhaps a a pointer to a lock tracking structure is recorded within the cache entry), and caching a version number is no longer permitted until all writes that hash to this entry are committed. There can be multiple writes to the same record and multiple writes to records that hash to this entry, which needs to be properly tracked. In addition, I believe this provides a means for being able to more accurately notify of when a transaction can be safely retried. So on the first attempt at a transaction, all writes will try to acquire a "lock" on for each record (atomically updating the cache entry to a write status, if there is no existing write status), and we will track these writes for follow-up work. If the transaction commits succesfully, then we do not need to retry. If the transaction fails, that's because there was contention. Rather than immediately calling the transaction callback/error handler with an IsBusy error, we should wait until we can acquire locks on all the entries that need to be written, and then call the callback handler once all the locks are in place. We should then be able to retry the transaction safely, replaying all the writes again, with pre-existing locks. We should probably use a separate signal (in the commit callback) than an IsBusy error, more explicitly indicating that a retry should now proceed. When a transaction commits, naturally we will need to remove the locks and follow-up with the work of finding any transactions that are waiting for their turn to acquire the locks, acquire those locks and notify any such transactions that are waiting to retry, that they can retry now. Presumably we want to unlock and write locks when the transaction is in conflict and waiting until it can acquire all the locks at one (synchronized) to avoid multiple part sets of locks creating deadlock potential. Feel free to suggest different ways of handling this if there are better approaches.


Test coverage

  • 12 new unit tests in test/lock-tracker.test.ts (Phases 2 & 3)
  • 4 new stress tests in stress-test/vt-stress.stress.test.ts (Phase 4)
  • All 438 existing tests continue to pass

kriszyp and others added 3 commits April 25, 2026 15:05
Adds a lock-free std::atomic<uint64_t>[] Verification Table (VT) to
the native binding so JS threads can cheaply verify record-cache
freshness without touching RocksDB.

- VerificationTable: fixed-size slot array keyed by
  hash(db_ptr, cf_id, key); slot holds 0 (empty) or a float64
  version bit-pattern (sign bit 0, leaving bit 63 free for Phase 2
  lock tags)
- Database::VerifyVersion / PopulateVersion: new native methods
  exposed as db.verifyVersion(key, version) and
  db.populateVersion(key, version) on the JS NativeDatabase type
- GetSync fast-path: optional 4th arg expectedVersion; returns
  FRESH_VERSION_FLAG sentinel on slot match; POPULATE_VERSION_FLAG
  on flags seeds the slot from the first 8 bytes of the read value
- DBSettings: lazy VT materialization with random seed; config()
  accepts verificationTableEntries (frozen after first materialize)
- TypeScript: verifyVersion/populateVersion on Store and RocksDatabase;
  FRESH_VERSION_FLAG and POPULATE_VERSION_FLAG exported from constants
- 12 new tests in test/verification-table.test.ts; all 431 tests pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n commit

Stamps VT slots as "write in flight" during optimistic transaction commit
so concurrent readers see vtIsLock() and fall through to RocksDB instead
of serving stale cached versions.

Also fixes a pre-existing race: the commit complete callback was
unconditionally resetting state→Pending after IsBusy, but close() may
have already set state→Aborted and nulled txn. Guard the reset to only
apply when state is still Committing, preventing a null-txn Rollback crash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pt-in

Moves VT lock installation from commit time (write-batch iteration) to
putSync/removeSync time so the slot is invalidated the moment a key
enters the transaction's write buffer, closing the window where a
cached read could observe a stale version after a write but before
commit.

Adds `verificationTable: true` per-DB open option (NativeDatabaseOptions
→ DBHandle::enableVerificationTable) so only opted-in column families
participate, keeping secondary-index CFs out of the VT.

Fixes a pre-existing race in the async commit complete callback where
`state = Pending` was set unconditionally after IsBusy, overwriting an
Aborted state set by close(), leading to Rollback() on a null txn.
Guards the reset: only overwrite Committing → Pending.

All 435 existing tests pass; 4 new Phase 2 lock-tracker tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 25, 2026

📊 Benchmark Results

get-sync.bench.ts

getSync() > random keys - small key size (100 records)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 23.51K ops/sec 42.53 41.07 632.218 0.113 117,565
🥈 rocksdb 2 12.53K ops/sec 79.81 77.59 22,583.183 0.895 62,650

getSync() > sequential keys - small key size (100 records)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 27.55K ops/sec 36.30 35.29 712.302 0.104 137,730
🥈 rocksdb 2 12.28K ops/sec 81.43 79.32 497.507 0.048 61,405

ranges.bench.ts

getRange() > small range (100 records, 50 range)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 26.23K ops/sec 38.12 36.49 751.12 0.152 131,174
🥈 rocksdb 2 3.67K ops/sec 272.195 238.636 2,517.548 0.541 18,370

realistic-load.bench.ts

Realistic write load with workers > write variable records with transaction log

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 187.54 ops/sec 5,332.184 64.50 129,571.255 35.21 384
🥈 lmdb 2 26.74 ops/sec 37,402.043 48.03 1,186,200.434 136.683 64.00

transaction-log.bench.ts

Transaction log > read 100 iterators while write log with 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 35.16K ops/sec 28.44 13.54 14,194.142 0.594 175,783
🥈 lmdb 2 445.07 ops/sec 2,246.849 139.106 13,711.637 1.33 2,226

Transaction log > read one entry from random position from log with 1000 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 687.67K ops/sec 1.45 1.26 3,879.424 0.167 3,438,370
🥈 lmdb 2 456.01K ops/sec 2.19 1.16 8,293.407 0.518 2,280,055

worker-put-sync.bench.ts

putSync() > random keys - small key size (100 records, 10 workers)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 854.22 ops/sec 1,170.658 1,018.065 1,853.01 0.352 1,709
🥈 lmdb 2 1.15 ops/sec 872,425.5 775,558.689 1,013,399.678 5.59 10.00

worker-transaction-log.bench.ts

Transaction log with workers > write log with 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 17.97K ops/sec 55.65 30.26 542.543 0.494 35,939
🥈 lmdb 2 825.91 ops/sec 1,210.782 89.63 18,079.573 5.80 1,652

Results from commit 70f4ddd

kriszyp and others added 2 commits April 25, 2026 17:23
…OW signal

When coordinatedRetry: true on TransactionOptions, an IsBusy conflict at
commit time is resolved (not rejected) with RETRY_NOW_VALUE instead of
propagating a TransactionIsBusyError. The native layer checks whether any
VT slot our transaction locked is now held by a concurrent transaction; if
so it parks a TSFN-based Waiter on that tracker and fires resolve(RETRY_NOW)
only after the conflicting write-intent releases, eliminating blind backoff.

Key mechanics:
- LockTracker gains woken flag + mutex-protected wakeCallbacks vector with
  addWakeCallback() / wake() methods
- releaseIntent() calls t->wake() after CAS-zeroing the slot, notifying
  any parked waiters before decrementing refcount
- Execute lambda saves lockedVTSlots to CommitState.savedSlots before
  releaseIntent() clears them on IsBusy
- Complete callback's IsBusy+coordinatedRetry path: iterates savedSlots,
  finds active vtIsLock trackers, creates a one-shot TSFN (retryNowCallJs /
  retryNowFinalize) and registers it as a wake callback; if already woken
  (tracker released between execute and complete), fires TSFN immediately
- If no active lock found, resolve(RETRY_NOW) is called directly on the JS
  thread from the complete callback (no TSFN overhead)
- database.ts transaction() loop: on RETRY_NOW return value, continues
  immediately without backoff
- RETRY_NOW constant exported from transaction.ts for Harper integration

All 438 tests pass; 3 new Phase 3 correctness tests added.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds uintptr_t dbPtr to LockTracker so cancelForDB can identify which
slots belong to a closing DB. DBDescriptor::close() calls cancelForDB
after the closables loop as a defensive final pass: if any VT lock
survives the normal TransactionHandle::close() → releaseIntent() → wake()
path, cancelForDB CASes it to 0 and fires wake() to unpark any TSFN
waiters, preventing a waiter from parking forever after DB close.

New stress-test/vt-stress.stress.test.ts exercises all VT paths under
an 8-slot table, which forces every write to a collision bucket, giving
full coverage of the ABA check, concurrent lock contention, coordinatedRetry
under collision, and the cancelForDB lifecycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kriszyp kriszyp changed the title Feature/verification table Process-global Verification Table with coordinated write-lock Apr 26, 2026
kriszyp and others added 5 commits April 25, 2026 19:51
- Transaction::Get now accepts argv[4] as expectedVersion, computing the
  VT slot and forwarding it to TransactionHandle::get() for both the
  block-cache fast path and the async libuv path.

- Database::Get already does the same (updated earlier); this brings
  parity for NativeTransaction callers.

- store.get(): 4th param changed from txnId to StoreGetOptions so
  expectedVersion flows naturally alongside the transaction id.  Passes
  expectedVersion to both the ONLY_IF_IN_MEMORY_CACHE getSync fast-path
  and the async context.get() call; propagates FRESH_VERSION_FLAG
  without clobbering VALUE_BUFFER.end.

- store.getSync(): passes options.expectedVersion as 4th arg and guards
  the FRESH_VERSION_FLAG sentinel from the VALUE_BUFFER assignment path.

- GetOptions gains an expectedVersion field.  getBinary/getBinaryFast
  pass the full options object (including expectedVersion) instead of
  only the txnId.  Return types widened to include number for the FRESH
  sentinel.  Decode paths in get()/getSync() guard against FRESH so the
  sentinel is never passed to the decoder.

- load-binding.ts: NativeDatabase.get and NativeTransaction.get type
  signatures updated to include expectedVersion; resolve callback widens
  to Buffer | number.  FRESH_VERSION_FLAG exported as a standalone value.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add `parseExpectedVersion` and `vtSlotFor` inline helpers to database.h.

`parseExpectedVersion(env, arg, out)` consolidates the repeated typeof →
get_double → memcpy → vtIsLock/zero check that appeared in 6 functions.

`vtSlotFor(dbHandle, vt, key)` consolidates the dbPtr + cfId + slotFor
3-liner that appeared in the same 6 places.

Applied in Database::Get, Database::GetSync, Database::VerifyVersion,
Database::PopulateVersion, Transaction::Get, and Transaction::GetSync.
Removes ~35 lines of duplicated boilerplate with no behavior change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add GetOptions.populateVersion which ORs POPULATE_VERSION_FLAG into the
native getSync/get flags, letting the native layer auto-seed the VT slot
in the same call rather than requiring a separate populateVersion() call.
Also merge expectedVersion into caller options so the transaction
snapshot is preserved on VT cache misses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Handle::get

Thread vtSlot, hasExpectedVersion, expectedVersion, and wantsPopulate
through the TransactionHandle::get signature so that block-cache hits
also run the VT fast-path (FRESH signal) and auto-populate logic,
matching the behaviour of the disk-read async path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant