Process-global Verification Table with coordinated write-lock#526
Draft
kriszyp wants to merge 10 commits into
Draft
Process-global Verification Table with coordinated write-lock#526kriszyp wants to merge 10 commits into
kriszyp wants to merge 10 commits into
Conversation
Adds a lock-free std::atomic<uint64_t>[] Verification Table (VT) to the native binding so JS threads can cheaply verify record-cache freshness without touching RocksDB. - VerificationTable: fixed-size slot array keyed by hash(db_ptr, cf_id, key); slot holds 0 (empty) or a float64 version bit-pattern (sign bit 0, leaving bit 63 free for Phase 2 lock tags) - Database::VerifyVersion / PopulateVersion: new native methods exposed as db.verifyVersion(key, version) and db.populateVersion(key, version) on the JS NativeDatabase type - GetSync fast-path: optional 4th arg expectedVersion; returns FRESH_VERSION_FLAG sentinel on slot match; POPULATE_VERSION_FLAG on flags seeds the slot from the first 8 bytes of the read value - DBSettings: lazy VT materialization with random seed; config() accepts verificationTableEntries (frozen after first materialize) - TypeScript: verifyVersion/populateVersion on Store and RocksDatabase; FRESH_VERSION_FLAG and POPULATE_VERSION_FLAG exported from constants - 12 new tests in test/verification-table.test.ts; all 431 tests pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n commit Stamps VT slots as "write in flight" during optimistic transaction commit so concurrent readers see vtIsLock() and fall through to RocksDB instead of serving stale cached versions. Also fixes a pre-existing race: the commit complete callback was unconditionally resetting state→Pending after IsBusy, but close() may have already set state→Aborted and nulled txn. Guard the reset to only apply when state is still Committing, preventing a null-txn Rollback crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pt-in Moves VT lock installation from commit time (write-batch iteration) to putSync/removeSync time so the slot is invalidated the moment a key enters the transaction's write buffer, closing the window where a cached read could observe a stale version after a write but before commit. Adds `verificationTable: true` per-DB open option (NativeDatabaseOptions → DBHandle::enableVerificationTable) so only opted-in column families participate, keeping secondary-index CFs out of the VT. Fixes a pre-existing race in the async commit complete callback where `state = Pending` was set unconditionally after IsBusy, overwriting an Aborted state set by close(), leading to Rollback() on a null txn. Guards the reset: only overwrite Committing → Pending. All 435 existing tests pass; 4 new Phase 2 lock-tracker tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
📊 Benchmark Resultsget-sync.bench.tsgetSync() > random keys - small key size (100 records)
getSync() > sequential keys - small key size (100 records)
ranges.bench.tsgetRange() > small range (100 records, 50 range)
realistic-load.bench.tsRealistic write load with workers > write variable records with transaction log
transaction-log.bench.tsTransaction log > read 100 iterators while write log with 100 byte records
Transaction log > read one entry from random position from log with 1000 100 byte records
worker-put-sync.bench.tsputSync() > random keys - small key size (100 records, 10 workers)
worker-transaction-log.bench.tsTransaction log with workers > write log with 100 byte records
Results from commit 70f4ddd |
…OW signal When coordinatedRetry: true on TransactionOptions, an IsBusy conflict at commit time is resolved (not rejected) with RETRY_NOW_VALUE instead of propagating a TransactionIsBusyError. The native layer checks whether any VT slot our transaction locked is now held by a concurrent transaction; if so it parks a TSFN-based Waiter on that tracker and fires resolve(RETRY_NOW) only after the conflicting write-intent releases, eliminating blind backoff. Key mechanics: - LockTracker gains woken flag + mutex-protected wakeCallbacks vector with addWakeCallback() / wake() methods - releaseIntent() calls t->wake() after CAS-zeroing the slot, notifying any parked waiters before decrementing refcount - Execute lambda saves lockedVTSlots to CommitState.savedSlots before releaseIntent() clears them on IsBusy - Complete callback's IsBusy+coordinatedRetry path: iterates savedSlots, finds active vtIsLock trackers, creates a one-shot TSFN (retryNowCallJs / retryNowFinalize) and registers it as a wake callback; if already woken (tracker released between execute and complete), fires TSFN immediately - If no active lock found, resolve(RETRY_NOW) is called directly on the JS thread from the complete callback (no TSFN overhead) - database.ts transaction() loop: on RETRY_NOW return value, continues immediately without backoff - RETRY_NOW constant exported from transaction.ts for Harper integration All 438 tests pass; 3 new Phase 3 correctness tests added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds uintptr_t dbPtr to LockTracker so cancelForDB can identify which slots belong to a closing DB. DBDescriptor::close() calls cancelForDB after the closables loop as a defensive final pass: if any VT lock survives the normal TransactionHandle::close() → releaseIntent() → wake() path, cancelForDB CASes it to 0 and fires wake() to unpark any TSFN waiters, preventing a waiter from parking forever after DB close. New stress-test/vt-stress.stress.test.ts exercises all VT paths under an 8-slot table, which forces every write to a collision bucket, giving full coverage of the ABA check, concurrent lock contention, coordinatedRetry under collision, and the cancelForDB lifecycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Transaction::Get now accepts argv[4] as expectedVersion, computing the VT slot and forwarding it to TransactionHandle::get() for both the block-cache fast path and the async libuv path. - Database::Get already does the same (updated earlier); this brings parity for NativeTransaction callers. - store.get(): 4th param changed from txnId to StoreGetOptions so expectedVersion flows naturally alongside the transaction id. Passes expectedVersion to both the ONLY_IF_IN_MEMORY_CACHE getSync fast-path and the async context.get() call; propagates FRESH_VERSION_FLAG without clobbering VALUE_BUFFER.end. - store.getSync(): passes options.expectedVersion as 4th arg and guards the FRESH_VERSION_FLAG sentinel from the VALUE_BUFFER assignment path. - GetOptions gains an expectedVersion field. getBinary/getBinaryFast pass the full options object (including expectedVersion) instead of only the txnId. Return types widened to include number for the FRESH sentinel. Decode paths in get()/getSync() guard against FRESH so the sentinel is never passed to the decoder. - load-binding.ts: NativeDatabase.get and NativeTransaction.get type signatures updated to include expectedVersion; resolve callback widens to Buffer | number. FRESH_VERSION_FLAG exported as a standalone value. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add `parseExpectedVersion` and `vtSlotFor` inline helpers to database.h. `parseExpectedVersion(env, arg, out)` consolidates the repeated typeof → get_double → memcpy → vtIsLock/zero check that appeared in 6 functions. `vtSlotFor(dbHandle, vt, key)` consolidates the dbPtr + cfId + slotFor 3-liner that appeared in the same 6 places. Applied in Database::Get, Database::GetSync, Database::VerifyVersion, Database::PopulateVersion, Transaction::Get, and Transaction::GetSync. Removes ~35 lines of duplicated boilerplate with no behavior change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add GetOptions.populateVersion which ORs POPULATE_VERSION_FLAG into the native getSync/get flags, letting the native layer auto-seed the VT slot in the same call rather than requiring a separate populateVersion() call. Also merge expectedVersion into caller options so the transaction snapshot is preserved on VT cache misses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Handle::get Thread vtSlot, hasExpectedVersion, expectedVersion, and wantsPopulate through the TransactionHandle::get signature so that block-cache hits also run the VT fast-path (FRESH signal) and auto-populate logic, matching the behaviour of the disk-read async path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a process-wide, lock-free Verification Table (VT) to the rocksdb-js native binding. The VT is a fixed-size array of
std::atomic<uint64_t>slots addressed byhash(dbPtr, cfId, key). It solves two problems in Harper:Stale cache detection ? per-thread JS caches can silently serve outdated records after another thread commits.
verifyVersion/populateVersiongive callers a fast O(1) way to confirm a cached record is still fresh without hitting RocksDB.Blind optimistic retry ?
IsBusyconflicts previously triggered quadratic-backoff retries with no coordination.coordinatedRetry: trueparks the retrying JS promise on the VT slot of the conflicting key; it wakes automatically (via TSFN) the moment the blocker commits, eliminating unnecessary backoff.Changes by phase
Phase 1 ?
VerificationTableclass: lock-free slot array,slotFor,verifyVersion,populateVersion,extractVersionFromValue,POPULATE_VERSION_FLAG/FRESH_VERSION_FLAGfast-path ingetSync,RocksDatabase.config({ verificationTableEntries }), JS-sideverifyVersion/populateVersiononNativeDatabase.Phase 2 ?
LockTrackerstruct: installed in the VT slot atputSync/removeSynctime (not deferred to commit). Per-CF opt-in viaverificationTable: trueon the DB open options.releaseIntent()CASes the slot back to 0 after commit (success or IsBusy).store.tsnow forwardsverificationTablethrough toNativeDatabase.open().Phase 3 ?
coordinatedRetry: trueonTransactionOptions. When IsBusy fires, the native complete callback checkssavedSlotsfor an active lock and parks a TSFNWaiteron the tracker.LockTracker::wake()fires all waiters afterreleaseIntent(), resolving the commit promise withRETRY_NOW_VALUEinstead of rejecting.database.tsloops immediately onRETRY_NOW.Phase 4 ?
LockTracker.dbPtrfield +VerificationTable::cancelForDB(dbPtr): a full-table scan called fromDBDescriptor::close()after allTransactionHandleclosables have been closed. Defensive safety net ensuring no TSFN waiter can park forever if DB close races a pending commit. Newstress-test/vt-stress.stress.test.tswith an 8-slot VT forces hash collisions on every write, covering ABA, concurrent contention, coordinatedRetry under collisions, and the cancelForDB lifecycle.Initial Prompt
I would like to implement a cache verification and write lock/tracking mechanism. The goal is to improve read performance and reduce contention induced transaction retries. In our Harper application, we have used an LRU cache (https://www.npmjs.com/package/weak-lru-cache) to store deserialized objects from the RocksDB database. We would like to be able to access that LRU cache, and when there is a hit, verify that the entry in the cache is still the most recent in the database. We have many worker/threads, and the cache is specific to each one (since it is in JS isolated), so the JS doesn't know if there is a change in the database from another thread. So (after a cache hit in the JS LRU cache) I would to be able to call get/getSync, providing a version number (the version numbers are recorded in the JS LRU cache), and if that version number is verified by rocksdb-js to be fresh, than the get can return a flag/indicator indicating that the caller's cached JS object is indeed still fresh and can be used. If it is not fresh, than the normal process of retrieving the binary data and returning it is followed. Also in our Harper application, we use a lock-free transaction mechanism, where we use RocksDB's optimistic transaction, reading from a snapshot and writing to the transaction, also recording a list of writes. And if there is contention (another thread wrote to a record in the transaction), the optimistic transaction fails (IsBusy error), and we then replay/retry the transaction, re-executing all the writes into a new RocksDB transaction. However, with highly contentious writes, this could be problematic; just naively retrying the transaction could continue to lead to continued conflicts and repeated retries.
Proposed approach
I would like to propose using a fixed array, with hash-keyed access to values that represent the known fresh version or a lock indicator/reference. Using a fixed hashed-key array (a probalistic bloomier filter?), should give us a lock-free way to quickly check the array for the freshness of an entry, or assign the latest version number on a cache miss. I believe there should be a single cache array for the whole system, across all databases. It can be allocated at the same time as the block cache that is shared by all databases. Once created, it is fixed and access is fast. We can default to a size of 1MB. I would propose that the indexes are a hash of the database (name or pointer), column family name, and record key. I think this should give good distribution. Of course this is probabilistic, there can certainly be hash collisions, but that should just result in false positives on cache hits, and safely revert to the slow path of retrieval. Each entry should be a 64-bit word/number. When there is cache miss, the entry can be updated with the latest version number from the retrieved object, if there is no active writes for this entry. The version number is always the first (64-bit) word of the record and can be retrieved directly from the beginning of the record. By ensuring no actives writes (no open transactions that have a pending write to a record key that hashes to this index), we should be able to preserve the invariant that there is one single active version for any record whose key hashes to this index/entry (no other versions of a record in any pending transaction). Note that the version numbers are big-endian timestamps, so there is some predicatibility to their basic format (always positive, usually starts with 66 in this era, which could be useful for distinguishing from a write indicator that could use a flag/bit that won't match a positive version number). When a record is written in a transaction, and there are no other active writes, we need to update the cache entry to an indicator that the record is now being written (perhaps a a pointer to a lock tracking structure is recorded within the cache entry), and caching a version number is no longer permitted until all writes that hash to this entry are committed. There can be multiple writes to the same record and multiple writes to records that hash to this entry, which needs to be properly tracked. In addition, I believe this provides a means for being able to more accurately notify of when a transaction can be safely retried. So on the first attempt at a transaction, all writes will try to acquire a "lock" on for each record (atomically updating the cache entry to a write status, if there is no existing write status), and we will track these writes for follow-up work. If the transaction commits succesfully, then we do not need to retry. If the transaction fails, that's because there was contention. Rather than immediately calling the transaction callback/error handler with an IsBusy error, we should wait until we can acquire locks on all the entries that need to be written, and then call the callback handler once all the locks are in place. We should then be able to retry the transaction safely, replaying all the writes again, with pre-existing locks. We should probably use a separate signal (in the commit callback) than an IsBusy error, more explicitly indicating that a retry should now proceed. When a transaction commits, naturally we will need to remove the locks and follow-up with the work of finding any transactions that are waiting for their turn to acquire the locks, acquire those locks and notify any such transactions that are waiting to retry, that they can retry now. Presumably we want to unlock and write locks when the transaction is in conflict and waiting until it can acquire all the locks at one (synchronized) to avoid multiple part sets of locks creating deadlock potential. Feel free to suggest different ways of handling this if there are better approaches.
Test coverage
test/lock-tracker.test.ts(Phases 2 & 3)stress-test/vt-stress.stress.test.ts(Phase 4)