Skip to content

Handle partial writev returns in transaction-log writeBatchToFile#573

Open
kriszyp wants to merge 2 commits into
mainfrom
fix/writev-partial-write-corruption
Open

Handle partial writev returns in transaction-log writeBatchToFile#573
kriszyp wants to merge 2 commits into
mainfrom
fix/writev-partial-write-corruption

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented May 19, 2026

Summary

Fix audit-log framing corruption caused by short writev() returns in TransactionLogFile::writeBatchToFile. The previous POSIX loop advanced through iovecs by count rather than by bytes written, so any short return silently dropped the tail of one iovec and skipped the rest of the batch — but still reported the partial byte count to the caller. The on-disk file looked contiguous, a truncated entry's header claimed more data than was actually written, and the next batch appended right after the partial bytes. The reader's length-prefixed advance landed inside the next entry; downstream this surfaced as RangeError: Offset is outside the bounds of the DataView during replication catch-up.

Both backends now retry-until-complete-or-fail:

  • POSIX: extracted writev loop to TransactionLogFile::writevAll in src/binding/writev_all.cpp. Tracks byte progress through a mutable iovec copy, trims iov_base / iov_len on partial writes, retries on EINTR, bails on written == 0.
  • Windows: WriteFile partial returns now retry on the same iovec instead of breaking out of the loop with an incorrect partial count. Any error now returns -1 unconditionally (was: return partial count if any bytes had been written, which the caller treated as success).

With writeBatchToFile now write-everything-or-fail (negative on error, which writeEntriesV1 already turns into a thrown DBException), the half-built partial-resume state on TransactionLogEntryBatch is dead code. Removed currentEntryBytesWritten (only ever read by debug logs) and currentEntryHeaderWritten (never set true anywhere).

Purpose

Where to focus review attention

  1. writev loop correctness (src/binding/writev_all.cpp): handles EINTR (retry), no infinite loop on written == 0 (returns -1), advances by bytes not count, correctly chunks at MAX_IOVS = 1024.
  2. Windows parallel fix (src/binding/transaction_log_file_windows.cpp:352-407): inner while (remaining > 0) retry loop, static_cast<DWORD>(remaining) matches existing pattern, error path returns -1 instead of partial (changed semantic).
  3. Caller safety: writeEntriesV1 pre-increments batch.currentEntryIndex before the write; this is now safe because a partial write surfaces as -1 and throws DBException. The removed currentEntryBytesWritten / currentEntryHeaderWritten fields were never used for resume.
  4. Test coverage of the partial path:
    • Standalone C++ unit test (test/binding/writev_partial_test.cpp, run via scripts/test-writev-partial.mjs) compiles against writev_all.cpp with -DROCKSDB_JS_WRITEV_ALL_STANDALONE so it doesn't pull in N-API/RocksDB. Forces partial writev by sending 1 MiB through a default-sized pipe and asserts byte-exact round-trip; also covers > MAX_IOVS iovecs into a regular file, zero iovcnt, and bad fd.
    • JS integration tests in test/transaction-log.test.ts exercise the high-iovcnt batching path end-to-end (> MAX_IOVS entries in one batch, plus varied-size entries) and assert byte-exact round-trip via query().
    • JS test (test/writev-partial.test.ts) wraps the C++ test as a vitest case so it runs as part of pnpm test.
  5. ODR: writev_all.cpp defines TransactionLogFile::writevAll. Production builds include the full transaction_log_file.h; the standalone test build uses a minimal forward declaration guarded by ROCKSDB_JS_WRITEV_ALL_STANDALONE. The two builds are separate binaries, so the same translation unit doesn't see both definitions.
  6. C++ style: tabs, brace placement, naming all match surrounding code in transaction_log_file_posix.cpp / _windows.cpp.

Out of scope (intentionally not fixed)

  • POSIX uses O_APPEND (src/binding/transaction_log_file_posix.cpp:93). If a partial write happens and writevAll returns -1 (so the in-memory size is not advanced), the next process restart will fstat and pick up the stale partial-bytes-and-then-next-batch as the file size. The in-memory readers don't see this because they use the cached size, but the post-crash recovery path will. This is a pre-existing concern (same on main) and orthogonal to the writev correctness fix; not addressed here.

Test plan

  • node scripts/test-writev-partial.mjs — standalone C++ partial-write test passes
  • pnpm build:binding — clean build
  • pnpm test — full suite passes (440 passed, 1 skipped)
  • pnpm lint / pnpm fmt:check / pnpm type-check — clean
  • CI verification on Linux/macOS/Windows runners

Review notes

  • Cross-model review via Gemini confirmed loop correctness (POSIX + Windows), caller safety with vestigial fields removed, ODR setup, test coverage. No issues raised; one informational note about the orthogonal O_APPEND recovery concern documented above as out-of-scope.

PR generated by Claude (Opus 4.7).

…uption

writeBatchToFile previously advanced through iovecs by COUNT rather than by
BYTES, so a short writev() return (POSIX allows it under EINTR, mid-write
ENOSPC, NFS/FUSE, etc.) silently dropped the tail of one iovec and skipped
the rest of the batch — while still reporting the partial-byte count to the
caller. The resulting on-disk transaction log looked contiguous, but a
truncated entry's header claimed more data than was actually written; the
next batch appended directly after the partial bytes, and the reader's
length-prefixed advance landed inside the next entry. Eventually surfaced as
"RangeError: Offset is outside the bounds of the DataView" during replication
catch-up (see HarperFast/harper#610).

Both backends now retry-until-complete-or-fail:

- POSIX: the writev loop tracks byte progress through a mutable iovec copy,
  trims iov_base/iov_len on partial writes, retries on EINTR, and bails on
  written==0 to avoid an infinite loop. The loop has been extracted into a
  static helper TransactionLogFile::writevAll, broken out into
  src/binding/writev_all.cpp so a standalone unit test can link against it
  without pulling in N-API and RocksDB. The new C++ test forces partial
  writes by sending more bytes than a default pipe buffer holds; the JS
  side also adds round-trip integrity tests that exercise the > MAX_IOVS
  chunking path with varied entry sizes.

- Windows: WriteFile partial returns now retry on the same iovec instead of
  breaking out of the loop with an incorrect partial count. Any write error
  now returns -1 unconditionally (was: return partial count if any bytes
  had been written), so the caller is no longer told "everything wrote"
  when it didn't.

With writeBatchToFile now write-everything-or-fail (negative on error,
which writeEntriesV1 already turns into a thrown DBException), the
TransactionLogEntryBatch's half-built partial-resume state was dead code:
currentEntryHeaderWritten was never set true anywhere, and
currentEntryBytesWritten was only ever read by debug logs. Removed both,
along with the corresponding DEBUG_LOG argument lists.

Fixes #572. Investigation: HarperFast/harper#610.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

📊 Benchmark Results

get-sync.bench.ts

getSync() > random keys - small key size (100 records)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 24.77K ops/sec 40.37 38.87 573.767 0.113 123,857
🥈 rocksdb 2 11.82K ops/sec 84.58 82.51 22,451.288 0.891 59,116

getSync() > sequential keys - small key size (100 records)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 28.96K ops/sec 34.54 33.45 1,992.857 0.125 144,777
🥈 rocksdb 2 12.52K ops/sec 79.90 78.07 2,702.471 0.115 62,578

ranges.bench.ts

getRange() > small range (100 records, 50 range)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 25.70K ops/sec 38.90 35.92 1,929.567 0.298 128,520
🥈 rocksdb 2 17.32K ops/sec 57.73 51.77 2,005.968 0.142 86,614

realistic-load.bench.ts

Realistic write load with workers > write variable records with transaction log

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 491.94 ops/sec 2,032.752 79.89 54,211.357 9.28 984
🥈 lmdb 2 26.48 ops/sec 37,767.694 408.66 1,189,365.747 136.499 64.00

transaction-log.bench.ts

Transaction log > read 100 iterators while write log with 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 36.72K ops/sec 27.23 13.52 13,666.061 0.563 183,601
🥈 lmdb 2 436.19 ops/sec 2,292.597 157.239 22,749.859 1.49 2,181

Transaction log > read one entry from random position from log with 1000 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 757.11K ops/sec 1.32 1.15 4,826.686 0.201 3,785,572
🥈 lmdb 2 448.34K ops/sec 2.23 1.21 8,509.254 0.618 2,241,704

worker-put-sync.bench.ts

putSync() > random keys - small key size (100 records, 10 workers)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 846.02 ops/sec 1,182.011 1,029.316 1,861.028 0.277 1,693
🥈 lmdb 2 1.17 ops/sec 854,899.658 809,058.644 903,111.364 2.72 10.00

worker-transaction-log.bench.ts

Transaction log with workers > write log with 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 21.05K ops/sec 47.50 31.18 454.62 0.484 42,108
🥈 lmdb 2 802.41 ops/sec 1,246.24 83.31 14,247.24 5.80 1,605

Results from commit 962774d

@kriszyp kriszyp marked this pull request as ready for review May 20, 2026 02:14
@kriszyp kriszyp requested a review from a team as a code owner May 20, 2026 02:14
@kriszyp kriszyp requested review from cb1kenobi and removed request for a team May 21, 2026 12:20
/**
* Whether the transaction header for the current entry has been written.
*/
bool currentEntryHeaderWritten = false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, this wasn't even used!?

* The number of bytes already written from the current entry's data
* (excluding the transaction header).
*/
uint32_t currentEntryBytesWritten = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was purely informational and ok to remove.

return totalWritten;
// Implementation of writevAll lives in writev_all.cpp so the same loop
// can be linked into a standalone unit-test executable.
return TransactionLogFile::writevAll(this->fd, iovecs, iovcnt);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this writevAll() function. We should keep the logic in this method.

Comment thread src/binding/writev_all.cpp Outdated
// We track byte progress through the iovec array and advance into a partial
// iovec's remainder by adjusting iov_base/iov_len so a short writev does not
// silently drop the tail of an entry.
constexpr int MAX_IOVS = 1024; // IOV_MAX on most systems
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we #include <limits.h> and use IOV_MAX?

Comment thread test/writev-partial.test.ts Outdated
// Skipped on Windows; the Win32 writeBatchToFile uses WriteFile, not writev.
describe.skipIf(process.platform === 'win32')('writevAll partial-write retry', () => {
it('writes all bytes through a pipe even when writev returns short counts', () => {
const script = join(process.cwd(), 'scripts', 'test-writev-partial.mjs');
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need tests to test scenarios in the C++ code that we can't test using the JavaScript API, then we should adopt a real C++ unit test runner like gtest. This would actually be advantageous as we could test many other scenarios that are challenging to test via JavaScript. Clang can instrument the C++ code and allow us to capture code coverage reports. I would much rather get a proper solution in place than band-aid some script to invoke the compiler and run a program. I can add gtest and give us a structure for adding C++ tests.

Comment thread package.json Outdated
"rebuild": "node-gyp rebuild",
"rebuild:debug": "node-gyp rebuild --coverage --debug --verbose",
"test": "cross-env CI=1 node --expose-gc ./node_modules/vitest/vitest.mjs --reporter=verbose",
"test:writev-partial": "node scripts/test-writev-partial.mjs",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably shouldn't specifically create a script for this test.

Comment thread test/binding/writev_partial_test.cpp Outdated
// writevAll handles short writev returns when the pipe buffer fills. We send
// 1 MiB across 256 iovecs through a pipe whose default capacity is well under
// the total, so the kernel will return short counts.
void testPartialPipeWrite() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test is actually testing a partial writevs, at least not on macOS. On macOS, blocking writev to a pipe with an active reader completes in a single syscall, so there's no short return. If it was non-blocking, then you'd have to have a retry loop that checks a returned error as EAGAIN.

The only/best thing to do is use preprocessor macros to define writev(). In release builds, writeBatchToFile() would call ROCKSDB_JS_WRITEV() which is defined as #define ROCKSDB_JS_WRITEV ::writev and in test builds would be some mocked function that returns a partial write. It would have to be crafty as to only run the mocked version for certain tests, but the real one for other tests, so you'd need some global state and a way to change it (maybe env vars?).

Inline the writev retry loop directly into writeBatchToFile instead of
extracting it to a separate writevAll helper. Use IOV_MAX from <limits.h>
rather than a hardcoded 1024. Remove the standalone C++ test scaffolding
(writev_all.cpp, writev_partial_test.cpp, test-writev-partial.mjs,
writev-partial.test.ts, test:writev-partial script) per reviewer preference
to adopt gtest for C++ unit testing instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Partial writev in writeBatchToFile causes audit-log framing corruption

2 participants