Handle partial writev returns in transaction-log writeBatchToFile#573
Handle partial writev returns in transaction-log writeBatchToFile#573kriszyp wants to merge 2 commits into
Conversation
…uption writeBatchToFile previously advanced through iovecs by COUNT rather than by BYTES, so a short writev() return (POSIX allows it under EINTR, mid-write ENOSPC, NFS/FUSE, etc.) silently dropped the tail of one iovec and skipped the rest of the batch — while still reporting the partial-byte count to the caller. The resulting on-disk transaction log looked contiguous, but a truncated entry's header claimed more data than was actually written; the next batch appended directly after the partial bytes, and the reader's length-prefixed advance landed inside the next entry. Eventually surfaced as "RangeError: Offset is outside the bounds of the DataView" during replication catch-up (see HarperFast/harper#610). Both backends now retry-until-complete-or-fail: - POSIX: the writev loop tracks byte progress through a mutable iovec copy, trims iov_base/iov_len on partial writes, retries on EINTR, and bails on written==0 to avoid an infinite loop. The loop has been extracted into a static helper TransactionLogFile::writevAll, broken out into src/binding/writev_all.cpp so a standalone unit test can link against it without pulling in N-API and RocksDB. The new C++ test forces partial writes by sending more bytes than a default pipe buffer holds; the JS side also adds round-trip integrity tests that exercise the > MAX_IOVS chunking path with varied entry sizes. - Windows: WriteFile partial returns now retry on the same iovec instead of breaking out of the loop with an incorrect partial count. Any write error now returns -1 unconditionally (was: return partial count if any bytes had been written), so the caller is no longer told "everything wrote" when it didn't. With writeBatchToFile now write-everything-or-fail (negative on error, which writeEntriesV1 already turns into a thrown DBException), the TransactionLogEntryBatch's half-built partial-resume state was dead code: currentEntryHeaderWritten was never set true anywhere, and currentEntryBytesWritten was only ever read by debug logs. Removed both, along with the corresponding DEBUG_LOG argument lists. Fixes #572. Investigation: HarperFast/harper#610. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📊 Benchmark Resultsget-sync.bench.tsgetSync() > random keys - small key size (100 records)
getSync() > sequential keys - small key size (100 records)
ranges.bench.tsgetRange() > small range (100 records, 50 range)
realistic-load.bench.tsRealistic write load with workers > write variable records with transaction log
transaction-log.bench.tsTransaction log > read 100 iterators while write log with 100 byte records
Transaction log > read one entry from random position from log with 1000 100 byte records
worker-put-sync.bench.tsputSync() > random keys - small key size (100 records, 10 workers)
worker-transaction-log.bench.tsTransaction log with workers > write log with 100 byte records
Results from commit 962774d |
| /** | ||
| * Whether the transaction header for the current entry has been written. | ||
| */ | ||
| bool currentEntryHeaderWritten = false; |
| * The number of bytes already written from the current entry's data | ||
| * (excluding the transaction header). | ||
| */ | ||
| uint32_t currentEntryBytesWritten = 0; |
There was a problem hiding this comment.
This was purely informational and ok to remove.
| return totalWritten; | ||
| // Implementation of writevAll lives in writev_all.cpp so the same loop | ||
| // can be linked into a standalone unit-test executable. | ||
| return TransactionLogFile::writevAll(this->fd, iovecs, iovcnt); |
There was a problem hiding this comment.
I don't think we need this writevAll() function. We should keep the logic in this method.
| // We track byte progress through the iovec array and advance into a partial | ||
| // iovec's remainder by adjusting iov_base/iov_len so a short writev does not | ||
| // silently drop the tail of an entry. | ||
| constexpr int MAX_IOVS = 1024; // IOV_MAX on most systems |
There was a problem hiding this comment.
Can we #include <limits.h> and use IOV_MAX?
| // Skipped on Windows; the Win32 writeBatchToFile uses WriteFile, not writev. | ||
| describe.skipIf(process.platform === 'win32')('writevAll partial-write retry', () => { | ||
| it('writes all bytes through a pipe even when writev returns short counts', () => { | ||
| const script = join(process.cwd(), 'scripts', 'test-writev-partial.mjs'); |
There was a problem hiding this comment.
If we need tests to test scenarios in the C++ code that we can't test using the JavaScript API, then we should adopt a real C++ unit test runner like gtest. This would actually be advantageous as we could test many other scenarios that are challenging to test via JavaScript. Clang can instrument the C++ code and allow us to capture code coverage reports. I would much rather get a proper solution in place than band-aid some script to invoke the compiler and run a program. I can add gtest and give us a structure for adding C++ tests.
| "rebuild": "node-gyp rebuild", | ||
| "rebuild:debug": "node-gyp rebuild --coverage --debug --verbose", | ||
| "test": "cross-env CI=1 node --expose-gc ./node_modules/vitest/vitest.mjs --reporter=verbose", | ||
| "test:writev-partial": "node scripts/test-writev-partial.mjs", |
There was a problem hiding this comment.
We probably shouldn't specifically create a script for this test.
| // writevAll handles short writev returns when the pipe buffer fills. We send | ||
| // 1 MiB across 256 iovecs through a pipe whose default capacity is well under | ||
| // the total, so the kernel will return short counts. | ||
| void testPartialPipeWrite() { |
There was a problem hiding this comment.
I don't think this test is actually testing a partial writevs, at least not on macOS. On macOS, blocking writev to a pipe with an active reader completes in a single syscall, so there's no short return. If it was non-blocking, then you'd have to have a retry loop that checks a returned error as EAGAIN.
The only/best thing to do is use preprocessor macros to define writev(). In release builds, writeBatchToFile() would call ROCKSDB_JS_WRITEV() which is defined as #define ROCKSDB_JS_WRITEV ::writev and in test builds would be some mocked function that returns a partial write. It would have to be crafty as to only run the mocked version for certain tests, but the real one for other tests, so you'd need some global state and a way to change it (maybe env vars?).
Inline the writev retry loop directly into writeBatchToFile instead of extracting it to a separate writevAll helper. Use IOV_MAX from <limits.h> rather than a hardcoded 1024. Remove the standalone C++ test scaffolding (writev_all.cpp, writev_partial_test.cpp, test-writev-partial.mjs, writev-partial.test.ts, test:writev-partial script) per reviewer preference to adopt gtest for C++ unit testing instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Fix audit-log framing corruption caused by short
writev()returns inTransactionLogFile::writeBatchToFile. The previous POSIX loop advanced through iovecs by count rather than by bytes written, so any short return silently dropped the tail of one iovec and skipped the rest of the batch — but still reported the partial byte count to the caller. The on-disk file looked contiguous, a truncated entry's header claimed more data than was actually written, and the next batch appended right after the partial bytes. The reader's length-prefixed advance landed inside the next entry; downstream this surfaced asRangeError: Offset is outside the bounds of the DataViewduring replication catch-up.Both backends now retry-until-complete-or-fail:
TransactionLogFile::writevAllinsrc/binding/writev_all.cpp. Tracks byte progress through a mutable iovec copy, trimsiov_base/iov_lenon partial writes, retries onEINTR, bails onwritten == 0.-1unconditionally (was: return partial count if any bytes had been written, which the caller treated as success).With
writeBatchToFilenow write-everything-or-fail (negative on error, whichwriteEntriesV1already turns into a thrownDBException), the half-built partial-resume state onTransactionLogEntryBatchis dead code. RemovedcurrentEntryBytesWritten(only ever read by debug logs) andcurrentEntryHeaderWritten(never set true anywhere).Purpose
writevinwriteBatchToFilecauses audit-log framing corruption #572Where to focus review attention
src/binding/writev_all.cpp): handlesEINTR(retry), no infinite loop onwritten == 0(returns-1), advances by bytes not count, correctly chunks atMAX_IOVS = 1024.src/binding/transaction_log_file_windows.cpp:352-407): innerwhile (remaining > 0)retry loop,static_cast<DWORD>(remaining)matches existing pattern, error path returns-1instead of partial (changed semantic).writeEntriesV1pre-incrementsbatch.currentEntryIndexbefore the write; this is now safe because a partial write surfaces as-1and throwsDBException. The removedcurrentEntryBytesWritten/currentEntryHeaderWrittenfields were never used for resume.test/binding/writev_partial_test.cpp, run viascripts/test-writev-partial.mjs) compiles againstwritev_all.cppwith-DROCKSDB_JS_WRITEV_ALL_STANDALONEso it doesn't pull in N-API/RocksDB. Forces partialwritevby sending 1 MiB through a default-sized pipe and asserts byte-exact round-trip; also covers> MAX_IOVSiovecs into a regular file, zero iovcnt, and bad fd.test/transaction-log.test.tsexercise the high-iovcnt batching path end-to-end (> MAX_IOVSentries in one batch, plus varied-size entries) and assert byte-exact round-trip viaquery().test/writev-partial.test.ts) wraps the C++ test as a vitest case so it runs as part ofpnpm test.writev_all.cppdefinesTransactionLogFile::writevAll. Production builds include the fulltransaction_log_file.h; the standalone test build uses a minimal forward declaration guarded byROCKSDB_JS_WRITEV_ALL_STANDALONE. The two builds are separate binaries, so the same translation unit doesn't see both definitions.transaction_log_file_posix.cpp/_windows.cpp.Out of scope (intentionally not fixed)
O_APPEND(src/binding/transaction_log_file_posix.cpp:93). If a partial write happens andwritevAllreturns-1(so the in-memorysizeis not advanced), the next process restart willfstatand pick up the stale partial-bytes-and-then-next-batch as the file size. The in-memory readers don't see this because they use the cachedsize, but the post-crash recovery path will. This is a pre-existing concern (same onmain) and orthogonal to the writev correctness fix; not addressed here.Test plan
node scripts/test-writev-partial.mjs— standalone C++ partial-write test passespnpm build:binding— clean buildpnpm test— full suite passes (440 passed, 1 skipped)pnpm lint/pnpm fmt:check/pnpm type-check— cleanReview notes
O_APPENDrecovery concern documented above as out-of-scope.PR generated by Claude (Opus 4.7).