[https://nvbugs/6293536][fix] Stage v2 KV block offsets through fresh host buffers#15607
Draft
thorjohnsen wants to merge 4 commits into
Draft
[https://nvbugs/6293536][fix] Stage v2 KV block offsets through fresh host buffers#15607thorjohnsen wants to merge 4 commits into
thorjohnsen wants to merge 4 commits into
Conversation
…host buffer KVCacheManager.copy_batch_block_offsets shipped each iteration's block-offset page tables to the device with an asynchronous H2D whose source was a single, persistently-reused pinned host buffer overwritten in place every call. An async copy reads its source at execution time, not enqueue time, so under the overlap scheduler the CPU runs an iteration ahead and the next iteration's in-place overwrite could clobber the source of the previous iteration's still-pending H2D. The attention kernel then indexed another batch's physical blocks, surfacing as memory corruption. The window is widened by v1 host KV offloading, which stalls the execution stream (syncTransfers) right in front of the offset H2D, plus concurrent prefill and primary-memory pressure. Stage the offsets through a freshly-allocated pinned host buffer on every call instead of reusing one in place. PyTorch's caching host allocator holds a freed pinned buffer until the consuming async copy completes, so the source can no longer be overwritten mid-flight -- the same protection the already-safe kv_lens / block_ids_per_seq staging paths rely on. The manager attribute is reassigned to the current buffer so same-iteration CPU readers (DSA sparse attention) and the speculative-decoding draft/target swap stay coherent; TrtllmAttentionMetadata.prepare re-grabs the reference after the copy. Add a regression test that drives the overlap window deterministically with a stream stall and asserts batch A's device offsets are not clobbered by batch B. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
Resolve copy_batch_block_offsets conflict: upstream added a cross-attention (CacheType.CROSS, beam_width>1) early-return branch that also reused the single persistent host_kv_cache_block_offsets buffer in place. Route both the cross and the self-attention paths through the fresh per-iteration pinned buffer so the nvbug 6293536 fix covers the cross path too. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
…r; stage only the async copy Address review and fix a correctness regression in the previous revision. The previous fix reassigned KVCacheManager.host_kv_cache_block_offsets to a freshly-allocated, per-call, num_seqs-sized buffer (and TrtllmAttentionMetadata .prepare re-grabbed it). That changed the layout and persistence the synchronous CPU readers of this attribute depend on -- DSA sparse attention, the speculative-decoding draft/target swap, and the KV-cache relocation path -- which broke test_modeling_llama.py::TestLlama::test_llama_verification_with_kv_cache_relocation (generation logits diverged from the reference after KV cache relocation). Separate the two roles instead of conflating them: - Keep filling the persistent host buffer in place, exactly as before, so every synchronous CPU reader observes the same persistent, max_batch-sized layout. The attribute is no longer reassigned. - Stage only the asynchronous H2D copy through a freshly-allocated pinned buffer (_stage_block_offsets_for_copy): a private snapshot of the rows this call needs. The async copy reads its source at execution time, so under the overlap scheduler the next iteration's in-place refill of the persistent buffer can no longer clobber this iteration's still-pending copy (nvbug 6293536). PyTorch's caching host allocator holds the snapshot until the consuming copy completes. The TrtllmAttentionMetadata.prepare re-grab is removed (the attribute is stable again), so trtllm.py returns to its original state and the target/draft handling is symmetric once more. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
… host buffers KVCacheManagerV2.copy_batch_block_offsets ships each iteration's block offsets to the device with copy_batch_block_offsets_to_device, an asynchronous cp.async gather kernel that reads its host inputs at execution time, not enqueue time. Two of those inputs are reused in place across iterations: host_kv_cache_block_offsets (the C++ KV cache writes page indices straight into a sequence's slot, and a freed slot is rebound to a different request on reuse) and copy_idx (a slice of the IndexMapper's single persistent copyIndex_ buffer, overwritten by the next get_copy_index call). Under the overlap scheduler the CPU runs an iteration ahead, so either could be clobbered before the previous iteration's still-pending kernel drains -> the kernel gathers another batch's physical blocks (nvbug 6293536, the v2 analogue of the v1 KVCacheManager fix). The v1 fix does not transfer: the persistent buffer cannot be reassigned (the C++ holds raw pointers into it via set_base_page_index_buf, and DSA reads it as the current page table), and a fresh per-call buffer alone is not protected by PyTorch's caching host allocator -- that keep-alive is only recorded for cudaMemcpyAsync out of pinned memory, not for a custom kernel reading host memory directly, so a freed buffer is reused immediately. Snapshot the rows each call needs into a fresh pinned buffer and feed the kernel an identity index, then retain the buffer until a CUDA event recorded after the kernel completes (event-guarded retirement, FIFO on the single stream), bounding memory to the overlap depth. The persistent host_kv_cache_block_offsets attribute is left in place for the C++ pointers and DSA readers. Also clarify (comment only) why the draft path in TrtllmAttentionMetadata.prepare needs no host-buffer re-grab: there is no metadata-cached draft host offsets attribute; consumers read draft_kv_cache_manager.host_kv_cache_block_offsets directly. Add a regression test that drives the overlap window deterministically with a stream stall and asserts batch A's device offsets are not clobbered by batch B. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
caeeaa5 to
3bdee3b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
#15546 fixed an overlap-scheduler data race in the v1
KVCacheManager.copy_batch_block_offsets. The v2 manager(
KVCacheManagerV2) has the same class of bug, surfaced by reviewer @yuxianq,but it needs a different fix.
KVCacheManagerV2.copy_batch_block_offsetsships each iteration's block offsetswith
copy_batch_block_offsets_to_device— an asynchronouscp.asyncgatherkernel (
cpp/.../kvCacheManagerV2Utils.cu) that reads its host inputs atexecution time, not enqueue time. Two of those inputs are reused in place
across iterations:
host_kv_cache_block_offsets— a persistent, per-slot buffer the C++ KVcache writes page indices into via raw pointers bound at create/resume
(
set_base_page_index_buf). A freed slot is rebound to a different request onreuse.
copy_idx—IndexMapper.get_copy_indexreturns a slice of a singlepersistent pinned
copyIndex_buffer, overwritten by the next call.Under the overlap scheduler the CPU runs an iteration ahead, so either input can
be clobbered before the previous iteration's still-pending gather kernel drains
→ the kernel gathers another batch's physical blocks (memory corruption).
Why the v1 fix does not transfer
it, and DSA sparse attention reads it as the current page table.
allocator only records a keep-alive event for
cudaMemcpyAsyncout of pinnedmemory — not for a custom kernel reading host memory directly. So a freed
buffer is reused immediately and the race persists. (Verified: the new
regression test still fails with a naive fresh-buffer fix.)
The fix
kernel an identity index, so both mutable kernel inputs are per-call and
immutable for this call.
(event, host_block_offsets, copy_index)until the event completes (FIFO onthe single stream), bounding memory to the overlap depth.
host_kv_cache_block_offsetsattribute is left in place for theC++ pointers and DSA readers.
Also adds a comment in
TrtllmAttentionMetadata.prepareclarifying why thedraft path needs no host-buffer re-grab (no metadata-cached draft host
attribute; consumers read
draft_kv_cache_manager.host_kv_cache_block_offsetsdirectly) — addressing the draft/target asymmetry @yuxianq flagged.
Test
New
tests/unittest/_torch/executor/test_kv_block_offset_overlap_race_v2.pydrives the overlap window deterministically with a stream stall and asserts batch
A's device offsets are not clobbered by batch B. Passes with the fix, fails
without it.
Local validation (H100): v1 + v2 regression tests pass;
test_dsa_indexer.py50 passed / 12 skipped; in-flight buffer list stays bounded across 50+ calls.