[https://nvbugs/6293536][fix] Stage v2 KV block offsets through fresh host buffers by thorjohnsen · Pull Request #15607 · NVIDIA/TensorRT-LLM

thorjohnsen · 2026-06-25T00:55:50Z

Stacked on #15546 — do not merge before it. This is a follow-up to
#15546 (same nvbug 6293536) that fixes the v2 KV cache manager. Because an
upstream PR must target main, the diff below currently also includes
#15546's v1 commit (a862b7ca77); once #15546 merges, rebasing this branch
on main reduces the diff to just the v2 commit. Kept as a draft until
then.

What

#15546 fixed an overlap-scheduler data race in the v1
KVCacheManager.copy_batch_block_offsets. The v2 manager
(KVCacheManagerV2) has the same class of bug, surfaced by reviewer @yuxianq,
but it needs a different fix.

KVCacheManagerV2.copy_batch_block_offsets ships each iteration's block offsets
with copy_batch_block_offsets_to_device — an asynchronous cp.async gather
kernel (cpp/.../kvCacheManagerV2Utils.cu) that reads its host inputs at
execution time, not enqueue time. Two of those inputs are reused in place
across iterations:

host_kv_cache_block_offsets — a persistent, per-slot buffer the C++ KV
cache writes page indices into via raw pointers bound at create/resume
(set_base_page_index_buf). A freed slot is rebound to a different request on
reuse.
copy_idx — IndexMapper.get_copy_index returns a slice of a single
persistent pinned copyIndex_ buffer, overwritten by the next call.

Under the overlap scheduler the CPU runs an iteration ahead, so either input can
be clobbered before the previous iteration's still-pending gather kernel drains
→ the kernel gathers another batch's physical blocks (memory corruption).

Why the v1 fix does not transfer

The persistent buffer cannot be reassigned: the C++ holds raw pointers into
it, and DSA sparse attention reads it as the current page table.
A fresh per-call buffer alone is not enough: PyTorch's caching host
allocator only records a keep-alive event for cudaMemcpyAsync out of pinned
memory — not for a custom kernel reading host memory directly. So a freed
buffer is reused immediately and the race persists. (Verified: the new
regression test still fails with a naive fresh-buffer fix.)

The fix

Snapshot the rows each call needs into a fresh pinned buffer and feed the
kernel an identity index, so both mutable kernel inputs are per-call and
immutable for this call.
Event-guarded retirement: record a CUDA event after the kernel and retain
(event, host_block_offsets, copy_index) until the event completes (FIFO on
the single stream), bounding memory to the overlap depth.
The persistent host_kv_cache_block_offsets attribute is left in place for the
C++ pointers and DSA readers.

Also adds a comment in TrtllmAttentionMetadata.prepare clarifying why the
draft path needs no host-buffer re-grab (no metadata-cached draft host
attribute; consumers read draft_kv_cache_manager.host_kv_cache_block_offsets
directly) — addressing the draft/target asymmetry @yuxianq flagged.

Test

New tests/unittest/_torch/executor/test_kv_block_offset_overlap_race_v2.py
drives the overlap window deterministically with a stream stall and asserts batch
A's device offsets are not clobbered by batch B. Passes with the fix, fails
without it.

Local validation (H100): v1 + v2 regression tests pass; test_dsa_indexer.py
50 passed / 12 skipped; in-flight buffer list stays bounded across 50+ calls.

…host buffer KVCacheManager.copy_batch_block_offsets shipped each iteration's block-offset page tables to the device with an asynchronous H2D whose source was a single, persistently-reused pinned host buffer overwritten in place every call. An async copy reads its source at execution time, not enqueue time, so under the overlap scheduler the CPU runs an iteration ahead and the next iteration's in-place overwrite could clobber the source of the previous iteration's still-pending H2D. The attention kernel then indexed another batch's physical blocks, surfacing as memory corruption. The window is widened by v1 host KV offloading, which stalls the execution stream (syncTransfers) right in front of the offset H2D, plus concurrent prefill and primary-memory pressure. Stage the offsets through a freshly-allocated pinned host buffer on every call instead of reusing one in place. PyTorch's caching host allocator holds a freed pinned buffer until the consuming async copy completes, so the source can no longer be overwritten mid-flight -- the same protection the already-safe kv_lens / block_ids_per_seq staging paths rely on. The manager attribute is reassigned to the current buffer so same-iteration CPU readers (DSA sparse attention) and the speculative-decoding draft/target swap stay coherent; TrtllmAttentionMetadata.prepare re-grabs the reference after the copy. Add a regression test that drives the overlap window deterministically with a stream stall and asserts batch A's device offsets are not clobbered by batch B. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

Resolve copy_batch_block_offsets conflict: upstream added a cross-attention (CacheType.CROSS, beam_width>1) early-return branch that also reused the single persistent host_kv_cache_block_offsets buffer in place. Route both the cross and the self-attention paths through the fresh per-iteration pinned buffer so the nvbug 6293536 fix covers the cross path too. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

…r; stage only the async copy Address review and fix a correctness regression in the previous revision. The previous fix reassigned KVCacheManager.host_kv_cache_block_offsets to a freshly-allocated, per-call, num_seqs-sized buffer (and TrtllmAttentionMetadata .prepare re-grabbed it). That changed the layout and persistence the synchronous CPU readers of this attribute depend on -- DSA sparse attention, the speculative-decoding draft/target swap, and the KV-cache relocation path -- which broke test_modeling_llama.py::TestLlama::test_llama_verification_with_kv_cache_relocation (generation logits diverged from the reference after KV cache relocation). Separate the two roles instead of conflating them: - Keep filling the persistent host buffer in place, exactly as before, so every synchronous CPU reader observes the same persistent, max_batch-sized layout. The attribute is no longer reassigned. - Stage only the asynchronous H2D copy through a freshly-allocated pinned buffer (_stage_block_offsets_for_copy): a private snapshot of the rows this call needs. The async copy reads its source at execution time, so under the overlap scheduler the next iteration's in-place refill of the persistent buffer can no longer clobber this iteration's still-pending copy (nvbug 6293536). PyTorch's caching host allocator holds the snapshot until the consuming copy completes. The TrtllmAttentionMetadata.prepare re-grab is removed (the attribute is stable again), so trtllm.py returns to its original state and the target/draft handling is symmetric once more. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

… host buffers KVCacheManagerV2.copy_batch_block_offsets ships each iteration's block offsets to the device with copy_batch_block_offsets_to_device, an asynchronous cp.async gather kernel that reads its host inputs at execution time, not enqueue time. Two of those inputs are reused in place across iterations: host_kv_cache_block_offsets (the C++ KV cache writes page indices straight into a sequence's slot, and a freed slot is rebound to a different request on reuse) and copy_idx (a slice of the IndexMapper's single persistent copyIndex_ buffer, overwritten by the next get_copy_index call). Under the overlap scheduler the CPU runs an iteration ahead, so either could be clobbered before the previous iteration's still-pending kernel drains -> the kernel gathers another batch's physical blocks (nvbug 6293536, the v2 analogue of the v1 KVCacheManager fix). The v1 fix does not transfer: the persistent buffer cannot be reassigned (the C++ holds raw pointers into it via set_base_page_index_buf, and DSA reads it as the current page table), and a fresh per-call buffer alone is not protected by PyTorch's caching host allocator -- that keep-alive is only recorded for cudaMemcpyAsync out of pinned memory, not for a custom kernel reading host memory directly, so a freed buffer is reused immediately. Snapshot the rows each call needs into a fresh pinned buffer and feed the kernel an identity index, then retain the buffer until a CUDA event recorded after the kernel completes (event-guarded retirement, FIFO on the single stream), bounding memory to the overlap depth. The persistent host_kv_cache_block_offsets attribute is left in place for the C++ pointers and DSA readers. Also clarify (comment only) why the draft path in TrtllmAttentionMetadata.prepare needs no host-buffer re-grab: there is no metadata-cached draft host offsets attribute; consumers read draft_kv_cache_manager.host_kv_cache_block_offsets directly. Add a regression test that drives the overlap window deterministically with a stream stall and asserts batch A's device offsets are not clobbered by batch B. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

thorjohnsen added 2 commits June 23, 2026 13:00

github-actions Bot assigned thorjohnsen Jun 25, 2026

thorjohnsen added 2 commits June 25, 2026 01:56

thorjohnsen force-pushed the fix/nvbug-6293536-v2-block-offset-race branch from caeeaa5 to 3bdee3b Compare June 25, 2026 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[https://nvbugs/6293536][fix] Stage v2 KV block offsets through fresh host buffers#15607

[https://nvbugs/6293536][fix] Stage v2 KV block offsets through fresh host buffers#15607
thorjohnsen wants to merge 4 commits into
NVIDIA:mainfrom
thorjohnsen:fix/nvbug-6293536-v2-block-offset-race

thorjohnsen commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

thorjohnsen commented Jun 25, 2026

What

Why the v1 fix does not transfer

The fix

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant