Skip to content

[https://nvbugs/6293536][fix] Stage v2 KV block offsets through fresh host buffers#15607

Draft
thorjohnsen wants to merge 4 commits into
NVIDIA:mainfrom
thorjohnsen:fix/nvbug-6293536-v2-block-offset-race
Draft

[https://nvbugs/6293536][fix] Stage v2 KV block offsets through fresh host buffers#15607
thorjohnsen wants to merge 4 commits into
NVIDIA:mainfrom
thorjohnsen:fix/nvbug-6293536-v2-block-offset-race

Conversation

@thorjohnsen

Copy link
Copy Markdown
Collaborator

Stacked on #15546 — do not merge before it. This is a follow-up to
#15546 (same nvbug 6293536) that fixes the v2 KV cache manager. Because an
upstream PR must target main, the diff below currently also includes
#15546's v1 commit (a862b7ca77); once #15546 merges, rebasing this branch
on main reduces the diff to just the v2 commit. Kept as a draft until
then.

What

#15546 fixed an overlap-scheduler data race in the v1
KVCacheManager.copy_batch_block_offsets. The v2 manager
(KVCacheManagerV2) has the same class of bug, surfaced by reviewer @yuxianq,
but it needs a different fix.

KVCacheManagerV2.copy_batch_block_offsets ships each iteration's block offsets
with copy_batch_block_offsets_to_device — an asynchronous cp.async gather
kernel
(cpp/.../kvCacheManagerV2Utils.cu) that reads its host inputs at
execution time, not enqueue time. Two of those inputs are reused in place
across iterations:

  1. host_kv_cache_block_offsets — a persistent, per-slot buffer the C++ KV
    cache writes page indices into via raw pointers bound at create/resume
    (set_base_page_index_buf). A freed slot is rebound to a different request on
    reuse.
  2. copy_idxIndexMapper.get_copy_index returns a slice of a single
    persistent pinned copyIndex_ buffer, overwritten by the next call.

Under the overlap scheduler the CPU runs an iteration ahead, so either input can
be clobbered before the previous iteration's still-pending gather kernel drains
→ the kernel gathers another batch's physical blocks (memory corruption).

Why the v1 fix does not transfer

  • The persistent buffer cannot be reassigned: the C++ holds raw pointers into
    it, and DSA sparse attention reads it as the current page table.
  • A fresh per-call buffer alone is not enough: PyTorch's caching host
    allocator only records a keep-alive event for cudaMemcpyAsync out of pinned
    memory — not for a custom kernel reading host memory directly. So a freed
    buffer is reused immediately and the race persists. (Verified: the new
    regression test still fails with a naive fresh-buffer fix.)

The fix

  • Snapshot the rows each call needs into a fresh pinned buffer and feed the
    kernel an identity index, so both mutable kernel inputs are per-call and
    immutable for this call.
  • Event-guarded retirement: record a CUDA event after the kernel and retain
    (event, host_block_offsets, copy_index) until the event completes (FIFO on
    the single stream), bounding memory to the overlap depth.
  • The persistent host_kv_cache_block_offsets attribute is left in place for the
    C++ pointers and DSA readers.

Also adds a comment in TrtllmAttentionMetadata.prepare clarifying why the
draft path needs no host-buffer re-grab (no metadata-cached draft host
attribute; consumers read draft_kv_cache_manager.host_kv_cache_block_offsets
directly) — addressing the draft/target asymmetry @yuxianq flagged.

Test

New tests/unittest/_torch/executor/test_kv_block_offset_overlap_race_v2.py
drives the overlap window deterministically with a stream stall and asserts batch
A's device offsets are not clobbered by batch B. Passes with the fix, fails
without it.

Local validation (H100): v1 + v2 regression tests pass; test_dsa_indexer.py
50 passed / 12 skipped; in-flight buffer list stays bounded across 50+ calls.

…host buffer

KVCacheManager.copy_batch_block_offsets shipped each iteration's block-offset
page tables to the device with an asynchronous H2D whose source was a single,
persistently-reused pinned host buffer overwritten in place every call. An
async copy reads its source at execution time, not enqueue time, so under the
overlap scheduler the CPU runs an iteration ahead and the next iteration's
in-place overwrite could clobber the source of the previous iteration's
still-pending H2D. The attention kernel then indexed another batch's physical
blocks, surfacing as memory corruption. The window is widened by v1 host KV
offloading, which stalls the execution stream (syncTransfers) right in front of
the offset H2D, plus concurrent prefill and primary-memory pressure.

Stage the offsets through a freshly-allocated pinned host buffer on every call
instead of reusing one in place. PyTorch's caching host allocator holds a freed
pinned buffer until the consuming async copy completes, so the source can no
longer be overwritten mid-flight -- the same protection the already-safe
kv_lens / block_ids_per_seq staging paths rely on. The manager attribute is
reassigned to the current buffer so same-iteration CPU readers (DSA sparse
attention) and the speculative-decoding draft/target swap stay coherent;
TrtllmAttentionMetadata.prepare re-grabs the reference after the copy.

Add a regression test that drives the overlap window deterministically with a
stream stall and asserts batch A's device offsets are not clobbered by batch B.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
Resolve copy_batch_block_offsets conflict: upstream added a cross-attention
(CacheType.CROSS, beam_width>1) early-return branch that also reused the single
persistent host_kv_cache_block_offsets buffer in place. Route both the cross
and the self-attention paths through the fresh per-iteration pinned buffer so
the nvbug 6293536 fix covers the cross path too.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
…r; stage only the async copy

Address review and fix a correctness regression in the previous revision.

The previous fix reassigned KVCacheManager.host_kv_cache_block_offsets to a
freshly-allocated, per-call, num_seqs-sized buffer (and TrtllmAttentionMetadata
.prepare re-grabbed it). That changed the layout and persistence the synchronous
CPU readers of this attribute depend on -- DSA sparse attention, the
speculative-decoding draft/target swap, and the KV-cache relocation path -- which
broke test_modeling_llama.py::TestLlama::test_llama_verification_with_kv_cache_relocation
(generation logits diverged from the reference after KV cache relocation).

Separate the two roles instead of conflating them:
- Keep filling the persistent host buffer in place, exactly as before, so every
  synchronous CPU reader observes the same persistent, max_batch-sized layout.
  The attribute is no longer reassigned.
- Stage only the asynchronous H2D copy through a freshly-allocated pinned buffer
  (_stage_block_offsets_for_copy): a private snapshot of the rows this call needs.
  The async copy reads its source at execution time, so under the overlap
  scheduler the next iteration's in-place refill of the persistent buffer can no
  longer clobber this iteration's still-pending copy (nvbug 6293536). PyTorch's
  caching host allocator holds the snapshot until the consuming copy completes.

The TrtllmAttentionMetadata.prepare re-grab is removed (the attribute is stable
again), so trtllm.py returns to its original state and the target/draft handling
is symmetric once more.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
… host buffers

KVCacheManagerV2.copy_batch_block_offsets ships each iteration's block offsets to
the device with copy_batch_block_offsets_to_device, an asynchronous cp.async
gather kernel that reads its host inputs at execution time, not enqueue time. Two
of those inputs are reused in place across iterations: host_kv_cache_block_offsets
(the C++ KV cache writes page indices straight into a sequence's slot, and a freed
slot is rebound to a different request on reuse) and copy_idx (a slice of the
IndexMapper's single persistent copyIndex_ buffer, overwritten by the next
get_copy_index call). Under the overlap scheduler the CPU runs an iteration ahead,
so either could be clobbered before the previous iteration's still-pending kernel
drains -> the kernel gathers another batch's physical blocks (nvbug 6293536, the
v2 analogue of the v1 KVCacheManager fix).

The v1 fix does not transfer: the persistent buffer cannot be reassigned (the C++
holds raw pointers into it via set_base_page_index_buf, and DSA reads it as the
current page table), and a fresh per-call buffer alone is not protected by
PyTorch's caching host allocator -- that keep-alive is only recorded for
cudaMemcpyAsync out of pinned memory, not for a custom kernel reading host memory
directly, so a freed buffer is reused immediately.

Snapshot the rows each call needs into a fresh pinned buffer and feed the kernel
an identity index, then retain the buffer until a CUDA event recorded after the
kernel completes (event-guarded retirement, FIFO on the single stream), bounding
memory to the overlap depth. The persistent host_kv_cache_block_offsets attribute
is left in place for the C++ pointers and DSA readers.

Also clarify (comment only) why the draft path in
TrtllmAttentionMetadata.prepare needs no host-buffer re-grab: there is no
metadata-cached draft host offsets attribute; consumers read
draft_kv_cache_manager.host_kv_cache_block_offsets directly.

Add a regression test that drives the overlap window deterministically with a
stream stall and asserts batch A's device offsets are not clobbered by batch B.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
@thorjohnsen thorjohnsen force-pushed the fix/nvbug-6293536-v2-block-offset-race branch from caeeaa5 to 3bdee3b Compare June 25, 2026 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant