Skip to content

[None][feat] Disaggregated KV-cache bounce transfer#15618

Open
Shixiaowei02 wants to merge 1 commit into
NVIDIA:mainfrom
Shixiaowei02:feat/kv-bounce
Open

[None][feat] Disaggregated KV-cache bounce transfer#15618
Shixiaowei02 wants to merge 1 commit into
NVIDIA:mainfrom
Shixiaowei02:feat/kv-bounce

Conversation

@Shixiaowei02

@Shixiaowei02 Shixiaowei02 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Description

This pull request introduces a new "bounce buffer" feature to the disaggregation transfer path, enabling more efficient VRAM-to-VRAM (d2d) key-value (KV) cache transfers by coalescing scattered fragments into a single contiguous memory region. This enhancement is opt-in and configurable, and it includes new buffer management, configuration, and gather/scatter logic, as well as related changes to the transfer protocol and Python bindings.

The most important changes are:

Bounce Buffer Feature Implementation:

  • Added a new bounce package (tensorrt_llm/_torch/disaggregation/native/bounce) containing buffer management (buffer.py), configuration (config.py), gather/scatter logic (gather.py), and transport utilities, enabling contiguous VRAM buffer allocation and efficient coalescing of KV fragments for transfer. [1] [2] [3] [4]

Transfer Protocol and Task Updates:

  • Extended transfer-related data structures (RecvReqInfo, WriteMeta) and serialization logic to support the bounce buffer feature by adding a bounce_dst_base field, which tracks the base address of the contiguous bounce buffer region. [1] [2] [3]
  • Introduced a new binary struct format (_KV_RESULT_PREFIX) for agent result messages to reduce serialization overhead and support bounce buffer metadata in transfer completion signaling.

Python Bindings and Typing:

  • Updated the C++/Python bindings for GenLlmReq to allow Python code to record per-request KV transfer timing using the current steady clock, aligning Python (v2) and C++ (v1) performance metrics.
  • Added type-checking imports and improved type annotations in the transfer module to support the new bounce configuration. [1] [2]

These changes collectively enable more efficient and configurable KV cache transfers in distributed inference scenarios, with improved performance monitoring and streamlined protocol handling.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

  • New Features

    • Added an optional KV-cache “bounce” mode to improve handling of fragmented, concurrent transfers.
    • Added configuration support to enable and size bounce buffering.
    • Added end-to-end support for tracking KV transfer timing and transferred size more consistently.
  • Bug Fixes

    • Improved transfer behavior for fan-in cases to avoid destination overwrites and better preserve result ordering.
    • Added fallback handling when bounce buffering is unavailable or cannot be allocated.
  • Tests

    • Added coverage for bounce configuration, wire-format handling, and transfer flow behavior across supported GPU test lists.

@Shixiaowei02

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55730 [ run ] triggered by Bot. Commit: aa358b4 Link to invocation

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds bounce-buffer configuration, allocation, copy, and transport plumbing for KV-cache disaggregation, plus receive-side completion, KV-size accounting, and steady-clock transfer timestamps. It also adds a bounce-size config field, a NIXL worker override, and related tests.

Changes

Bounce transfer flow

Layer / File(s) Summary
Bounce config surface
tensorrt_llm/_torch/disaggregation/native/bounce/__init__.py, tensorrt_llm/_torch/disaggregation/native/bounce/config.py, tensorrt_llm/llmapi/llm_args.py, tests/unittest/disaggregated/test_bounce.py, tests/unittest/llmapi/test_llm_args.py
The bounce package exports sizing policies and the bounce-size config field, and tests cover sizing and config validation.
Bounce buffer allocator
tensorrt_llm/_torch/disaggregation/native/bounce/buffer.py
A stable virtual-memory buffer and slot allocator reserve chunk-rounded contiguous regions, release them, and expose one registration descriptor.
Contiguous copy primitives
tensorrt_llm/_torch/disaggregation/native/bounce/gather.py
Contiguous gather/scatter uses a Triton batched-copy path when aligned and falls back to per-fragment cudaMemcpyAsync.
Bounce transport protocol
tensorrt_llm/_torch/disaggregation/native/bounce/transport.py, tensorrt_llm/_torch/disaggregation/native/transfer.py, tests/unittest/disaggregated/test_bounce.py, tests/integration/test_lists/test-db/l0_*.yml
Bounce transport state, request/result encoding, and transport behavior tests are added.
KV timing and size accounting
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp, tensorrt_llm/_torch/disaggregation/native/transfer.py, tensorrt_llm/_torch/disaggregation/transceiver.py
Result decoding, completion deferral, KV-size bookkeeping, and steady-clock timestamp bindings are added.

NIXL worker count override

Layer / File(s) Summary
Worker count override
tensorrt_llm/_torch/disaggregation/nixl/_agent_cpp.py
The agent reads TRTLLM_NIXL_NUM_WORKERS and writes it into backend_params["num_workers"] when present.

Sequence Diagram(s)

sequenceDiagram
  participant TransferWorker
  participant Transport
  participant Receiver
  participant Sender
  participant RxSession

  TransferWorker->>Transport: create_bounce(cfg)
  Receiver->>Transport: reserve(recv_req)
  Receiver->>Sender: send bounce request bytes
  Sender->>Receiver: KV_AGENT_RESULT + result tail
  Receiver->>RxSession: process_kv_agent_result(dst_ptrs, sizes, src_base)
  RxSession->>Transport: scatter_write_result(...)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#15378: Changes the GenericLlmRequest KV-transfer timestamp bindings in cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp.

Suggested labels

api-compatible

Suggested reviewers

  • reasonsolo
  • byshiue
  • syuoni
  • xinhe-nv
  • pcastonguay
  • VALLIS-NERIA
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly reflects the main change: adding a disaggregated KV-cache bounce transfer path.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/llmapi/llm_args.py (1)

3384-3396: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Propagate and constrain kv_cache_bounce_size_mb.

The new field is accepted on CacheTransceiverConfig, but _to_pybind() does not pass it into _CacheTransceiverConfig, so kv_cache_bounce_size_mb=384 is silently lost before the native-disagg transceiver sees it. Also reject negative sizes at the Pydantic boundary.

As per coding guidelines, user-facing Pydantic numeric fields should use built-in constraints such as Field(ge=0).

Proposed fix
-    kv_cache_bounce_size_mb: int = Field(
+    kv_cache_bounce_size_mb: int = Field(
         default=0,
+        ge=0,
         description=
         "Per-region size in MiB of the native-disagg KV-cache bounce buffer (one for send, one for recv). Bounce coalesces a request's scattered per-block KV into one contiguous fabric-VMM buffer and issues a single multi-rail NIXL write. The size doubles as the on/off switch: 0 (default) keeps the per-block path, >0 enables bounce at that capacity. Only used by the Python (v2) transceiver."
     )
@@
             max_tokens_in_buffer=self.max_tokens_in_buffer,
             kv_transfer_timeout_ms=self.kv_transfer_timeout_ms,
             kv_transfer_sender_future_timeout_ms=self.
-            kv_transfer_sender_future_timeout_ms)
+            kv_transfer_sender_future_timeout_ms,
+            kv_cache_bounce_size_mb=self.kv_cache_bounce_size_mb)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/llmapi/llm_args.py` around lines 3384 - 3396, The new
kv_cache_bounce_size_mb setting is not being propagated and is missing
validation. Update CacheTransceiverConfig so _to_pybind() passes
kv_cache_bounce_size_mb into _CacheTransceiverConfig, and add a Pydantic
non-negative constraint on the Field definition (use a built-in constraint such
as ge=0) so negative values are rejected at the boundary.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp`:
- Around line 199-202: The KV-transfer binding API now exposes no-argument
setters in `GenLlmReq`, so the Python test call sites still passing `timedelta`
are out of sync. Update the `set_kv_cache_transfer_start` and
`set_kv_cache_transfer_end` usages in the bindings test to call them with no
arguments, and keep the existing offset-based assertions unchanged so the test
still validates the timing behavior through the recorded values.

In `@tensorrt_llm/_torch/disaggregation/native/bounce/__init__.py`:
- Around line 33-49: The __all__ export list in the bounce package is not
sorted, triggering RUF022. Reorder the symbols in the __all__ definition in the
__init__ module so the exported names are alphabetized while keeping the same
set of exports; this should satisfy the lint gate without changing behavior.

In `@tensorrt_llm/_torch/disaggregation/native/bounce/buffer.py`:
- Line 38: The __slots__ tuple in the buffer class is not sorted, which triggers
RUF023; update the __slots__ definition in the native bounce buffer module so
the slot names are in sorted order, and apply the same ordering fix to the other
__slots__ occurrence mentioned in the review to keep the file lint-clean.
- Around line 81-85: The __del__ cleanup in buffer.py is swallowing all cleanup
errors with a broad Exception handler, which hides failed VMM teardown. Update
the Buffer.__del__ path to catch only the specific exception type expected from
close()/destroy(), and add a log message with the failure details instead of
silently passing. Keep the change localized to the destructor and the cleanup
call it wraps so the failure is observable without masking unrelated errors.
- Around line 131-145: The reservation logic in the allocation path of the
buffer manager is checking only the wrapped head position and can miss a free
hole earlier in the ring, causing unnecessary waits/timeouts. Update the
allocation flow in the buffer class’ reservation method to scan existing free
gaps before blocking on the condition variable, so a request can reuse an
available contiguous hole like the freed region before calling _cv.wait().

In `@tensorrt_llm/_torch/disaggregation/native/bounce/gather.py`:
- Around line 181-184: The _copy_frags helper currently uses plain zip(), which
can silently truncate if pairs and sizes get out of sync. Update the
fragment-copy loop in _copy_frags to use strict zip semantics so mismatches fail
fast, and keep the existing cudaMemcpyAsync/ CUASSERT behavior unchanged.

In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 1743-1751: The failure/cancel handling in process_kv_agent_result
and the related bounce-transfer path does not release the reserved recv slot, so
bounced slices can stay pinned in Transport._reserved forever. Add an explicit
abort/release path on the bounce transport, and invoke it from the FAILED and
cancel flows before or alongside task.fail() so reserved entries are always
freed; use the process_kv_agent_result and
Receiver.dispatch_task/scatter_write_result symbols to locate the affected
logic.
- Around line 568-580: The KV result send path is still using the old ASCII
payload on failure, which breaks Receiver._process_kv_agent_result() because it
now always expects the binary _KV_RESULT_PREFIX frame. Refactor
Sender._deliver_kv_to_agent() and _send_failed_result_to_receiver() to route all
KV result sends through one shared helper that always builds the new binary
KV_AGENT_RESULT message for both success and FAILED/CANCELLED cases, and update
tests in test_bounce.py to cover the failure-path frame format.
- Around line 2152-2162: The bounce transport currently has no explicit shutdown
owner, so TransferWorker.shutdown() must be updated to close the bounce created
by create_bounce and avoid leaking its thread and VMM buffers. Make shutdown in
TransferWorker responsible for calling the bounce close path once, and adjust
registered descriptor handling so the bounce memory descriptors are owned in one
place only and are not deregistered twice. Use the existing self._bounce,
self._registered_mem, and Transport.close behavior as the key symbols to locate
and align the cleanup flow.

In `@tensorrt_llm/_torch/disaggregation/nixl/_agent_cpp.py`:
- Around line 78-80: The TRTLLM_NIXL_NUM_WORKERS env value is being forwarded
directly in _agent_cpp.py without validation, which can let empty, non-integer,
or non-positive values reach the backend. Update the boundary handling in the
agent setup code that reads os.environ.get("TRTLLM_NIXL_NUM_WORKERS") to parse
it as an integer, reject invalid or <=0 values, and only assign
backend_params["num_workers"] when the value is valid. Keep the fix localized
around the existing nixl_num_workers and backend_params logic so the backend
always receives a safe integer.

In `@tests/unittest/disaggregated/test_bounce.py`:
- Around line 31-36: The import guard in test_bounce.py is too broad and can
hide real bugs in tensorrt_llm._torch.disaggregation.native.bounce.transport by
setting _HAVE_TRANSPORT to False on any exception. Narrow the handling in the
top-level transport import block to only the expected missing-CUDA/import
failure, or move the skip logic into the individual bounce tests with
pytest.importorskip, so genuine import regressions still fail CI. Also extend
the bounce test coverage in test_bounce.py to include a FAILED KV_AGENT_RESULT
frame to validate the wire-format/fan-in path.

In `@tests/unittest/llmapi/test_llm_args.py`:
- Around line 1758-1761: The existing cache bounce size test only checks default
and direct Pydantic construction, so extend the coverage around
CacheTransceiverConfig to verify invalid negative kv_cache_bounce_size_mb values
are rejected and that _to_pybind() passes through the kv_cache_bounce_size_mb
value unchanged; use the CacheTransceiverConfig constructor and _to_pybind() in
the tests so the config handoff path is exercised.

---

Outside diff comments:
In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 3384-3396: The new kv_cache_bounce_size_mb setting is not being
propagated and is missing validation. Update CacheTransceiverConfig so
_to_pybind() passes kv_cache_bounce_size_mb into _CacheTransceiverConfig, and
add a Pydantic non-negative constraint on the Field definition (use a built-in
constraint such as ge=0) so negative values are rejected at the boundary.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 82f06d94-cf2f-462a-8ef1-bbe3af79c0e6

📥 Commits

Reviewing files that changed from the base of the PR and between 948db5a and aa358b4.

📒 Files selected for processing (14)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
  • tensorrt_llm/_torch/disaggregation/native/bounce/__init__.py
  • tensorrt_llm/_torch/disaggregation/native/bounce/buffer.py
  • tensorrt_llm/_torch/disaggregation/native/bounce/config.py
  • tensorrt_llm/_torch/disaggregation/native/bounce/gather.py
  • tensorrt_llm/_torch/disaggregation/native/bounce/transport.py
  • tensorrt_llm/_torch/disaggregation/native/transfer.py
  • tensorrt_llm/_torch/disaggregation/nixl/_agent_cpp.py
  • tensorrt_llm/_torch/disaggregation/transceiver.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/unittest/disaggregated/test_bounce.py
  • tests/unittest/llmapi/test_llm_args.py

Comment thread cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
Comment thread tensorrt_llm/_torch/disaggregation/native/bounce/__init__.py
Comment thread tensorrt_llm/_torch/disaggregation/native/bounce/buffer.py Outdated
Comment thread tensorrt_llm/_torch/disaggregation/native/bounce/buffer.py Outdated
Comment thread tensorrt_llm/_torch/disaggregation/native/bounce/buffer.py Outdated
Comment thread tensorrt_llm/_torch/disaggregation/native/transfer.py
Comment thread tensorrt_llm/_torch/disaggregation/native/transfer.py
Comment thread tensorrt_llm/_torch/disaggregation/nixl/_agent_cpp.py Outdated
Comment thread tests/unittest/disaggregated/test_bounce.py
Comment thread tests/unittest/llmapi/test_llm_args.py Outdated
@Shixiaowei02 Shixiaowei02 force-pushed the feat/kv-bounce branch 2 times, most recently from 292c0da to 0969c7f Compare June 25, 2026 08:50
… fabric-VMM WRITE)

Coalesce a request's scattered per-block KV into one contiguous fabric-VMM buffer and
issue a single multi-rail NIXL WRITE (reliable cuda_ipc/MNNVL), scattering it back on the
receiver — replacing the per-fragment path that collapses under the TP head-mismatch
descriptor explosion. Includes TP fan-in support (N senders into one recv region via
per-writer sub-region bases), an on/off switch via the new
CacheTransceiverConfig.kv_cache_bounce_size_mb config field (the per-region size in MiB
doubles as the enable flag; 0/default keeps the per-block path; ge=0 validated), an
env-gated NIXL num_workers knob (validated), a binary KV_AGENT_RESULT prefix that cuts
per-completion GIL cost, and CPU unit tests (enabled in l0_a10/l0_h100).

The receiver scatters off a worker thread, so the KV transfer is marked complete only
AFTER that scatter has landed (cudaStreamSynchronize) — a per-completion callback rides
the scatter queue and resolves the task (complete on success, fail on scatter error)
instead of completing the instant the scatter is merely enqueued, so the gen never reads
KV before it is in place; the non-bounced path still completes inline.

Failure/lifecycle robustness: all KV_AGENT_RESULT sends (success AND failed/cancelled) go
through one binary frame so the receiver never stalls on an undecodable failure result; a
FAILED or cancelled bounced slice releases its reserved recv region (no arena leak); the
recv allocator is first-fit so an out-of-order freed hole is reused (no spurious timeout);
and TransferWorker.shutdown closes the bounce transport (stops the scatter thread and frees
the VMM buffers, single-source descriptor ownership).

Squashed from the bounce development series on measure/bounce-pr12490.

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
@Shixiaowei02

Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55781 [ run ] triggered by Bot. Commit: 19dc328 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55730 [ run ] completed with state ABORTED. Commit: aa358b4

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants