[None][feat] Disaggregated KV-cache bounce transfer#15618
Conversation
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #55730 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis PR adds bounce-buffer configuration, allocation, copy, and transport plumbing for KV-cache disaggregation, plus receive-side completion, KV-size accounting, and steady-clock transfer timestamps. It also adds a bounce-size config field, a NIXL worker override, and related tests. ChangesBounce transfer flow
NIXL worker count override
Sequence Diagram(s)sequenceDiagram
participant TransferWorker
participant Transport
participant Receiver
participant Sender
participant RxSession
TransferWorker->>Transport: create_bounce(cfg)
Receiver->>Transport: reserve(recv_req)
Receiver->>Sender: send bounce request bytes
Sender->>Receiver: KV_AGENT_RESULT + result tail
Receiver->>RxSession: process_kv_agent_result(dst_ptrs, sizes, src_base)
RxSession->>Transport: scatter_write_result(...)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
3384-3396: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winPropagate and constrain
kv_cache_bounce_size_mb.The new field is accepted on
CacheTransceiverConfig, but_to_pybind()does not pass it into_CacheTransceiverConfig, sokv_cache_bounce_size_mb=384is silently lost before the native-disagg transceiver sees it. Also reject negative sizes at the Pydantic boundary.As per coding guidelines, user-facing Pydantic numeric fields should use built-in constraints such as
Field(ge=0).Proposed fix
- kv_cache_bounce_size_mb: int = Field( + kv_cache_bounce_size_mb: int = Field( default=0, + ge=0, description= "Per-region size in MiB of the native-disagg KV-cache bounce buffer (one for send, one for recv). Bounce coalesces a request's scattered per-block KV into one contiguous fabric-VMM buffer and issues a single multi-rail NIXL write. The size doubles as the on/off switch: 0 (default) keeps the per-block path, >0 enables bounce at that capacity. Only used by the Python (v2) transceiver." ) @@ max_tokens_in_buffer=self.max_tokens_in_buffer, kv_transfer_timeout_ms=self.kv_transfer_timeout_ms, kv_transfer_sender_future_timeout_ms=self. - kv_transfer_sender_future_timeout_ms) + kv_transfer_sender_future_timeout_ms, + kv_cache_bounce_size_mb=self.kv_cache_bounce_size_mb)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/llmapi/llm_args.py` around lines 3384 - 3396, The new kv_cache_bounce_size_mb setting is not being propagated and is missing validation. Update CacheTransceiverConfig so _to_pybind() passes kv_cache_bounce_size_mb into _CacheTransceiverConfig, and add a Pydantic non-negative constraint on the Field definition (use a built-in constraint such as ge=0) so negative values are rejected at the boundary.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp`:
- Around line 199-202: The KV-transfer binding API now exposes no-argument
setters in `GenLlmReq`, so the Python test call sites still passing `timedelta`
are out of sync. Update the `set_kv_cache_transfer_start` and
`set_kv_cache_transfer_end` usages in the bindings test to call them with no
arguments, and keep the existing offset-based assertions unchanged so the test
still validates the timing behavior through the recorded values.
In `@tensorrt_llm/_torch/disaggregation/native/bounce/__init__.py`:
- Around line 33-49: The __all__ export list in the bounce package is not
sorted, triggering RUF022. Reorder the symbols in the __all__ definition in the
__init__ module so the exported names are alphabetized while keeping the same
set of exports; this should satisfy the lint gate without changing behavior.
In `@tensorrt_llm/_torch/disaggregation/native/bounce/buffer.py`:
- Line 38: The __slots__ tuple in the buffer class is not sorted, which triggers
RUF023; update the __slots__ definition in the native bounce buffer module so
the slot names are in sorted order, and apply the same ordering fix to the other
__slots__ occurrence mentioned in the review to keep the file lint-clean.
- Around line 81-85: The __del__ cleanup in buffer.py is swallowing all cleanup
errors with a broad Exception handler, which hides failed VMM teardown. Update
the Buffer.__del__ path to catch only the specific exception type expected from
close()/destroy(), and add a log message with the failure details instead of
silently passing. Keep the change localized to the destructor and the cleanup
call it wraps so the failure is observable without masking unrelated errors.
- Around line 131-145: The reservation logic in the allocation path of the
buffer manager is checking only the wrapped head position and can miss a free
hole earlier in the ring, causing unnecessary waits/timeouts. Update the
allocation flow in the buffer class’ reservation method to scan existing free
gaps before blocking on the condition variable, so a request can reuse an
available contiguous hole like the freed region before calling _cv.wait().
In `@tensorrt_llm/_torch/disaggregation/native/bounce/gather.py`:
- Around line 181-184: The _copy_frags helper currently uses plain zip(), which
can silently truncate if pairs and sizes get out of sync. Update the
fragment-copy loop in _copy_frags to use strict zip semantics so mismatches fail
fast, and keep the existing cudaMemcpyAsync/ CUASSERT behavior unchanged.
In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 1743-1751: The failure/cancel handling in process_kv_agent_result
and the related bounce-transfer path does not release the reserved recv slot, so
bounced slices can stay pinned in Transport._reserved forever. Add an explicit
abort/release path on the bounce transport, and invoke it from the FAILED and
cancel flows before or alongside task.fail() so reserved entries are always
freed; use the process_kv_agent_result and
Receiver.dispatch_task/scatter_write_result symbols to locate the affected
logic.
- Around line 568-580: The KV result send path is still using the old ASCII
payload on failure, which breaks Receiver._process_kv_agent_result() because it
now always expects the binary _KV_RESULT_PREFIX frame. Refactor
Sender._deliver_kv_to_agent() and _send_failed_result_to_receiver() to route all
KV result sends through one shared helper that always builds the new binary
KV_AGENT_RESULT message for both success and FAILED/CANCELLED cases, and update
tests in test_bounce.py to cover the failure-path frame format.
- Around line 2152-2162: The bounce transport currently has no explicit shutdown
owner, so TransferWorker.shutdown() must be updated to close the bounce created
by create_bounce and avoid leaking its thread and VMM buffers. Make shutdown in
TransferWorker responsible for calling the bounce close path once, and adjust
registered descriptor handling so the bounce memory descriptors are owned in one
place only and are not deregistered twice. Use the existing self._bounce,
self._registered_mem, and Transport.close behavior as the key symbols to locate
and align the cleanup flow.
In `@tensorrt_llm/_torch/disaggregation/nixl/_agent_cpp.py`:
- Around line 78-80: The TRTLLM_NIXL_NUM_WORKERS env value is being forwarded
directly in _agent_cpp.py without validation, which can let empty, non-integer,
or non-positive values reach the backend. Update the boundary handling in the
agent setup code that reads os.environ.get("TRTLLM_NIXL_NUM_WORKERS") to parse
it as an integer, reject invalid or <=0 values, and only assign
backend_params["num_workers"] when the value is valid. Keep the fix localized
around the existing nixl_num_workers and backend_params logic so the backend
always receives a safe integer.
In `@tests/unittest/disaggregated/test_bounce.py`:
- Around line 31-36: The import guard in test_bounce.py is too broad and can
hide real bugs in tensorrt_llm._torch.disaggregation.native.bounce.transport by
setting _HAVE_TRANSPORT to False on any exception. Narrow the handling in the
top-level transport import block to only the expected missing-CUDA/import
failure, or move the skip logic into the individual bounce tests with
pytest.importorskip, so genuine import regressions still fail CI. Also extend
the bounce test coverage in test_bounce.py to include a FAILED KV_AGENT_RESULT
frame to validate the wire-format/fan-in path.
In `@tests/unittest/llmapi/test_llm_args.py`:
- Around line 1758-1761: The existing cache bounce size test only checks default
and direct Pydantic construction, so extend the coverage around
CacheTransceiverConfig to verify invalid negative kv_cache_bounce_size_mb values
are rejected and that _to_pybind() passes through the kv_cache_bounce_size_mb
value unchanged; use the CacheTransceiverConfig constructor and _to_pybind() in
the tests so the config handoff path is exercised.
---
Outside diff comments:
In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 3384-3396: The new kv_cache_bounce_size_mb setting is not being
propagated and is missing validation. Update CacheTransceiverConfig so
_to_pybind() passes kv_cache_bounce_size_mb into _CacheTransceiverConfig, and
add a Pydantic non-negative constraint on the Field definition (use a built-in
constraint such as ge=0) so negative values are rejected at the boundary.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 82f06d94-cf2f-462a-8ef1-bbe3af79c0e6
📒 Files selected for processing (14)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpptensorrt_llm/_torch/disaggregation/native/bounce/__init__.pytensorrt_llm/_torch/disaggregation/native/bounce/buffer.pytensorrt_llm/_torch/disaggregation/native/bounce/config.pytensorrt_llm/_torch/disaggregation/native/bounce/gather.pytensorrt_llm/_torch/disaggregation/native/bounce/transport.pytensorrt_llm/_torch/disaggregation/native/transfer.pytensorrt_llm/_torch/disaggregation/nixl/_agent_cpp.pytensorrt_llm/_torch/disaggregation/transceiver.pytensorrt_llm/llmapi/llm_args.pytests/integration/test_lists/test-db/l0_a10.ymltests/integration/test_lists/test-db/l0_h100.ymltests/unittest/disaggregated/test_bounce.pytests/unittest/llmapi/test_llm_args.py
292c0da to
0969c7f
Compare
… fabric-VMM WRITE) Coalesce a request's scattered per-block KV into one contiguous fabric-VMM buffer and issue a single multi-rail NIXL WRITE (reliable cuda_ipc/MNNVL), scattering it back on the receiver — replacing the per-fragment path that collapses under the TP head-mismatch descriptor explosion. Includes TP fan-in support (N senders into one recv region via per-writer sub-region bases), an on/off switch via the new CacheTransceiverConfig.kv_cache_bounce_size_mb config field (the per-region size in MiB doubles as the enable flag; 0/default keeps the per-block path; ge=0 validated), an env-gated NIXL num_workers knob (validated), a binary KV_AGENT_RESULT prefix that cuts per-completion GIL cost, and CPU unit tests (enabled in l0_a10/l0_h100). The receiver scatters off a worker thread, so the KV transfer is marked complete only AFTER that scatter has landed (cudaStreamSynchronize) — a per-completion callback rides the scatter queue and resolves the task (complete on success, fail on scatter error) instead of completing the instant the scatter is merely enqueued, so the gen never reads KV before it is in place; the non-bounced path still completes inline. Failure/lifecycle robustness: all KV_AGENT_RESULT sends (success AND failed/cancelled) go through one binary frame so the receiver never stalls on an undecodable failure result; a FAILED or cancelled bounced slice releases its reserved recv region (no arena leak); the recv allocator is first-fit so an out-of-order freed hole is reused (no spurious timeout); and TransferWorker.shutdown closes the bounce transport (stops the scatter thread and frees the VMM buffers, single-source descriptor ownership). Squashed from the bounce development series on measure/bounce-pr12490. Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
0969c7f to
19dc328
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #55781 [ run ] triggered by Bot. Commit: |
|
PR_Github #55730 [ run ] completed with state |
Description
This pull request introduces a new "bounce buffer" feature to the disaggregation transfer path, enabling more efficient VRAM-to-VRAM (d2d) key-value (KV) cache transfers by coalescing scattered fragments into a single contiguous memory region. This enhancement is opt-in and configurable, and it includes new buffer management, configuration, and gather/scatter logic, as well as related changes to the transfer protocol and Python bindings.
The most important changes are:
Bounce Buffer Feature Implementation:
bouncepackage (tensorrt_llm/_torch/disaggregation/native/bounce) containing buffer management (buffer.py), configuration (config.py), gather/scatter logic (gather.py), and transport utilities, enabling contiguous VRAM buffer allocation and efficient coalescing of KV fragments for transfer. [1] [2] [3] [4]Transfer Protocol and Task Updates:
RecvReqInfo,WriteMeta) and serialization logic to support the bounce buffer feature by adding abounce_dst_basefield, which tracks the base address of the contiguous bounce buffer region. [1] [2] [3]_KV_RESULT_PREFIX) for agent result messages to reduce serialization overhead and support bounce buffer metadata in transfer completion signaling.Python Bindings and Typing:
GenLlmReqto allow Python code to record per-request KV transfer timing using the current steady clock, aligning Python (v2) and C++ (v1) performance metrics.These changes collectively enable more efficient and configurable KV cache transfers in distributed inference scenarios, with improved performance monitoring and streamlined protocol handling.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit
New Features
Bug Fixes
Tests