[None][perf] DSv4 follow-up: disagg routing improvements#15625
[None][perf] DSv4 follow-up: disagg routing improvements#15625lfr-0531 wants to merge 22 commits into
Conversation
…2 KV cache backends (NVIDIA#14435) Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
… HTTP client (NVIDIA#14562) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…rving (feat/deepseek_v4) (NVIDIA#14611) Signed-off-by: Shicheng Li <shicli@nvidia.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…e TTFT (NVIDIA#14767) Signed-off-by: Zheyu Fu <zheyuf@nvidia.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…fig flag (NVIDIA#14860) Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com> Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…handshake (NVIDIA#14950) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…VIDIA#15005) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…nversation_id routing (NVIDIA#14957) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…IA#15040) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…ve list (NVIDIA#14867) Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…ank affinity for attention DP (NVIDIA#14983) Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…r forward path (NVIDIA#15041) Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…rite (NVIDIA#15042) Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…NVIDIA#15133) Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
NVIDIA#15199) Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…er-rank-divergent gates (NVIDIA#14560) Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…se (NVIDIA#14844) Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…request (NVIDIA#15366) Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #55770 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughAdds request metadata and routing/config options for disaggregated serving, threads conversation IDs and cache-salt IDs through executor and protocol paths, updates KV-cache-aware routing and OpenAI disagg request handling, and expands unit and integration coverage for the new flows. ChangesDisaggregated serving
Sequence Diagram(s)sequenceDiagram
participant OpenAIHttpClient
participant OpenAIDisaggregatedService
participant KvCacheAwareRouter
participant KvCacheAwareServerState
OpenAIHttpClient->>OpenAIDisaggregatedService: send request / receive stream
OpenAIDisaggregatedService->>OpenAIDisaggregatedService: update disagg_request_id
OpenAIDisaggregatedService->>OpenAIDisaggregatedService: rewrite usage or trim messages
OpenAIDisaggregatedService->>KvCacheAwareRouter: finish_request(request, success)
KvCacheAwareRouter->>KvCacheAwareServerState: apply routed blocks / poll KV events
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90+ minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tensorrt_llm/serve/openai_disagg_service.py (1)
177-180: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winReturn the generation response through the usage rewriter.
The new
_rewrite_disagg_usage()pipeline is never applied on the ctx-first generation path, so client-visibleprompt_tokens,total_tokens, andcached_tokenscan remain generation-worker values instead of the context response’s accounting.Proposed fix
gen_response = await self._gen_client.send_request( gen_req, server=gen_server, hooks=hooks ) - return gen_response + return self._rewrite_disagg_usage(gen_response, ctx_response)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/serve/openai_disagg_service.py` around lines 177 - 180, The ctx-first generation path in self._gen_client.send_request currently returns gen_response directly, so the new _rewrite_disagg_usage() accounting is skipped. Update the generation flow in openai_disagg_service to pass the generation response through _rewrite_disagg_usage() before returning it, using the existing ctx response/accounting context so prompt_tokens, total_tokens, and cached_tokens are rewritten consistently.tensorrt_llm/serve/openai_client.py (1)
205-207: 🗄️ Data Integrity & Integration | 🟠 MajorDefer streaming cleanup until retries are exhausted
tensorrt_llm/serve/openai_client.py:205-207, 230-260, 276-319_response_generator()finishes the request on every attempt, but_post_with_retry()can still retry before any bytes are yielded. That can clear router state on a failed first attempt and leave load/affinity bookkeeping wrong for the retry. Keepsuccess=Falseby default and only mark success once the stream fully drains; cancellation/GeneratorExitcurrently bypassesexcept Exceptionand otherwise gets reported as success.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/serve/openai_client.py` around lines 205 - 207, The retry path in openai_client.py is finalizing streaming too early in _response_generator and _post_with_retry, which can clear router state before retries are exhausted. Update _response_generator so success stays false by default and only flips to true after the stream fully drains, and make sure cleanup/finalization only happens once the last attempt is complete. Also handle cancellation and GeneratorExit explicitly around _response_generator and _post_with_retry so they are not treated as successful responses.tensorrt_llm/_torch/disaggregation/native/mixers/attention/peer.py (1)
338-370: 🗄️ Data Integrity & Integration | 🟠 MajorKeep
tokens_per_blockequality required. The transfer path still aligns blocks and computes mapper offsets from a single block size, so divisible-but-unequal peers can mis-copy KV.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/disaggregation/native/mixers/attention/peer.py` around lines 338 - 370, The peer compatibility check is too permissive for tokens_per_block and must require exact equality instead of only allowing divisible mismatches. Update _tpb_check in AttentionPolicy to reject any local/peer tokens_per_block difference and have check_peer_compatible fail when values differ, since the transfer path and mapper offsets assume a single shared block size. Keep the existing structure in _tpb_check and check_peer_compatible, but change the logic so unequal block sizes are incompatible.tensorrt_llm/serve/router.py (1)
820-830: 🔒 Security & Privacy | 🟠 Major
trust_remote_code=Trueis unsafe for request-supplied model IDs.
_tokenize()passesrequest.modeldirectly into_get_tokenizer(), andTransformersTokenizer.from_pretrained(..., trust_remote_code=True)can execute arbitrary code from a model repo.ChatCompletionRequest.modelis just a free-form string here, and I don’t see any server-side restriction to the served model name.Gate this on operator config and/or restrict tokenization to the configured model.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/serve/router.py` around lines 820 - 830, The tokenizer loading path in _get_tokenizer currently trusts request-supplied model IDs and always enables trust_remote_code, which is unsafe. Update Router._get_tokenizer and the _tokenize call path so tokenization is only allowed for the server-configured served model or another explicitly whitelisted model, and make trust_remote_code controlled by operator configuration rather than hardcoded. Keep the custom tokenizer branch unchanged, but guard the TransformersTokenizer.from_pretrained path with a restriction tied to the configured model name.Source: Linters/SAST tools
🧹 Nitpick comments (4)
tests/integration/defs/disaggregated/test_disaggregated.py (1)
1741-1758: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winTest never validates the bypass signal it claims to check.
The comments state perf_metrics should grow sub-linearly when bypass works, and
last_totalis recorded across invocations, but it is only assigned — never compared. The sole assertion (assert metrics or last_total[0] > 0) just checks non-emptiness, so it passes even if the bypass logic is completely broken (i.e. one entry per request). As written this is effectively a smoke check, not a bypass-signal regression test.Consider asserting the actual invariant, e.g. recorded entries are strictly fewer than total requests issued (
num_iters * prompts_per_iter), which is what "bypassed requests are absent" implies. Also note the first invocation can spuriously fail if/perf_metricsis momentarily empty because the gen request hasn't finished.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/defs/disaggregated/test_disaggregated.py` around lines 1741 - 1758, The bypass regression check in _check_bypass_signal only verifies that /perf_metrics is non-empty, so it never proves bypassed requests are excluded. Update the test in test_disaggregated.py to assert the real invariant by comparing the collected metrics count against the total requests issued (for example, using the existing last_total tracking or the known num_iters * prompts_per_iter bound) and ensure the check fails when entries grow one-per-request. Keep the helper’s logger output and last_total state in _check_bypass_signal, but replace the smoke-style assertion with one that validates sub-linear growth after the first successful run.tests/unittest/disaggregated/test_openai_disagg_service.py (1)
140-200: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winAdd ctx-first usage rewrite regression coverage.
Coverage is insufficient for the new usage-rewrite flow: these tests validate prompt-token plumbing, but not that
_send_disagg_request_ctx_first()returns a generation response with ctx-derivedprompt_tokens,total_tokens, andcached_tokensfor non-streaming and SSE responses.As per path instructions,
tests/**: “Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/disaggregated/test_openai_disagg_service.py` around lines 140 - 200, The current tests cover prompt-token propagation, but they do not exercise the ctx-first usage rewrite path. Add regression coverage around _send_disagg_request_ctx_first() to verify that when the context response is rewritten into a generation response, the returned usage includes ctx-derived prompt_tokens, total_tokens, and cached_tokens for both non-streaming and SSE flows. Use the existing OpenAIServer and _send_disagg_request_ctx_first symbols, and assert the rewritten response payload/usage rather than only prompt_token_ids plumbing.Source: Path instructions
tests/unittest/disaggregated/test_disagg_openai_client.py (1)
480-564: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winAdd streaming retry coverage for router finalization.
Coverage is insufficient for the changed streaming path: these retry tests all use
stream=False. Please add cases intests/unittest/disaggregated/test_disagg_openai_client.pywhere an SSE request fails before yielding and then succeeds, and where it finally fails, assertingfinish_request(..., success=...)is called exactly for the terminal outcome.As per path instructions,
tests/**: “Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/disaggregated/test_disagg_openai_client.py` around lines 480 - 564, The existing retry coverage in test_disagg_openai_client only exercises non-streaming requests, so add streaming SSE retry tests for the router finalization path. Use the send_request flow with stream=True and cover one case where the request fails before yielding and then succeeds, and another where it ultimately fails after retries. Verify finish_request is called only once for the terminal outcome with the correct success value, and reference the existing helper methods like _make_client, _make_request, send_request, and finish_request to place the new assertions alongside the current retry tests.Source: Path instructions
tests/unittest/disaggregated/test_peer.py (1)
408-435: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueCoverage for the divisible/non-divisible relaxation is solid (positive + negative through the
register()contract). As a QA follow-up, consider adding a case where the local TPB is the larger value (e.g. local=32, peer=16) to lock in themax/mindirection-symmetry of_tpb_check, plus an equal-TPB baseline asserting no warning. Optional — current tests are sufficient to merge.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/disaggregated/test_peer.py` around lines 408 - 435, Add coverage around `_tpb_check` in `test_peer_registrar_tpb_divisible_warns_but_compatible` / `test_peer_registrar_tpb_not_divisible_raises` by adding a case where the local tokens_per_block is larger than the peer’s (for example local=32, peer=16) to verify the divisible logic is symmetric regardless of max/min ordering, and add an equal-tokens_per_block baseline to assert `register()` succeeds without warning; keep the checks anchored on `_make_peer_registrar` and `register()` so the behavior remains locked in if implementation details change.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 678-684: The dummy probe request construction in CppLlmRequest is
using the wrong keyword for the salt field, so update the resource_manager flow
that builds dummy_req to pass cache_salt instead of cache_salt_id (or otherwise
align it with the binding’s accepted kwarg). Verify the CppLlmRequest call site
in resource_manager uses the same naming as the binding signature so salted
probes do not fail with a TypeError.
In `@tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py`:
- Around line 867-871: The get_relax_value helper in adp_router.py should
normalize attention_dp_relax the same way as the other routers: read
py_scheduling_params from req_item.request, return True when scheduling_params
is missing or attention_dp_relax is None, and otherwise return the boolean
value. Update the logic inside get_relax_value so sorted(...,
key=get_relax_value) never compares None with bool and instead always receives a
normalized boolean key.
In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 1245-1251: The kv_cache_routing_max_sessions field in llm_args.py
currently accepts invalid non-positive values, which makes the LRU cap
nonsensical; update the llm_args model to use PositiveInt for
kv_cache_routing_max_sessions in the same Field definition so validation rejects
0 and negatives while preserving the existing default and description.
---
Outside diff comments:
In `@tensorrt_llm/_torch/disaggregation/native/mixers/attention/peer.py`:
- Around line 338-370: The peer compatibility check is too permissive for
tokens_per_block and must require exact equality instead of only allowing
divisible mismatches. Update _tpb_check in AttentionPolicy to reject any
local/peer tokens_per_block difference and have check_peer_compatible fail when
values differ, since the transfer path and mapper offsets assume a single shared
block size. Keep the existing structure in _tpb_check and check_peer_compatible,
but change the logic so unequal block sizes are incompatible.
In `@tensorrt_llm/serve/openai_client.py`:
- Around line 205-207: The retry path in openai_client.py is finalizing
streaming too early in _response_generator and _post_with_retry, which can clear
router state before retries are exhausted. Update _response_generator so success
stays false by default and only flips to true after the stream fully drains, and
make sure cleanup/finalization only happens once the last attempt is complete.
Also handle cancellation and GeneratorExit explicitly around _response_generator
and _post_with_retry so they are not treated as successful responses.
In `@tensorrt_llm/serve/openai_disagg_service.py`:
- Around line 177-180: The ctx-first generation path in
self._gen_client.send_request currently returns gen_response directly, so the
new _rewrite_disagg_usage() accounting is skipped. Update the generation flow in
openai_disagg_service to pass the generation response through
_rewrite_disagg_usage() before returning it, using the existing ctx
response/accounting context so prompt_tokens, total_tokens, and cached_tokens
are rewritten consistently.
In `@tensorrt_llm/serve/router.py`:
- Around line 820-830: The tokenizer loading path in _get_tokenizer currently
trusts request-supplied model IDs and always enables trust_remote_code, which is
unsafe. Update Router._get_tokenizer and the _tokenize call path so tokenization
is only allowed for the server-configured served model or another explicitly
whitelisted model, and make trust_remote_code controlled by operator
configuration rather than hardcoded. Keep the custom tokenizer branch unchanged,
but guard the TransformersTokenizer.from_pretrained path with a restriction tied
to the configured model name.
---
Nitpick comments:
In `@tests/integration/defs/disaggregated/test_disaggregated.py`:
- Around line 1741-1758: The bypass regression check in _check_bypass_signal
only verifies that /perf_metrics is non-empty, so it never proves bypassed
requests are excluded. Update the test in test_disaggregated.py to assert the
real invariant by comparing the collected metrics count against the total
requests issued (for example, using the existing last_total tracking or the
known num_iters * prompts_per_iter bound) and ensure the check fails when
entries grow one-per-request. Keep the helper’s logger output and last_total
state in _check_bypass_signal, but replace the smoke-style assertion with one
that validates sub-linear growth after the first successful run.
In `@tests/unittest/disaggregated/test_disagg_openai_client.py`:
- Around line 480-564: The existing retry coverage in test_disagg_openai_client
only exercises non-streaming requests, so add streaming SSE retry tests for the
router finalization path. Use the send_request flow with stream=True and cover
one case where the request fails before yielding and then succeeds, and another
where it ultimately fails after retries. Verify finish_request is called only
once for the terminal outcome with the correct success value, and reference the
existing helper methods like _make_client, _make_request, send_request, and
finish_request to place the new assertions alongside the current retry tests.
In `@tests/unittest/disaggregated/test_openai_disagg_service.py`:
- Around line 140-200: The current tests cover prompt-token propagation, but
they do not exercise the ctx-first usage rewrite path. Add regression coverage
around _send_disagg_request_ctx_first() to verify that when the context response
is rewritten into a generation response, the returned usage includes ctx-derived
prompt_tokens, total_tokens, and cached_tokens for both non-streaming and SSE
flows. Use the existing OpenAIServer and _send_disagg_request_ctx_first symbols,
and assert the rewritten response payload/usage rather than only
prompt_token_ids plumbing.
In `@tests/unittest/disaggregated/test_peer.py`:
- Around line 408-435: Add coverage around `_tpb_check` in
`test_peer_registrar_tpb_divisible_warns_but_compatible` /
`test_peer_registrar_tpb_not_divisible_raises` by adding a case where the local
tokens_per_block is larger than the peer’s (for example local=32, peer=16) to
verify the divisible logic is symmetric regardless of max/min ordering, and add
an equal-tokens_per_block baseline to assert `register()` succeeds without
warning; keep the checks anchored on `_make_peer_registrar` and `register()` so
the behavior remains locked in if implementation details change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4489bc3a-7e1e-43f1-be0b-c52a224ed3ef
📒 Files selected for processing (33)
cpp/include/tensorrt_llm/executor/executor.hcpp/tensorrt_llm/executor/request.cppcpp/tensorrt_llm/executor/requestImpl.hcpp/tensorrt_llm/nanobind/executor/request.cpptensorrt_llm/_torch/disaggregation/native/mixers/attention/peer.pytensorrt_llm/_torch/distributed/communicator.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/scheduler/adp_router.pytensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.pytensorrt_llm/disaggregated_params.pytensorrt_llm/inputs/utils.pytensorrt_llm/llmapi/disagg_utils.pytensorrt_llm/llmapi/llm_args.pytensorrt_llm/serve/openai_client.pytensorrt_llm/serve/openai_disagg_service.pytensorrt_llm/serve/openai_protocol.pytensorrt_llm/serve/openai_server.pytensorrt_llm/serve/router.pytests/integration/defs/accuracy/test_disaggregated_serving.pytests/integration/defs/disaggregated/test_configs/disagg_config_conditional_deepseek_v3_v2.yamltests/integration/defs/disaggregated/test_disaggregated.pytests/integration/test_lists/qa/llm_function_core.txttests/integration/test_lists/test-db/l0_b200.ymltests/integration/test_lists/test-db/l0_dgx_h100.ymltests/unittest/_torch/executor/test_adp_router.pytests/unittest/disaggregated/test_disagg_openai_client.pytests/unittest/disaggregated/test_disagg_utils.pytests/unittest/disaggregated/test_disaggregated_params.pytests/unittest/disaggregated/test_openai_disagg_service.pytests/unittest/disaggregated/test_openai_server_info.pytests/unittest/disaggregated/test_peer.pytests/unittest/disaggregated/test_router.py
| dummy_req = CppLlmRequest(request_id=0, | ||
| max_new_tokens=0, | ||
| input_tokens=input_tokens, | ||
| sampling_config=SamplingConfig(), | ||
| is_streaming=False, | ||
| lora_task_id=lora_task_id) | ||
| lora_task_id=lora_task_id, | ||
| cache_salt_id=cache_salt_id) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm the C++ LlmRequest binding exposes cache_salt_id as a ctor arg.
rg -nP -C3 'cache_salt_id' cpp/ | rg -i 'llmRequest|arg\(|init'
echo "----- binding init for LlmRequest -----"
rg -nP --type=cpp -C2 'nb::arg\("cache_salt_id"\)|py::arg\("cache_salt_id"\)' cpp/Repository: NVIDIA/TensorRT-LLM
Length of output: 197
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the CppLlmRequest/LlmRequest binding and inspect its constructor signature.
git ls-files | rg 'resource_manager\.py|batch_manager|llm_request|request'
echo "----- search for CppLlmRequest / LlmRequest binding -----"
rg -n --hidden --glob '!**/.git/**' 'CppLlmRequest|class LlmRequest|nb::class_|py::class_' .Repository: NVIDIA/TensorRT-LLM
Length of output: 29024
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the Python file around the reviewed line and nearby request creation helpers.
wc -l tensorrt_llm/_torch/pyexecutor/resource_manager.py
sed -n '630,710p' tensorrt_llm/_torch/pyexecutor/resource_manager.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 4342
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for the request class definition and any constructor arguments that may include cache_salt_id.
rg -n --hidden --glob '!**/.git/**' -C 3 'cache_salt_id|CppLlmRequest|LlmRequest' tensorrt_llm cpp bindings .Repository: NVIDIA/TensorRT-LLM
Length of output: 50376
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "----- constructor bindings for cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp -----"
sed -n '1,260p' cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
echo "----- class declaration / constructors in cpp/include/tensorrt_llm/batch_manager/llmRequest.h -----"
rg -n -C 3 'LlmRequest\(' cpp/include/tensorrt_llm/batch_manager/llmRequest.h cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cppRepository: NVIDIA/TensorRT-LLM
Length of output: 12845
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "----- cache_salt_id references in batch_manager request implementation -----"
rg -n -C 2 'cache_salt_id' cpp/include/tensorrt_llm/batch_manager/llmRequest.h \
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp \
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp \
tensorrt_llm/_torch/pyexecutor/llm_request.py
echo "----- relevant ctor overloads in cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp -----"
rg -n -C 4 'nb::init|def\("__init__"|py::init|constructor' cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cppRepository: NVIDIA/TensorRT-LLM
Length of output: 234
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "----- binding section around tb::LlmRequest in cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp -----"
sed -n '320,380p' cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
echo "----- constructor declarations in cpp/include/tensorrt_llm/batch_manager/llmRequest.h -----"
rg -n -C 6 'LlmRequest\(' cpp/include/tensorrt_llm/batch_manager/llmRequest.h
echo "----- cache_salt_id in cpp/include/tensorrt_llm/batch_manager/llmRequest.h -----"
rg -n -C 3 'cache_salt_id' cpp/include/tensorrt_llm/batch_manager/llmRequest.hRepository: NVIDIA/TensorRT-LLM
Length of output: 10244
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the tail of the tb::LlmRequest nanobind constructor to see whether cache_salt_id is a named parameter.
sed -n '380,470p' cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
# Show the exact constructor tail in the C++ header as well.
sed -n '2397,2483p' cpp/include/tensorrt_llm/batch_manager/llmRequest.hRepository: NVIDIA/TensorRT-LLM
Length of output: 15002
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find any explicit cache_salt_id mention in the LlmRequest binding and nearby lines.
rg -n -C 2 'cache_salt_id|cacheSalt|cacheSaltId' cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp cpp/include/tensorrt_llm/batch_manager/llmRequest.hRepository: NVIDIA/TensorRT-LLM
Length of output: 3563
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the constructor lambda tail where cache_salt_id would appear.
sed -n '430,520p' cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
# A precise grep for the final constructor arguments around cache salt.
rg -n -C 4 'multimodal_run_lengths|cache_salt|cacheSalt' cpp/tensorrt_llm/nanobind/batch_manager/bindings.cppRepository: NVIDIA/TensorRT-LLM
Length of output: 10912
Use the binding’s cache_salt kwarg here
CppLlmRequest exposes cache_salt, not cache_salt_id; this call will raise TypeError on salted probes unless the kwarg name is aligned or aliased.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 678 - 684,
The dummy probe request construction in CppLlmRequest is using the wrong keyword
for the salt field, so update the resource_manager flow that builds dummy_req to
pass cache_salt instead of cache_salt_id (or otherwise align it with the
binding’s accepted kwarg). Verify the CppLlmRequest call site in
resource_manager uses the same naming as the binding signature so salted probes
do not fail with a TypeError.
| def get_relax_value(req_item): | ||
| scheduling_params = getattr(req_item.request, "py_scheduling_params", None) | ||
| if scheduling_params is None: | ||
| return True | ||
| return scheduling_params.attention_dp_relax |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check whether attention_dp_relax can ever be None in practice.
rg -nP --type=py 'attention_dp_relax' -C2Repository: NVIDIA/TensorRT-LLM
Length of output: 157
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the relevant router implementation and compare it with the other routers.
git ls-files | rg '(^|/)CODING_GUIDELINES\.md$|tensorrt_llm/_torch/pyexecutor/scheduler/adp_router\.py$|tensorrt_llm/_torch/pyexecutor/scheduler/.*router.*\.py$'
echo '--- outline: adp_router.py ---'
ast-grep outline tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py --view expanded || true
echo '--- search for attention_dp_relax across repo ---'
rg -n --hidden --glob '!**/.git/**' 'attention_dp_relax|py_scheduling_params' tensorrt_llm/_torch -S || trueRepository: NVIDIA/TensorRT-LLM
Length of output: 4344
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the relevant router implementation and compare it with the other routers.
git ls-files | rg '(^|/)CODING_GUIDELINES\.md$|tensorrt_llm/_torch/pyexecutor/scheduler/adp_router\.py$|tensorrt_llm/_torch/pyexecutor/scheduler/.*router.*\.py$'
echo '--- outline: adp_router.py ---'
ast-grep outline tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py --view expanded || true
echo '--- search for attention_dp_relax across repo ---'
rg -n --hidden --glob '!**/.git/**' 'attention_dp_relax|py_scheduling_params' tensorrt_llm/_torch -S || trueRepository: NVIDIA/TensorRT-LLM
Length of output: 4344
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- relevant slice of adp_router.py ---'
sed -n '840,890p' tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py
echo '--- other router definitions of relax handling ---'
rg -n -C 3 'attention_dp_relax|get_relax_value|True if val is None else val' tensorrt_llm/_torch/pyexecutor/scheduler -SRepository: NVIDIA/TensorRT-LLM
Length of output: 6793
Normalize attention_dp_relax here as well.
This helper should mirror the other routers and coerce None to True; otherwise sorted(..., key=get_relax_value) can end up comparing None with bool values and raise TypeError.
Suggested fix
def get_relax_value(req_item):
scheduling_params = getattr(req_item.request, "py_scheduling_params", None)
if scheduling_params is None:
return True
- return scheduling_params.attention_dp_relax
+ val = scheduling_params.attention_dp_relax
+ return True if val is None else val📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_relax_value(req_item): | |
| scheduling_params = getattr(req_item.request, "py_scheduling_params", None) | |
| if scheduling_params is None: | |
| return True | |
| return scheduling_params.attention_dp_relax | |
| def get_relax_value(req_item): | |
| scheduling_params = getattr(req_item.request, "py_scheduling_params", None) | |
| if scheduling_params is None: | |
| return True | |
| val = scheduling_params.attention_dp_relax | |
| return True if val is None else val |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py` around lines 867 -
871, The get_relax_value helper in adp_router.py should normalize
attention_dp_relax the same way as the other routers: read py_scheduling_params
from req_item.request, return True when scheduling_params is missing or
attention_dp_relax is None, and otherwise return the boolean value. Update the
logic inside get_relax_value so sorted(..., key=get_relax_value) never compares
None with bool and instead always receives a normalized boolean key.
| kv_cache_routing_max_sessions: int = Field( | ||
| default=65536, | ||
| description= | ||
| "LRU cap on the conversation->rank map used by conversation-affinity " | ||
| "routing. The oldest conversations are evicted once more than this many " | ||
| "are tracked, bounding memory on long-running servers. Only used when " | ||
| "kv_cache_routing_conversation_affinity is True.") |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -nP '\bPositiveInt\b' tensorrt_llm/llmapi/llm_args.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1269
Constrain kv_cache_routing_max_sessions to a positive value.
0 or a negative cap would be nonsensical for an LRU limit. Use PositiveInt for this field.
♻️ Suggested change
- kv_cache_routing_max_sessions: int = Field(
+ kv_cache_routing_max_sessions: PositiveInt = Field(
default=65536,
description=
"LRU cap on the conversation->rank map used by conversation-affinity "
"routing. The oldest conversations are evicted once more than this many "
"are tracked, bounding memory on long-running servers. Only used when "
"kv_cache_routing_conversation_affinity is True.")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| kv_cache_routing_max_sessions: int = Field( | |
| default=65536, | |
| description= | |
| "LRU cap on the conversation->rank map used by conversation-affinity " | |
| "routing. The oldest conversations are evicted once more than this many " | |
| "are tracked, bounding memory on long-running servers. Only used when " | |
| "kv_cache_routing_conversation_affinity is True.") | |
| kv_cache_routing_max_sessions: PositiveInt = Field( | |
| default=65536, | |
| description= | |
| "LRU cap on the conversation->rank map used by conversation-affinity " | |
| "routing. The oldest conversations are evicted once more than this many " | |
| "are tracked, bounding memory on long-running servers. Only used when " | |
| "kv_cache_routing_conversation_affinity is True.") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/llmapi/llm_args.py` around lines 1245 - 1251, The
kv_cache_routing_max_sessions field in llm_args.py currently accepts invalid
non-positive values, which makes the LRU cap nonsensical; update the llm_args
model to use PositiveInt for kv_cache_routing_max_sessions in the same Field
definition so validation rejects 0 and negatives while preserving the existing
default and description.
Sources: Coding guidelines, Path instructions
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Description
This PR carves the PR-2 router/OpenAI disagg routing scope from
github/feat/deepseek_v4aftera7728aaonto currentgithub/main.It includes:
PR-1 and PR-2 have no intended hard product dependency in this split. They still touch nearby runtime/disagg surfaces, so whichever lands second should be rebased and conflict-checked.
Rebase/conflict notes are recorded in
tmp/dsv4_final_prs/PR2_rebase_conflicts_20260625.md.Test Coverage
python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --use_ccache --cuda_architectures "90-real;100-real" --configure_cmakebuild/tensorrt_llm-1.3.0rc20-cp312-cp312-linux_x86_64.whl..venv-3.12withpip install --force-reinstall --no-deps.tensorrt_llm,serve.router,serve.openai_disagg_service,scheduler.adp_router, and nativebindingsimport from this worktree/build.git diff --check github/main...HEADPATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=0 timeout 20m .venv-3.12/bin/python -m pytest tests/unittest/disaggregated/test_peer.py tests/unittest/disaggregated/test_router.py tests/unittest/disaggregated/test_disagg_openai_client.py tests/unittest/disaggregated/test_disagg_utils.py tests/unittest/disaggregated/test_disaggregated_params.py tests/unittest/disaggregated/test_openai_disagg_service.py tests/unittest/disaggregated/test_openai_server_info.py -q170 passed, 3 warningsPATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=0 timeout 20m .venv-3.12/bin/python -m pytest tests/unittest/_torch/executor/test_adp_router.py -q65 passed, 2 warningsPATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=0 timeout 10m .venv-3.12/bin/python -m pytest tests/unittest/llmapi/apps --ignore=tests/unittest/llmapi/apps/test_disagg_serving_perf_metrics.py -q -k "not test_list_tool_calls and not test_iterator_tool_calls and not test_extra_fields_raw_dict"410 passed, 3 deselected, 13 warningsNotes:
tests/unittest/servedoes not exist in this worktree.tests/unittest/llmapi/appshas 3 failures in unmodifiedchat_utilstests (MagicMockclass lacksmodel_type).tests/unittest/llmapi/apps/test_disagg_serving_perf_metrics.pywas retried with.venv-3.12/binonPATH, then interrupted after worker logs stayed at MPI pool startup for several minutes with 0% GPU utilization; no worker processes were left running and GPU memory returned to 0 MiB.PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.