Skip to content

[None][perf] DSv4 follow-up: disagg routing improvements#15625

Open
lfr-0531 wants to merge 22 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-a7728aa-pr2-router-20260625
Open

[None][perf] DSv4 follow-up: disagg routing improvements#15625
lfr-0531 wants to merge 22 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-a7728aa-pr2-router-20260625

Conversation

@lfr-0531

@lfr-0531 lfr-0531 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Added request token counts to the public API and Python bindings.
    • Introduced conversation-aware and cache-aware routing improvements, including optional conversation pinning and cache-salt support.
    • Added support for first-token prioritization in generation scheduling.
  • Bug Fixes

    • Improved disaggregated request handling, retry behavior, and completion tracking.
    • Better preserves conversation IDs and cache-related settings across request flows.
    • Expanded server info output to show KV-cache details like block sizing and hash mode.
  • Tests

    • Added and updated integration/unit coverage for routing, disaggregation, retries, and server info.

Description

This PR carves the PR-2 router/OpenAI disagg routing scope from github/feat/deepseek_v4 after a7728aa onto current github/main.

It includes:

  • OpenAI disaggregated service/router request handling, gen-first/context-first plumbing, retry IDs, prompt-token echo, usage rewrite, and optional generation request history trimming.
  • KV-cache-aware router improvements including salt-aware block hashing, tokenizer-cache handling, tool-calling tokenization, in-transfer accounting gate, absolute block score, and conversation affinity.
  • ADP router/request enqueue fixes and related server-info normalization used by disaggregated routing.

PR-1 and PR-2 have no intended hard product dependency in this split. They still touch nearby runtime/disagg surfaces, so whichever lands second should be rebased and conflict-checked.

Rebase/conflict notes are recorded in tmp/dsv4_final_prs/PR2_rebase_conflicts_20260625.md.

Test Coverage

  • Build/install:
    • python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --use_ccache --cuda_architectures "90-real;100-real" --configure_cmake
    • Built build/tensorrt_llm-1.3.0rc20-cp312-cp312-linux_x86_64.whl.
    • Installed into .venv-3.12 with pip install --force-reinstall --no-deps.
    • Verified tensorrt_llm, serve.router, serve.openai_disagg_service, scheduler.adp_router, and native bindings import from this worktree/build.
  • Static/scope:
    • git diff --check github/main...HEAD
    • Changed-file conflict-marker search.
    • Forbidden-scope check for DSv4 model/sparse attention/MoE/FMHA paths returned no matches.
  • Unit tests:
    • PATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=0 timeout 20m .venv-3.12/bin/python -m pytest tests/unittest/disaggregated/test_peer.py tests/unittest/disaggregated/test_router.py tests/unittest/disaggregated/test_disagg_openai_client.py tests/unittest/disaggregated/test_disagg_utils.py tests/unittest/disaggregated/test_disaggregated_params.py tests/unittest/disaggregated/test_openai_disagg_service.py tests/unittest/disaggregated/test_openai_server_info.py -q
      • 170 passed, 3 warnings
    • PATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=0 timeout 20m .venv-3.12/bin/python -m pytest tests/unittest/_torch/executor/test_adp_router.py -q
      • 65 passed, 2 warnings
    • PATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=0 timeout 10m .venv-3.12/bin/python -m pytest tests/unittest/llmapi/apps --ignore=tests/unittest/llmapi/apps/test_disagg_serving_perf_metrics.py -q -k "not test_list_tool_calls and not test_iterator_tool_calls and not test_extra_fields_raw_dict"
      • 410 passed, 3 deselected, 13 warnings

Notes:

  • tests/unittest/serve does not exist in this worktree.
  • Full tests/unittest/llmapi/apps has 3 failures in unmodified chat_utils tests (MagicMock class lacks model_type).
  • tests/unittest/llmapi/apps/test_disagg_serving_perf_metrics.py was retried with .venv-3.12/bin on PATH, then interrupted after worker logs stayed at MPI pool startup for several minutes with 0% GPU utilization; no worker processes were left running and GPU memory returned to 0 MiB.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

lancelly and others added 22 commits June 25, 2026 09:15
…2 KV cache backends (NVIDIA#14435)

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
… HTTP client (NVIDIA#14562)

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…rving (feat/deepseek_v4) (NVIDIA#14611)

Signed-off-by: Shicheng Li <shicli@nvidia.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…e TTFT (NVIDIA#14767)

Signed-off-by: Zheyu Fu <zheyuf@nvidia.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…fig flag (NVIDIA#14860)

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Co-authored-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…handshake (NVIDIA#14950)

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…VIDIA#15005)

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…nversation_id routing (NVIDIA#14957)

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…IA#15040)

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…ve list (NVIDIA#14867)

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…ank affinity for attention DP (NVIDIA#14983)

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
)

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…r forward path (NVIDIA#15041)

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…rite (NVIDIA#15042)

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…NVIDIA#15133)

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
NVIDIA#15199)

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…er-rank-divergent gates (NVIDIA#14560)

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…se (NVIDIA#14844)

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
…request (NVIDIA#15366)

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
@lfr-0531 lfr-0531 requested review from a team as code owners June 25, 2026 11:18
@lfr-0531

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55770 [ run ] triggered by Bot. Commit: e85c5d5 Link to invocation

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds request metadata and routing/config options for disaggregated serving, threads conversation IDs and cache-salt IDs through executor and protocol paths, updates KV-cache-aware routing and OpenAI disagg request handling, and expands unit and integration coverage for the new flows.

Changes

Disaggregated serving

Layer / File(s) Summary
Request metadata and config threading
cpp/include/tensorrt_llm/executor/executor.h, cpp/tensorrt_llm/executor/request.cpp, cpp/tensorrt_llm/executor/requestImpl.h, cpp/tensorrt_llm/nanobind/executor/request.cpp, tensorrt_llm/disaggregated_params.py, tensorrt_llm/inputs/utils.py, tensorrt_llm/llmapi/disagg_utils.py, tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/serve/openai_protocol.py, tests/unittest/disaggregated/test_disaggregated_params.py, tests/unittest/disaggregated/test_disagg_utils.py
Request exposes num_input_tokens, DisaggregatedParams carries conversation_id, and disaggregation/config helpers thread cache_salt_id, gen_strip_message_history, and tokens_per_block through the public surfaces.
Distributed execution and ADP plumbing
tensorrt_llm/_torch/distributed/communicator.py, tensorrt_llm/_torch/disaggregation/native/mixers/attention/peer.py, tensorrt_llm/_torch/pyexecutor/py_executor.py, tensorrt_llm/_torch/pyexecutor/resource_manager.py, tests/unittest/disaggregated/test_peer.py
safe_broadcast, peer tokens_per_block checks, cache-salt probing, and PyExecutor timeout/metrics buffering update the distributed execution path.
ADP routing and scheduling
tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py, tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py, tests/unittest/_torch/executor/test_adp_router.py
ADP routing adds shared token counting, conversation-aware placement, in-transfer accounting, cache-salt-aware prefix matching, and first-token prioritization.
Serve router KV-cache state
tensorrt_llm/serve/router.py, tensorrt_llm/serve/openai_server.py, tests/unittest/disaggregated/test_router.py, tests/unittest/disaggregated/test_openai_server_info.py
Serve routing hashes KV blocks per algorithm, tracks routed blocks, coalesces polling, and reports KV-cache settings in /server_info.
OpenAI disagg request flow
tensorrt_llm/serve/openai_client.py, tensorrt_llm/serve/openai_disagg_service.py, tests/unittest/disaggregated/test_disagg_openai_client.py, tests/unittest/disaggregated/test_openai_disagg_service.py
OpenAI disaggregation carries request success state, retries, usage rewriting, and message trimming.
Integration config and matrix updates
tests/integration/defs/accuracy/test_disaggregated_serving.py, tests/integration/defs/disaggregated/test_configs/disagg_config_conditional_deepseek_v3_v2.yaml, tests/integration/defs/disaggregated/test_disaggregated.py, tests/integration/test_lists/qa/llm_function_core.txt, tests/integration/test_lists/test-db/l0_b200.yml, tests/integration/test_lists/test-db/l0_dgx_h100.yml
New conditional V2 integration config and test selections add DeepSeek V3 Lite generation-first coverage.

Sequence Diagram(s)

sequenceDiagram
  participant OpenAIHttpClient
  participant OpenAIDisaggregatedService
  participant KvCacheAwareRouter
  participant KvCacheAwareServerState
  OpenAIHttpClient->>OpenAIDisaggregatedService: send request / receive stream
  OpenAIDisaggregatedService->>OpenAIDisaggregatedService: update disagg_request_id
  OpenAIDisaggregatedService->>OpenAIDisaggregatedService: rewrite usage or trim messages
  OpenAIDisaggregatedService->>KvCacheAwareRouter: finish_request(request, success)
  KvCacheAwareRouter->>KvCacheAwareServerState: apply routed blocks / poll KV events
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Suggested reviewers

  • yuxianq
  • byshiue
  • xinhe-nv
  • pcastonguay
  • Superjomn
  • juney-nvidia
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.30% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and matches the PR’s main focus on disaggregated routing improvements.
Description check ✅ Passed The description follows the repository template with Description, Test Coverage, and PR Checklist sections filled in.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tensorrt_llm/serve/openai_disagg_service.py (1)

177-180: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Return the generation response through the usage rewriter.

The new _rewrite_disagg_usage() pipeline is never applied on the ctx-first generation path, so client-visible prompt_tokens, total_tokens, and cached_tokens can remain generation-worker values instead of the context response’s accounting.

Proposed fix
             gen_response = await self._gen_client.send_request(
                 gen_req, server=gen_server, hooks=hooks
             )
-            return gen_response
+            return self._rewrite_disagg_usage(gen_response, ctx_response)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/serve/openai_disagg_service.py` around lines 177 - 180, The
ctx-first generation path in self._gen_client.send_request currently returns
gen_response directly, so the new _rewrite_disagg_usage() accounting is skipped.
Update the generation flow in openai_disagg_service to pass the generation
response through _rewrite_disagg_usage() before returning it, using the existing
ctx response/accounting context so prompt_tokens, total_tokens, and
cached_tokens are rewritten consistently.
tensorrt_llm/serve/openai_client.py (1)

205-207: 🗄️ Data Integrity & Integration | 🟠 Major

Defer streaming cleanup until retries are exhausted
tensorrt_llm/serve/openai_client.py:205-207, 230-260, 276-319 _response_generator() finishes the request on every attempt, but _post_with_retry() can still retry before any bytes are yielded. That can clear router state on a failed first attempt and leave load/affinity bookkeeping wrong for the retry. Keep success=False by default and only mark success once the stream fully drains; cancellation/GeneratorExit currently bypasses except Exception and otherwise gets reported as success.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/serve/openai_client.py` around lines 205 - 207, The retry path
in openai_client.py is finalizing streaming too early in _response_generator and
_post_with_retry, which can clear router state before retries are exhausted.
Update _response_generator so success stays false by default and only flips to
true after the stream fully drains, and make sure cleanup/finalization only
happens once the last attempt is complete. Also handle cancellation and
GeneratorExit explicitly around _response_generator and _post_with_retry so they
are not treated as successful responses.
tensorrt_llm/_torch/disaggregation/native/mixers/attention/peer.py (1)

338-370: 🗄️ Data Integrity & Integration | 🟠 Major

Keep tokens_per_block equality required. The transfer path still aligns blocks and computes mapper offsets from a single block size, so divisible-but-unequal peers can mis-copy KV.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/disaggregation/native/mixers/attention/peer.py` around
lines 338 - 370, The peer compatibility check is too permissive for
tokens_per_block and must require exact equality instead of only allowing
divisible mismatches. Update _tpb_check in AttentionPolicy to reject any
local/peer tokens_per_block difference and have check_peer_compatible fail when
values differ, since the transfer path and mapper offsets assume a single shared
block size. Keep the existing structure in _tpb_check and check_peer_compatible,
but change the logic so unequal block sizes are incompatible.
tensorrt_llm/serve/router.py (1)

820-830: 🔒 Security & Privacy | 🟠 Major

trust_remote_code=True is unsafe for request-supplied model IDs.

_tokenize() passes request.model directly into _get_tokenizer(), and TransformersTokenizer.from_pretrained(..., trust_remote_code=True) can execute arbitrary code from a model repo. ChatCompletionRequest.model is just a free-form string here, and I don’t see any server-side restriction to the served model name.

Gate this on operator config and/or restrict tokenization to the configured model.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/serve/router.py` around lines 820 - 830, The tokenizer loading
path in _get_tokenizer currently trusts request-supplied model IDs and always
enables trust_remote_code, which is unsafe. Update Router._get_tokenizer and the
_tokenize call path so tokenization is only allowed for the server-configured
served model or another explicitly whitelisted model, and make trust_remote_code
controlled by operator configuration rather than hardcoded. Keep the custom
tokenizer branch unchanged, but guard the TransformersTokenizer.from_pretrained
path with a restriction tied to the configured model name.

Source: Linters/SAST tools

🧹 Nitpick comments (4)
tests/integration/defs/disaggregated/test_disaggregated.py (1)

1741-1758: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Test never validates the bypass signal it claims to check.

The comments state perf_metrics should grow sub-linearly when bypass works, and last_total is recorded across invocations, but it is only assigned — never compared. The sole assertion (assert metrics or last_total[0] > 0) just checks non-emptiness, so it passes even if the bypass logic is completely broken (i.e. one entry per request). As written this is effectively a smoke check, not a bypass-signal regression test.

Consider asserting the actual invariant, e.g. recorded entries are strictly fewer than total requests issued (num_iters * prompts_per_iter), which is what "bypassed requests are absent" implies. Also note the first invocation can spuriously fail if /perf_metrics is momentarily empty because the gen request hasn't finished.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/disaggregated/test_disaggregated.py` around lines 1741
- 1758, The bypass regression check in _check_bypass_signal only verifies that
/perf_metrics is non-empty, so it never proves bypassed requests are excluded.
Update the test in test_disaggregated.py to assert the real invariant by
comparing the collected metrics count against the total requests issued (for
example, using the existing last_total tracking or the known num_iters *
prompts_per_iter bound) and ensure the check fails when entries grow
one-per-request. Keep the helper’s logger output and last_total state in
_check_bypass_signal, but replace the smoke-style assertion with one that
validates sub-linear growth after the first successful run.
tests/unittest/disaggregated/test_openai_disagg_service.py (1)

140-200: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Add ctx-first usage rewrite regression coverage.

Coverage is insufficient for the new usage-rewrite flow: these tests validate prompt-token plumbing, but not that _send_disagg_request_ctx_first() returns a generation response with ctx-derived prompt_tokens, total_tokens, and cached_tokens for non-streaming and SSE responses.

As per path instructions, tests/**: “Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/disaggregated/test_openai_disagg_service.py` around lines 140
- 200, The current tests cover prompt-token propagation, but they do not
exercise the ctx-first usage rewrite path. Add regression coverage around
_send_disagg_request_ctx_first() to verify that when the context response is
rewritten into a generation response, the returned usage includes ctx-derived
prompt_tokens, total_tokens, and cached_tokens for both non-streaming and SSE
flows. Use the existing OpenAIServer and _send_disagg_request_ctx_first symbols,
and assert the rewritten response payload/usage rather than only
prompt_token_ids plumbing.

Source: Path instructions

tests/unittest/disaggregated/test_disagg_openai_client.py (1)

480-564: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Add streaming retry coverage for router finalization.

Coverage is insufficient for the changed streaming path: these retry tests all use stream=False. Please add cases in tests/unittest/disaggregated/test_disagg_openai_client.py where an SSE request fails before yielding and then succeeds, and where it finally fails, asserting finish_request(..., success=...) is called exactly for the terminal outcome.

As per path instructions, tests/**: “Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/disaggregated/test_disagg_openai_client.py` around lines 480 -
564, The existing retry coverage in test_disagg_openai_client only exercises
non-streaming requests, so add streaming SSE retry tests for the router
finalization path. Use the send_request flow with stream=True and cover one case
where the request fails before yielding and then succeeds, and another where it
ultimately fails after retries. Verify finish_request is called only once for
the terminal outcome with the correct success value, and reference the existing
helper methods like _make_client, _make_request, send_request, and
finish_request to place the new assertions alongside the current retry tests.

Source: Path instructions

tests/unittest/disaggregated/test_peer.py (1)

408-435: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Coverage for the divisible/non-divisible relaxation is solid (positive + negative through the register() contract). As a QA follow-up, consider adding a case where the local TPB is the larger value (e.g. local=32, peer=16) to lock in the max/min direction-symmetry of _tpb_check, plus an equal-TPB baseline asserting no warning. Optional — current tests are sufficient to merge.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/disaggregated/test_peer.py` around lines 408 - 435, Add
coverage around `_tpb_check` in
`test_peer_registrar_tpb_divisible_warns_but_compatible` /
`test_peer_registrar_tpb_not_divisible_raises` by adding a case where the local
tokens_per_block is larger than the peer’s (for example local=32, peer=16) to
verify the divisible logic is symmetric regardless of max/min ordering, and add
an equal-tokens_per_block baseline to assert `register()` succeeds without
warning; keep the checks anchored on `_make_peer_registrar` and `register()` so
the behavior remains locked in if implementation details change.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 678-684: The dummy probe request construction in CppLlmRequest is
using the wrong keyword for the salt field, so update the resource_manager flow
that builds dummy_req to pass cache_salt instead of cache_salt_id (or otherwise
align it with the binding’s accepted kwarg). Verify the CppLlmRequest call site
in resource_manager uses the same naming as the binding signature so salted
probes do not fail with a TypeError.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py`:
- Around line 867-871: The get_relax_value helper in adp_router.py should
normalize attention_dp_relax the same way as the other routers: read
py_scheduling_params from req_item.request, return True when scheduling_params
is missing or attention_dp_relax is None, and otherwise return the boolean
value. Update the logic inside get_relax_value so sorted(...,
key=get_relax_value) never compares None with bool and instead always receives a
normalized boolean key.

In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 1245-1251: The kv_cache_routing_max_sessions field in llm_args.py
currently accepts invalid non-positive values, which makes the LRU cap
nonsensical; update the llm_args model to use PositiveInt for
kv_cache_routing_max_sessions in the same Field definition so validation rejects
0 and negatives while preserving the existing default and description.

---

Outside diff comments:
In `@tensorrt_llm/_torch/disaggregation/native/mixers/attention/peer.py`:
- Around line 338-370: The peer compatibility check is too permissive for
tokens_per_block and must require exact equality instead of only allowing
divisible mismatches. Update _tpb_check in AttentionPolicy to reject any
local/peer tokens_per_block difference and have check_peer_compatible fail when
values differ, since the transfer path and mapper offsets assume a single shared
block size. Keep the existing structure in _tpb_check and check_peer_compatible,
but change the logic so unequal block sizes are incompatible.

In `@tensorrt_llm/serve/openai_client.py`:
- Around line 205-207: The retry path in openai_client.py is finalizing
streaming too early in _response_generator and _post_with_retry, which can clear
router state before retries are exhausted. Update _response_generator so success
stays false by default and only flips to true after the stream fully drains, and
make sure cleanup/finalization only happens once the last attempt is complete.
Also handle cancellation and GeneratorExit explicitly around _response_generator
and _post_with_retry so they are not treated as successful responses.

In `@tensorrt_llm/serve/openai_disagg_service.py`:
- Around line 177-180: The ctx-first generation path in
self._gen_client.send_request currently returns gen_response directly, so the
new _rewrite_disagg_usage() accounting is skipped. Update the generation flow in
openai_disagg_service to pass the generation response through
_rewrite_disagg_usage() before returning it, using the existing ctx
response/accounting context so prompt_tokens, total_tokens, and cached_tokens
are rewritten consistently.

In `@tensorrt_llm/serve/router.py`:
- Around line 820-830: The tokenizer loading path in _get_tokenizer currently
trusts request-supplied model IDs and always enables trust_remote_code, which is
unsafe. Update Router._get_tokenizer and the _tokenize call path so tokenization
is only allowed for the server-configured served model or another explicitly
whitelisted model, and make trust_remote_code controlled by operator
configuration rather than hardcoded. Keep the custom tokenizer branch unchanged,
but guard the TransformersTokenizer.from_pretrained path with a restriction tied
to the configured model name.

---

Nitpick comments:
In `@tests/integration/defs/disaggregated/test_disaggregated.py`:
- Around line 1741-1758: The bypass regression check in _check_bypass_signal
only verifies that /perf_metrics is non-empty, so it never proves bypassed
requests are excluded. Update the test in test_disaggregated.py to assert the
real invariant by comparing the collected metrics count against the total
requests issued (for example, using the existing last_total tracking or the
known num_iters * prompts_per_iter bound) and ensure the check fails when
entries grow one-per-request. Keep the helper’s logger output and last_total
state in _check_bypass_signal, but replace the smoke-style assertion with one
that validates sub-linear growth after the first successful run.

In `@tests/unittest/disaggregated/test_disagg_openai_client.py`:
- Around line 480-564: The existing retry coverage in test_disagg_openai_client
only exercises non-streaming requests, so add streaming SSE retry tests for the
router finalization path. Use the send_request flow with stream=True and cover
one case where the request fails before yielding and then succeeds, and another
where it ultimately fails after retries. Verify finish_request is called only
once for the terminal outcome with the correct success value, and reference the
existing helper methods like _make_client, _make_request, send_request, and
finish_request to place the new assertions alongside the current retry tests.

In `@tests/unittest/disaggregated/test_openai_disagg_service.py`:
- Around line 140-200: The current tests cover prompt-token propagation, but
they do not exercise the ctx-first usage rewrite path. Add regression coverage
around _send_disagg_request_ctx_first() to verify that when the context response
is rewritten into a generation response, the returned usage includes ctx-derived
prompt_tokens, total_tokens, and cached_tokens for both non-streaming and SSE
flows. Use the existing OpenAIServer and _send_disagg_request_ctx_first symbols,
and assert the rewritten response payload/usage rather than only
prompt_token_ids plumbing.

In `@tests/unittest/disaggregated/test_peer.py`:
- Around line 408-435: Add coverage around `_tpb_check` in
`test_peer_registrar_tpb_divisible_warns_but_compatible` /
`test_peer_registrar_tpb_not_divisible_raises` by adding a case where the local
tokens_per_block is larger than the peer’s (for example local=32, peer=16) to
verify the divisible logic is symmetric regardless of max/min ordering, and add
an equal-tokens_per_block baseline to assert `register()` succeeds without
warning; keep the checks anchored on `_make_peer_registrar` and `register()` so
the behavior remains locked in if implementation details change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4489bc3a-7e1e-43f1-be0b-c52a224ed3ef

📥 Commits

Reviewing files that changed from the base of the PR and between a02214a and e85c5d5.

📒 Files selected for processing (33)
  • cpp/include/tensorrt_llm/executor/executor.h
  • cpp/tensorrt_llm/executor/request.cpp
  • cpp/tensorrt_llm/executor/requestImpl.h
  • cpp/tensorrt_llm/nanobind/executor/request.cpp
  • tensorrt_llm/_torch/disaggregation/native/mixers/attention/peer.py
  • tensorrt_llm/_torch/distributed/communicator.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py
  • tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py
  • tensorrt_llm/disaggregated_params.py
  • tensorrt_llm/inputs/utils.py
  • tensorrt_llm/llmapi/disagg_utils.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/serve/openai_client.py
  • tensorrt_llm/serve/openai_disagg_service.py
  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/serve/router.py
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/integration/defs/disaggregated/test_configs/disagg_config_conditional_deepseek_v3_v2.yaml
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/test-db/l0_b200.yml
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tests/unittest/_torch/executor/test_adp_router.py
  • tests/unittest/disaggregated/test_disagg_openai_client.py
  • tests/unittest/disaggregated/test_disagg_utils.py
  • tests/unittest/disaggregated/test_disaggregated_params.py
  • tests/unittest/disaggregated/test_openai_disagg_service.py
  • tests/unittest/disaggregated/test_openai_server_info.py
  • tests/unittest/disaggregated/test_peer.py
  • tests/unittest/disaggregated/test_router.py

Comment on lines 678 to +684
dummy_req = CppLlmRequest(request_id=0,
max_new_tokens=0,
input_tokens=input_tokens,
sampling_config=SamplingConfig(),
is_streaming=False,
lora_task_id=lora_task_id)
lora_task_id=lora_task_id,
cache_salt_id=cache_salt_id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm the C++ LlmRequest binding exposes cache_salt_id as a ctor arg.
rg -nP -C3 'cache_salt_id' cpp/ | rg -i 'llmRequest|arg\(|init'
echo "----- binding init for LlmRequest -----"
rg -nP --type=cpp -C2 'nb::arg\("cache_salt_id"\)|py::arg\("cache_salt_id"\)' cpp/

Repository: NVIDIA/TensorRT-LLM

Length of output: 197


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the CppLlmRequest/LlmRequest binding and inspect its constructor signature.
git ls-files | rg 'resource_manager\.py|batch_manager|llm_request|request'
echo "----- search for CppLlmRequest / LlmRequest binding -----"
rg -n --hidden --glob '!**/.git/**' 'CppLlmRequest|class LlmRequest|nb::class_|py::class_' .

Repository: NVIDIA/TensorRT-LLM

Length of output: 29024


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the Python file around the reviewed line and nearby request creation helpers.
wc -l tensorrt_llm/_torch/pyexecutor/resource_manager.py
sed -n '630,710p' tensorrt_llm/_torch/pyexecutor/resource_manager.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 4342


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for the request class definition and any constructor arguments that may include cache_salt_id.
rg -n --hidden --glob '!**/.git/**' -C 3 'cache_salt_id|CppLlmRequest|LlmRequest' tensorrt_llm cpp bindings .

Repository: NVIDIA/TensorRT-LLM

Length of output: 50376


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "----- constructor bindings for cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp -----"
sed -n '1,260p' cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp

echo "----- class declaration / constructors in cpp/include/tensorrt_llm/batch_manager/llmRequest.h -----"
rg -n -C 3 'LlmRequest\(' cpp/include/tensorrt_llm/batch_manager/llmRequest.h cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp

Repository: NVIDIA/TensorRT-LLM

Length of output: 12845


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "----- cache_salt_id references in batch_manager request implementation -----"
rg -n -C 2 'cache_salt_id' cpp/include/tensorrt_llm/batch_manager/llmRequest.h \
  cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp \
  cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp \
  tensorrt_llm/_torch/pyexecutor/llm_request.py

echo "----- relevant ctor overloads in cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp -----"
rg -n -C 4 'nb::init|def\("__init__"|py::init|constructor' cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp

Repository: NVIDIA/TensorRT-LLM

Length of output: 234


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "----- binding section around tb::LlmRequest in cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp -----"
sed -n '320,380p' cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp

echo "----- constructor declarations in cpp/include/tensorrt_llm/batch_manager/llmRequest.h -----"
rg -n -C 6 'LlmRequest\(' cpp/include/tensorrt_llm/batch_manager/llmRequest.h

echo "----- cache_salt_id in cpp/include/tensorrt_llm/batch_manager/llmRequest.h -----"
rg -n -C 3 'cache_salt_id' cpp/include/tensorrt_llm/batch_manager/llmRequest.h

Repository: NVIDIA/TensorRT-LLM

Length of output: 10244


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the tail of the tb::LlmRequest nanobind constructor to see whether cache_salt_id is a named parameter.
sed -n '380,470p' cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp

# Show the exact constructor tail in the C++ header as well.
sed -n '2397,2483p' cpp/include/tensorrt_llm/batch_manager/llmRequest.h

Repository: NVIDIA/TensorRT-LLM

Length of output: 15002


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find any explicit cache_salt_id mention in the LlmRequest binding and nearby lines.
rg -n -C 2 'cache_salt_id|cacheSalt|cacheSaltId' cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp cpp/include/tensorrt_llm/batch_manager/llmRequest.h

Repository: NVIDIA/TensorRT-LLM

Length of output: 3563


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the constructor lambda tail where cache_salt_id would appear.
sed -n '430,520p' cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp

# A precise grep for the final constructor arguments around cache salt.
rg -n -C 4 'multimodal_run_lengths|cache_salt|cacheSalt' cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp

Repository: NVIDIA/TensorRT-LLM

Length of output: 10912


Use the binding’s cache_salt kwarg here
CppLlmRequest exposes cache_salt, not cache_salt_id; this call will raise TypeError on salted probes unless the kwarg name is aligned or aliased.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 678 - 684,
The dummy probe request construction in CppLlmRequest is using the wrong keyword
for the salt field, so update the resource_manager flow that builds dummy_req to
pass cache_salt instead of cache_salt_id (or otherwise align it with the
binding’s accepted kwarg). Verify the CppLlmRequest call site in
resource_manager uses the same naming as the binding signature so salted probes
do not fail with a TypeError.

Comment on lines +867 to +871
def get_relax_value(req_item):
scheduling_params = getattr(req_item.request, "py_scheduling_params", None)
if scheduling_params is None:
return True
return scheduling_params.attention_dp_relax

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check whether attention_dp_relax can ever be None in practice.
rg -nP --type=py 'attention_dp_relax' -C2

Repository: NVIDIA/TensorRT-LLM

Length of output: 157


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant router implementation and compare it with the other routers.
git ls-files | rg '(^|/)CODING_GUIDELINES\.md$|tensorrt_llm/_torch/pyexecutor/scheduler/adp_router\.py$|tensorrt_llm/_torch/pyexecutor/scheduler/.*router.*\.py$'
echo '--- outline: adp_router.py ---'
ast-grep outline tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py --view expanded || true
echo '--- search for attention_dp_relax across repo ---'
rg -n --hidden --glob '!**/.git/**' 'attention_dp_relax|py_scheduling_params' tensorrt_llm/_torch -S || true

Repository: NVIDIA/TensorRT-LLM

Length of output: 4344


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant router implementation and compare it with the other routers.
git ls-files | rg '(^|/)CODING_GUIDELINES\.md$|tensorrt_llm/_torch/pyexecutor/scheduler/adp_router\.py$|tensorrt_llm/_torch/pyexecutor/scheduler/.*router.*\.py$'
echo '--- outline: adp_router.py ---'
ast-grep outline tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py --view expanded || true
echo '--- search for attention_dp_relax across repo ---'
rg -n --hidden --glob '!**/.git/**' 'attention_dp_relax|py_scheduling_params' tensorrt_llm/_torch -S || true

Repository: NVIDIA/TensorRT-LLM

Length of output: 4344


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- relevant slice of adp_router.py ---'
sed -n '840,890p' tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py

echo '--- other router definitions of relax handling ---'
rg -n -C 3 'attention_dp_relax|get_relax_value|True if val is None else val' tensorrt_llm/_torch/pyexecutor/scheduler -S

Repository: NVIDIA/TensorRT-LLM

Length of output: 6793


Normalize attention_dp_relax here as well.

This helper should mirror the other routers and coerce None to True; otherwise sorted(..., key=get_relax_value) can end up comparing None with bool values and raise TypeError.

Suggested fix
         def get_relax_value(req_item):
             scheduling_params = getattr(req_item.request, "py_scheduling_params", None)
             if scheduling_params is None:
                 return True
-            return scheduling_params.attention_dp_relax
+            val = scheduling_params.attention_dp_relax
+            return True if val is None else val
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_relax_value(req_item):
scheduling_params = getattr(req_item.request, "py_scheduling_params", None)
if scheduling_params is None:
return True
return scheduling_params.attention_dp_relax
def get_relax_value(req_item):
scheduling_params = getattr(req_item.request, "py_scheduling_params", None)
if scheduling_params is None:
return True
val = scheduling_params.attention_dp_relax
return True if val is None else val
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py` around lines 867 -
871, The get_relax_value helper in adp_router.py should normalize
attention_dp_relax the same way as the other routers: read py_scheduling_params
from req_item.request, return True when scheduling_params is missing or
attention_dp_relax is None, and otherwise return the boolean value. Update the
logic inside get_relax_value so sorted(..., key=get_relax_value) never compares
None with bool and instead always receives a normalized boolean key.

Comment on lines +1245 to +1251
kv_cache_routing_max_sessions: int = Field(
default=65536,
description=
"LRU cap on the conversation->rank map used by conversation-affinity "
"routing. The oldest conversations are evicted once more than this many "
"are tracked, bounding memory on long-running servers. Only used when "
"kv_cache_routing_conversation_affinity is True.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -nP '\bPositiveInt\b' tensorrt_llm/llmapi/llm_args.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1269


Constrain kv_cache_routing_max_sessions to a positive value.

0 or a negative cap would be nonsensical for an LRU limit. Use PositiveInt for this field.

♻️ Suggested change
-    kv_cache_routing_max_sessions: int = Field(
+    kv_cache_routing_max_sessions: PositiveInt = Field(
         default=65536,
         description=
         "LRU cap on the conversation->rank map used by conversation-affinity "
         "routing. The oldest conversations are evicted once more than this many "
         "are tracked, bounding memory on long-running servers. Only used when "
         "kv_cache_routing_conversation_affinity is True.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
kv_cache_routing_max_sessions: int = Field(
default=65536,
description=
"LRU cap on the conversation->rank map used by conversation-affinity "
"routing. The oldest conversations are evicted once more than this many "
"are tracked, bounding memory on long-running servers. Only used when "
"kv_cache_routing_conversation_affinity is True.")
kv_cache_routing_max_sessions: PositiveInt = Field(
default=65536,
description=
"LRU cap on the conversation->rank map used by conversation-affinity "
"routing. The oldest conversations are evicted once more than this many "
"are tracked, bounding memory on long-running servers. Only used when "
"kv_cache_routing_conversation_affinity is True.")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/llmapi/llm_args.py` around lines 1245 - 1251, The
kv_cache_routing_max_sessions field in llm_args.py currently accepts invalid
non-positive values, which makes the LRU cap nonsensical; update the llm_args
model to use PositiveInt for kv_cache_routing_max_sessions in the same Field
definition so validation rejects 0 and negatives while preserving the existing
default and description.

Sources: Coding guidelines, Path instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants