[None][fix] Align GPTOSS router tokenization and disagg draft scheduling#15605
[None][fix] Align GPTOSS router tokenization and disagg draft scheduling#15605SimengLiu-nv wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughPyExecutor now synchronizes disaggregated draft-token fields before batch scheduling and uses the synchronized count in scheduled-token estimation. Router tokenization can select Harmony from ChangesDraft token synchronization
Harmony tokenization routing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/serve/router.py (1)
772-785: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winForward all worker-side chat-template inputs into router hashing.
The router now mirrors
tools, but the worker also renders withrequest.documentsandrequest.chat_template. Requests that differ only by RAG documents or a custom template can still produce identical router hashes while generating different worker prompts.Suggested fix
chat_template_kwargs = dict(request.chat_template_kwargs or {}) chat_template_kwargs["tools"] = self._tool_dicts(request) +chat_template_kwargs["documents"] = request.documents +if request.chat_template is not None: + chat_template_kwargs["chat_template"] = request.chat_template result = tokenizer.apply_chat_template(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/serve/router.py` around lines 772 - 785, Forward the remaining worker-side chat-template inputs into the router hash computation in router.py’s tokenizer.apply_chat_template path: the current hash only mirrors tools via chat_template_kwargs, but it also needs request.documents and request.chat_template so router hashing matches the worker-rendered prompt. Update the chat_template_kwargs assembly in the same block that builds result to include these request fields before calling apply_chat_template, ensuring requests that differ by RAG documents or a custom template no longer collide.
🧹 Nitpick comments (3)
tests/unittest/disaggregated/test_router.py (2)
1118-1181: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winCover the disabled-Harmony selection path.
The selection tests cover explicit
use_harmony=Trueandconfig.jsoninference, but not the server-compatible disabled path. Add amonkeypatch.setenv("DISABLE_HARMONY_ADAPTER", "1")case intests/unittest/disaggregated/test_router.pyasserting GPT-OSS requests use the tokenizer path and do not call Harmony. As per path instructions, coverage needs follow-up in this PR.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/disaggregated/test_router.py` around lines 1118 - 1181, Add a test case in test_use_harmony_flag_for_alias_model or a nearby test in KvCacheAwareRouter coverage that sets DISABLE_HARMONY_ADAPTER via monkeypatch before calling _tokenize; verify the disabled-Harmony path uses tokenizer.apply_chat_template, does not invoke harmony_adapter.get_harmony_adapter or openai_to_harmony_tokens, and leaves req.prompt_token_ids populated as expected for GPT-OSS/server-compatible requests. Use the existing test helpers and symbols like KvCacheAwareRouter, _tokenize, and ChatCompletionRequest to keep the new case aligned with the current Harmony selection tests.Source: Path instructions
1037-1117: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winAdd parity cases for
documentsand customchat_template.Coverage is insufficient for the remaining worker-side template inputs. Extend
tests/unittest/disaggregated/test_router.pywith parametrized router/server parity cases whererequest.documentsandrequest.chat_templateare set, so cache-hash tokenization stays aligned with serving tokenization. As per path instructions, this names the concrete test file and coverage gap.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/disaggregated/test_router.py` around lines 1037 - 1117, Add router/server parity coverage for the remaining worker-side template inputs by extending the existing `test_gpt_oss_router_and_server_create_same_tokenized_input` in `test_router.py`. Parametrize new cases where `ChatCompletionRequest.documents` is populated and where a custom `chat_template` is set, then verify `KvCacheAwareRouter._tokenize` and `OpenAIServer.chat_harmony` produce identical tokenized inputs. Reuse the current `fake_harmony_tokens`, `harmony.openai_to_harmony_tokens`, and `generate_async` assertions so the new cases exercise the same cache-hash/tokenization path.Source: Path instructions
tests/unittest/_torch/executor/test_py_executor.py (1)
378-408: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winCoverage is insufficient for the executor integration paths.
tests/unittest/_torch/executor/test_py_executor.pynow covers the static helper directly, but should also cover: (1) trans-completedraft_tokens=Nonewithenable_spec_decode=True/max_total_draft_tokens > 0, and (2) the PP scheduling path that bypasses_prepare_and_schedule_batch. As per path instructions, tests undertests/**should be reviewed for actionable coverage sufficiency.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/executor/test_py_executor.py` around lines 378 - 408, The executor tests only cover the helper directly and miss key integration coverage. Add a test around trans-complete requests with draft_tokens=None when enable_spec_decode is true and max_total_draft_tokens is positive, and add a PP scheduling test that exercises the path bypassing _prepare_and_schedule_batch. Use the existing PyExecutor and _sync_disagg_generation_trans_complete_draft_tokens / _compute_scheduled_tokens helpers to locate the relevant flow, but assert behavior through the higher-level scheduling integration rather than only the static helper.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2617-2633: The draft-token sync in
_sync_disagg_generation_trans_complete_draft_tokens is using a different draft
count rule than _prepare_disagg_gen_transmission_complete, which can undercount
capacity when spec decode is enabled. Update the transmission-complete sync to
use the same draft-token resolver/logic as
_prepare_disagg_gen_transmission_complete, and move this sync so it runs after
the spec-decode decision is applied. Keep the behavior aligned for empty or None
context drafts and ensure py_draft_tokens, draft_tokens, and
py_draft_pages_allocated are derived from the shared resolution path.
- Around line 2651-2652: The PP executor path is missing the same disagg
generation transition sync that `_prepare_and_schedule_batch` already performs,
so `pp_size > 1` can schedule requests with stale draft-token state. Add the
`_sync_disagg_generation_trans_complete_draft_tokens` call in
`_executor_loop_pp` before it invokes `_pp_schedule_and_propagate()` or
`_schedule()`, mirroring the existing behavior in `_prepare_and_schedule_batch`
and using `self.active_requests` there as the source.
In `@tensorrt_llm/serve/router.py`:
- Around line 736-740: The Harmony tokenization decision in
`_uses_harmony_tokenization` is still defaulting GPT-OSS models to Harmony even
when the server has disabled it. Update `KvCacheAwareRouter` to use the same
resolved `use_harmony` value that `OpenAIServer` derives from
`DISABLE_HARMONY_ADAPTER`, or otherwise apply that env gate here, so router KV
hashing and worker chat template rendering stay aligned.
---
Outside diff comments:
In `@tensorrt_llm/serve/router.py`:
- Around line 772-785: Forward the remaining worker-side chat-template inputs
into the router hash computation in router.py’s tokenizer.apply_chat_template
path: the current hash only mirrors tools via chat_template_kwargs, but it also
needs request.documents and request.chat_template so router hashing matches the
worker-rendered prompt. Update the chat_template_kwargs assembly in the same
block that builds result to include these request fields before calling
apply_chat_template, ensuring requests that differ by RAG documents or a custom
template no longer collide.
---
Nitpick comments:
In `@tests/unittest/_torch/executor/test_py_executor.py`:
- Around line 378-408: The executor tests only cover the helper directly and
miss key integration coverage. Add a test around trans-complete requests with
draft_tokens=None when enable_spec_decode is true and max_total_draft_tokens is
positive, and add a PP scheduling test that exercises the path bypassing
_prepare_and_schedule_batch. Use the existing PyExecutor and
_sync_disagg_generation_trans_complete_draft_tokens / _compute_scheduled_tokens
helpers to locate the relevant flow, but assert behavior through the
higher-level scheduling integration rather than only the static helper.
In `@tests/unittest/disaggregated/test_router.py`:
- Around line 1118-1181: Add a test case in
test_use_harmony_flag_for_alias_model or a nearby test in KvCacheAwareRouter
coverage that sets DISABLE_HARMONY_ADAPTER via monkeypatch before calling
_tokenize; verify the disabled-Harmony path uses tokenizer.apply_chat_template,
does not invoke harmony_adapter.get_harmony_adapter or openai_to_harmony_tokens,
and leaves req.prompt_token_ids populated as expected for
GPT-OSS/server-compatible requests. Use the existing test helpers and symbols
like KvCacheAwareRouter, _tokenize, and ChatCompletionRequest to keep the new
case aligned with the current Harmony selection tests.
- Around line 1037-1117: Add router/server parity coverage for the remaining
worker-side template inputs by extending the existing
`test_gpt_oss_router_and_server_create_same_tokenized_input` in
`test_router.py`. Parametrize new cases where `ChatCompletionRequest.documents`
is populated and where a custom `chat_template` is set, then verify
`KvCacheAwareRouter._tokenize` and `OpenAIServer.chat_harmony` produce identical
tokenized inputs. Reuse the current `fake_harmony_tokens`,
`harmony.openai_to_harmony_tokens`, and `generate_async` assertions so the new
cases exercise the same cache-hash/tokenization path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1001c8fd-15d1-46f1-9553-f45f1bea19cf
📒 Files selected for processing (4)
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/serve/router.pytests/unittest/_torch/executor/test_py_executor.pytests/unittest/disaggregated/test_router.py
| @staticmethod | ||
| def _sync_disagg_generation_trans_complete_draft_tokens( | ||
| requests: Iterable[LlmRequest]) -> None: | ||
| for request in requests: | ||
| if not getattr(request, | ||
| "is_disagg_generation_transmission_complete", False): | ||
| continue | ||
|
|
||
| context_phase_params = request.context_phase_params | ||
| if context_phase_params is None: | ||
| continue | ||
|
|
||
| draft_tokens = context_phase_params.draft_tokens | ||
| request.py_draft_tokens = [] if draft_tokens is None else list( | ||
| draft_tokens) | ||
| request.draft_tokens = request.py_draft_tokens | ||
| request.py_draft_pages_allocated = len(request.py_draft_tokens) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Use the same draft-token resolver before scheduling and transmission-complete setup.
Line 2630 treats None/empty ctx drafts as zero scheduled draft tokens, but Line 4478 later turns that same no-draft case into max_total_draft_tokens dummy drafts when model_engine.enable_spec_decode is true. That can undercount batch capacity before the request prepares more decode tokens. Move this sync after the current spec-decode decision and share the same resolver with _prepare_disagg_gen_transmission_complete.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 2617 - 2633, The
draft-token sync in _sync_disagg_generation_trans_complete_draft_tokens is using
a different draft count rule than _prepare_disagg_gen_transmission_complete,
which can undercount capacity when spec decode is enabled. Update the
transmission-complete sync to use the same draft-token resolver/logic as
_prepare_disagg_gen_transmission_complete, and move this sync so it runs after
the spec-decode decision is applied. Keep the behavior aligned for empty or None
context drafts and ensure py_draft_tokens, draft_tokens, and
py_draft_pages_allocated are derived from the shared resolution path.
| self._sync_disagg_generation_trans_complete_draft_tokens( | ||
| self.active_requests) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Mirror this sync in the PP scheduling path.
Line 2651 only covers _prepare_and_schedule_batch; _executor_loop_pp bypasses that helper and calls _pp_schedule_and_propagate() / _schedule() directly, so pp_size > 1 disagg generation requests can still schedule with stale draft-token fields.
Suggested placement in `_executor_loop_pp`
if self.kv_cache_transceiver:
self._check_disagg_ctx_schedulable_status(new_requests)
self._check_disagg_gen_transfer_status()
+ self._sync_disagg_generation_trans_complete_draft_tokens(
+ self.active_requests)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 2651 - 2652, The
PP executor path is missing the same disagg generation transition sync that
`_prepare_and_schedule_batch` already performs, so `pp_size > 1` can schedule
requests with stale draft-token state. Add the
`_sync_disagg_generation_trans_complete_draft_tokens` call in
`_executor_loop_pp` before it invokes `_pp_schedule_and_propagate()` or
`_schedule()`, mirroring the existing behavior in `_prepare_and_schedule_batch`
and using `self.active_requests` there as the source.
| def _uses_harmony_tokenization(self, | ||
| request: ChatCompletionRequest) -> bool: | ||
| if self._use_harmony is not None: | ||
| return self._use_harmony | ||
| return self._get_model_type(request.model) == "gpt_oss" |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Honor the server-side Harmony disable gate.
OpenAIServer resolves DISABLE_HARMONY_ADAPTER=1 to use_harmony=False, but this default router inference still enables Harmony for GPT-OSS models. That makes router KV hashes use Harmony tokens while the worker renders the non-Harmony chat template. Consider applying the same env gate here or always passing the server’s resolved use_harmony value into KvCacheAwareRouter.
Suggested alignment
def _uses_harmony_tokenization(self,
request: ChatCompletionRequest) -> bool:
+ if os.getenv("DISABLE_HARMONY_ADAPTER", "0") == "1":
+ return False
if self._use_harmony is not None:
return self._use_harmony
return self._get_model_type(request.model) == "gpt_oss"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _uses_harmony_tokenization(self, | |
| request: ChatCompletionRequest) -> bool: | |
| if self._use_harmony is not None: | |
| return self._use_harmony | |
| return self._get_model_type(request.model) == "gpt_oss" | |
| def _uses_harmony_tokenization(self, | |
| request: ChatCompletionRequest) -> bool: | |
| if os.getenv("DISABLE_HARMONY_ADAPTER", "0") == "1": | |
| return False | |
| if self._use_harmony is not None: | |
| return self._use_harmony | |
| return self._get_model_type(request.model) == "gpt_oss" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/serve/router.py` around lines 736 - 740, The Harmony
tokenization decision in `_uses_harmony_tokenization` is still defaulting
GPT-OSS models to Harmony even when the server has disabled it. Update
`KvCacheAwareRouter` to use the same resolved `use_harmony` value that
`OpenAIServer` derives from `DISABLE_HARMONY_ADAPTER`, or otherwise apply that
env gate here, so router KV hashing and worker chat template rendering stay
aligned.
Use tool-aware chat template and Harmony tokenization for KV-cache-aware router hashes without mutating forwarded OpenAI requests. Sync disaggregated generation draft tokens from context-phase params before scheduling so batch capacity accounting sees transferred draft tokens. Add unit coverage for router/server tokenization parity and disagg draft-token scheduler accounting. Signed-off-by: Simeng Liu <simengl@nvidia.com>
3081468 to
1f3dde4
Compare
Use tool-aware chat template and Harmony tokenization for KV-cache-aware router hashes without mutating forwarded OpenAI requests.
Sync disaggregated generation draft tokens from context-phase params before scheduling so batch capacity accounting sees transferred draft tokens.
Add unit coverage for router/server tokenization parity and disagg draft-token scheduler accounting.
Summary by CodeRabbit
New Features
Bug Fixes
Description
Test Coverage
tests/unittest/_torch/executor/test_py_executor.py
tests/unittest/disaggregated/test_router.py
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.