Skip to content

[TRTLLM-12622][feat] Reland: Add native post-processing hook to trtllm-serve#15631

Open
xwang233 wants to merge 1 commit into
NVIDIA:mainfrom
xwang233:reland-trtllm-12622-post-processor-hook
Open

[TRTLLM-12622][feat] Reland: Add native post-processing hook to trtllm-serve#15631
xwang233 wants to merge 1 commit into
NVIDIA:mainfrom
xwang233:reland-trtllm-12622-post-processor-hook

Conversation

@xwang233

@xwang233 xwang233 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Description

Reland of #15239 (native per-request post-processing hook for trtllm-serve, TRTLLM-12622), which was reverted in #15629.

The original PR was reverted because it broke AutoDeploy unit tests relying on DemoLLM with AttributeError: 'DemoLLM' object has no attribute '_post_processor_hook'. DemoLLM.__init__ does not call super().__init__(), so it never picked up the _post_processor_hook attribute that BaseLLM.__init__ sets and that generate_async reads.

This reland re-applies #15239 unchanged and folds in the minimal fix from #15628: initialize self._post_processor_hook = None in DemoLLM.__init__. DemoLLM is the only LLM subclass that bypasses BaseLLM.__init__; all others chain super().__init__(), so no other class needs the attribute.

Test Coverage

PR Checklist

  • Please check this after reviewing the above items as appropriate for this PR.

Summary by CodeRabbit

  • New Features

    • Added support for a user-defined post-processing hook in server and client flows.
    • Published documentation covering setup, behavior, supported verdicts, and usage examples.
  • Bug Fixes

    • Improved handling of late or missing responses to avoid request-processing errors.
    • Made hook-driven termination end requests promptly and safely.
  • Tests

    • Added unit and end-to-end coverage for rewriting, suppression, termination, and server compatibility checks.

Reland of NVIDIA#15239 (reverted in NVIDIA#15629). Re-applies the native per-request
post-processing hook for trtllm-serve, plus the follow-up fix from NVIDIA#15628:
initialize DemoLLM._post_processor_hook to None so AutoDeploy unit tests
relying on DemoLLM no longer fail with AttributeError.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
@xwang233 xwang233 requested review from a team as code owners June 25, 2026 16:05
@xwang233 xwang233 added the api-compatible Accepted LLM API contract change that is backwards-compatible label Jun 25, 2026
@xwang233

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55825 [ run ] triggered by Bot. Commit: b315e71 Link to invocation

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a configurable post-processing hook for detokenized output, wires it through serve/LLM/executor paths, adds guardrails for tokenizer and harmony modes, and includes unit, integration, and documentation coverage.

Changes

Post-processing hook feature

Layer / File(s) Summary
Hook contract and unit behavior
tensorrt_llm/executor/postprocessor_hook.py, tests/unittest/executor/test_postprocessor_hook.py
Defines the hook payload, verdict types, loader, and in-place output mutation logic, with unit tests for emit, suppress, terminate, loader behavior, and state handling.
Result and worker plumbing
tensorrt_llm/executor/result.py, tensorrt_llm/executor/postproc_worker.py, tensorrt_llm/executor/worker.py, tensorrt_llm/executor/proxy.py, tests/unittest/executor/test_postprocessor_hook.py, tests/unittest/executor/test_proxy_postproc_terminate.py
Adds hook storage and detokenization in results, threads the hook through postprocess worker startup and record handling, updates proxy cleanup and late-response handling, and adds matching result/proxy tests.
CLI and runtime configuration
tensorrt_llm/commands/serve.py, tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/llmapi/llm.py, tensorrt_llm/_torch/auto_deploy/llm.py, tensorrt_llm/serve/openai_server.py, tests/unittest/api_stability/references/llm.yaml, tests/unittest/llmapi/test_llm_args.py, tests/unittest/executor/test_postprocessor_hook.py
Adds the post_processor_hook CLI and LLM argument plumbing, resolves the hook during LLM initialization, propagates it into request outputs and executor config, initializes the DemoLLM hook field, and rejects incompatible tokenizer-skipping and harmony combinations with schema and unit tests.
Docs and OpenAI coverage
docs/source/features/post-processor-hook.md, docs/source/index.rst, tests/integration/defs/test_e2e.py, tests/integration/test_lists/test-db/l0_a10.yml, tests/unittest/llmapi/apps/_postproc_hook_samples.py, tests/unittest/llmapi/apps/_test_openai_post_processor.py
Adds the feature documentation page and index entry, sample hook implementations, the E2E wrapper and test-list entries, and OpenAI completion/chat tests across streaming, non-streaming, and detokenize=false cases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#15239: Introduces the same post-processor hook feature across the executor, CLI/server wiring, and tests at the same code paths.
  • NVIDIA/TensorRT-LLM#15629: Reverts the post-processor hook feature across the same hook, CLI, executor, and test paths.

Suggested reviewers

  • suyoggupta
  • JunyiXu-nv
  • chang-l
  • nv-guomingz
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.87% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the reland and the main feature change: a native post-processing hook for trtllm-serve.
Description check ✅ Passed The description covers the required Description, Test Coverage, and PR Checklist sections with clear, relevant details.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/llm.py (1)

212-226: 🎯 Functional Correctness | 🟠 Major

Reject or load post_processor_hook in DemoLLM

DemoLLM inherits LlmArgs, which already accepts post_processor_hook, but this constructor never loads it and hard-sets self._post_processor_hook = None. That silently disables the hook on the demollm path. Either reject the option here or thread it through the same loader used by LLM.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/auto_deploy/llm.py` around lines 212 - 226, DemoLLM
currently accepts post_processor_hook through LlmArgs but drops it by always
setting self._post_processor_hook to None, which disables the hook on the
demollm path. Update DemoLLM.__init__ to either explicitly reject
post_processor_hook in the same way it rejects encode_only, or load and assign
it using the same hook-loading logic as LLM so the argument is honored
consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/source/features/post-processor-hook.md`:
- Around line 30-34: The docs example in the post-processor hook section uses
the less-preferred `--extra_llm_api_options` path for a config-file setup;
update the `trtllm-serve` example to show passing the YAML via `--config`
instead. Keep the `post_processor_hook` setting associated with the same
example, and adjust the surrounding wording so it matches the preferred
configuration style used in the docs.

In `@tensorrt_llm/commands/serve.py`:
- Around line 640-652: The new post-processing hook option is being accepted by
trtllm-serve even when the visual-gen path is used, where it is not forwarded or
enforced. Update the serve command flow around the visual-gen branch and
_serve_visual_gen() to explicitly reject --post_processor_hook for visual-gen
serving, or otherwise propagate it only if that path truly supports it. Use the
existing serve.py option wiring and the visual-gen dispatch logic as the place
to add the validation so unsupported configurations fail fast instead of being
silently ignored.

In `@tensorrt_llm/executor/postproc_worker.py`:
- Around line 222-226: The fast-path in postproc_worker’s response handling is
using record._done to force is_final, but _done is also set by ordinary
stop-string handling in DetokenizedGenerationResultBase._handle_response().
Update the terminate-record check in the worker logic to key off record._aborted
instead, so only the terminate-hook path forces the request finalization and
non-aborted generations still wait for the engine’s real final response.

In `@tensorrt_llm/executor/postprocessor_hook.py`:
- Around line 181-233: The post-processor handling in the result.outputs loop
does not fully fail the chunk closed after a TERMINATE verdict, so sibling
outputs can still be serialized. Update the logic around
PostProcessorHookAction.TERMINATE in postprocessor_hook.py so that once one
output triggers terminate, the remaining outputs in the same result are
blanked/suppressed and marked stopped before the loop exits. Use the existing
result._aborted/result._done flags and output.finish_reason/output.stop_reason
handling in this loop to ensure multi-output requests cannot leak sibling text.

In `@tests/unittest/executor/test_postprocessor_hook.py`:
- Around line 82-156: Add a test in test_postprocessor_hook around
apply_post_processor_hook and terminate() that uses two CompletionOutput objects
in result.outputs, triggers termination from one output, and verifies the whole
request is cancelled. Assert the sibling output is also blanked/withheld and
gets the stopped/finished state expected from terminate(), not just the
triggering output, so the request-wide behavior is covered.

---

Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/llm.py`:
- Around line 212-226: DemoLLM currently accepts post_processor_hook through
LlmArgs but drops it by always setting self._post_processor_hook to None, which
disables the hook on the demollm path. Update DemoLLM.__init__ to either
explicitly reject post_processor_hook in the same way it rejects encode_only, or
load and assign it using the same hook-loading logic as LLM so the argument is
honored consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 81a87132-605c-427c-8d5f-ac6decbe1786

📥 Commits

Reviewing files that changed from the base of the PR and between edb14ee and b315e71.

📒 Files selected for processing (20)
  • docs/source/features/post-processor-hook.md
  • docs/source/index.rst
  • tensorrt_llm/_torch/auto_deploy/llm.py
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/executor/postproc_worker.py
  • tensorrt_llm/executor/postprocessor_hook.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/serve/openai_server.py
  • tests/integration/defs/test_e2e.py
  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/unittest/api_stability/references/llm.yaml
  • tests/unittest/executor/test_postprocessor_hook.py
  • tests/unittest/executor/test_proxy_postproc_terminate.py
  • tests/unittest/llmapi/apps/_postproc_hook_samples.py
  • tests/unittest/llmapi/apps/_test_openai_post_processor.py
  • tests/unittest/llmapi/test_llm_args.py

Comment thread docs/source/features/post-processor-hook.md
Comment thread tensorrt_llm/commands/serve.py
Comment thread tensorrt_llm/executor/postproc_worker.py
Comment thread tensorrt_llm/executor/postprocessor_hook.py
Comment thread tests/unittest/executor/test_postprocessor_hook.py

@govind-ramnarayan govind-ramnarayan left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this for AutoDeploy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-compatible Accepted LLM API contract change that is backwards-compatible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants