[#15565][fix] AutoDeploy: Fix Super MTP IMA introduced by checkpointing replay by galagam · Pull Request #15622 · NVIDIA/TensorRT-LLM

galagam · 2026-06-25T08:56:25Z

Summary by CodeRabbit

Bug Fixes
- Improved reliability for auto-deploy and replay-based execution by ensuring metadata is prepared at the right time and avoiding use of uninitialized buffer data.
- Updated several platform-specific test exclusions and acceptance thresholds to better match current behavior across supported GPU configurations.
Tests
- Added coverage for replay-buffer handling in a cached execution path.
- Expanded integration test coverage for an additional 8-GPU B200 scenario.

Description

Make replay-metadata preparation symmetric across runtime and cudagraph
capture by registering prepare_replay_metadata as a host-prepare hook, so it
runs inside nest_sequences() on every path -- exactly like all other AD
attention metadata.

Previously prepare_replay_metadata was a side call only in
ADEngine._prepare_inputs (runtime). The cudagraph capture path
(SequenceInfo.set_capture_batch) synthesizes every other metadata buffer
(slot_idx, KV page metadata, in-graph prepare ops) but never invoked it, so
replay_work_items was the lone metadata buffer left unprepared at capture --
the root cause of the SuperV3 MTP capture-time IMA.

The hook is registered only when the replay buffers were actually allocated
(ssm_replay enabled), so non-replay models pay no per-forward cost and never
touch the unallocated buffers. The explicit runtime call is removed (the hook
now covers it).

In addition, zero-init replay_work_items as an extra defense mechanism.

Increase required acceptance rate by the integration test. Empirically, we're seeing acceptance rates of 53%-54% in quantized variants as well. Tighten the threshold from 40% to 50% to avoid future regressions.

Test Coverage

Added unit test:
tests/unittest/auto_deploy/singlegpu/custom_ops/mamba/test_flashinfer_mamba_cached_op.py::test_extend_replay_init_buffers
accuracy/test_llm_api_autodeploy.py::TestNemotronSuperV3::test_mtp that are being unwaived as part of this PR

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-06-25T09:02:01Z

📝 Walkthrough

Walkthrough

Replay work-items are zero-initialized, populated through a host-prepare hook before attention forward, and the related AutoDeploy tests, test list, and waiver entries are updated. A new Mamba replay regression test constructs replay buffers and runs the cached SSM path.

Changes

Replay metadata and AutoDeploy coverage

Layer / File(s)	Summary
Replay metadata lifecycle `tensorrt_llm/_torch/auto_deploy/shim/interface.py`, `tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`	`CachedSequenceInterface` zero-initializes `_replay_work_items`, registers a host-prepare hook that calls `prepare_replay_metadata()`, and `ADEngine._prepare_inputs` stops calling `prepare_replay_metadata()` directly after `nest_sequences(...)`.
Mamba replay regression `tests/unittest/auto_deploy/singlegpu/custom_ops/mamba/test_flashinfer_mamba_cached_op.py`	The test module adds replay-related imports and helpers, builds a replay-capable `CachedSequenceInterface`, and adds `test_extend_replay_no_ima` to run `flashinfer_cached_ssm` with replay buffers and CUDA synchronization.
Acceptance threshold and waivers `tests/integration/defs/accuracy/test_llm_api_autodeploy.py`, `tests/integration/test_lists/test-db/l0_dgx_b200.yml`, `tests/integration/test_lists/waives.txt`	`test_mtp` uses a fixed `min_acceptance_rate=0.50`, the B200 post-merge list adds the `nvfp4_ws8_80gb-trtllm` case, and waiver entries are removed and replaced for the affected AutoDeploy accuracy cases.

Sequence Diagram(s)

sequenceDiagram
  participant ADEngine as ADEngine._prepare_inputs
  participant CachedSequenceInterface
  participant RegisterHook as register_host_prepare_for_attention_forward
  participant PrepareReplay as prepare_replay_metadata()
  ADEngine->>CachedSequenceInterface: info.nest_sequences(...)
  CachedSequenceInterface->>RegisterHook: register_host_prepare_for_attention_forward(...)
  RegisterHook->>PrepareReplay: prepare_replay_metadata()
  PrepareReplay-->>CachedSequenceInterface: _replay_work_items populated

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#15506 — Also updates tests/integration/test_lists/waives.txt with TestNemotronSuperV3::test_mtp waiver changes.
NVIDIA/TensorRT-LLM#15511 — Modifies integration skip/waiver lists in the same test-gating area.
NVIDIA/TensorRT-LLM#15609 — Changes waiver entries for parametrized integration cases in the same waiver file.

Suggested labels

AutoDeploy

Suggested reviewers

tcherckez-nvidia
suyoggupta
MrGeva

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise, uses the required issue/type format, and matches the main AutoDeploy replay-IMA fix.
Description check	✅ Passed	The description includes the required summary, test coverage, and checklist sections with a clear explanation of the fix.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (1)

tests/unittest/auto_deploy/singlegpu/custom_ops/mamba/test_flashinfer_mamba_cached_op.py (1)
346-468: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Good deterministic regression guard.

The poison-then-prep-then-synchronize structure reliably reproduces the capture-time IMA, and confining the poison to _replay_work_items keeps the failure deterministic. Coverage for the replay path is sufficient.

Optionally, you could strengthen the assertion to prove prep actually overwrote the buffer (not merely that the kernel didn't fault), e.g. assert the poison sentinel is gone from the consumed rows:
🧪 Optional stronger assertion
     torch.cuda.synchronize()
+    # Prep must have overwritten the poisoned rows the kernel consumes.
+    assert (replay_work_items[:num_extend] != 0x7FFFFFFF).all()
     assert out.shape == hidden_states.shape
     assert torch.isfinite(out).all()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/auto_deploy/singlegpu/custom_ops/mamba/test_flashinfer_mamba_cached_op.py`
around lines 346 - 468, The replay regression guard is good, but it only checks
that the kernel does not fault; add an assertion in test_extend_replay_no_ima
that the production prep path actually overwrites the poisoned
_replay_work_items buffer before flashinfer_cached_ssm consumes it. Use the
existing setup around interface.info.set_capture_batch(), replay_work_items, and
replay_n_writes to verify the sentinel value is gone from the relevant rows
after metadata prep and before synchronization.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@tests/unittest/auto_deploy/singlegpu/custom_ops/mamba/test_flashinfer_mamba_cached_op.py`:
- Around line 346-468: The replay regression guard is good, but it only checks
that the kernel does not fault; add an assertion in test_extend_replay_no_ima
that the production prep path actually overwrites the poisoned
_replay_work_items buffer before flashinfer_cached_ssm consumes it. Use the
existing setup around interface.info.set_capture_batch(), replay_work_items, and
replay_n_writes to verify the sentinel value is gone from the relevant rows
after metadata prep and before synchronization.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ae99a81d-7b62-467a-ab74-94a77a15c7b7

📥 Commits

Reviewing files that changed from the base of the PR and between a02214a and 0f4a221.

📒 Files selected for processing (6)

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/shim/interface.py
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
tests/integration/test_lists/test-db/l0_dgx_b200.yml
tests/integration/test_lists/waives.txt
tests/unittest/auto_deploy/singlegpu/custom_ops/mamba/test_flashinfer_mamba_cached_op.py

💤 Files with no reviewable changes (2)

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tests/integration/test_lists/waives.txt

galagam · 2026-06-25T11:10:27Z

/bot run

tensorrt-cicd · 2026-06-25T11:16:41Z

PR_Github #55767 [ run ] triggered by Bot. Commit: a93dede Link to invocation

tensorrt-cicd · 2026-06-25T13:01:50Z

PR_Github #55767 [ run ] completed with state SUCCESS. Commit: a93dede
/LLM/main/L0_MergeRequest_PR pipeline #44664 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…tems buffer The SuperV3 MTP replay path allocated the shared `replay_work_items` metadata buffer with `torch.empty` in `CachedSequenceInterface._create_mamba_hybrid_cache_manager`. During CUDA graph capture/warmup the replay SSM kernel runs before `prepare_replay_metadata()` has populated this buffer, so in `persistent_main` mode the kernel reads uninitialized cache-slot indices and performs an out-of-bounds SSM state-cache access -- an illegal memory access that crashed executor init (e.g. test_mtp[nvfp4_ws4_80gb-trtllm]). Allocate the buffer with `torch.zeros` so every work item maps to slot 0 / pnat 0 (in-bounds) until real metadata is written per step. This mirrors the PyTorch backend, which already zero-inits `replay_work_items` (mamba2_metadata.py). Add a regression test that drives the real replay op (flashinfer_cached_ssm -> replay_selective_state_update) at the production SuperV3 shape (persistent_main mode) with the buffer allocated through the production path and left unprepared (capture scenario), asserting no IMA. The test deterministically faults without the fix and passes with it. Unwaive the SuperV3 test_mtp ws4 cases (bf16 / fp8 / nvfp4) under nvbugs 6316981 and 6336682, now that the capture-time IMA is fixed. Register the previously-uncovered nvfp4_ws8 trtllm MTP case on DGX-B200 (existing 8-GPU post-merge section). Raise the SuperV3 test_mtp acceptance-rate gate to a single 0.50 for all dtypes (observed >=53% on bf16/fp8/nvfp4 ws4). Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

… during cudagraph capture Make replay-metadata preparation symmetric across runtime and cudagraph capture by registering prepare_replay_metadata as a host-prepare hook, so it runs inside nest_sequences() on every path -- exactly like all other AD attention metadata. Previously prepare_replay_metadata was a side call only in ADEngine._prepare_inputs (runtime). The cudagraph capture path (SequenceInfo.set_capture_batch) synthesizes every other metadata buffer (slot_idx, KV page metadata, in-graph prepare ops) but never invoked it, so replay_work_items was the lone metadata buffer left unprepared at capture -- the root cause of the SuperV3 MTP capture-time IMA. The hook is registered only when the replay buffers were actually allocated (ssm_replay enabled), so non-replay models pay no per-forward cost and never touch the unallocated buffers. The explicit runtime call is removed (the hook now covers it). The zero-init of replay_work_items is kept as defense-in-depth (reframed in comment), mirroring the PyTorch backend which also keeps the zeros guard alongside its per-forward prepare. Verified on 4xGB200: with the buffer deliberately poisoned at allocation, the hook overwrites it at capture and every runtime step -- test_mtp[nvfp4_ws4_80gb-trtllm] passes (acceptance 54.5%, no IMA), confirming zero-init is now optional. Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

galagam · 2026-06-25T15:22:36Z

/bot run --post-merge

tensorrt-cicd · 2026-06-25T15:29:11Z

PR_Github #55806 [ run ] triggered by Bot. Commit: 27f1dd7 Link to invocation

galagam requested review from a team as code owners June 25, 2026 08:56

galagam requested a review from MrGeva June 25, 2026 08:56

github-actions Bot assigned galagam Jun 25, 2026

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

MrGeva approved these changes Jun 25, 2026

View reviewed changes

galagam requested review from govind-ramnarayan and hnover-nv June 25, 2026 09:07

galagam force-pushed the gagam/fix-mtp-ima branch from 7728d94 to a93dede Compare June 25, 2026 09:09

galagam added 3 commits June 25, 2026 18:21

refactor

27f1dd7

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

galagam force-pushed the gagam/fix-mtp-ima branch from a93dede to 27f1dd7 Compare June 25, 2026 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[#15565][fix] AutoDeploy: Fix Super MTP IMA introduced by checkpointing replay#15622

[#15565][fix] AutoDeploy: Fix Super MTP IMA introduced by checkpointing replay#15622
galagam wants to merge 3 commits into
NVIDIA:mainfrom
nv-auto-deploy:gagam/fix-mtp-ima

galagam commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 25, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

galagam commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

galagam commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

galagam commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 25, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

galagam commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

galagam commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

galagam commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading