[None][perf] DSv4 follow-up: autotuner updates by lfr-0531 · Pull Request #15626 · NVIDIA/TensorRT-LLM

lfr-0531 · 2026-06-25T11:27:55Z

Summary by CodeRabbit

New Features
- Added an option to keep certain autotuning results in memory without saving them to disk.
- Improved tuning behavior for some distributed and fused GPU workloads, with broader tactic selection support.
Bug Fixes
- Made autotuning cache saves safer when multiple processes write at the same time.
- Prevented cached tuning results from interfering with startup warmup behavior in specific GPU paths.
Tests
- Expanded automated coverage for autotuning shortcuts and CUDA graph replay scenarios.

Description

This PR carves the PR-4 autotuner scope from github/feat/deepseek_v4 after a7728aa onto current github/main.

It includes:

Autotuner cache merge/atomic-save updates.
Distributed tuning support and the single (runner, tactic) shortcut.
MHC autotune tactic reductions and the follow-up that disables PARALLEL distributed tuning for MhcFusedHc.

The PR is intentionally limited to autotuner/MHC tuning files and direct unit tests. The source test-list hunk for deleted l0_b200_ds.yml was left out per the split plan.

Rebase/conflict notes are recorded in tmp/dsv4_final_prs/PR4_rebase_conflicts_20260625.md.

Test Coverage

Import/scope:
- Verified tensorrt_llm, tensorrt_llm._torch.autotuner, and tensorrt_llm._torch.modules.mhc.mhc_cuda import from this worktree.
- git diff --check github/main...HEAD
- Changed-file conflict-marker search.
- PR-4 forbidden-scope check returned no matches.
Unit tests:
- PATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=0 timeout 20m .venv-3.12/bin/python -m pytest tests/unittest/_torch/misc/test_autotuner.py -q
  - 30 passed, 4 skipped, 10 warnings
- PATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=0 timeout 20m .venv-3.12/bin/python -m pytest tests/unittest/_torch/modules/test_mhc.py -q
  - 55 passed, 3 warnings

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-06-25T11:37:41Z

📝 Walkthrough

Walkthrough

The PR adds an autotuner cache opt-out path with lockfile-backed JSON persistence, a single-pair profiling shortcut, updated mHC distributed tuning and MMA tactic selection rules, and new CUDA graph coverage for fused-HC behaviors.

Changes

Autotuner and mHC tuning updates

Layer / File(s)	Summary
Cache opt-out contract `tensorrt_llm/_torch/autotuner.py`, `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	`TuningConfig.exclude_from_cache` is documented and defaulted, `choose_one()` records excluded ops and bypasses cache lookup during tuning, and `fp8SwapABGemmRunner` opts out of disk persistence.
Cache persistence and locking `tensorrt_llm/_torch/autotuner.py`	`AutoTunerProfilingCache` tracks excluded ops, filters them from partitioned output, and saves/loads JSON with lockfile-backed merge and atomic-write handling.
Single-pair profiling shortcut `tensorrt_llm/_torch/autotuner.py`, `tests/unittest/_torch/misc/test_autotuner.py`	`_profile_runners()` precomputes candidate pairs for a one-pair fast path, and the new unit test covers shortcut and multi-tactic profiling behavior.
mHC tuning rules `tensorrt_llm/_torch/modules/mhc/mhc_cuda.py`	mHC pre-mapping uses parallel distributed tuning, fused-HC adds MMA heuristics, and tactic generation now deduplicates and revises backend thresholds and fallbacks.
CUDA graph coverage `tests/unittest/_torch/modules/test_mhc.py`	CUDA-graph helpers build fused-HC cases, compare replay with eager execution, and add high-splitK plus decode-bucket/prefill coverage.

Sequence Diagram(s)

sequenceDiagram
  participant AutoTunerProfilingCache
  participant _exclusive_cache_lock
  participant _atomic_write_json

  AutoTunerProfilingCache->>_exclusive_cache_lock: acquire lockfile
  AutoTunerProfilingCache->>AutoTunerProfilingCache: read existing cache JSON
  AutoTunerProfilingCache->>AutoTunerProfilingCache: merge shared and rank entries
  AutoTunerProfilingCache->>_atomic_write_json: write merged JSON
  AutoTunerProfilingCache->>_exclusive_cache_lock: release lockfile

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and matches the PR’s main autotuner/MHC follow-up scope.
Description check	✅ Passed	The description follows the template with Description, Test Coverage, and a completed PR Checklist.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

tests/unittest/_torch/misc/test_autotuner.py (1)
1056-1129: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a cache opt-out regression next to this shortcut test.

Coverage is sufficient for the single-pair shortcut, but this PR also changes exclude_from_cache persistence. Please add a case in tests/unittest/_torch/misc/test_autotuner.py that seeds a stale cache entry for an excluded op, enters autotune(cache_path=...), and asserts profiling still runs and the saved JSON no longer contains that op.

As per path instructions, tests/** reviews should assess coverage and suggest concrete file names.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/misc/test_autotuner.py` around lines 1056 - 1129, The
shortcut test covers single-pair profiling, but it does not verify the new
exclude_from_cache persistence behavior. Add a regression case in test_autotuner
alongside test_single_pair_shortcut that seeds a stale profiling cache entry for
an op marked excluded, then enters autotune(cache_path=...) and confirms
AutoTuner still profiles instead of reusing the stale entry. After the run,
assert the saved cache JSON no longer contains that op so the persistence
cleanup is exercised.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/autotuner.py`:
- Around line 1083-1096: The forced cache miss in autotuning is only applied
before the early return, but the per-profile lookup in the tuning flow still
uses search_cache() and can reuse stale entries. Update the lookup path in
autotuner’s cache handling so the excluded-from-cache behavior also forces a
miss during each profile-specific search, ensuring _profile_runners() and the
JIT warmup still run when tuning_config.exclude_from_cache is set. Use the
existing custom_op, profiling_cache.search_cache(), and fallback_entry() logic
as the main touchpoints.
- Around line 596-616: The cache merge in _partition_cache_by_strategy()
preserves stale persisted entries for ops now marked exclude_from_cache, so they
are not fully removed from disk. Before calling _atomic_write_json, filter
current_cache to drop any excluded-op keys from both the shared cache and the
rank-specific cache, then merge the remaining serialized_shared_cache and
serialized_rank_cache entries as usual.
- Around line 1244-1257: The exception handlers in Autotuner are too broad and
should be narrowed to the specific CUDA/runtime failures these profiling paths
expect. Update the `except Exception as e` blocks in `Autotuner` (including the
single-pair failure path and the later profiling path) to catch the relevant
PyTorch/CUDA error types only, and explicitly handle `torch.cuda.synchronize()`
failures by logging them with context instead of swallowing them. Keep the
existing failure bookkeeping in place after the narrower catch so profiling
stats still update correctly.

In `@tests/unittest/_torch/modules/test_mhc.py`:
- Around line 1098-1106: Add the same availability guard used by the existing
CUDA-graph MMA test to the new fused-HC MMA-only tests, including
test_mhc_fused_hc_cuda_graph_high_splitk_tactics and the other fused_*_mma
CUDA-graph path test in this block, so they skip when
_mhc_fused_hc_mma_available() is false. Keep the tests’ current coverage for the
CUDA-graph replay paths, but ensure the guard is applied before creating the
runner or forcing MMA tactics so non-SM100 and non-BUILD_DEEP_GEMM builds skip
cleanly instead of failing.
- Around line 1076-1078: The output comparison loop in the test uses zip()
without strict checking, so it may silently ignore mismatched output counts.
Update the zip in the graph_out/eager_out assertion to use strict=True, and keep
the existing tuple of names so the test fails loudly if the runner returns a
different number of outputs.

---

Nitpick comments:
In `@tests/unittest/_torch/misc/test_autotuner.py`:
- Around line 1056-1129: The shortcut test covers single-pair profiling, but it
does not verify the new exclude_from_cache persistence behavior. Add a
regression case in test_autotuner alongside test_single_pair_shortcut that seeds
a stale profiling cache entry for an op marked excluded, then enters
autotune(cache_path=...) and confirms AutoTuner still profiles instead of
reusing the stale entry. After the run, assert the saved cache JSON no longer
contains that op so the persistence cleanup is exercised.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f309f66e-9e0f-4dbc-9173-c43b059aeea6

📥 Commits

Reviewing files that changed from the base of the PR and between a02214a and 23fbc68.

📒 Files selected for processing (5)

tensorrt_llm/_torch/autotuner.py
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/modules/mhc/mhc_cuda.py
tests/unittest/_torch/misc/test_autotuner.py
tests/unittest/_torch/modules/test_mhc.py

coderabbitai · 2026-06-25T11:37:44Z

+            with _exclusive_cache_lock(lock_path):
+                current_cache = self._read_existing_cache(file_path)

                # Merge shared cache entries (non-INDEPENDENT ops)
                if self.SHARED_CACHE_KEY not in current_cache:
                    current_cache[self.SHARED_CACHE_KEY] = {}
                current_cache[self.SHARED_CACHE_KEY].update(
                    serialized_shared_cache)

-                # Save rank-specific cache entries (INDEPENDENT ops)
-                current_cache[f"rank_{rank}"] = serialized_rank_cache
-
-                json.dump(current_cache, f, indent=2, default=str)
+                # Merge rank-specific cache entries (INDEPENDENT ops).
+                # MUST be a merge (not assignment): a concurrent writer
+                # that committed its rank_{rank} contribution between
+                # this writer's load_cache and save_cache would otherwise
+                # be silently dropped by an `=` assignment that re-writes
+                # the slot with only this writer's in-memory cache.
+                rank_key = f"rank_{rank}"
+                if rank_key not in current_cache:
+                    current_cache[rank_key] = {}
+                current_cache[rank_key].update(serialized_rank_cache)
+
+                self._atomic_write_json(file_path, current_cache)


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Purge stale excluded-op entries from the loaded JSON before writing.

_partition_cache_by_strategy() drops only this process’s outgoing entries. If current_cache already contains an older persisted entry for an op now marked exclude_from_cache, the merge at Lines 599-614 preserves it, so the op is not actually “dropped entirely” from disk.

Proposed fix

with _exclusive_cache_lock(lock_path): current_cache = self._read_existing_cache(file_path) + self._drop_excluded_entries(current_cache) # Merge shared cache entries (non-INDEPENDENT ops) if self.SHARED_CACHE_KEY not in current_cache: current_cache[self.SHARED_CACHE_KEY] = {}

+ def _drop_excluded_entries(self, cache_contents: Dict[str, Any]) -> None: + """Remove previously persisted entries for ops excluded from disk cache.""" + if not self.excluded_op: + return + + for bucket in cache_contents.values(): + if not isinstance(bucket, dict): + continue + for key_str in list(bucket.keys()): + try: + key = ast.literal_eval(key_str) + except (ValueError, SyntaxError): + continue + if key and key[0] in self.excluded_op: + del bucket[key_str]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with _exclusive_cache_lock(lock_path):

current_cache = self._read_existing_cache(file_path)

# Merge shared cache entries (non-INDEPENDENT ops)

if self.SHARED_CACHE_KEY not in current_cache:

current_cache[self.SHARED_CACHE_KEY] = {}

current_cache[self.SHARED_CACHE_KEY].update(

serialized_shared_cache)

# Save rank-specific cache entries (INDEPENDENT ops)

current_cache[f"rank_{rank}"] = serialized_rank_cache

json.dump(current_cache, f, indent=2, default=str)

# Merge rank-specific cache entries (INDEPENDENT ops).

# MUST be a merge (not assignment): a concurrent writer

# that committed its rank_{rank} contribution between

# this writer's load_cache and save_cache would otherwise

# be silently dropped by an `=` assignment that re-writes

# the slot with only this writer's in-memory cache.

rank_key = f"rank_{rank}"

if rank_key not in current_cache:

current_cache[rank_key] = {}

current_cache[rank_key].update(serialized_rank_cache)

self._atomic_write_json(file_path, current_cache)

with _exclusive_cache_lock(lock_path):

current_cache = self._read_existing_cache(file_path)

self._drop_excluded_entries(current_cache)

# Merge shared cache entries (non-INDEPENDENT ops)

if self.SHARED_CACHE_KEY not in current_cache:

current_cache[self.SHARED_CACHE_KEY] = {}

current_cache[self.SHARED_CACHE_KEY].update(

serialized_shared_cache)

# Merge rank-specific cache entries (INDEPENDENT ops).

# MUST be a merge (not assignment): a concurrent writer

# that committed its rank_{rank} contribution between

# this writer's load_cache and save_cache would otherwise

# be silently dropped by an `=` assignment that re-writes

# the slot with only this writer's in-memory cache.

rank_key = f"rank_{rank}"

if rank_key not in current_cache:

current_cache[rank_key] = {}

current_cache[rank_key].update(serialized_rank_cache)

self._atomic_write_json(file_path, current_cache)

def _drop_excluded_entries(self, cache_contents: Dict[str, Any]) -> None:

"""Remove previously persisted entries for ops excluded from disk cache."""

if not self.excluded_op:

return

for bucket in cache_contents.values():

if not isinstance(bucket, dict):

continue

for key_str in list(bucket.keys()):

try:

key = ast.literal_eval(key_str)

except (ValueError, SyntaxError):

continue

if key and key[0] in self.excluded_op:

del bucket[key_str]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/autotuner.py` around lines 596 - 616, The cache merge in _partition_cache_by_strategy() preserves stale persisted entries for ops now marked exclude_from_cache, so they are not fully removed from disk. Before calling _atomic_write_json, filter current_cache to drop any excluded-op keys from both the shared cache and the rank-specific cache, then merge the remaining serialized_shared_cache and serialized_rank_cache entries as usual.

coderabbitai · 2026-06-25T11:37:44Z

+        if tuning_config.exclude_from_cache:
+            self.profiling_cache.add_excluded_op(custom_op)
+        if tuning_config.exclude_from_cache and self.is_tuning_mode:
+            is_cache_hit = False
+            best_runner_id, best_tactic, min_time = (
+                self.profiling_cache.fallback_entry())
+        else:
+            is_cache_hit, best_runner_id, best_tactic, min_time = (
+                self.profiling_cache.search_cache(
+                    custom_op,
+                    runners,
+                    input_shapes,
+                    tuning_config,
+                    apply_map_to_tuning_buckets=True))


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Apply the excluded-cache miss to the per-profile lookup too.

The initial forced miss avoids the early return, but the tuning loop still calls search_cache() for each profile. A stale loaded entry can therefore skip _profile_runners() and the JIT warmup this option is meant to force.

Proposed fix

tensors = self._prepare_input_tensors(p, inputs) - is_cache_hit, *_ = self.profiling_cache.search_cache( - custom_op, - runners, - p.get_opt_shapes(), - tuning_config, - apply_map_to_tuning_buckets=False, - ) + if tuning_config.exclude_from_cache: + is_cache_hit = False + else: + is_cache_hit, *_ = self.profiling_cache.search_cache( + custom_op, + runners, + p.get_opt_shapes(), + tuning_config, + apply_map_to_tuning_buckets=False, + )

Also applies to: 1141-1148

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/autotuner.py` around lines 1083 - 1096, The forced cache miss in autotuning is only applied before the early return, but the per-profile lookup in the tuning flow still uses search_cache() and can reuse stale entries. Update the lookup path in autotuner’s cache handling so the excluded-from-cache behavior also forces a miss during each profile-specific search, ensuring _profile_runners() and the JIT warmup still run when tuning_config.exclude_from_cache is set. Use the existing custom_op, profiling_cache.search_cache(), and fallback_entry() logic as the main touchpoints.

coderabbitai · 2026-06-25T11:37:44Z

+            except Exception as e:
+                shapes = self._get_input_sizes(input_tensors)
+                logger.warning_once(
+                    f"[Autotuner] Single-pair run failed for custom_op={custom_op}, runner={runner}, tactic={tac}, shapes={shapes}. Error: {e}",
+                    key=(custom_op, "warning_autotuning_single_pair_failure"),
                )
+                self.stats.failed_profiling_count[custom_op].add(
+                    self.profiling_cache.get_cache_key(
+                        custom_op,
+                        runner,
+                        profile.get_opt_shapes(),
+                        tuning_config,
+                        apply_map_to_tuning_buckets=False))
+                has_tuning_failure_occurred = True


📐 Maintainability & Code Quality | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash ruff check --select BLE001,S110 tensorrt_llm/_torch/autotuner.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 3116

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the relevant sections around the reported lines. sed -n '730,790p' tensorrt_llm/_torch/autotuner.py printf '\n---\n' sed -n '1220,1310p' tensorrt_llm/_torch/autotuner.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 7231

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect surrounding helper definitions to understand the expected exception surface. rg -n "torch.cuda.synchronize|warning_once|failed_profiling_count|profile.get_opt_shapes|RuntimeError|Exception as e" tensorrt_llm/_torch/autotuner.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1985

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check how this module handles neighboring runtime failures for consistency. sed -n '930,980p' tensorrt_llm/_torch/autotuner.py printf '\n---\n' sed -n '1030,1065p' tensorrt_llm/_torch/autotuner.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 4418

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the profiling helper and nearby exception handling patterns. sed -n '680,745p' tensorrt_llm/_torch/autotuner.py printf '\n---\n' sed -n '745,785p' tensorrt_llm/_torch/autotuner.py printf '\n---\n' sed -n '1180,1310p' tensorrt_llm/_torch/autotuner.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 11101

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find how this file and nearby code handle CUDA/runtime exceptions and synchronization failures. rg -n "RuntimeError|torch\.cuda\.synchronize|warning_once|except Exception as e|except RuntimeError as e|except Exception:" tensorrt_llm/_torch/autotuner.py tensorrt_llm -g '!**/__pycache__/**'

Repository: NVIDIA/TensorRT-LLM

Length of output: 50376

Narrow the profiling exception handlers. Replace the broad except Exception blocks at tensorrt_llm/_torch/autotuner.py:1244 and :1283-1291 with the specific CUDA/runtime errors these paths are meant to handle, and log torch.cuda.synchronize() failures instead of swallowing them. The current handlers still trip BLE001/S110.

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 1244-1244: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/autotuner.py` around lines 1244 - 1257, The exception handlers in Autotuner are too broad and should be narrowed to the specific CUDA/runtime failures these profiling paths expect. Update the `except Exception as e` blocks in `Autotuner` (including the single-pair failure path and the later profiling path) to catch the relevant PyTorch/CUDA error types only, and explicitly handle `torch.cuda.synchronize()` failures by logging them with context instead of swallowing them. Keep the existing failure bookkeeping in place after the narrower catch so profiling stats still update correctly.

Sources: Coding guidelines, Linters/SAST tools

coderabbitai · 2026-06-25T11:37:44Z

+    for actual, expected, name in zip(
+        graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input")
+    ):


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Add strict=True to the output comparison zip().

Ruff reports B905 here; this assertion should fail loudly if the runner returns a different number of outputs.

Proposed fix

for actual, expected, name in zip( - graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input") + graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input"), strict=True ):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for actual, expected, name in zip(

graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input")

):

for actual, expected, name in zip(

graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input"), strict=True

):

🧰 Tools

🪛 Ruff (0.15.18)

[warning] 1076-1078: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/modules/test_mhc.py` around lines 1076 - 1078, The output comparison loop in the test uses zip() without strict checking, so it may silently ignore mismatched output counts. Update the zip in the graph_out/eager_out assertion to use strict=True, and keep the existing tuple of names so the test fails loudly if the runner returns a different number of outputs.

Source: Linters/SAST tools

coderabbitai · 2026-06-25T11:37:44Z

+def test_mhc_fused_hc_cuda_graph_high_splitk_tactics(n: int, tactic):
+    """Reduced autotune maps decode buckets to high-splitK MMA tactics.
+
+    Unlike the bit-exact ks=1 graph test above, this covers the actual
+    M=64/128 PR autotune path where splitK atomics and CUDA graph replay are
+    both active.
+    """
+    runner, inputs = _make_fused_hc_runner_case(n=n, hidden_size=4096, hc_mult=4, seed=41 + n)
+    _assert_graph_replay_matches_eager(runner, inputs, tactic)


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Skip these MMA-only tests when fused-HC MMA is unavailable.

Both new tests force fused_*_mma tactics, but unlike the existing CUDA-graph test above, they do not guard _mhc_fused_hc_mma_available(). On non-SM100 or builds without BUILD_DEEP_GEMM=ON, this will fail instead of skip. Coverage is sufficient for the new CUDA-graph paths once the environment guard is added.

Proposed fix

def test_mhc_fused_hc_cuda_graph_high_splitk_tactics(n: int, tactic): """Reduced autotune maps decode buckets to high-splitK MMA tactics. @@ both active. """ + if not _mhc_fused_hc_mma_available(): + pytest.skip("mHC fused-HC MMA kernels require SM100 and BUILD_DEEP_GEMM=ON") + runner, inputs = _make_fused_hc_runner_case(n=n, hidden_size=4096, hc_mult=4, seed=41 + n) _assert_graph_replay_matches_eager(runner, inputs, tactic) @@ def test_mhc_fused_hc_cuda_graph_decode_buckets_then_prefill(): @@ """ + if not _mhc_fused_hc_mma_available(): + pytest.skip("mHC fused-HC MMA kernels require SM100 and BUILD_DEEP_GEMM=ON") + hidden_size = 4096

As per path instructions, tests/**: keep feedback actionable and state whether coverage is sufficient.

Also applies to: 1109-1137

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/modules/test_mhc.py` around lines 1098 - 1106, Add the same availability guard used by the existing CUDA-graph MMA test to the new fused-HC MMA-only tests, including test_mhc_fused_hc_cuda_graph_high_splitk_tactics and the other fused_*_mma CUDA-graph path test in this block, so they skip when _mhc_fused_hc_mma_available() is false. Keep the tests’ current coverage for the CUDA-graph replay paths, but ensure the guard is applied before creating the runner or forcing MMA tactics so non-SM100 and non-BUILD_DEEP_GEMM builds skip cleanly instead of failing.

Source: Path instructions

Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com> Co-authored-by: Mingyang Hao <mingyangHao@users.noreply.github.com> (cherry picked from commit 93ed231) Signed-off-by: Mingyang <mhao1999@outlook.com>

…VIDIA#14441) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> (cherry picked from commit 335a418)

…actic) shortcut (NVIDIA#14458) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> (cherry picked from commit 5d59d34)

…VIDIA#14675) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> (cherry picked from commit f6f6f14)

lfr-0531 · 2026-06-25T13:54:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T14:04:34Z

PR_Github #55794 [ run ] triggered by Bot. Commit: 71e2338 Link to invocation

lfr-0531 requested review from a team as code owners June 25, 2026 11:27

lfr-0531 requested review from hyukn and leslie-fang25 June 25, 2026 11:27

github-actions Bot assigned lfr-0531 Jun 25, 2026

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

mingyangHao and others added 4 commits June 25, 2026 12:22

[None][chore] Reduce mhc autotune tactics (NVIDIA#14329)

c9313b6

Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com> Co-authored-by: Mingyang Hao <mingyangHao@users.noreply.github.com> (cherry picked from commit 93ed231) Signed-off-by: Mingyang <mhao1999@outlook.com>

[None][fix] Autotuner cache: merge instead of replace, atomic save (N…

71ef5db

…VIDIA#14441) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> (cherry picked from commit 335a418)

[None][perf] Autotuner: distributed tuning for mhc + single-(runner,t…

fc7b956

…actic) shortcut (NVIDIA#14458) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> (cherry picked from commit 5d59d34)

[None][fix] mhc: disable PARALLEL distributed tuning for MhcFusedHc (N…

71e2338

…VIDIA#14675) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> (cherry picked from commit f6f6f14)

lfr-0531 force-pushed the user/fanrongl/dsv4-a7728aa-pr4-autotuner-20260625 branch from 23fbc68 to 71e2338 Compare June 25, 2026 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][perf] DSv4 follow-up: autotuner updates#15626

[None][perf] DSv4 follow-up: autotuner updates#15626
lfr-0531 wants to merge 4 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-a7728aa-pr4-autotuner-20260625

lfr-0531 commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 25, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

lfr-0531 commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

lfr-0531 commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 25, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

lfr-0531 commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lfr-0531 commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading