Skip to content

[None][perf] DSv4 follow-up: autotuner updates#15626

Open
lfr-0531 wants to merge 4 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-a7728aa-pr4-autotuner-20260625
Open

[None][perf] DSv4 follow-up: autotuner updates#15626
lfr-0531 wants to merge 4 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-a7728aa-pr4-autotuner-20260625

Conversation

@lfr-0531

@lfr-0531 lfr-0531 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Added an option to keep certain autotuning results in memory without saving them to disk.
    • Improved tuning behavior for some distributed and fused GPU workloads, with broader tactic selection support.
  • Bug Fixes

    • Made autotuning cache saves safer when multiple processes write at the same time.
    • Prevented cached tuning results from interfering with startup warmup behavior in specific GPU paths.
  • Tests

    • Expanded automated coverage for autotuning shortcuts and CUDA graph replay scenarios.

Description

This PR carves the PR-4 autotuner scope from github/feat/deepseek_v4 after a7728aa onto current github/main.

It includes:

  • Autotuner cache merge/atomic-save updates.
  • Distributed tuning support and the single (runner, tactic) shortcut.
  • MHC autotune tactic reductions and the follow-up that disables PARALLEL distributed tuning for MhcFusedHc.

The PR is intentionally limited to autotuner/MHC tuning files and direct unit tests. The source test-list hunk for deleted l0_b200_ds.yml was left out per the split plan.

Rebase/conflict notes are recorded in tmp/dsv4_final_prs/PR4_rebase_conflicts_20260625.md.

Test Coverage

  • Import/scope:
    • Verified tensorrt_llm, tensorrt_llm._torch.autotuner, and tensorrt_llm._torch.modules.mhc.mhc_cuda import from this worktree.
    • git diff --check github/main...HEAD
    • Changed-file conflict-marker search.
    • PR-4 forbidden-scope check returned no matches.
  • Unit tests:
    • PATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=0 timeout 20m .venv-3.12/bin/python -m pytest tests/unittest/_torch/misc/test_autotuner.py -q
      • 30 passed, 4 skipped, 10 warnings
    • PATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=0 timeout 20m .venv-3.12/bin/python -m pytest tests/unittest/_torch/modules/test_mhc.py -q
      • 55 passed, 3 warnings

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@lfr-0531 lfr-0531 requested review from a team as code owners June 25, 2026 11:27
@lfr-0531 lfr-0531 requested review from hyukn and leslie-fang25 June 25, 2026 11:27
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adds an autotuner cache opt-out path with lockfile-backed JSON persistence, a single-pair profiling shortcut, updated mHC distributed tuning and MMA tactic selection rules, and new CUDA graph coverage for fused-HC behaviors.

Changes

Autotuner and mHC tuning updates

Layer / File(s) Summary
Cache opt-out contract
tensorrt_llm/_torch/autotuner.py, tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
TuningConfig.exclude_from_cache is documented and defaulted, choose_one() records excluded ops and bypasses cache lookup during tuning, and fp8SwapABGemmRunner opts out of disk persistence.
Cache persistence and locking
tensorrt_llm/_torch/autotuner.py
AutoTunerProfilingCache tracks excluded ops, filters them from partitioned output, and saves/loads JSON with lockfile-backed merge and atomic-write handling.
Single-pair profiling shortcut
tensorrt_llm/_torch/autotuner.py, tests/unittest/_torch/misc/test_autotuner.py
_profile_runners() precomputes candidate pairs for a one-pair fast path, and the new unit test covers shortcut and multi-tactic profiling behavior.
mHC tuning rules
tensorrt_llm/_torch/modules/mhc/mhc_cuda.py
mHC pre-mapping uses parallel distributed tuning, fused-HC adds MMA heuristics, and tactic generation now deduplicates and revises backend thresholds and fallbacks.
CUDA graph coverage
tests/unittest/_torch/modules/test_mhc.py
CUDA-graph helpers build fused-HC cases, compare replay with eager execution, and add high-splitK plus decode-bucket/prefill coverage.

Sequence Diagram(s)

sequenceDiagram
  participant AutoTunerProfilingCache
  participant _exclusive_cache_lock
  participant _atomic_write_json

  AutoTunerProfilingCache->>_exclusive_cache_lock: acquire lockfile
  AutoTunerProfilingCache->>AutoTunerProfilingCache: read existing cache JSON
  AutoTunerProfilingCache->>AutoTunerProfilingCache: merge shared and rank entries
  AutoTunerProfilingCache->>_atomic_write_json: write merged JSON
  AutoTunerProfilingCache->>_exclusive_cache_lock: release lockfile
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and matches the PR’s main autotuner/MHC follow-up scope.
Description check ✅ Passed The description follows the template with Description, Test Coverage, and a completed PR Checklist.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
tests/unittest/_torch/misc/test_autotuner.py (1)

1056-1129: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a cache opt-out regression next to this shortcut test.

Coverage is sufficient for the single-pair shortcut, but this PR also changes exclude_from_cache persistence. Please add a case in tests/unittest/_torch/misc/test_autotuner.py that seeds a stale cache entry for an excluded op, enters autotune(cache_path=...), and asserts profiling still runs and the saved JSON no longer contains that op.

As per path instructions, tests/** reviews should assess coverage and suggest concrete file names.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/misc/test_autotuner.py` around lines 1056 - 1129, The
shortcut test covers single-pair profiling, but it does not verify the new
exclude_from_cache persistence behavior. Add a regression case in test_autotuner
alongside test_single_pair_shortcut that seeds a stale profiling cache entry for
an op marked excluded, then enters autotune(cache_path=...) and confirms
AutoTuner still profiles instead of reusing the stale entry. After the run,
assert the saved cache JSON no longer contains that op so the persistence
cleanup is exercised.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/autotuner.py`:
- Around line 1083-1096: The forced cache miss in autotuning is only applied
before the early return, but the per-profile lookup in the tuning flow still
uses search_cache() and can reuse stale entries. Update the lookup path in
autotuner’s cache handling so the excluded-from-cache behavior also forces a
miss during each profile-specific search, ensuring _profile_runners() and the
JIT warmup still run when tuning_config.exclude_from_cache is set. Use the
existing custom_op, profiling_cache.search_cache(), and fallback_entry() logic
as the main touchpoints.
- Around line 596-616: The cache merge in _partition_cache_by_strategy()
preserves stale persisted entries for ops now marked exclude_from_cache, so they
are not fully removed from disk. Before calling _atomic_write_json, filter
current_cache to drop any excluded-op keys from both the shared cache and the
rank-specific cache, then merge the remaining serialized_shared_cache and
serialized_rank_cache entries as usual.
- Around line 1244-1257: The exception handlers in Autotuner are too broad and
should be narrowed to the specific CUDA/runtime failures these profiling paths
expect. Update the `except Exception as e` blocks in `Autotuner` (including the
single-pair failure path and the later profiling path) to catch the relevant
PyTorch/CUDA error types only, and explicitly handle `torch.cuda.synchronize()`
failures by logging them with context instead of swallowing them. Keep the
existing failure bookkeeping in place after the narrower catch so profiling
stats still update correctly.

In `@tests/unittest/_torch/modules/test_mhc.py`:
- Around line 1098-1106: Add the same availability guard used by the existing
CUDA-graph MMA test to the new fused-HC MMA-only tests, including
test_mhc_fused_hc_cuda_graph_high_splitk_tactics and the other fused_*_mma
CUDA-graph path test in this block, so they skip when
_mhc_fused_hc_mma_available() is false. Keep the tests’ current coverage for the
CUDA-graph replay paths, but ensure the guard is applied before creating the
runner or forcing MMA tactics so non-SM100 and non-BUILD_DEEP_GEMM builds skip
cleanly instead of failing.
- Around line 1076-1078: The output comparison loop in the test uses zip()
without strict checking, so it may silently ignore mismatched output counts.
Update the zip in the graph_out/eager_out assertion to use strict=True, and keep
the existing tuple of names so the test fails loudly if the runner returns a
different number of outputs.

---

Nitpick comments:
In `@tests/unittest/_torch/misc/test_autotuner.py`:
- Around line 1056-1129: The shortcut test covers single-pair profiling, but it
does not verify the new exclude_from_cache persistence behavior. Add a
regression case in test_autotuner alongside test_single_pair_shortcut that seeds
a stale profiling cache entry for an op marked excluded, then enters
autotune(cache_path=...) and confirms AutoTuner still profiles instead of
reusing the stale entry. After the run, assert the saved cache JSON no longer
contains that op so the persistence cleanup is exercised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f309f66e-9e0f-4dbc-9173-c43b059aeea6

📥 Commits

Reviewing files that changed from the base of the PR and between a02214a and 23fbc68.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/modules/mhc/mhc_cuda.py
  • tests/unittest/_torch/misc/test_autotuner.py
  • tests/unittest/_torch/modules/test_mhc.py

Comment on lines +596 to +616
with _exclusive_cache_lock(lock_path):
current_cache = self._read_existing_cache(file_path)

# Merge shared cache entries (non-INDEPENDENT ops)
if self.SHARED_CACHE_KEY not in current_cache:
current_cache[self.SHARED_CACHE_KEY] = {}
current_cache[self.SHARED_CACHE_KEY].update(
serialized_shared_cache)

# Save rank-specific cache entries (INDEPENDENT ops)
current_cache[f"rank_{rank}"] = serialized_rank_cache

json.dump(current_cache, f, indent=2, default=str)
# Merge rank-specific cache entries (INDEPENDENT ops).
# MUST be a merge (not assignment): a concurrent writer
# that committed its rank_{rank} contribution between
# this writer's load_cache and save_cache would otherwise
# be silently dropped by an `=` assignment that re-writes
# the slot with only this writer's in-memory cache.
rank_key = f"rank_{rank}"
if rank_key not in current_cache:
current_cache[rank_key] = {}
current_cache[rank_key].update(serialized_rank_cache)

self._atomic_write_json(file_path, current_cache)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Purge stale excluded-op entries from the loaded JSON before writing.

_partition_cache_by_strategy() drops only this process’s outgoing entries. If current_cache already contains an older persisted entry for an op now marked exclude_from_cache, the merge at Lines 599-614 preserves it, so the op is not actually “dropped entirely” from disk.

Proposed fix
             with _exclusive_cache_lock(lock_path):
                 current_cache = self._read_existing_cache(file_path)
+                self._drop_excluded_entries(current_cache)

                 # Merge shared cache entries (non-INDEPENDENT ops)
                 if self.SHARED_CACHE_KEY not in current_cache:
                     current_cache[self.SHARED_CACHE_KEY] = {}
+    def _drop_excluded_entries(self, cache_contents: Dict[str, Any]) -> None:
+        """Remove previously persisted entries for ops excluded from disk cache."""
+        if not self.excluded_op:
+            return
+
+        for bucket in cache_contents.values():
+            if not isinstance(bucket, dict):
+                continue
+            for key_str in list(bucket.keys()):
+                try:
+                    key = ast.literal_eval(key_str)
+                except (ValueError, SyntaxError):
+                    continue
+                if key and key[0] in self.excluded_op:
+                    del bucket[key_str]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with _exclusive_cache_lock(lock_path):
current_cache = self._read_existing_cache(file_path)
# Merge shared cache entries (non-INDEPENDENT ops)
if self.SHARED_CACHE_KEY not in current_cache:
current_cache[self.SHARED_CACHE_KEY] = {}
current_cache[self.SHARED_CACHE_KEY].update(
serialized_shared_cache)
# Save rank-specific cache entries (INDEPENDENT ops)
current_cache[f"rank_{rank}"] = serialized_rank_cache
json.dump(current_cache, f, indent=2, default=str)
# Merge rank-specific cache entries (INDEPENDENT ops).
# MUST be a merge (not assignment): a concurrent writer
# that committed its rank_{rank} contribution between
# this writer's load_cache and save_cache would otherwise
# be silently dropped by an `=` assignment that re-writes
# the slot with only this writer's in-memory cache.
rank_key = f"rank_{rank}"
if rank_key not in current_cache:
current_cache[rank_key] = {}
current_cache[rank_key].update(serialized_rank_cache)
self._atomic_write_json(file_path, current_cache)
with _exclusive_cache_lock(lock_path):
current_cache = self._read_existing_cache(file_path)
self._drop_excluded_entries(current_cache)
# Merge shared cache entries (non-INDEPENDENT ops)
if self.SHARED_CACHE_KEY not in current_cache:
current_cache[self.SHARED_CACHE_KEY] = {}
current_cache[self.SHARED_CACHE_KEY].update(
serialized_shared_cache)
# Merge rank-specific cache entries (INDEPENDENT ops).
# MUST be a merge (not assignment): a concurrent writer
# that committed its rank_{rank} contribution between
# this writer's load_cache and save_cache would otherwise
# be silently dropped by an `=` assignment that re-writes
# the slot with only this writer's in-memory cache.
rank_key = f"rank_{rank}"
if rank_key not in current_cache:
current_cache[rank_key] = {}
current_cache[rank_key].update(serialized_rank_cache)
self._atomic_write_json(file_path, current_cache)
def _drop_excluded_entries(self, cache_contents: Dict[str, Any]) -> None:
"""Remove previously persisted entries for ops excluded from disk cache."""
if not self.excluded_op:
return
for bucket in cache_contents.values():
if not isinstance(bucket, dict):
continue
for key_str in list(bucket.keys()):
try:
key = ast.literal_eval(key_str)
except (ValueError, SyntaxError):
continue
if key and key[0] in self.excluded_op:
del bucket[key_str]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/autotuner.py` around lines 596 - 616, The cache merge in
_partition_cache_by_strategy() preserves stale persisted entries for ops now
marked exclude_from_cache, so they are not fully removed from disk. Before
calling _atomic_write_json, filter current_cache to drop any excluded-op keys
from both the shared cache and the rank-specific cache, then merge the remaining
serialized_shared_cache and serialized_rank_cache entries as usual.

Comment on lines +1083 to +1096
if tuning_config.exclude_from_cache:
self.profiling_cache.add_excluded_op(custom_op)
if tuning_config.exclude_from_cache and self.is_tuning_mode:
is_cache_hit = False
best_runner_id, best_tactic, min_time = (
self.profiling_cache.fallback_entry())
else:
is_cache_hit, best_runner_id, best_tactic, min_time = (
self.profiling_cache.search_cache(
custom_op,
runners,
input_shapes,
tuning_config,
apply_map_to_tuning_buckets=True))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Apply the excluded-cache miss to the per-profile lookup too.

The initial forced miss avoids the early return, but the tuning loop still calls search_cache() for each profile. A stale loaded entry can therefore skip _profile_runners() and the JIT warmup this option is meant to force.

Proposed fix
                 tensors = self._prepare_input_tensors(p, inputs)
-                is_cache_hit, *_ = self.profiling_cache.search_cache(
-                    custom_op,
-                    runners,
-                    p.get_opt_shapes(),
-                    tuning_config,
-                    apply_map_to_tuning_buckets=False,
-                )
+                if tuning_config.exclude_from_cache:
+                    is_cache_hit = False
+                else:
+                    is_cache_hit, *_ = self.profiling_cache.search_cache(
+                        custom_op,
+                        runners,
+                        p.get_opt_shapes(),
+                        tuning_config,
+                        apply_map_to_tuning_buckets=False,
+                    )

Also applies to: 1141-1148

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/autotuner.py` around lines 1083 - 1096, The forced cache
miss in autotuning is only applied before the early return, but the per-profile
lookup in the tuning flow still uses search_cache() and can reuse stale entries.
Update the lookup path in autotuner’s cache handling so the excluded-from-cache
behavior also forces a miss during each profile-specific search, ensuring
_profile_runners() and the JIT warmup still run when
tuning_config.exclude_from_cache is set. Use the existing custom_op,
profiling_cache.search_cache(), and fallback_entry() logic as the main
touchpoints.

Comment on lines +1244 to +1257
except Exception as e:
shapes = self._get_input_sizes(input_tensors)
logger.warning_once(
f"[Autotuner] Single-pair run failed for custom_op={custom_op}, runner={runner}, tactic={tac}, shapes={shapes}. Error: {e}",
key=(custom_op, "warning_autotuning_single_pair_failure"),
)
self.stats.failed_profiling_count[custom_op].add(
self.profiling_cache.get_cache_key(
custom_op,
runner,
profile.get_opt_shapes(),
tuning_config,
apply_map_to_tuning_buckets=False))
has_tuning_failure_occurred = True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
ruff check --select BLE001,S110 tensorrt_llm/_torch/autotuner.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 3116


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant sections around the reported lines.
sed -n '730,790p' tensorrt_llm/_torch/autotuner.py
printf '\n---\n'
sed -n '1220,1310p' tensorrt_llm/_torch/autotuner.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 7231


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect surrounding helper definitions to understand the expected exception surface.
rg -n "torch.cuda.synchronize|warning_once|failed_profiling_count|profile.get_opt_shapes|RuntimeError|Exception as e" tensorrt_llm/_torch/autotuner.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1985


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check how this module handles neighboring runtime failures for consistency.
sed -n '930,980p' tensorrt_llm/_torch/autotuner.py
printf '\n---\n'
sed -n '1030,1065p' tensorrt_llm/_torch/autotuner.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 4418


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the profiling helper and nearby exception handling patterns.
sed -n '680,745p' tensorrt_llm/_torch/autotuner.py
printf '\n---\n'
sed -n '745,785p' tensorrt_llm/_torch/autotuner.py
printf '\n---\n'
sed -n '1180,1310p' tensorrt_llm/_torch/autotuner.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 11101


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find how this file and nearby code handle CUDA/runtime exceptions and synchronization failures.
rg -n "RuntimeError|torch\.cuda\.synchronize|warning_once|except Exception as e|except RuntimeError as e|except Exception:" tensorrt_llm/_torch/autotuner.py tensorrt_llm -g '!**/__pycache__/**'

Repository: NVIDIA/TensorRT-LLM

Length of output: 50376


Narrow the profiling exception handlers. Replace the broad except Exception blocks at tensorrt_llm/_torch/autotuner.py:1244 and :1283-1291 with the specific CUDA/runtime errors these paths are meant to handle, and log torch.cuda.synchronize() failures instead of swallowing them. The current handlers still trip BLE001/S110.

🧰 Tools
🪛 Ruff (0.15.18)

[warning] 1244-1244: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/autotuner.py` around lines 1244 - 1257, The exception
handlers in Autotuner are too broad and should be narrowed to the specific
CUDA/runtime failures these profiling paths expect. Update the `except Exception
as e` blocks in `Autotuner` (including the single-pair failure path and the
later profiling path) to catch the relevant PyTorch/CUDA error types only, and
explicitly handle `torch.cuda.synchronize()` failures by logging them with
context instead of swallowing them. Keep the existing failure bookkeeping in
place after the narrower catch so profiling stats still update correctly.

Sources: Coding guidelines, Linters/SAST tools

Comment on lines +1076 to +1078
for actual, expected, name in zip(
graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input")
):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Add strict=True to the output comparison zip().

Ruff reports B905 here; this assertion should fail loudly if the runner returns a different number of outputs.

Proposed fix
     for actual, expected, name in zip(
-        graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input")
+        graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input"), strict=True
     ):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for actual, expected, name in zip(
graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input")
):
for actual, expected, name in zip(
graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input"), strict=True
):
🧰 Tools
🪛 Ruff (0.15.18)

[warning] 1076-1078: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/test_mhc.py` around lines 1076 - 1078, The
output comparison loop in the test uses zip() without strict checking, so it may
silently ignore mismatched output counts. Update the zip in the
graph_out/eager_out assertion to use strict=True, and keep the existing tuple of
names so the test fails loudly if the runner returns a different number of
outputs.

Source: Linters/SAST tools

Comment on lines +1098 to +1106
def test_mhc_fused_hc_cuda_graph_high_splitk_tactics(n: int, tactic):
"""Reduced autotune maps decode buckets to high-splitK MMA tactics.

Unlike the bit-exact ks=1 graph test above, this covers the actual
M=64/128 PR autotune path where splitK atomics and CUDA graph replay are
both active.
"""
runner, inputs = _make_fused_hc_runner_case(n=n, hidden_size=4096, hc_mult=4, seed=41 + n)
_assert_graph_replay_matches_eager(runner, inputs, tactic)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Skip these MMA-only tests when fused-HC MMA is unavailable.

Both new tests force fused_*_mma tactics, but unlike the existing CUDA-graph test above, they do not guard _mhc_fused_hc_mma_available(). On non-SM100 or builds without BUILD_DEEP_GEMM=ON, this will fail instead of skip. Coverage is sufficient for the new CUDA-graph paths once the environment guard is added.

Proposed fix
 def test_mhc_fused_hc_cuda_graph_high_splitk_tactics(n: int, tactic):
     """Reduced autotune maps decode buckets to high-splitK MMA tactics.
@@
     both active.
     """
+    if not _mhc_fused_hc_mma_available():
+        pytest.skip("mHC fused-HC MMA kernels require SM100 and BUILD_DEEP_GEMM=ON")
+
     runner, inputs = _make_fused_hc_runner_case(n=n, hidden_size=4096, hc_mult=4, seed=41 + n)
     _assert_graph_replay_matches_eager(runner, inputs, tactic)
@@
 def test_mhc_fused_hc_cuda_graph_decode_buckets_then_prefill():
@@
     """
+    if not _mhc_fused_hc_mma_available():
+        pytest.skip("mHC fused-HC MMA kernels require SM100 and BUILD_DEEP_GEMM=ON")
+
     hidden_size = 4096

As per path instructions, tests/**: keep feedback actionable and state whether coverage is sufficient.

Also applies to: 1109-1137

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/test_mhc.py` around lines 1098 - 1106, Add the
same availability guard used by the existing CUDA-graph MMA test to the new
fused-HC MMA-only tests, including
test_mhc_fused_hc_cuda_graph_high_splitk_tactics and the other fused_*_mma
CUDA-graph path test in this block, so they skip when
_mhc_fused_hc_mma_available() is false. Keep the tests’ current coverage for the
CUDA-graph replay paths, but ensure the guard is applied before creating the
runner or forcing MMA tactics so non-SM100 and non-BUILD_DEEP_GEMM builds skip
cleanly instead of failing.

Source: Path instructions

mingyangHao and others added 4 commits June 25, 2026 12:22
Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com>
Co-authored-by: Mingyang Hao <mingyangHao@users.noreply.github.com>
(cherry picked from commit 93ed231)
Signed-off-by: Mingyang <mhao1999@outlook.com>
…VIDIA#14441)

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
(cherry picked from commit 335a418)
…actic) shortcut (NVIDIA#14458)

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
(cherry picked from commit 5d59d34)
…VIDIA#14675)

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
(cherry picked from commit f6f6f14)
@lfr-0531 lfr-0531 force-pushed the user/fanrongl/dsv4-a7728aa-pr4-autotuner-20260625 branch from 23fbc68 to 71e2338 Compare June 25, 2026 12:23
@lfr-0531

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55794 [ run ] triggered by Bot. Commit: 71e2338 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants