[None][perf] DSv4 follow-up: autotuner updates#15626
Conversation
📝 WalkthroughWalkthroughThe PR adds an autotuner cache opt-out path with lockfile-backed JSON persistence, a single-pair profiling shortcut, updated mHC distributed tuning and MMA tactic selection rules, and new CUDA graph coverage for fused-HC behaviors. ChangesAutotuner and mHC tuning updates
Sequence Diagram(s)sequenceDiagram
participant AutoTunerProfilingCache
participant _exclusive_cache_lock
participant _atomic_write_json
AutoTunerProfilingCache->>_exclusive_cache_lock: acquire lockfile
AutoTunerProfilingCache->>AutoTunerProfilingCache: read existing cache JSON
AutoTunerProfilingCache->>AutoTunerProfilingCache: merge shared and rank entries
AutoTunerProfilingCache->>_atomic_write_json: write merged JSON
AutoTunerProfilingCache->>_exclusive_cache_lock: release lockfile
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
tests/unittest/_torch/misc/test_autotuner.py (1)
1056-1129: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd a cache opt-out regression next to this shortcut test.
Coverage is sufficient for the single-pair shortcut, but this PR also changes
exclude_from_cachepersistence. Please add a case intests/unittest/_torch/misc/test_autotuner.pythat seeds a stale cache entry for an excluded op, entersautotune(cache_path=...), and asserts profiling still runs and the saved JSON no longer contains that op.As per path instructions,
tests/**reviews should assess coverage and suggest concrete file names.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/misc/test_autotuner.py` around lines 1056 - 1129, The shortcut test covers single-pair profiling, but it does not verify the new exclude_from_cache persistence behavior. Add a regression case in test_autotuner alongside test_single_pair_shortcut that seeds a stale profiling cache entry for an op marked excluded, then enters autotune(cache_path=...) and confirms AutoTuner still profiles instead of reusing the stale entry. After the run, assert the saved cache JSON no longer contains that op so the persistence cleanup is exercised.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/autotuner.py`:
- Around line 1083-1096: The forced cache miss in autotuning is only applied
before the early return, but the per-profile lookup in the tuning flow still
uses search_cache() and can reuse stale entries. Update the lookup path in
autotuner’s cache handling so the excluded-from-cache behavior also forces a
miss during each profile-specific search, ensuring _profile_runners() and the
JIT warmup still run when tuning_config.exclude_from_cache is set. Use the
existing custom_op, profiling_cache.search_cache(), and fallback_entry() logic
as the main touchpoints.
- Around line 596-616: The cache merge in _partition_cache_by_strategy()
preserves stale persisted entries for ops now marked exclude_from_cache, so they
are not fully removed from disk. Before calling _atomic_write_json, filter
current_cache to drop any excluded-op keys from both the shared cache and the
rank-specific cache, then merge the remaining serialized_shared_cache and
serialized_rank_cache entries as usual.
- Around line 1244-1257: The exception handlers in Autotuner are too broad and
should be narrowed to the specific CUDA/runtime failures these profiling paths
expect. Update the `except Exception as e` blocks in `Autotuner` (including the
single-pair failure path and the later profiling path) to catch the relevant
PyTorch/CUDA error types only, and explicitly handle `torch.cuda.synchronize()`
failures by logging them with context instead of swallowing them. Keep the
existing failure bookkeeping in place after the narrower catch so profiling
stats still update correctly.
In `@tests/unittest/_torch/modules/test_mhc.py`:
- Around line 1098-1106: Add the same availability guard used by the existing
CUDA-graph MMA test to the new fused-HC MMA-only tests, including
test_mhc_fused_hc_cuda_graph_high_splitk_tactics and the other fused_*_mma
CUDA-graph path test in this block, so they skip when
_mhc_fused_hc_mma_available() is false. Keep the tests’ current coverage for the
CUDA-graph replay paths, but ensure the guard is applied before creating the
runner or forcing MMA tactics so non-SM100 and non-BUILD_DEEP_GEMM builds skip
cleanly instead of failing.
- Around line 1076-1078: The output comparison loop in the test uses zip()
without strict checking, so it may silently ignore mismatched output counts.
Update the zip in the graph_out/eager_out assertion to use strict=True, and keep
the existing tuple of names so the test fails loudly if the runner returns a
different number of outputs.
---
Nitpick comments:
In `@tests/unittest/_torch/misc/test_autotuner.py`:
- Around line 1056-1129: The shortcut test covers single-pair profiling, but it
does not verify the new exclude_from_cache persistence behavior. Add a
regression case in test_autotuner alongside test_single_pair_shortcut that seeds
a stale profiling cache entry for an op marked excluded, then enters
autotune(cache_path=...) and confirms AutoTuner still profiles instead of
reusing the stale entry. After the run, assert the saved cache JSON no longer
contains that op so the persistence cleanup is exercised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f309f66e-9e0f-4dbc-9173-c43b059aeea6
📒 Files selected for processing (5)
tensorrt_llm/_torch/autotuner.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.pytensorrt_llm/_torch/modules/mhc/mhc_cuda.pytests/unittest/_torch/misc/test_autotuner.pytests/unittest/_torch/modules/test_mhc.py
| with _exclusive_cache_lock(lock_path): | ||
| current_cache = self._read_existing_cache(file_path) | ||
|
|
||
| # Merge shared cache entries (non-INDEPENDENT ops) | ||
| if self.SHARED_CACHE_KEY not in current_cache: | ||
| current_cache[self.SHARED_CACHE_KEY] = {} | ||
| current_cache[self.SHARED_CACHE_KEY].update( | ||
| serialized_shared_cache) | ||
|
|
||
| # Save rank-specific cache entries (INDEPENDENT ops) | ||
| current_cache[f"rank_{rank}"] = serialized_rank_cache | ||
|
|
||
| json.dump(current_cache, f, indent=2, default=str) | ||
| # Merge rank-specific cache entries (INDEPENDENT ops). | ||
| # MUST be a merge (not assignment): a concurrent writer | ||
| # that committed its rank_{rank} contribution between | ||
| # this writer's load_cache and save_cache would otherwise | ||
| # be silently dropped by an `=` assignment that re-writes | ||
| # the slot with only this writer's in-memory cache. | ||
| rank_key = f"rank_{rank}" | ||
| if rank_key not in current_cache: | ||
| current_cache[rank_key] = {} | ||
| current_cache[rank_key].update(serialized_rank_cache) | ||
|
|
||
| self._atomic_write_json(file_path, current_cache) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Purge stale excluded-op entries from the loaded JSON before writing.
_partition_cache_by_strategy() drops only this process’s outgoing entries. If current_cache already contains an older persisted entry for an op now marked exclude_from_cache, the merge at Lines 599-614 preserves it, so the op is not actually “dropped entirely” from disk.
Proposed fix
with _exclusive_cache_lock(lock_path):
current_cache = self._read_existing_cache(file_path)
+ self._drop_excluded_entries(current_cache)
# Merge shared cache entries (non-INDEPENDENT ops)
if self.SHARED_CACHE_KEY not in current_cache:
current_cache[self.SHARED_CACHE_KEY] = {}+ def _drop_excluded_entries(self, cache_contents: Dict[str, Any]) -> None:
+ """Remove previously persisted entries for ops excluded from disk cache."""
+ if not self.excluded_op:
+ return
+
+ for bucket in cache_contents.values():
+ if not isinstance(bucket, dict):
+ continue
+ for key_str in list(bucket.keys()):
+ try:
+ key = ast.literal_eval(key_str)
+ except (ValueError, SyntaxError):
+ continue
+ if key and key[0] in self.excluded_op:
+ del bucket[key_str]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with _exclusive_cache_lock(lock_path): | |
| current_cache = self._read_existing_cache(file_path) | |
| # Merge shared cache entries (non-INDEPENDENT ops) | |
| if self.SHARED_CACHE_KEY not in current_cache: | |
| current_cache[self.SHARED_CACHE_KEY] = {} | |
| current_cache[self.SHARED_CACHE_KEY].update( | |
| serialized_shared_cache) | |
| # Save rank-specific cache entries (INDEPENDENT ops) | |
| current_cache[f"rank_{rank}"] = serialized_rank_cache | |
| json.dump(current_cache, f, indent=2, default=str) | |
| # Merge rank-specific cache entries (INDEPENDENT ops). | |
| # MUST be a merge (not assignment): a concurrent writer | |
| # that committed its rank_{rank} contribution between | |
| # this writer's load_cache and save_cache would otherwise | |
| # be silently dropped by an `=` assignment that re-writes | |
| # the slot with only this writer's in-memory cache. | |
| rank_key = f"rank_{rank}" | |
| if rank_key not in current_cache: | |
| current_cache[rank_key] = {} | |
| current_cache[rank_key].update(serialized_rank_cache) | |
| self._atomic_write_json(file_path, current_cache) | |
| with _exclusive_cache_lock(lock_path): | |
| current_cache = self._read_existing_cache(file_path) | |
| self._drop_excluded_entries(current_cache) | |
| # Merge shared cache entries (non-INDEPENDENT ops) | |
| if self.SHARED_CACHE_KEY not in current_cache: | |
| current_cache[self.SHARED_CACHE_KEY] = {} | |
| current_cache[self.SHARED_CACHE_KEY].update( | |
| serialized_shared_cache) | |
| # Merge rank-specific cache entries (INDEPENDENT ops). | |
| # MUST be a merge (not assignment): a concurrent writer | |
| # that committed its rank_{rank} contribution between | |
| # this writer's load_cache and save_cache would otherwise | |
| # be silently dropped by an `=` assignment that re-writes | |
| # the slot with only this writer's in-memory cache. | |
| rank_key = f"rank_{rank}" | |
| if rank_key not in current_cache: | |
| current_cache[rank_key] = {} | |
| current_cache[rank_key].update(serialized_rank_cache) | |
| self._atomic_write_json(file_path, current_cache) | |
| def _drop_excluded_entries(self, cache_contents: Dict[str, Any]) -> None: | |
| """Remove previously persisted entries for ops excluded from disk cache.""" | |
| if not self.excluded_op: | |
| return | |
| for bucket in cache_contents.values(): | |
| if not isinstance(bucket, dict): | |
| continue | |
| for key_str in list(bucket.keys()): | |
| try: | |
| key = ast.literal_eval(key_str) | |
| except (ValueError, SyntaxError): | |
| continue | |
| if key and key[0] in self.excluded_op: | |
| del bucket[key_str] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/autotuner.py` around lines 596 - 616, The cache merge in
_partition_cache_by_strategy() preserves stale persisted entries for ops now
marked exclude_from_cache, so they are not fully removed from disk. Before
calling _atomic_write_json, filter current_cache to drop any excluded-op keys
from both the shared cache and the rank-specific cache, then merge the remaining
serialized_shared_cache and serialized_rank_cache entries as usual.
| if tuning_config.exclude_from_cache: | ||
| self.profiling_cache.add_excluded_op(custom_op) | ||
| if tuning_config.exclude_from_cache and self.is_tuning_mode: | ||
| is_cache_hit = False | ||
| best_runner_id, best_tactic, min_time = ( | ||
| self.profiling_cache.fallback_entry()) | ||
| else: | ||
| is_cache_hit, best_runner_id, best_tactic, min_time = ( | ||
| self.profiling_cache.search_cache( | ||
| custom_op, | ||
| runners, | ||
| input_shapes, | ||
| tuning_config, | ||
| apply_map_to_tuning_buckets=True)) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Apply the excluded-cache miss to the per-profile lookup too.
The initial forced miss avoids the early return, but the tuning loop still calls search_cache() for each profile. A stale loaded entry can therefore skip _profile_runners() and the JIT warmup this option is meant to force.
Proposed fix
tensors = self._prepare_input_tensors(p, inputs)
- is_cache_hit, *_ = self.profiling_cache.search_cache(
- custom_op,
- runners,
- p.get_opt_shapes(),
- tuning_config,
- apply_map_to_tuning_buckets=False,
- )
+ if tuning_config.exclude_from_cache:
+ is_cache_hit = False
+ else:
+ is_cache_hit, *_ = self.profiling_cache.search_cache(
+ custom_op,
+ runners,
+ p.get_opt_shapes(),
+ tuning_config,
+ apply_map_to_tuning_buckets=False,
+ )Also applies to: 1141-1148
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/autotuner.py` around lines 1083 - 1096, The forced cache
miss in autotuning is only applied before the early return, but the per-profile
lookup in the tuning flow still uses search_cache() and can reuse stale entries.
Update the lookup path in autotuner’s cache handling so the excluded-from-cache
behavior also forces a miss during each profile-specific search, ensuring
_profile_runners() and the JIT warmup still run when
tuning_config.exclude_from_cache is set. Use the existing custom_op,
profiling_cache.search_cache(), and fallback_entry() logic as the main
touchpoints.
| except Exception as e: | ||
| shapes = self._get_input_sizes(input_tensors) | ||
| logger.warning_once( | ||
| f"[Autotuner] Single-pair run failed for custom_op={custom_op}, runner={runner}, tactic={tac}, shapes={shapes}. Error: {e}", | ||
| key=(custom_op, "warning_autotuning_single_pair_failure"), | ||
| ) | ||
| self.stats.failed_profiling_count[custom_op].add( | ||
| self.profiling_cache.get_cache_key( | ||
| custom_op, | ||
| runner, | ||
| profile.get_opt_shapes(), | ||
| tuning_config, | ||
| apply_map_to_tuning_buckets=False)) | ||
| has_tuning_failure_occurred = True |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
ruff check --select BLE001,S110 tensorrt_llm/_torch/autotuner.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 3116
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the relevant sections around the reported lines.
sed -n '730,790p' tensorrt_llm/_torch/autotuner.py
printf '\n---\n'
sed -n '1220,1310p' tensorrt_llm/_torch/autotuner.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 7231
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect surrounding helper definitions to understand the expected exception surface.
rg -n "torch.cuda.synchronize|warning_once|failed_profiling_count|profile.get_opt_shapes|RuntimeError|Exception as e" tensorrt_llm/_torch/autotuner.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1985
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check how this module handles neighboring runtime failures for consistency.
sed -n '930,980p' tensorrt_llm/_torch/autotuner.py
printf '\n---\n'
sed -n '1030,1065p' tensorrt_llm/_torch/autotuner.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 4418
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the profiling helper and nearby exception handling patterns.
sed -n '680,745p' tensorrt_llm/_torch/autotuner.py
printf '\n---\n'
sed -n '745,785p' tensorrt_llm/_torch/autotuner.py
printf '\n---\n'
sed -n '1180,1310p' tensorrt_llm/_torch/autotuner.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 11101
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find how this file and nearby code handle CUDA/runtime exceptions and synchronization failures.
rg -n "RuntimeError|torch\.cuda\.synchronize|warning_once|except Exception as e|except RuntimeError as e|except Exception:" tensorrt_llm/_torch/autotuner.py tensorrt_llm -g '!**/__pycache__/**'Repository: NVIDIA/TensorRT-LLM
Length of output: 50376
Narrow the profiling exception handlers. Replace the broad except Exception blocks at tensorrt_llm/_torch/autotuner.py:1244 and :1283-1291 with the specific CUDA/runtime errors these paths are meant to handle, and log torch.cuda.synchronize() failures instead of swallowing them. The current handlers still trip BLE001/S110.
🧰 Tools
🪛 Ruff (0.15.18)
[warning] 1244-1244: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/autotuner.py` around lines 1244 - 1257, The exception
handlers in Autotuner are too broad and should be narrowed to the specific
CUDA/runtime failures these profiling paths expect. Update the `except Exception
as e` blocks in `Autotuner` (including the single-pair failure path and the
later profiling path) to catch the relevant PyTorch/CUDA error types only, and
explicitly handle `torch.cuda.synchronize()` failures by logging them with
context instead of swallowing them. Keep the existing failure bookkeeping in
place after the narrower catch so profiling stats still update correctly.
Sources: Coding guidelines, Linters/SAST tools
| for actual, expected, name in zip( | ||
| graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input") | ||
| ): |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Add strict=True to the output comparison zip().
Ruff reports B905 here; this assertion should fail loudly if the runner returns a different number of outputs.
Proposed fix
for actual, expected, name in zip(
- graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input")
+ graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input"), strict=True
):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for actual, expected, name in zip( | |
| graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input") | |
| ): | |
| for actual, expected, name in zip( | |
| graph_out, eager_out, ("residual", "post_mix", "comb_mix", "layer_input"), strict=True | |
| ): |
🧰 Tools
🪛 Ruff (0.15.18)
[warning] 1076-1078: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unittest/_torch/modules/test_mhc.py` around lines 1076 - 1078, The
output comparison loop in the test uses zip() without strict checking, so it may
silently ignore mismatched output counts. Update the zip in the
graph_out/eager_out assertion to use strict=True, and keep the existing tuple of
names so the test fails loudly if the runner returns a different number of
outputs.
Source: Linters/SAST tools
| def test_mhc_fused_hc_cuda_graph_high_splitk_tactics(n: int, tactic): | ||
| """Reduced autotune maps decode buckets to high-splitK MMA tactics. | ||
|
|
||
| Unlike the bit-exact ks=1 graph test above, this covers the actual | ||
| M=64/128 PR autotune path where splitK atomics and CUDA graph replay are | ||
| both active. | ||
| """ | ||
| runner, inputs = _make_fused_hc_runner_case(n=n, hidden_size=4096, hc_mult=4, seed=41 + n) | ||
| _assert_graph_replay_matches_eager(runner, inputs, tactic) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Skip these MMA-only tests when fused-HC MMA is unavailable.
Both new tests force fused_*_mma tactics, but unlike the existing CUDA-graph test above, they do not guard _mhc_fused_hc_mma_available(). On non-SM100 or builds without BUILD_DEEP_GEMM=ON, this will fail instead of skip. Coverage is sufficient for the new CUDA-graph paths once the environment guard is added.
Proposed fix
def test_mhc_fused_hc_cuda_graph_high_splitk_tactics(n: int, tactic):
"""Reduced autotune maps decode buckets to high-splitK MMA tactics.
@@
both active.
"""
+ if not _mhc_fused_hc_mma_available():
+ pytest.skip("mHC fused-HC MMA kernels require SM100 and BUILD_DEEP_GEMM=ON")
+
runner, inputs = _make_fused_hc_runner_case(n=n, hidden_size=4096, hc_mult=4, seed=41 + n)
_assert_graph_replay_matches_eager(runner, inputs, tactic)
@@
def test_mhc_fused_hc_cuda_graph_decode_buckets_then_prefill():
@@
"""
+ if not _mhc_fused_hc_mma_available():
+ pytest.skip("mHC fused-HC MMA kernels require SM100 and BUILD_DEEP_GEMM=ON")
+
hidden_size = 4096As per path instructions, tests/**: keep feedback actionable and state whether coverage is sufficient.
Also applies to: 1109-1137
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unittest/_torch/modules/test_mhc.py` around lines 1098 - 1106, Add the
same availability guard used by the existing CUDA-graph MMA test to the new
fused-HC MMA-only tests, including
test_mhc_fused_hc_cuda_graph_high_splitk_tactics and the other fused_*_mma
CUDA-graph path test in this block, so they skip when
_mhc_fused_hc_mma_available() is false. Keep the tests’ current coverage for the
CUDA-graph replay paths, but ensure the guard is applied before creating the
runner or forcing MMA tactics so non-SM100 and non-BUILD_DEEP_GEMM builds skip
cleanly instead of failing.
Source: Path instructions
Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com> Co-authored-by: Mingyang Hao <mingyangHao@users.noreply.github.com> (cherry picked from commit 93ed231) Signed-off-by: Mingyang <mhao1999@outlook.com>
…VIDIA#14441) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> (cherry picked from commit 335a418)
…actic) shortcut (NVIDIA#14458) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> (cherry picked from commit 5d59d34)
…VIDIA#14675) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> (cherry picked from commit f6f6f14)
23fbc68 to
71e2338
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #55794 [ run ] triggered by Bot. Commit: |
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Description
This PR carves the PR-4 autotuner scope from
github/feat/deepseek_v4aftera7728aaonto currentgithub/main.It includes:
(runner, tactic)shortcut.MhcFusedHc.The PR is intentionally limited to autotuner/MHC tuning files and direct unit tests. The source test-list hunk for deleted
l0_b200_ds.ymlwas left out per the split plan.Rebase/conflict notes are recorded in
tmp/dsv4_final_prs/PR4_rebase_conflicts_20260625.md.Test Coverage
tensorrt_llm,tensorrt_llm._torch.autotuner, andtensorrt_llm._torch.modules.mhc.mhc_cudaimport from this worktree.git diff --check github/main...HEADPATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=0 timeout 20m .venv-3.12/bin/python -m pytest tests/unittest/_torch/misc/test_autotuner.py -q30 passed, 4 skipped, 10 warningsPATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=0 timeout 20m .venv-3.12/bin/python -m pytest tests/unittest/_torch/modules/test_mhc.py -q55 passed, 3 warningsPR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.