[TRTLLM-12950][perf] DSv4 follow-up: DeepGEMM and MegaMoE by lfr-0531 · Pull Request #15632 · NVIDIA/TensorRT-LLM

lfr-0531 · 2026-06-25T16:13:46Z

Summary by CodeRabbit

New Features
- Added a new MoE preparation path that can directly set up selected experts, weights, and quantized activations.
- Added support for bounded SwiGLU behavior in several fused MoE kernels and Python APIs.
- Added a helper to detect when all-reduce inputs already use an NCCL window buffer.
Bug Fixes
- Improved MoE scheduling and backend selection for fused-prepare and routing-separation cases.
- Reduced unnecessary memory retention by releasing CPU-backed weights more aggressively.
- Refined all-reduce caching behavior for NCCL window-backed inputs.

Description

This PR carves the PR-5 DeepGEMM/MegaMoE follow-up scope from github/feat/deepseek_v4 after a7728aa onto current github/main.

It includes:

DeepGEMM dependency and PDL setup.
MegaMoE prepare/custom-op registration and scheduler follow-up.
MegaMoE loader/OOM and EPLB-related follow-up that can be applied without introducing the full DSv4 model file.
CuTeDSL grouped-GEMM swiglu-limit follow-up and related unit coverage.

It intentionally excludes NVFP4 MegaMoE backend/tuning commits aad9e0f008 and f90246b3d8 because equivalent material is already on main, and it excludes the modeling_deepseekv4.py hunks because that file is not present on current main and belongs to the PR-3 DSv4 model/sparse-attention stack.

Rebase/conflict notes are recorded in tmp/dsv4_final_prs/PR5_rebase_conflicts_20260625.md.

Test Coverage

Build/install:
- python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --use_ccache --cuda_architectures "90-real;100-real" --configure_cmake
- .venv-3.12/bin/pip install --force-reinstall --no-deps build/tensorrt_llm-1.3.0rc20-cp312-cp312-linux_x86_64.whl
- Verified installed libth_common.so contains the PR-5 C++ tactic replacement string.
Static/scope:
- git diff --check
- Strict changed-file conflict-marker search.
- PR-5 forbidden-scope check returned no matches.
- pre-commit run --files ... passed on touched source/test files.
Unit tests:
- Focused installed-wheel MoE/CUTLASS/MegaMoE regression:
  - 6 passed, 3 warnings
- PATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=1 timeout 12m .venv-3.12/bin/python -m pytest tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py tests/unittest/_torch/thop/parallel/test_fp8_quantize.py -q --tb=short
  - 934 passed, 9 skipped, 245 warnings

Full tests/unittest/_torch/modules/moe caveat:

One full-suite attempt before the final verification fixes exited with CUDA illegal-memory-access cascading errors after the CUDA context was poisoned.
A diagnostic rerun with CUDA_LAUNCH_BLOCKING=1, -x, and a 15 minute timeout did not reproduce a first failure before timeout.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> (cherry picked from commit f172310)

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> (cherry picked from commit 41a16de)

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> (cherry picked from commit 2fbd0e1)

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> (cherry picked from commit a0d5400)

Signed-off-by: Yuhan Li <51736452+liyuhannnnn@users.noreply.github.com> Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> Co-authored-by: Yuhan Li <51736452+liyuhannnnn@users.noreply.github.com> (cherry picked from commit 15602a2) Signed-off-by: Qi Zhang (qizh) <10434017+Tracin@users.noreply.github.com>

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> (cherry picked from commit e0539a5) Signed-off-by: Qi Zhang (qizh) <10434017+Tracin@users.noreply.github.com>

…DIA#14673) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> (cherry picked from commit e97de0d)

…VIDIA#14658) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> (cherry picked from commit 2dd03e6)

… nodes (NVIDIA#14880) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> (cherry picked from commit 524b20d) Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>

(cherry picked from commit d7f9e4a) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com> Signed-off-by: Mingyang <mingyangh@nvidia.com> # Conflicts: # tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py # tests/unittest/_torch/modules/moe/test_moe_scheduler.py

Signed-off-by: Shicheng Li <shicli@nvidia.com> (cherry picked from commit 1264ed2) Signed-off-by: lishicheng1996-nv <shicli@nvidia.com>

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> (cherry picked from commit 435a0a2)

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

coderabbitai · 2026-06-25T16:21:08Z

📝 Walkthrough

Walkthrough

The PR adds a fused MegaMoE prepare CUDA op (megamoe_prepare) that converts BF16 activations to MX-FP8 in one kernel, wires it into MegaMoEDeepGemm and FusedCommMoEScheduler, adds a is_nccl_window_buffer Torch op, adds configurable SwiGLU clamping (swiglu_limit_scalar) for Blackwell grouped GEMM kernels, extracts mmap pageout helpers into mmap_utils.py, moves DeepGemm PDL init to engine construction, and updates third-party source URLs.

Changes

MoE runtime and kernel plumbing

Layer / File(s)	Summary
MegaMoE prepare op `cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h`, `cpp/tensorrt_llm/kernels/megaMoePrepareKernel.cu`, `cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp`, `cpp/tensorrt_llm/thop/CMakeLists.txt`, `tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py`, `tests/unittest/_torch/thop/parallel/test_fp8_quantize.py`	Adds the `MegaMoePrepareExpertType`/`ScaleType` enums, `invokeMegaMoePrepare` CUDA kernel (SM100+ only), `megaMoePrepare` Torch op with full validation, fake stub, and tests for correctness, dtype coverage, and zero tokens.
MegaMoE fused prepare backend `tensorrt_llm/_torch/modules/fused_moe/mega_moe/mega_moe_deepgemm.py`, `tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py`, `tensorrt_llm/_torch/modules/fused_moe/ops/moe_op_cutlass.py`, `tensorrt_llm/_torch/modules/fused_moe/quantization.py`, `tests/unittest/_torch/modules/moe/test_moe_scheduler.py`, `tests/unittest/_torch/modules/moe/test_moe_backend.py`	`MegaMoEDeepGemm` moves SymmBuffer allocation to `post_load_weights`, adds `supports_fused_prepare`, and `run_moe` calls `megamoe_prepare` when `x_sf` is absent; `FusedCommMoEScheduler` skips quantization for fused-prepare backends; five new scheduler unit tests.
NCCL window-buffer routing `cpp/tensorrt_llm/thop/allreduceOp.cpp`, `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	Adds `is_nccl_window_buffer` C++ impl and Torch op; `AllReduceRunner` includes the flag in its cache key; `tunable_allreduce` gates the NCCL window pre-hook on the predicate result.
MoE tactic replacement `cpp/tensorrt_llm/thop/moeOp.cpp`	`FusedMoeRunner` constructor sanitizes `NO_SMEM` GEMM tactics by replacing them with the first supported profile when W4 group-scaling or fused finalize is active.
Blackwell SwiGLU clamp kernel and op wiring `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py`, `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py`, `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_act_fusion.py`, `tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py`, `tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_swiglu_fusion.py`, `tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_act_fusion.py`, `tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py`	Adds `fclip_xorsign` PTX op, `SWIGLU_LIMIT_SCALAR_DISABLED` sentinel, and threads `swiglu_limit`/`swiglu_limit_scalar` through both Blackwell kernel constructors, runner cache keys, and custom op signatures; reference scripts and unit tests updated to match.
MoE SwiGLU backend wiring `tensorrt_llm/_torch/modules/fused_moe/create_moe.py`, `tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py`, `tensorrt_llm/_torch/modules/fused_moe/quantization.py`	`CuteDslFusedMoE` accepts and forwards `swiglu_limit_scalar`; `create_moe` validates its usage per backend; weight normalization of `swiglu_limit` moves earlier in `process_weights_after_loading`.
Memory pageout and weight release `tensorrt_llm/_torch/mmap_utils.py`, `tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py`, `tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py`, `tensorrt_llm/_torch/modules/fused_moe/quantization.py`, `tensorrt_llm/_torch/pyexecutor/model_loader.py`	Extracts `madvise_range`, `pageout_file_backed_regions`, and `advise_tensor_pageout` into `mmap_utils.py`; `mark_consumed` calls `advise_tensor_pageout` per CPU tensor; `model_loader` calls `torch.cuda.empty_cache()` once after finalization.
DeepGemm PDL configuration `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Removes top-level PDL init from `torch_custom_ops.py` and adds a one-time `_configure_deep_gemm_pdl()` helper called from `PyTorchModelEngine.__init__`.

Third-party source updates

Layer / File(s)	Summary
Fetch content entries `3rdparty/fetch_content.json`	`deep_ep_download` loses its `patch_file` override; `eigen` switches from GitLab to GitHub URL.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#12884: Modifies the same CuteDSL Blackwell activation-fusion MoE path in cute_dsl_custom_ops.py and related runners, adding an is_gated SwiGLU/Relu2 switch where this PR threads swiglu_limit_scalar.
NVIDIA/TensorRT-LLM#13908: Modifies the same moe_scheduler.py and mega_moe_deepgemm.py files, changing how FusedCommMoEScheduler/MegaMoEDeepGemm prepare inputs for MegaMoE execution.
NVIDIA/TensorRT-LLM#15559: Affects NCCLWindowAllocator directly, which is queried by the new is_nccl_window_buffer op added in this PR.

Suggested reviewers

xxi-nv
syuoni
hyukn
leslie-fang25
yuxianq

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title follows the required ticket/type format and clearly summarizes the DeepGEMM/MegaMoE follow-up scope.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description follows the template and includes a clear summary, test coverage, and checklist confirmation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (6)

tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py (1)
621-621: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add no-clamp coverage for the new optional parameter.

Coverage is insufficient for the default/disabled path: these updated tests exercise swiglu_limit=1.0, but not float("inf") or the omitted-argument default for the SwiGLU op paths. Please parametrize the SwiGLU cases in tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py over at least [float("inf"), 1.0].

As per path instructions, tests/**: “Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM.”

Also applies to: 763-763
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py` at line 621,
Coverage for the new SwiGLU optional parameter is missing the no-clamp/default
path, so update the SwiGLU test cases in test_cute_dsl_moe to parameterize
swiglu_limit over both float("inf") and 1.0, and make sure the relevant op-path
assertions run with the omitted-argument/default behavior as well. Use the
existing SwiGLU test helpers and parameterized cases in
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py to cover both
enabled-clamp and disabled-clamp scenarios.
Source: Path instructions
tensorrt_llm/_torch/mmap_utils.py (2)
102-102: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Annotate advise_tensor_pageout with parameter and return types.

The sibling helpers carry full type hints, but this public function omits the tensor type and the -> None return. As per coding guidelines, always annotate functions with return types (use None if no return). Use a TYPE_CHECKING import to avoid pulling torch at module load.
✏️ Proposed annotations
 import ctypes
 import mmap
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    import torch
-def advise_tensor_pageout(tensor, mode: str = "dontneed"):
+def advise_tensor_pageout(tensor: "torch.Tensor", mode: str = "dontneed") -> None:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/mmap_utils.py` at line 102, Add full type annotations to
the public helper `advise_tensor_pageout`: annotate the `tensor` parameter with
the appropriate Torch tensor type and explicitly declare the return type as
`None`. Because this module should avoid importing `torch` at runtime, use a
`TYPE_CHECKING` guarded import for the tensor type and update the
`advise_tensor_pageout` signature to match the other typed helpers.
Source: Coding guidelines

62-72: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

libc is re-dlopened on every madvise_range call.

advise_tensor_pageout is invoked per-tensor (e.g. the loaded_shared_weights loop in moe_load_balancer.finalize_layer_weights), so a fresh ctypes.CDLL is created on each weight. Cache it lazily so the lib is opened once while preserving the current Linux-only lazy-failure behavior (a module-level CDLL would break import on non-Linux hosts).
♻️ Lazy libc cache
+_LIBC = None
+
+
+def _get_libc() -> "ctypes.CDLL":
+    global _LIBC
+    if _LIBC is None:
+        _LIBC = ctypes.CDLL("libc.so.6", use_errno=True)
+    return _LIBC
+
+
 def madvise_range(addr: int, size: int, mode: str = "dontneed") -> None:
     if size <= 0:
         return
     try:
         advice = _MADV_ADVICE_BY_MODE[mode]
     except KeyError:
         raise ValueError("mode must be 'pageout' or 'dontneed'.")
-    libc = ctypes.CDLL("libc.so.6", use_errno=True)
-    ret = libc.madvise(ctypes.c_void_p(addr), ctypes.c_size_t(size), ctypes.c_int(advice))
+    ret = _get_libc().madvise(ctypes.c_void_p(addr), ctypes.c_size_t(size), ctypes.c_int(advice))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/mmap_utils.py` around lines 62 - 72, The `madvise_range`
helper in `tensorrt_llm/_torch/mmap_utils.py` re-opens `libc.so.6` on every
call, which is expensive in per-tensor paths like `advise_tensor_pageout` and
`moe_load_balancer.finalize_layer_weights`. Update the function to use a lazily
initialized module-level cache for the `ctypes.CDLL` handle so `libc` is loaded
once and reused across calls, while keeping the current Linux-only lazy failure
behavior intact. Preserve the existing `madvise_range` and
`_MADV_ADVICE_BY_MODE` flow and avoid moving the `CDLL` creation to import time.
tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py (1)
99-114: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Best-effort page-out should not be able to skip key deletion.

The whole page-out block (imports, torch.cuda.synchronize(), the loop) lives in one try guarded only by except ImportError. A non-ImportError from torch.cuda.synchronize() (e.g. RuntimeError when CUDA is unavailable/uninitialized) propagates out of mark_consumed, skipping the del self._weights[key] loop at Lines 115-116 — so weights are neither paged out nor freed and the caller sees an exception. As per coding guidelines, keep the try body minimal and move the logic into an else block, and ensure the optimization can't break the core deletion.
♻️ Restructure so deletion always runs
             if keys_to_delete:
                 try:
                     import torch
-
                     from tensorrt_llm._torch.mmap_utils import \
                         advise_tensor_pageout
-                    torch.cuda.synchronize()
-                    for k in keys_to_delete:
-                        t = self._weights[k]
-                        if isinstance(t,
-                                      torch.Tensor) and t.device.type == "cpu":
-                            try:
-                                advise_tensor_pageout(t, mode="dontneed")
-                            except (OSError, ValueError):
-                                pass
                 except ImportError:
                     pass
+                else:
+                    try:
+                        torch.cuda.synchronize()
+                        for k in keys_to_delete:
+                            t = self._weights[k]
+                            if isinstance(
+                                    t, torch.Tensor) and t.device.type == "cpu":
+                                try:
+                                    advise_tensor_pageout(t, mode="dontneed")
+                                except (OSError, ValueError):
+                                    pass
+                    except RuntimeError:
+                        # CUDA unavailable: skip best-effort pageout only.
+                        pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py` around lines 99
- 114, The page-out optimization in mark_consumed currently wraps imports,
torch.cuda.synchronize(), and the tensor loop in one try, so a non-ImportError
can abort before key deletion runs. Keep the try block limited to the torch
import path in base_weight_loader.py, move the synchronize/page-out loop into
the corresponding else block (or otherwise isolate it), and make sure the final
del self._weights[key] cleanup always executes even if page-out fails.
Source: Coding guidelines
cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h (1)
28-44: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Document the new public header API with Doxygen.

This adds two public enums and a public kernel entry point, but the header does not describe the contract yet. Please add //! docs here, including the expected tensor shapes/dtypes for the tensor-like arguments, so callers do not need to reverse-engineer the .cu/.cpp files.

As per coding guidelines, **/*.{h,hpp,hxx}: Follow Doxygen rules for documenting class interfaces and function prototypes; use //! for C++ style comments, and public Tensor-like arguments should document expected dimensions and allowed dtypes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h` around lines 28 - 44, The
public API in MegaMoePrepareExpertType, MegaMoePrepareScaleType, and
invokeMegaMoePrepare lacks the required Doxygen contract. Add //! documentation
for each enum and the kernel entry point, and describe the expected tensor-like
argument shapes/dtypes for input, tokenSelectedExperts, tokenFinalScales, xOut,
xSfOut, topkIdxOut, and topkWeightsOut so users can call invokeMegaMoePrepare
without inspecting implementation details. Use the existing symbols in
megaMoePrepareKernel.h and follow the C++ header documentation style required
for public interfaces.
Source: Coding guidelines
tests/unittest/_torch/modules/moe/test_moe_scheduler.py (1)
126-237: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Coverage is adequate for the changed scheduler logic; one optional follow-up. The four tests exercise routing separation, TRTLLMGen kwarg gating, and both fused-prepare branches, and the data_ptr() assertions correctly validate the single-chunk no-copy path. The multi-chunk concatenation path (len(outputs) > 1 → torch.cat) isn't directly asserted — consider adding a case with moe_max_num_tokens small enough to force num_chunks > 1 so the concat path is covered. Also confirm this new file is discovered by the relevant CI unit-test list.

As per path instructions for tests/**: assess whether coverage is sufficient and suggest concrete follow-up.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/moe/test_moe_scheduler.py` around lines 126 -
237, Add a follow-up test for `FusedCommMoEScheduler.forward`/`_forward_chunk`
that forces `num_chunks > 1` so the `torch.cat` multi-chunk path is executed and
asserted, since current coverage only checks the single-chunk no-copy path.
Reuse the existing `_FakeMoe`, `_RecordingFusedBackend`, and
`FusedCommMoEScheduler` helpers, but choose inputs or `moe_max_num_tokens` so
multiple outputs are produced and concatenated. Also verify the new test module
is picked up by the relevant CI/unit-test discovery for `tests/**`.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp`:
- Around line 72-78: Add explicit device guarding and same-device validation in
megaMoePrepareOp before launching the kernel. CHECK_TH_CUDA only confirms CUDA
tensors, so update the launch path around megaMoePrepareOp and its use of
at::cuda::getCurrentCUDAStream(input.get_device()) by introducing a
c10::cuda::CUDAGuard for input.device() and verifying tokenSelectedExperts,
tokenFinalScales, xOut, xSfOut, topkIdxOut, and topkWeightsOut all match
input.device(). Reject mixed-device inputs/outputs before the kernel launch to
prevent running on the wrong GPU or passing cross-device pointers.

In
`@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_act_fusion.py`:
- Around line 405-406: The swiglu_limit handling in
blockscaled_contiguous_gather_grouped_gemm_act_fusion should normalize negative
values before setting the clamp flag, because fclip_xorsign only works as
symmetric clipping for positive limits. Update the initialization around
swiglu_limit and has_swiglu_limit so negative sentinel values are treated like
disabling the limit, consistent with the custom-op sentinel behavior, and only
enable clipping when the effective limit is positive.

In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 346-350: The swiglu_limit_scalar allowlist in create_moe.py is
missing CuteDslB12xFusedMoE, which now receives that argument through the shared
CuteDSL constructor path. Update the moe_cls validation list in the
swiglu_limit_scalar check to include CuteDslB12xFusedMoE alongside
CuteDslFusedMoE and the other supported classes so B12x requests with scalar
clamp pass validation and reach the backend.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py`:
- Around line 66-70: Canonicalize negative swiglu_limit_scalar values in
CuteDslFusedMoE so they behave like disabled clamping instead of producing an
invalid clamp range in swiglu_fused_moe. Update the code path that stores/passes
this field in CuteDslFusedMoE to normalize any negative value to float("inf")
before it reaches swiglu_fused_moe, keeping the behavior aligned with the
custom-op contract.

In `@tests/unittest/_torch/thop/parallel/test_fp8_quantize.py`:
- Around line 237-301: Add negative coverage in test_megamoe_prepare_* and
_run_megamoe_prepare so the public megalMoePrepare op’s validation paths are
exercised, not just the happy path. Specifically, add cases in
tests/unittest/_torch/thop/parallel/test_fp8_quantize.py for unsupported
input/output dtypes, mismatched [num_tokens, top_k] shapes, hidden_size not
divisible by 128, and undersized x_out/x_sf_out buffers. Use the existing test
helpers and torch.ops.trtllm.megamoe_prepare entrypoint to keep the checks
aligned with cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp.

---

Nitpick comments:
In `@cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h`:
- Around line 28-44: The public API in MegaMoePrepareExpertType,
MegaMoePrepareScaleType, and invokeMegaMoePrepare lacks the required Doxygen
contract. Add //! documentation for each enum and the kernel entry point, and
describe the expected tensor-like argument shapes/dtypes for input,
tokenSelectedExperts, tokenFinalScales, xOut, xSfOut, topkIdxOut, and
topkWeightsOut so users can call invokeMegaMoePrepare without inspecting
implementation details. Use the existing symbols in megaMoePrepareKernel.h and
follow the C++ header documentation style required for public interfaces.

In `@tensorrt_llm/_torch/mmap_utils.py`:
- Line 102: Add full type annotations to the public helper
`advise_tensor_pageout`: annotate the `tensor` parameter with the appropriate
Torch tensor type and explicitly declare the return type as `None`. Because this
module should avoid importing `torch` at runtime, use a `TYPE_CHECKING` guarded
import for the tensor type and update the `advise_tensor_pageout` signature to
match the other typed helpers.
- Around line 62-72: The `madvise_range` helper in
`tensorrt_llm/_torch/mmap_utils.py` re-opens `libc.so.6` on every call, which is
expensive in per-tensor paths like `advise_tensor_pageout` and
`moe_load_balancer.finalize_layer_weights`. Update the function to use a lazily
initialized module-level cache for the `ctypes.CDLL` handle so `libc` is loaded
once and reused across calls, while keeping the current Linux-only lazy failure
behavior intact. Preserve the existing `madvise_range` and
`_MADV_ADVICE_BY_MODE` flow and avoid moving the `CDLL` creation to import time.

In `@tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py`:
- Around line 99-114: The page-out optimization in mark_consumed currently wraps
imports, torch.cuda.synchronize(), and the tensor loop in one try, so a
non-ImportError can abort before key deletion runs. Keep the try block limited
to the torch import path in base_weight_loader.py, move the synchronize/page-out
loop into the corresponding else block (or otherwise isolate it), and make sure
the final del self._weights[key] cleanup always executes even if page-out fails.

In `@tests/unittest/_torch/modules/moe/test_moe_scheduler.py`:
- Around line 126-237: Add a follow-up test for
`FusedCommMoEScheduler.forward`/`_forward_chunk` that forces `num_chunks > 1` so
the `torch.cat` multi-chunk path is executed and asserted, since current
coverage only checks the single-chunk no-copy path. Reuse the existing
`_FakeMoe`, `_RecordingFusedBackend`, and `FusedCommMoEScheduler` helpers, but
choose inputs or `moe_max_num_tokens` so multiple outputs are produced and
concatenated. Also verify the new test module is picked up by the relevant
CI/unit-test discovery for `tests/**`.

In `@tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py`:
- Line 621: Coverage for the new SwiGLU optional parameter is missing the
no-clamp/default path, so update the SwiGLU test cases in test_cute_dsl_moe to
parameterize swiglu_limit over both float("inf") and 1.0, and make sure the
relevant op-path assertions run with the omitted-argument/default behavior as
well. Use the existing SwiGLU test helpers and parameterized cases in
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py to cover both
enabled-clamp and disabled-clamp scenarios.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cf04400e-17c8-4b74-b996-121cfebaf1f9

📥 Commits

Reviewing files that changed from the base of the PR and between edb14ee and bba0cb0.

📒 Files selected for processing (30)

3rdparty/fetch_content.json
cpp/tensorrt_llm/kernels/megaMoePrepareKernel.cu
cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h
cpp/tensorrt_llm/thop/CMakeLists.txt
cpp/tensorrt_llm/thop/allreduceOp.cpp
cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp
cpp/tensorrt_llm/thop/moeOp.cpp
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_act_fusion.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py
tensorrt_llm/_torch/mmap_utils.py
tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py
tensorrt_llm/_torch/modules/fused_moe/create_moe.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
tensorrt_llm/_torch/modules/fused_moe/mega_moe/mega_moe_deepgemm.py
tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py
tensorrt_llm/_torch/modules/fused_moe/ops/moe_op_cutlass.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/pyexecutor/model_loader.py
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_act_fusion.py
tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_swiglu_fusion.py
tests/unittest/_torch/modules/moe/test_moe_backend.py
tests/unittest/_torch/modules/moe/test_moe_scheduler.py
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py
tests/unittest/_torch/thop/parallel/test_fp8_quantize.py

coderabbitai · 2026-06-25T16:43:00Z

+    CHECK_TH_CUDA(input);
+    CHECK_TH_CUDA(tokenSelectedExperts);
+    CHECK_TH_CUDA(tokenFinalScales);
+    CHECK_TH_CUDA(xOut);
+    CHECK_TH_CUDA(xSfOut);
+    CHECK_TH_CUDA(topkIdxOut);
+    CHECK_TH_CUDA(topkWeightsOut);


🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '\n== File outline ==\n' ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true printf '\n== Relevant lines ==\n' nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p' printf '\n== Search for similar CUDA device handling ==\n' rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'

Repository: NVIDIA/TensorRT-LLM

Length of output: 296

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '\n== File outline ==\n' ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true printf '\n== Relevant lines ==\n' nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p' printf '\n== Search for similar CUDA device handling ==\n' rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'

Repository: NVIDIA/TensorRT-LLM

Length of output: 296

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '\n== File outline ==\n' ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true printf '\n== Relevant lines ==\n' nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p' printf '\n== Search for similar CUDA device handling ==\n' rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'

Repository: NVIDIA/TensorRT-LLM

Length of output: 296

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '\n== File outline ==\n' ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true printf '\n== Relevant lines ==\n' nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p' printf '\n== Search for similar CUDA device handling ==\n' rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'

Repository: NVIDIA/TensorRT-LLM

Length of output: 296

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '\n== File outline ==\n' ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true printf '\n== Relevant lines ==\n' nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p' printf '\n== Search for similar CUDA device handling ==\n' rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'

Repository: NVIDIA/TensorRT-LLM

Length of output: 296

🏁 Script executed:

#!/bin/bash set -euo pipefail nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p' rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'

Repository: NVIDIA/TensorRT-LLM

Length of output: 197

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "hello"

Repository: NVIDIA/TensorRT-LLM

Length of output: 163

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '== tracked paths matching megaMoePrepare ==\n' git ls-files | rg 'megaMoePrepare|MegaMoePrepare|megaMoe|MegaMoe' || true printf '\n== tracked paths under cpp/tensorrt_llm/thop ==\n' git ls-files cpp/tensorrt_llm/thop || true printf '\n== candidate file contents (if present) ==\n' for f in $(git ls-files | rg 'megaMoePrepare|MegaMoePrepare|megaMoe|MegaMoe' || true); do printf '\n--- %s ---\n' "$f" sed -n '1,220p' "$f" done

Repository: NVIDIA/TensorRT-LLM

Length of output: 19222

Guard the launch device and require same-device tensors. CHECK_TH_CUDA only proves CUDA placement; it does not ensure every tensor matches input.device(). Add a c10::cuda::CUDAGuard before at::cuda::getCurrentCUDAStream(input.get_device()) and reject mixed-device inputs/outputs, or this can launch on the wrong device and pass cross-device pointers. cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp:72-130

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp` around lines 72 - 78, Add explicit device guarding and same-device validation in megaMoePrepareOp before launching the kernel. CHECK_TH_CUDA only confirms CUDA tensors, so update the launch path around megaMoePrepareOp and its use of at::cuda::getCurrentCUDAStream(input.get_device()) by introducing a c10::cuda::CUDAGuard for input.device() and verifying tokenSelectedExperts, tokenFinalScales, xOut, xSfOut, topkIdxOut, and topkWeightsOut all match input.device(). Reject mixed-device inputs/outputs before the kernel launch to prevent running on the wrong GPU or passing cross-device pointers.

coderabbitai · 2026-06-25T16:43:00Z

+        self.swiglu_limit = swiglu_limit
+        self.has_swiglu_limit = swiglu_limit != float("inf")


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Normalize negative swiglu_limit before enabling clamp.

fclip_xorsign only behaves like symmetric clipping when limit > 0; storing a negative value here enables clamp and changes the sign semantics instead of disabling it. Mirror the custom-op sentinel handling before deriving has_swiglu_limit.

Proposed fix

- self.swiglu_limit = swiglu_limit + if swiglu_limit < 0: + swiglu_limit = float("inf") + self.swiglu_limit = swiglu_limit self.has_swiglu_limit = swiglu_limit != float("inf")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.swiglu_limit = swiglu_limit

self.has_swiglu_limit = swiglu_limit != float("inf")

if swiglu_limit < 0:

swiglu_limit = float("inf")

self.swiglu_limit = swiglu_limit

self.has_swiglu_limit = swiglu_limit != float("inf")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_act_fusion.py` around lines 405 - 406, The swiglu_limit handling in blockscaled_contiguous_gather_grouped_gemm_act_fusion should normalize negative values before setting the clamp flag, because fclip_xorsign only works as symmetric clipping for positive limits. Update the initialization around swiglu_limit and has_swiglu_limit so negative sentinel values are treated like disabling the limit, consistent with the custom-op sentinel behavior, and only enable clipping when the effective limit is positive.

coderabbitai · 2026-06-25T16:43:00Z

    if swiglu_limit_scalar is not None:
        assert moe_cls in [
            CutlassFusedMoE, TRTLLMGenFusedMoE, WideEPMoE, DeepGemmFusedMoE,
-            MegaMoEDeepGemm
+            MegaMoEDeepGemm, CuteDslFusedMoE
        ], f"swiglu_limit_scalar is not supported in {moe_cls.__name__}."


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Add CuteDslB12xFusedMoE to the scalar clamp allowlist.

The allowlist here omits CuteDslB12xFusedMoE, but Line 447 now forwards swiglu_limit_scalar through the shared CuteDSL constructor path. Any B12x backend request with a scalar clamp will fail validation before it reaches the backend.

Suggested fix

if swiglu_limit_scalar is not None: assert moe_cls in [ CutlassFusedMoE, TRTLLMGenFusedMoE, WideEPMoE, DeepGemmFusedMoE, - MegaMoEDeepGemm, CuteDslFusedMoE + MegaMoEDeepGemm, CuteDslFusedMoE, CuteDslB12xFusedMoE ], f"swiglu_limit_scalar is not supported in {moe_cls.__name__}."

Also applies to: 447-447

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py` around lines 346 - 350, The swiglu_limit_scalar allowlist in create_moe.py is missing CuteDslB12xFusedMoE, which now receives that argument through the shared CuteDSL constructor path. Update the moe_cls validation list in the swiglu_limit_scalar check to include CuteDslB12xFusedMoE alongside CuteDslFusedMoE and the other supported classes so B12x requests with scalar clamp pass validation and reach the backend.

coderabbitai · 2026-06-25T16:43:00Z

+def swiglu_fused_moe(x, swiglu_limit_scalar: float = float("inf")):
    x, gate = x.chunk(2, dim=-1)
+    if swiglu_limit_scalar != float("inf"):
+        gate = gate.clamp(max=swiglu_limit_scalar)
+        x = x.clamp(min=-swiglu_limit_scalar, max=swiglu_limit_scalar)


🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== File list ==" git ls-files tensorrt_llm/_torch/modules/fused_moe | sed -n '1,120p' echo echo "== Outline: fused_moe_cute_dsl.py ==" ast-grep outline tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py --view expanded || true echo echo "== Relevant references ==" rg -n "swiglu_limit_scalar|swiglu_fused_moe|canonicalize" tensorrt_llm/_torch/modules/fused_moe -S

Repository: NVIDIA/TensorRT-LLM

Length of output: 10284

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== fused_moe_cute_dsl.py relevant sections ==" sed -n '430,470p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py echo sed -n '760,780p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py echo echo "== interface.py relevant section ==" sed -n '270,320p' tensorrt_llm/_torch/modules/fused_moe/interface.py echo echo "== create_moe.py relevant section ==" sed -n '330,360p' tensorrt_llm/_torch/modules/fused_moe/create_moe.py echo echo "== trtllm_gen relevant section ==" sed -n '460,490p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 9216

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Search for negative/sentinel handling ==" rg -n "swiglu_limit_scalar.*< 0|< 0.*swiglu_limit_scalar|float\$\"inf\"\$|disabled sentinel|canonicalize" tensorrt_llm/_torch/modules/fused_moe tensorrt_llm -S echo echo "== swiglu_fused_moe consumer / runner lines ==" sed -n '720,780p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py echo echo "== Any tests mentioning swiglu_limit_scalar ==" rg -n "swiglu_limit_scalar" tests tensorrt_llm -S

Repository: NVIDIA/TensorRT-LLM

Length of output: 24654

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Search for negative/sentinel handling ==" rg -n "swiglu_limit_scalar.*< 0|< 0.*swiglu_limit_scalar|float\$\"inf\"\$|disabled sentinel|canonicalize" tensorrt_llm/_torch/modules/fused_moe tensorrt_llm -S echo echo "== swiglu_fused_moe consumer / runner lines ==" sed -n '720,780p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py echo echo "== Any tests mentioning swiglu_limit_scalar ==" rg -n "swiglu_limit_scalar" tests tensorrt_llm -S

Repository: NVIDIA/TensorRT-LLM

Length of output: 24654

🏁 Script executed:

python3 - <<'PY' from math import inf def current_behavior(swiglu_limit_scalar): # mirrors tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py if swiglu_limit_scalar != inf: gate_clamp = swiglu_limit_scalar x_min = -swiglu_limit_scalar x_max = swiglu_limit_scalar return gate_clamp, x_min, x_max return None for v in [None, inf, -1.0, 0.0, 3.5]: if v is None: v = inf print(v, current_behavior(v)) PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 240

Canonicalize negative swiglu_limit_scalar in CuteDslFusedMoE.

swiglu_fused_moe() treats any non-inf value as an active clamp, so -1.0 becomes an inverted clamp range instead of disabling the path. Normalize negative values to inf when storing the module field, matching the custom-op contract.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py` around lines 66 - 70, Canonicalize negative swiglu_limit_scalar values in CuteDslFusedMoE so they behave like disabled clamping instead of producing an invalid clamp range in swiglu_fused_moe. Update the code path that stores/passes this field in CuteDslFusedMoE to normalize any negative value to float("inf") before it reaches swiglu_fused_moe, keeping the behavior aligned with the custom-op contract.

coderabbitai · 2026-06-25T16:43:00Z

+@pytest.mark.parametrize("m", [1, 7, 128])
+@pytest.mark.parametrize("k", [128, 512])
+@pytest.mark.parametrize("top_k", [1, 6])
+@skip_pre_blackwell_unittest
+def test_megamoe_prepare_matches_mxfp8_quantize(m, k, top_k):
+    torch.random.manual_seed(123)
+    hidden_states = (torch.randn([m, k], dtype=torch.float) * 16).to(
+        torch.bfloat16).cuda().contiguous()
+    token_selected_experts = torch.randint(0,
+                                           384, (m, top_k),
+                                           dtype=torch.int32,
+                                           device="cuda")
+    token_final_scales = torch.randn((m, top_k),
+                                     dtype=torch.float32,
+                                     device="cuda")
+
+    _run_megamoe_prepare(hidden_states, token_selected_experts,
+                         token_final_scales)
+
+
+@pytest.mark.parametrize("expert_dtype", [torch.int32, torch.int64])
+@pytest.mark.parametrize("scale_dtype",
+                         [torch.float32, torch.float16, torch.bfloat16])
+@skip_pre_blackwell_unittest
+def test_megamoe_prepare_accepts_supported_topk_dtypes(expert_dtype,
+                                                       scale_dtype):
+    torch.random.manual_seed(123)
+    m, k, top_k = 5, 256, 3
+    hidden_states = (torch.randn([m, k], dtype=torch.float) * 16).to(
+        torch.bfloat16).cuda().contiguous()
+    token_selected_experts = torch.randint(0,
+                                           384, (m, top_k),
+                                           dtype=expert_dtype,
+                                           device="cuda")
+    token_final_scales = torch.randn((m, top_k),
+                                     dtype=scale_dtype,
+                                     device="cuda")
+
+    _run_megamoe_prepare(hidden_states, token_selected_experts,
+                         token_final_scales)
+
+
+@skip_pre_blackwell_unittest
+def test_megamoe_prepare_allows_zero_tokens():
+    k, top_k = 256, 3
+    hidden_states = torch.empty((0, k), dtype=torch.bfloat16, device="cuda")
+    token_selected_experts = torch.empty((0, top_k),
+                                         dtype=torch.int32,
+                                         device="cuda")
+    token_final_scales = torch.empty((0, top_k),
+                                     dtype=torch.float32,
+                                     device="cuda")
+    x_out = torch.empty((1, k), dtype=torch.float8_e4m3fn, device="cuda")
+    x_sf_out = torch.empty((1, k // 128), dtype=torch.int32, device="cuda")
+    topk_idx_out = torch.empty((1, top_k), dtype=torch.int64, device="cuda")
+    topk_weights_out = torch.empty((1, top_k),
+                                   dtype=torch.float32,
+                                   device="cuda")
+
+    torch.ops.trtllm.megamoe_prepare(hidden_states, token_selected_experts,
+                                     token_final_scales, x_out, x_sf_out,
+                                     topk_idx_out, topk_weights_out)
+    torch.cuda.synchronize()
+
+


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Coverage is still insufficient for the public op's validation contract.

These tests cover the happy path, dtype matrix, and zero-token fast path, but cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp also rejects bad dtypes, mismatched shapes, undersized output buffers, and hidden_size % 128 != 0. None of those guards is exercised here, so a regression in the public API contract would ship unnoticed.

Please add negative cases in tests/unittest/_torch/thop/parallel/test_fp8_quantize.py for at least:

unsupported input / output dtypes,

mismatched [num_tokens, top_k] shapes,

hidden_size not divisible by 128,

undersized x_out / x_sf_out.

As per path instructions, tests/**: Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM. Keep feedback actionable: suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/thop/parallel/test_fp8_quantize.py` around lines 237 - 301, Add negative coverage in test_megamoe_prepare_* and _run_megamoe_prepare so the public megalMoePrepare op’s validation paths are exercised, not just the happy path. Specifically, add cases in tests/unittest/_torch/thop/parallel/test_fp8_quantize.py for unsupported input/output dtypes, mismatched [num_tokens, top_k] shapes, hidden_size not divisible by 128, and undersized x_out/x_sf_out buffers. Use the existing test helpers and torch.ops.trtllm.megamoe_prepare entrypoint to keep the checks aligned with cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp.

Source: Path instructions

lfr-0531 requested review from a team as code owners June 25, 2026 16:13

lfr-0531 requested review from lancelly, liji-nv, mikeiovine and syuoni June 25, 2026 16:13

github-actions Bot assigned lfr-0531 Jun 25, 2026

liji-nv and others added 13 commits June 25, 2026 16:15

[None][chore] Update DeepGEMM dependency (NVIDIA#14743)

091a9d6

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> (cherry picked from commit f172310)

[None][chore] Update DeepGEMM dependency (NVIDIA#14847)

ee97c7e

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> (cherry picked from commit 41a16de)

[None][fix] Update DeepGEMM to fix paged MQA metadata OOB (NVIDIA#14940)

7fc9139

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> (cherry picked from commit 2fbd0e1)

[None][fix] configure DeepGEMM PDL during engine init (NVIDIA#15004)

0715f2e

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> (cherry picked from commit a0d5400)

[None][feat] Pass swiglu limit to Cutedsl moe for DSv4 (NVIDIA#14513)

0ecf9d5

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> (cherry picked from commit e0539a5) Signed-off-by: Qi Zhang (qizh) <10434017+Tracin@users.noreply.github.com>

[None][fix] Restore DSv4 NVFP4 routed swiglu_limit on TRTLLM-Gen (NVI…

9622e35

…DIA#14673) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> (cherry picked from commit e97de0d)

[None][perf] DSv4 MegaMoE: free mmap pages and segments during load (N…

82c7187

…VIDIA#14658) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> (cherry picked from commit 2dd03e6)

[None][fix] Fix host-RAM OOM during MEGAMOE weight load on small-host…

96c7339

… nodes (NVIDIA#14880) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> (cherry picked from commit 524b20d) Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>

[None][fix] MegaMoE: reject TEP topology in __init__ (NVIDIA#14222)

aa64cc6

Signed-off-by: Shicheng Li <shicli@nvidia.com> (cherry picked from commit 1264ed2) Signed-off-by: lishicheng1996-nv <shicli@nvidia.com>

[None][fix] gate NCCL window autotune pre-hook (NVIDIA#14566)

a6b7590

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> (cherry picked from commit 435a0a2)

[TRTLLM-12950][fix] stabilize MegaMoE follow-up paths

bba0cb0

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

lfr-0531 force-pushed the user/fanrongl/dsv4-a7728aa-pr5-deepgemm-megamoe-20260625 branch from 64398b4 to bba0cb0 Compare June 25, 2026 16:17

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRTLLM-12950][perf] DSv4 follow-up: DeepGEMM and MegaMoE#15632

[TRTLLM-12950][perf] DSv4 follow-up: DeepGEMM and MegaMoE#15632
lfr-0531 wants to merge 13 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-a7728aa-pr5-deepgemm-megamoe-20260625

lfr-0531 commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

		self.swiglu_limit = swiglu_limit
		self.has_swiglu_limit = swiglu_limit != float("inf")

Uh oh!

Conversation

lfr-0531 commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

lfr-0531 commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading