Skip to content

[TRTLLM-12950][perf] DSv4 follow-up: DeepGEMM and MegaMoE#15632

Open
lfr-0531 wants to merge 13 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-a7728aa-pr5-deepgemm-megamoe-20260625
Open

[TRTLLM-12950][perf] DSv4 follow-up: DeepGEMM and MegaMoE#15632
lfr-0531 wants to merge 13 commits into
NVIDIA:mainfrom
lfr-0531:user/fanrongl/dsv4-a7728aa-pr5-deepgemm-megamoe-20260625

Conversation

@lfr-0531

@lfr-0531 lfr-0531 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Added a new MoE preparation path that can directly set up selected experts, weights, and quantized activations.
    • Added support for bounded SwiGLU behavior in several fused MoE kernels and Python APIs.
    • Added a helper to detect when all-reduce inputs already use an NCCL window buffer.
  • Bug Fixes

    • Improved MoE scheduling and backend selection for fused-prepare and routing-separation cases.
    • Reduced unnecessary memory retention by releasing CPU-backed weights more aggressively.
    • Refined all-reduce caching behavior for NCCL window-backed inputs.

Description

This PR carves the PR-5 DeepGEMM/MegaMoE follow-up scope from github/feat/deepseek_v4 after a7728aa onto current github/main.

It includes:

  • DeepGEMM dependency and PDL setup.
  • MegaMoE prepare/custom-op registration and scheduler follow-up.
  • MegaMoE loader/OOM and EPLB-related follow-up that can be applied without introducing the full DSv4 model file.
  • CuTeDSL grouped-GEMM swiglu-limit follow-up and related unit coverage.

It intentionally excludes NVFP4 MegaMoE backend/tuning commits aad9e0f008 and f90246b3d8 because equivalent material is already on main, and it excludes the modeling_deepseekv4.py hunks because that file is not present on current main and belongs to the PR-3 DSv4 model/sparse-attention stack.

Rebase/conflict notes are recorded in tmp/dsv4_final_prs/PR5_rebase_conflicts_20260625.md.

Test Coverage

  • Build/install:
    • python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --use_ccache --cuda_architectures "90-real;100-real" --configure_cmake
    • .venv-3.12/bin/pip install --force-reinstall --no-deps build/tensorrt_llm-1.3.0rc20-cp312-cp312-linux_x86_64.whl
    • Verified installed libth_common.so contains the PR-5 C++ tactic replacement string.
  • Static/scope:
    • git diff --check
    • Strict changed-file conflict-marker search.
    • PR-5 forbidden-scope check returned no matches.
    • pre-commit run --files ... passed on touched source/test files.
  • Unit tests:
    • Focused installed-wheel MoE/CUTLASS/MegaMoE regression:
      • 6 passed, 3 warnings
    • PATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=1 timeout 12m .venv-3.12/bin/python -m pytest tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py tests/unittest/_torch/thop/parallel/test_fp8_quantize.py -q --tb=short
      • 934 passed, 9 skipped, 245 warnings

Full tests/unittest/_torch/modules/moe caveat:

  • One full-suite attempt before the final verification fixes exited with CUDA illegal-memory-access cascading errors after the CUDA context was poisoned.
  • A diagnostic rerun with CUDA_LAUNCH_BLOCKING=1, -x, and a 15 minute timeout did not reproduce a first failure before timeout.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

liji-nv and others added 13 commits June 25, 2026 16:15
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
(cherry picked from commit f172310)
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
(cherry picked from commit 41a16de)
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
(cherry picked from commit 2fbd0e1)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
(cherry picked from commit a0d5400)
Signed-off-by: Yuhan Li <51736452+liyuhannnnn@users.noreply.github.com>
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: Yuhan Li <51736452+liyuhannnnn@users.noreply.github.com>
(cherry picked from commit 15602a2)
Signed-off-by: Qi Zhang (qizh) <10434017+Tracin@users.noreply.github.com>
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
(cherry picked from commit e0539a5)
Signed-off-by: Qi Zhang (qizh) <10434017+Tracin@users.noreply.github.com>
…DIA#14673)

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
(cherry picked from commit e97de0d)
…VIDIA#14658)

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
(cherry picked from commit 2dd03e6)
… nodes (NVIDIA#14880)

Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
(cherry picked from commit 524b20d)
Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>
(cherry picked from commit d7f9e4a)
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Signed-off-by: Mingyang <mingyangh@nvidia.com>

# Conflicts:
#	tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py
#	tests/unittest/_torch/modules/moe/test_moe_scheduler.py
Signed-off-by: Shicheng Li <shicli@nvidia.com>
(cherry picked from commit 1264ed2)
Signed-off-by: lishicheng1996-nv <shicli@nvidia.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
(cherry picked from commit 435a0a2)
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
@lfr-0531 lfr-0531 force-pushed the user/fanrongl/dsv4-a7728aa-pr5-deepgemm-megamoe-20260625 branch from 64398b4 to bba0cb0 Compare June 25, 2026 16:17
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adds a fused MegaMoE prepare CUDA op (megamoe_prepare) that converts BF16 activations to MX-FP8 in one kernel, wires it into MegaMoEDeepGemm and FusedCommMoEScheduler, adds a is_nccl_window_buffer Torch op, adds configurable SwiGLU clamping (swiglu_limit_scalar) for Blackwell grouped GEMM kernels, extracts mmap pageout helpers into mmap_utils.py, moves DeepGemm PDL init to engine construction, and updates third-party source URLs.

Changes

MoE runtime and kernel plumbing

Layer / File(s) Summary
MegaMoE prepare op
cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h, cpp/tensorrt_llm/kernels/megaMoePrepareKernel.cu, cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp, cpp/tensorrt_llm/thop/CMakeLists.txt, tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py, tests/unittest/_torch/thop/parallel/test_fp8_quantize.py
Adds the MegaMoePrepareExpertType/ScaleType enums, invokeMegaMoePrepare CUDA kernel (SM100+ only), megaMoePrepare Torch op with full validation, fake stub, and tests for correctness, dtype coverage, and zero tokens.
MegaMoE fused prepare backend
tensorrt_llm/_torch/modules/fused_moe/mega_moe/mega_moe_deepgemm.py, tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py, tensorrt_llm/_torch/modules/fused_moe/ops/moe_op_cutlass.py, tensorrt_llm/_torch/modules/fused_moe/quantization.py, tests/unittest/_torch/modules/moe/test_moe_scheduler.py, tests/unittest/_torch/modules/moe/test_moe_backend.py
MegaMoEDeepGemm moves SymmBuffer allocation to post_load_weights, adds supports_fused_prepare, and run_moe calls megamoe_prepare when x_sf is absent; FusedCommMoEScheduler skips quantization for fused-prepare backends; five new scheduler unit tests.
NCCL window-buffer routing
cpp/tensorrt_llm/thop/allreduceOp.cpp, tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
Adds is_nccl_window_buffer C++ impl and Torch op; AllReduceRunner includes the flag in its cache key; tunable_allreduce gates the NCCL window pre-hook on the predicate result.
MoE tactic replacement
cpp/tensorrt_llm/thop/moeOp.cpp
FusedMoeRunner constructor sanitizes NO_SMEM GEMM tactics by replacing them with the first supported profile when W4 group-scaling or fused finalize is active.
Blackwell SwiGLU clamp kernel and op wiring
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py, tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py, tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_act_fusion.py, tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py, tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_swiglu_fusion.py, tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_act_fusion.py, tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py
Adds fclip_xorsign PTX op, SWIGLU_LIMIT_SCALAR_DISABLED sentinel, and threads swiglu_limit/swiglu_limit_scalar through both Blackwell kernel constructors, runner cache keys, and custom op signatures; reference scripts and unit tests updated to match.
MoE SwiGLU backend wiring
tensorrt_llm/_torch/modules/fused_moe/create_moe.py, tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py, tensorrt_llm/_torch/modules/fused_moe/quantization.py
CuteDslFusedMoE accepts and forwards swiglu_limit_scalar; create_moe validates its usage per backend; weight normalization of swiglu_limit moves earlier in process_weights_after_loading.
Memory pageout and weight release
tensorrt_llm/_torch/mmap_utils.py, tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py, tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py, tensorrt_llm/_torch/modules/fused_moe/quantization.py, tensorrt_llm/_torch/pyexecutor/model_loader.py
Extracts madvise_range, pageout_file_backed_regions, and advise_tensor_pageout into mmap_utils.py; mark_consumed calls advise_tensor_pageout per CPU tensor; model_loader calls torch.cuda.empty_cache() once after finalization.
DeepGemm PDL configuration
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py, tensorrt_llm/_torch/pyexecutor/model_engine.py
Removes top-level PDL init from torch_custom_ops.py and adds a one-time _configure_deep_gemm_pdl() helper called from PyTorchModelEngine.__init__.

Third-party source updates

Layer / File(s) Summary
Fetch content entries
3rdparty/fetch_content.json
deep_ep_download loses its patch_file override; eigen switches from GitLab to GitHub URL.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#12884: Modifies the same CuteDSL Blackwell activation-fusion MoE path in cute_dsl_custom_ops.py and related runners, adding an is_gated SwiGLU/Relu2 switch where this PR threads swiglu_limit_scalar.
  • NVIDIA/TensorRT-LLM#13908: Modifies the same moe_scheduler.py and mega_moe_deepgemm.py files, changing how FusedCommMoEScheduler/MegaMoEDeepGemm prepare inputs for MegaMoE execution.
  • NVIDIA/TensorRT-LLM#15559: Affects NCCLWindowAllocator directly, which is queried by the new is_nccl_window_buffer op added in this PR.

Suggested reviewers

  • xxi-nv
  • syuoni
  • hyukn
  • leslie-fang25
  • yuxianq
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title follows the required ticket/type format and clearly summarizes the DeepGEMM/MegaMoE follow-up scope.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description follows the template and includes a clear summary, test coverage, and checklist confirmation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (6)
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py (1)

621-621: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add no-clamp coverage for the new optional parameter.

Coverage is insufficient for the default/disabled path: these updated tests exercise swiglu_limit=1.0, but not float("inf") or the omitted-argument default for the SwiGLU op paths. Please parametrize the SwiGLU cases in tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py over at least [float("inf"), 1.0].

As per path instructions, tests/**: “Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM.”

Also applies to: 763-763

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py` at line 621,
Coverage for the new SwiGLU optional parameter is missing the no-clamp/default
path, so update the SwiGLU test cases in test_cute_dsl_moe to parameterize
swiglu_limit over both float("inf") and 1.0, and make sure the relevant op-path
assertions run with the omitted-argument/default behavior as well. Use the
existing SwiGLU test helpers and parameterized cases in
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py to cover both
enabled-clamp and disabled-clamp scenarios.

Source: Path instructions

tensorrt_llm/_torch/mmap_utils.py (2)

102-102: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Annotate advise_tensor_pageout with parameter and return types.

The sibling helpers carry full type hints, but this public function omits the tensor type and the -> None return. As per coding guidelines, always annotate functions with return types (use None if no return). Use a TYPE_CHECKING import to avoid pulling torch at module load.

✏️ Proposed annotations
 import ctypes
 import mmap
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    import torch
-def advise_tensor_pageout(tensor, mode: str = "dontneed"):
+def advise_tensor_pageout(tensor: "torch.Tensor", mode: str = "dontneed") -> None:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/mmap_utils.py` at line 102, Add full type annotations to
the public helper `advise_tensor_pageout`: annotate the `tensor` parameter with
the appropriate Torch tensor type and explicitly declare the return type as
`None`. Because this module should avoid importing `torch` at runtime, use a
`TYPE_CHECKING` guarded import for the tensor type and update the
`advise_tensor_pageout` signature to match the other typed helpers.

Source: Coding guidelines


62-72: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

libc is re-dlopened on every madvise_range call.

advise_tensor_pageout is invoked per-tensor (e.g. the loaded_shared_weights loop in moe_load_balancer.finalize_layer_weights), so a fresh ctypes.CDLL is created on each weight. Cache it lazily so the lib is opened once while preserving the current Linux-only lazy-failure behavior (a module-level CDLL would break import on non-Linux hosts).

♻️ Lazy libc cache
+_LIBC = None
+
+
+def _get_libc() -> "ctypes.CDLL":
+    global _LIBC
+    if _LIBC is None:
+        _LIBC = ctypes.CDLL("libc.so.6", use_errno=True)
+    return _LIBC
+
+
 def madvise_range(addr: int, size: int, mode: str = "dontneed") -> None:
     if size <= 0:
         return
     try:
         advice = _MADV_ADVICE_BY_MODE[mode]
     except KeyError:
         raise ValueError("mode must be 'pageout' or 'dontneed'.")
-    libc = ctypes.CDLL("libc.so.6", use_errno=True)
-    ret = libc.madvise(ctypes.c_void_p(addr), ctypes.c_size_t(size), ctypes.c_int(advice))
+    ret = _get_libc().madvise(ctypes.c_void_p(addr), ctypes.c_size_t(size), ctypes.c_int(advice))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/mmap_utils.py` around lines 62 - 72, The `madvise_range`
helper in `tensorrt_llm/_torch/mmap_utils.py` re-opens `libc.so.6` on every
call, which is expensive in per-tensor paths like `advise_tensor_pageout` and
`moe_load_balancer.finalize_layer_weights`. Update the function to use a lazily
initialized module-level cache for the `ctypes.CDLL` handle so `libc` is loaded
once and reused across calls, while keeping the current Linux-only lazy failure
behavior intact. Preserve the existing `madvise_range` and
`_MADV_ADVICE_BY_MODE` flow and avoid moving the `CDLL` creation to import time.
tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py (1)

99-114: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Best-effort page-out should not be able to skip key deletion.

The whole page-out block (imports, torch.cuda.synchronize(), the loop) lives in one try guarded only by except ImportError. A non-ImportError from torch.cuda.synchronize() (e.g. RuntimeError when CUDA is unavailable/uninitialized) propagates out of mark_consumed, skipping the del self._weights[key] loop at Lines 115-116 — so weights are neither paged out nor freed and the caller sees an exception. As per coding guidelines, keep the try body minimal and move the logic into an else block, and ensure the optimization can't break the core deletion.

♻️ Restructure so deletion always runs
             if keys_to_delete:
                 try:
                     import torch
-
                     from tensorrt_llm._torch.mmap_utils import \
                         advise_tensor_pageout
-                    torch.cuda.synchronize()
-                    for k in keys_to_delete:
-                        t = self._weights[k]
-                        if isinstance(t,
-                                      torch.Tensor) and t.device.type == "cpu":
-                            try:
-                                advise_tensor_pageout(t, mode="dontneed")
-                            except (OSError, ValueError):
-                                pass
                 except ImportError:
                     pass
+                else:
+                    try:
+                        torch.cuda.synchronize()
+                        for k in keys_to_delete:
+                            t = self._weights[k]
+                            if isinstance(
+                                    t, torch.Tensor) and t.device.type == "cpu":
+                                try:
+                                    advise_tensor_pageout(t, mode="dontneed")
+                                except (OSError, ValueError):
+                                    pass
+                    except RuntimeError:
+                        # CUDA unavailable: skip best-effort pageout only.
+                        pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py` around lines 99
- 114, The page-out optimization in mark_consumed currently wraps imports,
torch.cuda.synchronize(), and the tensor loop in one try, so a non-ImportError
can abort before key deletion runs. Keep the try block limited to the torch
import path in base_weight_loader.py, move the synchronize/page-out loop into
the corresponding else block (or otherwise isolate it), and make sure the final
del self._weights[key] cleanup always executes even if page-out fails.

Source: Coding guidelines

cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h (1)

28-44: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Document the new public header API with Doxygen.

This adds two public enums and a public kernel entry point, but the header does not describe the contract yet. Please add //! docs here, including the expected tensor shapes/dtypes for the tensor-like arguments, so callers do not need to reverse-engineer the .cu/.cpp files.

As per coding guidelines, **/*.{h,hpp,hxx}: Follow Doxygen rules for documenting class interfaces and function prototypes; use //! for C++ style comments, and public Tensor-like arguments should document expected dimensions and allowed dtypes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h` around lines 28 - 44, The
public API in MegaMoePrepareExpertType, MegaMoePrepareScaleType, and
invokeMegaMoePrepare lacks the required Doxygen contract. Add //! documentation
for each enum and the kernel entry point, and describe the expected tensor-like
argument shapes/dtypes for input, tokenSelectedExperts, tokenFinalScales, xOut,
xSfOut, topkIdxOut, and topkWeightsOut so users can call invokeMegaMoePrepare
without inspecting implementation details. Use the existing symbols in
megaMoePrepareKernel.h and follow the C++ header documentation style required
for public interfaces.

Source: Coding guidelines

tests/unittest/_torch/modules/moe/test_moe_scheduler.py (1)

126-237: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Coverage is adequate for the changed scheduler logic; one optional follow-up. The four tests exercise routing separation, TRTLLMGen kwarg gating, and both fused-prepare branches, and the data_ptr() assertions correctly validate the single-chunk no-copy path. The multi-chunk concatenation path (len(outputs) > 1torch.cat) isn't directly asserted — consider adding a case with moe_max_num_tokens small enough to force num_chunks > 1 so the concat path is covered. Also confirm this new file is discovered by the relevant CI unit-test list.

As per path instructions for tests/**: assess whether coverage is sufficient and suggest concrete follow-up.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/moe/test_moe_scheduler.py` around lines 126 -
237, Add a follow-up test for `FusedCommMoEScheduler.forward`/`_forward_chunk`
that forces `num_chunks > 1` so the `torch.cat` multi-chunk path is executed and
asserted, since current coverage only checks the single-chunk no-copy path.
Reuse the existing `_FakeMoe`, `_RecordingFusedBackend`, and
`FusedCommMoEScheduler` helpers, but choose inputs or `moe_max_num_tokens` so
multiple outputs are produced and concatenated. Also verify the new test module
is picked up by the relevant CI/unit-test discovery for `tests/**`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp`:
- Around line 72-78: Add explicit device guarding and same-device validation in
megaMoePrepareOp before launching the kernel. CHECK_TH_CUDA only confirms CUDA
tensors, so update the launch path around megaMoePrepareOp and its use of
at::cuda::getCurrentCUDAStream(input.get_device()) by introducing a
c10::cuda::CUDAGuard for input.device() and verifying tokenSelectedExperts,
tokenFinalScales, xOut, xSfOut, topkIdxOut, and topkWeightsOut all match
input.device(). Reject mixed-device inputs/outputs before the kernel launch to
prevent running on the wrong GPU or passing cross-device pointers.

In
`@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_act_fusion.py`:
- Around line 405-406: The swiglu_limit handling in
blockscaled_contiguous_gather_grouped_gemm_act_fusion should normalize negative
values before setting the clamp flag, because fclip_xorsign only works as
symmetric clipping for positive limits. Update the initialization around
swiglu_limit and has_swiglu_limit so negative sentinel values are treated like
disabling the limit, consistent with the custom-op sentinel behavior, and only
enable clipping when the effective limit is positive.

In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 346-350: The swiglu_limit_scalar allowlist in create_moe.py is
missing CuteDslB12xFusedMoE, which now receives that argument through the shared
CuteDSL constructor path. Update the moe_cls validation list in the
swiglu_limit_scalar check to include CuteDslB12xFusedMoE alongside
CuteDslFusedMoE and the other supported classes so B12x requests with scalar
clamp pass validation and reach the backend.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py`:
- Around line 66-70: Canonicalize negative swiglu_limit_scalar values in
CuteDslFusedMoE so they behave like disabled clamping instead of producing an
invalid clamp range in swiglu_fused_moe. Update the code path that stores/passes
this field in CuteDslFusedMoE to normalize any negative value to float("inf")
before it reaches swiglu_fused_moe, keeping the behavior aligned with the
custom-op contract.

In `@tests/unittest/_torch/thop/parallel/test_fp8_quantize.py`:
- Around line 237-301: Add negative coverage in test_megamoe_prepare_* and
_run_megamoe_prepare so the public megalMoePrepare op’s validation paths are
exercised, not just the happy path. Specifically, add cases in
tests/unittest/_torch/thop/parallel/test_fp8_quantize.py for unsupported
input/output dtypes, mismatched [num_tokens, top_k] shapes, hidden_size not
divisible by 128, and undersized x_out/x_sf_out buffers. Use the existing test
helpers and torch.ops.trtllm.megamoe_prepare entrypoint to keep the checks
aligned with cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp.

---

Nitpick comments:
In `@cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h`:
- Around line 28-44: The public API in MegaMoePrepareExpertType,
MegaMoePrepareScaleType, and invokeMegaMoePrepare lacks the required Doxygen
contract. Add //! documentation for each enum and the kernel entry point, and
describe the expected tensor-like argument shapes/dtypes for input,
tokenSelectedExperts, tokenFinalScales, xOut, xSfOut, topkIdxOut, and
topkWeightsOut so users can call invokeMegaMoePrepare without inspecting
implementation details. Use the existing symbols in megaMoePrepareKernel.h and
follow the C++ header documentation style required for public interfaces.

In `@tensorrt_llm/_torch/mmap_utils.py`:
- Line 102: Add full type annotations to the public helper
`advise_tensor_pageout`: annotate the `tensor` parameter with the appropriate
Torch tensor type and explicitly declare the return type as `None`. Because this
module should avoid importing `torch` at runtime, use a `TYPE_CHECKING` guarded
import for the tensor type and update the `advise_tensor_pageout` signature to
match the other typed helpers.
- Around line 62-72: The `madvise_range` helper in
`tensorrt_llm/_torch/mmap_utils.py` re-opens `libc.so.6` on every call, which is
expensive in per-tensor paths like `advise_tensor_pageout` and
`moe_load_balancer.finalize_layer_weights`. Update the function to use a lazily
initialized module-level cache for the `ctypes.CDLL` handle so `libc` is loaded
once and reused across calls, while keeping the current Linux-only lazy failure
behavior intact. Preserve the existing `madvise_range` and
`_MADV_ADVICE_BY_MODE` flow and avoid moving the `CDLL` creation to import time.

In `@tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py`:
- Around line 99-114: The page-out optimization in mark_consumed currently wraps
imports, torch.cuda.synchronize(), and the tensor loop in one try, so a
non-ImportError can abort before key deletion runs. Keep the try block limited
to the torch import path in base_weight_loader.py, move the synchronize/page-out
loop into the corresponding else block (or otherwise isolate it), and make sure
the final del self._weights[key] cleanup always executes even if page-out fails.

In `@tests/unittest/_torch/modules/moe/test_moe_scheduler.py`:
- Around line 126-237: Add a follow-up test for
`FusedCommMoEScheduler.forward`/`_forward_chunk` that forces `num_chunks > 1` so
the `torch.cat` multi-chunk path is executed and asserted, since current
coverage only checks the single-chunk no-copy path. Reuse the existing
`_FakeMoe`, `_RecordingFusedBackend`, and `FusedCommMoEScheduler` helpers, but
choose inputs or `moe_max_num_tokens` so multiple outputs are produced and
concatenated. Also verify the new test module is picked up by the relevant
CI/unit-test discovery for `tests/**`.

In `@tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py`:
- Line 621: Coverage for the new SwiGLU optional parameter is missing the
no-clamp/default path, so update the SwiGLU test cases in test_cute_dsl_moe to
parameterize swiglu_limit over both float("inf") and 1.0, and make sure the
relevant op-path assertions run with the omitted-argument/default behavior as
well. Use the existing SwiGLU test helpers and parameterized cases in
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py to cover both
enabled-clamp and disabled-clamp scenarios.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cf04400e-17c8-4b74-b996-121cfebaf1f9

📥 Commits

Reviewing files that changed from the base of the PR and between edb14ee and bba0cb0.

📒 Files selected for processing (30)
  • 3rdparty/fetch_content.json
  • cpp/tensorrt_llm/kernels/megaMoePrepareKernel.cu
  • cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h
  • cpp/tensorrt_llm/thop/CMakeLists.txt
  • cpp/tensorrt_llm/thop/allreduceOp.cpp
  • cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp
  • cpp/tensorrt_llm/thop/moeOp.cpp
  • tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_act_fusion.py
  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py
  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.py
  • tensorrt_llm/_torch/mmap_utils.py
  • tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py
  • tensorrt_llm/_torch/modules/fused_moe/create_moe.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
  • tensorrt_llm/_torch/modules/fused_moe/mega_moe/mega_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py
  • tensorrt_llm/_torch/modules/fused_moe/ops/moe_op_cutlass.py
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/model_loader.py
  • tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_act_fusion.py
  • tests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_swiglu_fusion.py
  • tests/unittest/_torch/modules/moe/test_moe_backend.py
  • tests/unittest/_torch/modules/moe/test_moe_scheduler.py
  • tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py
  • tests/unittest/_torch/thop/parallel/test_fp8_quantize.py

Comment on lines +72 to +78
CHECK_TH_CUDA(input);
CHECK_TH_CUDA(tokenSelectedExperts);
CHECK_TH_CUDA(tokenFinalScales);
CHECK_TH_CUDA(xOut);
CHECK_TH_CUDA(xSfOut);
CHECK_TH_CUDA(topkIdxOut);
CHECK_TH_CUDA(topkWeightsOut);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== File outline ==\n'
ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true

printf '\n== Relevant lines ==\n'
nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p'

printf '\n== Search for similar CUDA device handling ==\n'
rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device\(\)" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'

Repository: NVIDIA/TensorRT-LLM

Length of output: 296


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== File outline ==\n'
ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true

printf '\n== Relevant lines ==\n'
nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p'

printf '\n== Search for similar CUDA device handling ==\n'
rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device\(\)" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'

Repository: NVIDIA/TensorRT-LLM

Length of output: 296


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== File outline ==\n'
ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true

printf '\n== Relevant lines ==\n'
nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p'

printf '\n== Search for similar CUDA device handling ==\n'
rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device\(\)" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'

Repository: NVIDIA/TensorRT-LLM

Length of output: 296


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== File outline ==\n'
ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true

printf '\n== Relevant lines ==\n'
nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p'

printf '\n== Search for similar CUDA device handling ==\n'
rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device\(\)" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'

Repository: NVIDIA/TensorRT-LLM

Length of output: 296


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== File outline ==\n'
ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true

printf '\n== Relevant lines ==\n'
nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p'

printf '\n== Search for similar CUDA device handling ==\n'
rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device\(\)" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'

Repository: NVIDIA/TensorRT-LLM

Length of output: 296


🏁 Script executed:

#!/bin/bash
set -euo pipefail
nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p'
rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device\(\)" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'

Repository: NVIDIA/TensorRT-LLM

Length of output: 197


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "hello"

Repository: NVIDIA/TensorRT-LLM

Length of output: 163


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '== tracked paths matching megaMoePrepare ==\n'
git ls-files | rg 'megaMoePrepare|MegaMoePrepare|megaMoe|MegaMoe' || true

printf '\n== tracked paths under cpp/tensorrt_llm/thop ==\n'
git ls-files cpp/tensorrt_llm/thop || true

printf '\n== candidate file contents (if present) ==\n'
for f in $(git ls-files | rg 'megaMoePrepare|MegaMoePrepare|megaMoe|MegaMoe' || true); do
    printf '\n--- %s ---\n' "$f"
    sed -n '1,220p' "$f"
done

Repository: NVIDIA/TensorRT-LLM

Length of output: 19222


Guard the launch device and require same-device tensors. CHECK_TH_CUDA only proves CUDA placement; it does not ensure every tensor matches input.device(). Add a c10::cuda::CUDAGuard before at::cuda::getCurrentCUDAStream(input.get_device()) and reject mixed-device inputs/outputs, or this can launch on the wrong device and pass cross-device pointers. cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp:72-130

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp` around lines 72 - 78, Add
explicit device guarding and same-device validation in megaMoePrepareOp before
launching the kernel. CHECK_TH_CUDA only confirms CUDA tensors, so update the
launch path around megaMoePrepareOp and its use of
at::cuda::getCurrentCUDAStream(input.get_device()) by introducing a
c10::cuda::CUDAGuard for input.device() and verifying tokenSelectedExperts,
tokenFinalScales, xOut, xSfOut, topkIdxOut, and topkWeightsOut all match
input.device(). Reject mixed-device inputs/outputs before the kernel launch to
prevent running on the wrong GPU or passing cross-device pointers.

Comment on lines +405 to +406
self.swiglu_limit = swiglu_limit
self.has_swiglu_limit = swiglu_limit != float("inf")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Normalize negative swiglu_limit before enabling clamp.

fclip_xorsign only behaves like symmetric clipping when limit > 0; storing a negative value here enables clamp and changes the sign semantics instead of disabling it. Mirror the custom-op sentinel handling before deriving has_swiglu_limit.

Proposed fix
-        self.swiglu_limit = swiglu_limit
+        if swiglu_limit < 0:
+            swiglu_limit = float("inf")
+        self.swiglu_limit = swiglu_limit
         self.has_swiglu_limit = swiglu_limit != float("inf")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.swiglu_limit = swiglu_limit
self.has_swiglu_limit = swiglu_limit != float("inf")
if swiglu_limit < 0:
swiglu_limit = float("inf")
self.swiglu_limit = swiglu_limit
self.has_swiglu_limit = swiglu_limit != float("inf")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_act_fusion.py`
around lines 405 - 406, The swiglu_limit handling in
blockscaled_contiguous_gather_grouped_gemm_act_fusion should normalize negative
values before setting the clamp flag, because fclip_xorsign only works as
symmetric clipping for positive limits. Update the initialization around
swiglu_limit and has_swiglu_limit so negative sentinel values are treated like
disabling the limit, consistent with the custom-op sentinel behavior, and only
enable clipping when the effective limit is positive.

Comment on lines 346 to 350
if swiglu_limit_scalar is not None:
assert moe_cls in [
CutlassFusedMoE, TRTLLMGenFusedMoE, WideEPMoE, DeepGemmFusedMoE,
MegaMoEDeepGemm
MegaMoEDeepGemm, CuteDslFusedMoE
], f"swiglu_limit_scalar is not supported in {moe_cls.__name__}."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Add CuteDslB12xFusedMoE to the scalar clamp allowlist.

The allowlist here omits CuteDslB12xFusedMoE, but Line 447 now forwards swiglu_limit_scalar through the shared CuteDSL constructor path. Any B12x backend request with a scalar clamp will fail validation before it reaches the backend.

Suggested fix
     if swiglu_limit_scalar is not None:
         assert moe_cls in [
             CutlassFusedMoE, TRTLLMGenFusedMoE, WideEPMoE, DeepGemmFusedMoE,
-            MegaMoEDeepGemm, CuteDslFusedMoE
+            MegaMoEDeepGemm, CuteDslFusedMoE, CuteDslB12xFusedMoE
         ], f"swiglu_limit_scalar is not supported in {moe_cls.__name__}."

Also applies to: 447-447

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py` around lines 346 - 350,
The swiglu_limit_scalar allowlist in create_moe.py is missing
CuteDslB12xFusedMoE, which now receives that argument through the shared CuteDSL
constructor path. Update the moe_cls validation list in the swiglu_limit_scalar
check to include CuteDslB12xFusedMoE alongside CuteDslFusedMoE and the other
supported classes so B12x requests with scalar clamp pass validation and reach
the backend.

Comment on lines +66 to +70
def swiglu_fused_moe(x, swiglu_limit_scalar: float = float("inf")):
x, gate = x.chunk(2, dim=-1)
if swiglu_limit_scalar != float("inf"):
gate = gate.clamp(max=swiglu_limit_scalar)
x = x.clamp(min=-swiglu_limit_scalar, max=swiglu_limit_scalar)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== File list =="
git ls-files tensorrt_llm/_torch/modules/fused_moe | sed -n '1,120p'

echo
echo "== Outline: fused_moe_cute_dsl.py =="
ast-grep outline tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py --view expanded || true

echo
echo "== Relevant references =="
rg -n "swiglu_limit_scalar|swiglu_fused_moe|canonicalize" tensorrt_llm/_torch/modules/fused_moe -S

Repository: NVIDIA/TensorRT-LLM

Length of output: 10284


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== fused_moe_cute_dsl.py relevant sections =="
sed -n '430,470p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
echo
sed -n '760,780p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py

echo
echo "== interface.py relevant section =="
sed -n '270,320p' tensorrt_llm/_torch/modules/fused_moe/interface.py

echo
echo "== create_moe.py relevant section =="
sed -n '330,360p' tensorrt_llm/_torch/modules/fused_moe/create_moe.py

echo
echo "== trtllm_gen relevant section =="
sed -n '460,490p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 9216


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Search for negative/sentinel handling =="
rg -n "swiglu_limit_scalar.*< 0|< 0.*swiglu_limit_scalar|float\\(\"inf\"\\)|disabled sentinel|canonicalize" tensorrt_llm/_torch/modules/fused_moe tensorrt_llm -S

echo
echo "== swiglu_fused_moe consumer / runner lines =="
sed -n '720,780p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py

echo
echo "== Any tests mentioning swiglu_limit_scalar =="
rg -n "swiglu_limit_scalar" tests tensorrt_llm -S

Repository: NVIDIA/TensorRT-LLM

Length of output: 24654


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Search for negative/sentinel handling =="
rg -n "swiglu_limit_scalar.*< 0|< 0.*swiglu_limit_scalar|float\\(\"inf\"\\)|disabled sentinel|canonicalize" tensorrt_llm/_torch/modules/fused_moe tensorrt_llm -S

echo
echo "== swiglu_fused_moe consumer / runner lines =="
sed -n '720,780p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py

echo
echo "== Any tests mentioning swiglu_limit_scalar =="
rg -n "swiglu_limit_scalar" tests tensorrt_llm -S

Repository: NVIDIA/TensorRT-LLM

Length of output: 24654


🏁 Script executed:

python3 - <<'PY'
from math import inf

def current_behavior(swiglu_limit_scalar):
    # mirrors tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
    if swiglu_limit_scalar != inf:
        gate_clamp = swiglu_limit_scalar
        x_min = -swiglu_limit_scalar
        x_max = swiglu_limit_scalar
        return gate_clamp, x_min, x_max
    return None

for v in [None, inf, -1.0, 0.0, 3.5]:
    if v is None:
        v = inf
    print(v, current_behavior(v))
PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 240


Canonicalize negative swiglu_limit_scalar in CuteDslFusedMoE.

swiglu_fused_moe() treats any non-inf value as an active clamp, so -1.0 becomes an inverted clamp range instead of disabling the path. Normalize negative values to inf when storing the module field, matching the custom-op contract.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py` around lines 66
- 70, Canonicalize negative swiglu_limit_scalar values in CuteDslFusedMoE so
they behave like disabled clamping instead of producing an invalid clamp range
in swiglu_fused_moe. Update the code path that stores/passes this field in
CuteDslFusedMoE to normalize any negative value to float("inf") before it
reaches swiglu_fused_moe, keeping the behavior aligned with the custom-op
contract.

Comment on lines +237 to +301
@pytest.mark.parametrize("m", [1, 7, 128])
@pytest.mark.parametrize("k", [128, 512])
@pytest.mark.parametrize("top_k", [1, 6])
@skip_pre_blackwell_unittest
def test_megamoe_prepare_matches_mxfp8_quantize(m, k, top_k):
torch.random.manual_seed(123)
hidden_states = (torch.randn([m, k], dtype=torch.float) * 16).to(
torch.bfloat16).cuda().contiguous()
token_selected_experts = torch.randint(0,
384, (m, top_k),
dtype=torch.int32,
device="cuda")
token_final_scales = torch.randn((m, top_k),
dtype=torch.float32,
device="cuda")

_run_megamoe_prepare(hidden_states, token_selected_experts,
token_final_scales)


@pytest.mark.parametrize("expert_dtype", [torch.int32, torch.int64])
@pytest.mark.parametrize("scale_dtype",
[torch.float32, torch.float16, torch.bfloat16])
@skip_pre_blackwell_unittest
def test_megamoe_prepare_accepts_supported_topk_dtypes(expert_dtype,
scale_dtype):
torch.random.manual_seed(123)
m, k, top_k = 5, 256, 3
hidden_states = (torch.randn([m, k], dtype=torch.float) * 16).to(
torch.bfloat16).cuda().contiguous()
token_selected_experts = torch.randint(0,
384, (m, top_k),
dtype=expert_dtype,
device="cuda")
token_final_scales = torch.randn((m, top_k),
dtype=scale_dtype,
device="cuda")

_run_megamoe_prepare(hidden_states, token_selected_experts,
token_final_scales)


@skip_pre_blackwell_unittest
def test_megamoe_prepare_allows_zero_tokens():
k, top_k = 256, 3
hidden_states = torch.empty((0, k), dtype=torch.bfloat16, device="cuda")
token_selected_experts = torch.empty((0, top_k),
dtype=torch.int32,
device="cuda")
token_final_scales = torch.empty((0, top_k),
dtype=torch.float32,
device="cuda")
x_out = torch.empty((1, k), dtype=torch.float8_e4m3fn, device="cuda")
x_sf_out = torch.empty((1, k // 128), dtype=torch.int32, device="cuda")
topk_idx_out = torch.empty((1, top_k), dtype=torch.int64, device="cuda")
topk_weights_out = torch.empty((1, top_k),
dtype=torch.float32,
device="cuda")

torch.ops.trtllm.megamoe_prepare(hidden_states, token_selected_experts,
token_final_scales, x_out, x_sf_out,
topk_idx_out, topk_weights_out)
torch.cuda.synchronize()


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Coverage is still insufficient for the public op's validation contract.

These tests cover the happy path, dtype matrix, and zero-token fast path, but cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp also rejects bad dtypes, mismatched shapes, undersized output buffers, and hidden_size % 128 != 0. None of those guards is exercised here, so a regression in the public API contract would ship unnoticed.

Please add negative cases in tests/unittest/_torch/thop/parallel/test_fp8_quantize.py for at least:

  • unsupported input / output dtypes,
  • mismatched [num_tokens, top_k] shapes,
  • hidden_size not divisible by 128,
  • undersized x_out / x_sf_out.

As per path instructions, tests/**: Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM. Keep feedback actionable: suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/thop/parallel/test_fp8_quantize.py` around lines 237 -
301, Add negative coverage in test_megamoe_prepare_* and _run_megamoe_prepare so
the public megalMoePrepare op’s validation paths are exercised, not just the
happy path. Specifically, add cases in
tests/unittest/_torch/thop/parallel/test_fp8_quantize.py for unsupported
input/output dtypes, mismatched [num_tokens, top_k] shapes, hidden_size not
divisible by 128, and undersized x_out/x_sf_out buffers. Use the existing test
helpers and torch.ops.trtllm.megamoe_prepare entrypoint to keep the checks
aligned with cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp.

Source: Path instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants