[TRTLLM-12950][perf] DSv4 follow-up: DeepGEMM and MegaMoE#15632
[TRTLLM-12950][perf] DSv4 follow-up: DeepGEMM and MegaMoE#15632lfr-0531 wants to merge 13 commits into
Conversation
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> (cherry picked from commit f172310)
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> (cherry picked from commit 41a16de)
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> (cherry picked from commit 2fbd0e1)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> (cherry picked from commit a0d5400)
Signed-off-by: Yuhan Li <51736452+liyuhannnnn@users.noreply.github.com> Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> Co-authored-by: Yuhan Li <51736452+liyuhannnnn@users.noreply.github.com> (cherry picked from commit 15602a2) Signed-off-by: Qi Zhang (qizh) <10434017+Tracin@users.noreply.github.com>
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com> (cherry picked from commit e0539a5) Signed-off-by: Qi Zhang (qizh) <10434017+Tracin@users.noreply.github.com>
…DIA#14673) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> (cherry picked from commit e97de0d)
…VIDIA#14658) Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> (cherry picked from commit 2dd03e6)
… nodes (NVIDIA#14880) Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com> (cherry picked from commit 524b20d) Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>
(cherry picked from commit d7f9e4a) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com> Signed-off-by: Mingyang <mingyangh@nvidia.com> # Conflicts: # tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py # tests/unittest/_torch/modules/moe/test_moe_scheduler.py
Signed-off-by: Shicheng Li <shicli@nvidia.com> (cherry picked from commit 1264ed2) Signed-off-by: lishicheng1996-nv <shicli@nvidia.com>
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> (cherry picked from commit 435a0a2)
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
64398b4 to
bba0cb0
Compare
📝 WalkthroughWalkthroughThe PR adds a fused MegaMoE prepare CUDA op ( ChangesMoE runtime and kernel plumbing
Third-party source updates
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (6)
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py (1)
621-621: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd no-clamp coverage for the new optional parameter.
Coverage is insufficient for the default/disabled path: these updated tests exercise
swiglu_limit=1.0, but notfloat("inf")or the omitted-argument default for the SwiGLU op paths. Please parametrize the SwiGLU cases intests/unittest/_torch/thop/parallel/test_cute_dsl_moe.pyover at least[float("inf"), 1.0].As per path instructions,
tests/**: “Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM.”Also applies to: 763-763
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py` at line 621, Coverage for the new SwiGLU optional parameter is missing the no-clamp/default path, so update the SwiGLU test cases in test_cute_dsl_moe to parameterize swiglu_limit over both float("inf") and 1.0, and make sure the relevant op-path assertions run with the omitted-argument/default behavior as well. Use the existing SwiGLU test helpers and parameterized cases in tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py to cover both enabled-clamp and disabled-clamp scenarios.Source: Path instructions
tensorrt_llm/_torch/mmap_utils.py (2)
102-102: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAnnotate
advise_tensor_pageoutwith parameter and return types.The sibling helpers carry full type hints, but this public function omits the
tensortype and the-> Nonereturn. As per coding guidelines, always annotate functions with return types (useNoneif no return). Use aTYPE_CHECKINGimport to avoid pullingtorchat module load.✏️ Proposed annotations
import ctypes import mmap +from typing import TYPE_CHECKING + +if TYPE_CHECKING: + import torch-def advise_tensor_pageout(tensor, mode: str = "dontneed"): +def advise_tensor_pageout(tensor: "torch.Tensor", mode: str = "dontneed") -> None:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/mmap_utils.py` at line 102, Add full type annotations to the public helper `advise_tensor_pageout`: annotate the `tensor` parameter with the appropriate Torch tensor type and explicitly declare the return type as `None`. Because this module should avoid importing `torch` at runtime, use a `TYPE_CHECKING` guarded import for the tensor type and update the `advise_tensor_pageout` signature to match the other typed helpers.Source: Coding guidelines
62-72: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win
libcis re-dlopened on everymadvise_rangecall.
advise_tensor_pageoutis invoked per-tensor (e.g. theloaded_shared_weightsloop inmoe_load_balancer.finalize_layer_weights), so a freshctypes.CDLLis created on each weight. Cache it lazily so the lib is opened once while preserving the current Linux-only lazy-failure behavior (a module-levelCDLLwould break import on non-Linux hosts).♻️ Lazy libc cache
+_LIBC = None + + +def _get_libc() -> "ctypes.CDLL": + global _LIBC + if _LIBC is None: + _LIBC = ctypes.CDLL("libc.so.6", use_errno=True) + return _LIBC + + def madvise_range(addr: int, size: int, mode: str = "dontneed") -> None:if size <= 0: return try: advice = _MADV_ADVICE_BY_MODE[mode] except KeyError: raise ValueError("mode must be 'pageout' or 'dontneed'.") - libc = ctypes.CDLL("libc.so.6", use_errno=True) - ret = libc.madvise(ctypes.c_void_p(addr), ctypes.c_size_t(size), ctypes.c_int(advice)) + ret = _get_libc().madvise(ctypes.c_void_p(addr), ctypes.c_size_t(size), ctypes.c_int(advice))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/mmap_utils.py` around lines 62 - 72, The `madvise_range` helper in `tensorrt_llm/_torch/mmap_utils.py` re-opens `libc.so.6` on every call, which is expensive in per-tensor paths like `advise_tensor_pageout` and `moe_load_balancer.finalize_layer_weights`. Update the function to use a lazily initialized module-level cache for the `ctypes.CDLL` handle so `libc` is loaded once and reused across calls, while keeping the current Linux-only lazy failure behavior intact. Preserve the existing `madvise_range` and `_MADV_ADVICE_BY_MODE` flow and avoid moving the `CDLL` creation to import time.tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py (1)
99-114: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winBest-effort page-out should not be able to skip key deletion.
The whole page-out block (imports,
torch.cuda.synchronize(), the loop) lives in onetryguarded only byexcept ImportError. A non-ImportErrorfromtorch.cuda.synchronize()(e.g.RuntimeErrorwhen CUDA is unavailable/uninitialized) propagates out ofmark_consumed, skipping thedel self._weights[key]loop at Lines 115-116 — so weights are neither paged out nor freed and the caller sees an exception. As per coding guidelines, keep thetrybody minimal and move the logic into anelseblock, and ensure the optimization can't break the core deletion.♻️ Restructure so deletion always runs
if keys_to_delete: try: import torch - from tensorrt_llm._torch.mmap_utils import \ advise_tensor_pageout - torch.cuda.synchronize() - for k in keys_to_delete: - t = self._weights[k] - if isinstance(t, - torch.Tensor) and t.device.type == "cpu": - try: - advise_tensor_pageout(t, mode="dontneed") - except (OSError, ValueError): - pass except ImportError: pass + else: + try: + torch.cuda.synchronize() + for k in keys_to_delete: + t = self._weights[k] + if isinstance( + t, torch.Tensor) and t.device.type == "cpu": + try: + advise_tensor_pageout(t, mode="dontneed") + except (OSError, ValueError): + pass + except RuntimeError: + # CUDA unavailable: skip best-effort pageout only. + pass🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py` around lines 99 - 114, The page-out optimization in mark_consumed currently wraps imports, torch.cuda.synchronize(), and the tensor loop in one try, so a non-ImportError can abort before key deletion runs. Keep the try block limited to the torch import path in base_weight_loader.py, move the synchronize/page-out loop into the corresponding else block (or otherwise isolate it), and make sure the final del self._weights[key] cleanup always executes even if page-out fails.Source: Coding guidelines
cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h (1)
28-44: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDocument the new public header API with Doxygen.
This adds two public enums and a public kernel entry point, but the header does not describe the contract yet. Please add
//!docs here, including the expected tensor shapes/dtypes for the tensor-like arguments, so callers do not need to reverse-engineer the.cu/.cppfiles.As per coding guidelines,
**/*.{h,hpp,hxx}:Follow Doxygen rules for documenting class interfaces and function prototypes; use //! for C++ style comments, and public Tensor-like arguments should document expected dimensions and allowed dtypes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h` around lines 28 - 44, The public API in MegaMoePrepareExpertType, MegaMoePrepareScaleType, and invokeMegaMoePrepare lacks the required Doxygen contract. Add //! documentation for each enum and the kernel entry point, and describe the expected tensor-like argument shapes/dtypes for input, tokenSelectedExperts, tokenFinalScales, xOut, xSfOut, topkIdxOut, and topkWeightsOut so users can call invokeMegaMoePrepare without inspecting implementation details. Use the existing symbols in megaMoePrepareKernel.h and follow the C++ header documentation style required for public interfaces.Source: Coding guidelines
tests/unittest/_torch/modules/moe/test_moe_scheduler.py (1)
126-237: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueCoverage is adequate for the changed scheduler logic; one optional follow-up. The four tests exercise routing separation, TRTLLMGen kwarg gating, and both fused-prepare branches, and the
data_ptr()assertions correctly validate the single-chunk no-copy path. The multi-chunk concatenation path (len(outputs) > 1→torch.cat) isn't directly asserted — consider adding a case withmoe_max_num_tokenssmall enough to forcenum_chunks > 1so the concat path is covered. Also confirm this new file is discovered by the relevant CI unit-test list.As per path instructions for
tests/**: assess whether coverage is sufficient and suggest concrete follow-up.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/modules/moe/test_moe_scheduler.py` around lines 126 - 237, Add a follow-up test for `FusedCommMoEScheduler.forward`/`_forward_chunk` that forces `num_chunks > 1` so the `torch.cat` multi-chunk path is executed and asserted, since current coverage only checks the single-chunk no-copy path. Reuse the existing `_FakeMoe`, `_RecordingFusedBackend`, and `FusedCommMoEScheduler` helpers, but choose inputs or `moe_max_num_tokens` so multiple outputs are produced and concatenated. Also verify the new test module is picked up by the relevant CI/unit-test discovery for `tests/**`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp`:
- Around line 72-78: Add explicit device guarding and same-device validation in
megaMoePrepareOp before launching the kernel. CHECK_TH_CUDA only confirms CUDA
tensors, so update the launch path around megaMoePrepareOp and its use of
at::cuda::getCurrentCUDAStream(input.get_device()) by introducing a
c10::cuda::CUDAGuard for input.device() and verifying tokenSelectedExperts,
tokenFinalScales, xOut, xSfOut, topkIdxOut, and topkWeightsOut all match
input.device(). Reject mixed-device inputs/outputs before the kernel launch to
prevent running on the wrong GPU or passing cross-device pointers.
In
`@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_act_fusion.py`:
- Around line 405-406: The swiglu_limit handling in
blockscaled_contiguous_gather_grouped_gemm_act_fusion should normalize negative
values before setting the clamp flag, because fclip_xorsign only works as
symmetric clipping for positive limits. Update the initialization around
swiglu_limit and has_swiglu_limit so negative sentinel values are treated like
disabling the limit, consistent with the custom-op sentinel behavior, and only
enable clipping when the effective limit is positive.
In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 346-350: The swiglu_limit_scalar allowlist in create_moe.py is
missing CuteDslB12xFusedMoE, which now receives that argument through the shared
CuteDSL constructor path. Update the moe_cls validation list in the
swiglu_limit_scalar check to include CuteDslB12xFusedMoE alongside
CuteDslFusedMoE and the other supported classes so B12x requests with scalar
clamp pass validation and reach the backend.
In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py`:
- Around line 66-70: Canonicalize negative swiglu_limit_scalar values in
CuteDslFusedMoE so they behave like disabled clamping instead of producing an
invalid clamp range in swiglu_fused_moe. Update the code path that stores/passes
this field in CuteDslFusedMoE to normalize any negative value to float("inf")
before it reaches swiglu_fused_moe, keeping the behavior aligned with the
custom-op contract.
In `@tests/unittest/_torch/thop/parallel/test_fp8_quantize.py`:
- Around line 237-301: Add negative coverage in test_megamoe_prepare_* and
_run_megamoe_prepare so the public megalMoePrepare op’s validation paths are
exercised, not just the happy path. Specifically, add cases in
tests/unittest/_torch/thop/parallel/test_fp8_quantize.py for unsupported
input/output dtypes, mismatched [num_tokens, top_k] shapes, hidden_size not
divisible by 128, and undersized x_out/x_sf_out buffers. Use the existing test
helpers and torch.ops.trtllm.megamoe_prepare entrypoint to keep the checks
aligned with cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp.
---
Nitpick comments:
In `@cpp/tensorrt_llm/kernels/megaMoePrepareKernel.h`:
- Around line 28-44: The public API in MegaMoePrepareExpertType,
MegaMoePrepareScaleType, and invokeMegaMoePrepare lacks the required Doxygen
contract. Add //! documentation for each enum and the kernel entry point, and
describe the expected tensor-like argument shapes/dtypes for input,
tokenSelectedExperts, tokenFinalScales, xOut, xSfOut, topkIdxOut, and
topkWeightsOut so users can call invokeMegaMoePrepare without inspecting
implementation details. Use the existing symbols in megaMoePrepareKernel.h and
follow the C++ header documentation style required for public interfaces.
In `@tensorrt_llm/_torch/mmap_utils.py`:
- Line 102: Add full type annotations to the public helper
`advise_tensor_pageout`: annotate the `tensor` parameter with the appropriate
Torch tensor type and explicitly declare the return type as `None`. Because this
module should avoid importing `torch` at runtime, use a `TYPE_CHECKING` guarded
import for the tensor type and update the `advise_tensor_pageout` signature to
match the other typed helpers.
- Around line 62-72: The `madvise_range` helper in
`tensorrt_llm/_torch/mmap_utils.py` re-opens `libc.so.6` on every call, which is
expensive in per-tensor paths like `advise_tensor_pageout` and
`moe_load_balancer.finalize_layer_weights`. Update the function to use a lazily
initialized module-level cache for the `ctypes.CDLL` handle so `libc` is loaded
once and reused across calls, while keeping the current Linux-only lazy failure
behavior intact. Preserve the existing `madvise_range` and
`_MADV_ADVICE_BY_MODE` flow and avoid moving the `CDLL` creation to import time.
In `@tensorrt_llm/_torch/models/checkpoints/base_weight_loader.py`:
- Around line 99-114: The page-out optimization in mark_consumed currently wraps
imports, torch.cuda.synchronize(), and the tensor loop in one try, so a
non-ImportError can abort before key deletion runs. Keep the try block limited
to the torch import path in base_weight_loader.py, move the synchronize/page-out
loop into the corresponding else block (or otherwise isolate it), and make sure
the final del self._weights[key] cleanup always executes even if page-out fails.
In `@tests/unittest/_torch/modules/moe/test_moe_scheduler.py`:
- Around line 126-237: Add a follow-up test for
`FusedCommMoEScheduler.forward`/`_forward_chunk` that forces `num_chunks > 1` so
the `torch.cat` multi-chunk path is executed and asserted, since current
coverage only checks the single-chunk no-copy path. Reuse the existing
`_FakeMoe`, `_RecordingFusedBackend`, and `FusedCommMoEScheduler` helpers, but
choose inputs or `moe_max_num_tokens` so multiple outputs are produced and
concatenated. Also verify the new test module is picked up by the relevant
CI/unit-test discovery for `tests/**`.
In `@tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py`:
- Line 621: Coverage for the new SwiGLU optional parameter is missing the
no-clamp/default path, so update the SwiGLU test cases in test_cute_dsl_moe to
parameterize swiglu_limit over both float("inf") and 1.0, and make sure the
relevant op-path assertions run with the omitted-argument/default behavior as
well. Use the existing SwiGLU test helpers and parameterized cases in
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py to cover both
enabled-clamp and disabled-clamp scenarios.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cf04400e-17c8-4b74-b996-121cfebaf1f9
📒 Files selected for processing (30)
3rdparty/fetch_content.jsoncpp/tensorrt_llm/kernels/megaMoePrepareKernel.cucpp/tensorrt_llm/kernels/megaMoePrepareKernel.hcpp/tensorrt_llm/thop/CMakeLists.txtcpp/tensorrt_llm/thop/allreduceOp.cppcpp/tensorrt_llm/thop/megaMoePrepareOp.cppcpp/tensorrt_llm/thop/moeOp.cpptensorrt_llm/_torch/custom_ops/cpp_custom_ops.pytensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_act_fusion.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.pytensorrt_llm/_torch/cute_dsl_kernels/blackwell/utils.pytensorrt_llm/_torch/mmap_utils.pytensorrt_llm/_torch/models/checkpoints/base_weight_loader.pytensorrt_llm/_torch/modules/fused_moe/create_moe.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.pytensorrt_llm/_torch/modules/fused_moe/mega_moe/mega_moe_deepgemm.pytensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.pytensorrt_llm/_torch/modules/fused_moe/moe_scheduler.pytensorrt_llm/_torch/modules/fused_moe/ops/moe_op_cutlass.pytensorrt_llm/_torch/modules/fused_moe/quantization.pytensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/pyexecutor/model_loader.pytests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_gather_grouped_gemm_act_fusion.pytests/scripts/cute_dsl_kernels/run_blockscaled_contiguous_grouped_gemm_swiglu_fusion.pytests/unittest/_torch/modules/moe/test_moe_backend.pytests/unittest/_torch/modules/moe/test_moe_scheduler.pytests/unittest/_torch/thop/parallel/test_cute_dsl_moe.pytests/unittest/_torch/thop/parallel/test_fp8_quantize.py
| CHECK_TH_CUDA(input); | ||
| CHECK_TH_CUDA(tokenSelectedExperts); | ||
| CHECK_TH_CUDA(tokenFinalScales); | ||
| CHECK_TH_CUDA(xOut); | ||
| CHECK_TH_CUDA(xSfOut); | ||
| CHECK_TH_CUDA(topkIdxOut); | ||
| CHECK_TH_CUDA(topkWeightsOut); |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== File outline ==\n'
ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true
printf '\n== Relevant lines ==\n'
nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p'
printf '\n== Search for similar CUDA device handling ==\n'
rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device\(\)" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'Repository: NVIDIA/TensorRT-LLM
Length of output: 296
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== File outline ==\n'
ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true
printf '\n== Relevant lines ==\n'
nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p'
printf '\n== Search for similar CUDA device handling ==\n'
rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device\(\)" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'Repository: NVIDIA/TensorRT-LLM
Length of output: 296
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== File outline ==\n'
ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true
printf '\n== Relevant lines ==\n'
nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p'
printf '\n== Search for similar CUDA device handling ==\n'
rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device\(\)" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'Repository: NVIDIA/TensorRT-LLM
Length of output: 296
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== File outline ==\n'
ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true
printf '\n== Relevant lines ==\n'
nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p'
printf '\n== Search for similar CUDA device handling ==\n'
rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device\(\)" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'Repository: NVIDIA/TensorRT-LLM
Length of output: 296
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== File outline ==\n'
ast-grep outline cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp --view expanded || true
printf '\n== Relevant lines ==\n'
nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p'
printf '\n== Search for similar CUDA device handling ==\n'
rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device\(\)" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'Repository: NVIDIA/TensorRT-LLM
Length of output: 296
🏁 Script executed:
#!/bin/bash
set -euo pipefail
nl -ba cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp | sed -n '1,220p'
rg -n "CUDAGuard|CHECK_TH_CUDA|getCurrentCUDAStream|device_guard|deviceGuard|device\(\)" cpp/tensorrt_llm/thop -g '*.cpp' -g '*.h'Repository: NVIDIA/TensorRT-LLM
Length of output: 197
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "hello"Repository: NVIDIA/TensorRT-LLM
Length of output: 163
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '== tracked paths matching megaMoePrepare ==\n'
git ls-files | rg 'megaMoePrepare|MegaMoePrepare|megaMoe|MegaMoe' || true
printf '\n== tracked paths under cpp/tensorrt_llm/thop ==\n'
git ls-files cpp/tensorrt_llm/thop || true
printf '\n== candidate file contents (if present) ==\n'
for f in $(git ls-files | rg 'megaMoePrepare|MegaMoePrepare|megaMoe|MegaMoe' || true); do
printf '\n--- %s ---\n' "$f"
sed -n '1,220p' "$f"
doneRepository: NVIDIA/TensorRT-LLM
Length of output: 19222
Guard the launch device and require same-device tensors. CHECK_TH_CUDA only proves CUDA placement; it does not ensure every tensor matches input.device(). Add a c10::cuda::CUDAGuard before at::cuda::getCurrentCUDAStream(input.get_device()) and reject mixed-device inputs/outputs, or this can launch on the wrong device and pass cross-device pointers. cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp:72-130
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp` around lines 72 - 78, Add
explicit device guarding and same-device validation in megaMoePrepareOp before
launching the kernel. CHECK_TH_CUDA only confirms CUDA tensors, so update the
launch path around megaMoePrepareOp and its use of
at::cuda::getCurrentCUDAStream(input.get_device()) by introducing a
c10::cuda::CUDAGuard for input.device() and verifying tokenSelectedExperts,
tokenFinalScales, xOut, xSfOut, topkIdxOut, and topkWeightsOut all match
input.device(). Reject mixed-device inputs/outputs before the kernel launch to
prevent running on the wrong GPU or passing cross-device pointers.
| self.swiglu_limit = swiglu_limit | ||
| self.has_swiglu_limit = swiglu_limit != float("inf") |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Normalize negative swiglu_limit before enabling clamp.
fclip_xorsign only behaves like symmetric clipping when limit > 0; storing a negative value here enables clamp and changes the sign semantics instead of disabling it. Mirror the custom-op sentinel handling before deriving has_swiglu_limit.
Proposed fix
- self.swiglu_limit = swiglu_limit
+ if swiglu_limit < 0:
+ swiglu_limit = float("inf")
+ self.swiglu_limit = swiglu_limit
self.has_swiglu_limit = swiglu_limit != float("inf")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.swiglu_limit = swiglu_limit | |
| self.has_swiglu_limit = swiglu_limit != float("inf") | |
| if swiglu_limit < 0: | |
| swiglu_limit = float("inf") | |
| self.swiglu_limit = swiglu_limit | |
| self.has_swiglu_limit = swiglu_limit != float("inf") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_act_fusion.py`
around lines 405 - 406, The swiglu_limit handling in
blockscaled_contiguous_gather_grouped_gemm_act_fusion should normalize negative
values before setting the clamp flag, because fclip_xorsign only works as
symmetric clipping for positive limits. Update the initialization around
swiglu_limit and has_swiglu_limit so negative sentinel values are treated like
disabling the limit, consistent with the custom-op sentinel behavior, and only
enable clipping when the effective limit is positive.
| if swiglu_limit_scalar is not None: | ||
| assert moe_cls in [ | ||
| CutlassFusedMoE, TRTLLMGenFusedMoE, WideEPMoE, DeepGemmFusedMoE, | ||
| MegaMoEDeepGemm | ||
| MegaMoEDeepGemm, CuteDslFusedMoE | ||
| ], f"swiglu_limit_scalar is not supported in {moe_cls.__name__}." |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Add CuteDslB12xFusedMoE to the scalar clamp allowlist.
The allowlist here omits CuteDslB12xFusedMoE, but Line 447 now forwards swiglu_limit_scalar through the shared CuteDSL constructor path. Any B12x backend request with a scalar clamp will fail validation before it reaches the backend.
Suggested fix
if swiglu_limit_scalar is not None:
assert moe_cls in [
CutlassFusedMoE, TRTLLMGenFusedMoE, WideEPMoE, DeepGemmFusedMoE,
- MegaMoEDeepGemm, CuteDslFusedMoE
+ MegaMoEDeepGemm, CuteDslFusedMoE, CuteDslB12xFusedMoE
], f"swiglu_limit_scalar is not supported in {moe_cls.__name__}."Also applies to: 447-447
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py` around lines 346 - 350,
The swiglu_limit_scalar allowlist in create_moe.py is missing
CuteDslB12xFusedMoE, which now receives that argument through the shared CuteDSL
constructor path. Update the moe_cls validation list in the swiglu_limit_scalar
check to include CuteDslB12xFusedMoE alongside CuteDslFusedMoE and the other
supported classes so B12x requests with scalar clamp pass validation and reach
the backend.
| def swiglu_fused_moe(x, swiglu_limit_scalar: float = float("inf")): | ||
| x, gate = x.chunk(2, dim=-1) | ||
| if swiglu_limit_scalar != float("inf"): | ||
| gate = gate.clamp(max=swiglu_limit_scalar) | ||
| x = x.clamp(min=-swiglu_limit_scalar, max=swiglu_limit_scalar) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== File list =="
git ls-files tensorrt_llm/_torch/modules/fused_moe | sed -n '1,120p'
echo
echo "== Outline: fused_moe_cute_dsl.py =="
ast-grep outline tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py --view expanded || true
echo
echo "== Relevant references =="
rg -n "swiglu_limit_scalar|swiglu_fused_moe|canonicalize" tensorrt_llm/_torch/modules/fused_moe -SRepository: NVIDIA/TensorRT-LLM
Length of output: 10284
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== fused_moe_cute_dsl.py relevant sections =="
sed -n '430,470p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
echo
sed -n '760,780p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
echo
echo "== interface.py relevant section =="
sed -n '270,320p' tensorrt_llm/_torch/modules/fused_moe/interface.py
echo
echo "== create_moe.py relevant section =="
sed -n '330,360p' tensorrt_llm/_torch/modules/fused_moe/create_moe.py
echo
echo "== trtllm_gen relevant section =="
sed -n '460,490p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 9216
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Search for negative/sentinel handling =="
rg -n "swiglu_limit_scalar.*< 0|< 0.*swiglu_limit_scalar|float\\(\"inf\"\\)|disabled sentinel|canonicalize" tensorrt_llm/_torch/modules/fused_moe tensorrt_llm -S
echo
echo "== swiglu_fused_moe consumer / runner lines =="
sed -n '720,780p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
echo
echo "== Any tests mentioning swiglu_limit_scalar =="
rg -n "swiglu_limit_scalar" tests tensorrt_llm -SRepository: NVIDIA/TensorRT-LLM
Length of output: 24654
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Search for negative/sentinel handling =="
rg -n "swiglu_limit_scalar.*< 0|< 0.*swiglu_limit_scalar|float\\(\"inf\"\\)|disabled sentinel|canonicalize" tensorrt_llm/_torch/modules/fused_moe tensorrt_llm -S
echo
echo "== swiglu_fused_moe consumer / runner lines =="
sed -n '720,780p' tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
echo
echo "== Any tests mentioning swiglu_limit_scalar =="
rg -n "swiglu_limit_scalar" tests tensorrt_llm -SRepository: NVIDIA/TensorRT-LLM
Length of output: 24654
🏁 Script executed:
python3 - <<'PY'
from math import inf
def current_behavior(swiglu_limit_scalar):
# mirrors tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
if swiglu_limit_scalar != inf:
gate_clamp = swiglu_limit_scalar
x_min = -swiglu_limit_scalar
x_max = swiglu_limit_scalar
return gate_clamp, x_min, x_max
return None
for v in [None, inf, -1.0, 0.0, 3.5]:
if v is None:
v = inf
print(v, current_behavior(v))
PYRepository: NVIDIA/TensorRT-LLM
Length of output: 240
Canonicalize negative swiglu_limit_scalar in CuteDslFusedMoE.
swiglu_fused_moe() treats any non-inf value as an active clamp, so -1.0 becomes an inverted clamp range instead of disabling the path. Normalize negative values to inf when storing the module field, matching the custom-op contract.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py` around lines 66
- 70, Canonicalize negative swiglu_limit_scalar values in CuteDslFusedMoE so
they behave like disabled clamping instead of producing an invalid clamp range
in swiglu_fused_moe. Update the code path that stores/passes this field in
CuteDslFusedMoE to normalize any negative value to float("inf") before it
reaches swiglu_fused_moe, keeping the behavior aligned with the custom-op
contract.
| @pytest.mark.parametrize("m", [1, 7, 128]) | ||
| @pytest.mark.parametrize("k", [128, 512]) | ||
| @pytest.mark.parametrize("top_k", [1, 6]) | ||
| @skip_pre_blackwell_unittest | ||
| def test_megamoe_prepare_matches_mxfp8_quantize(m, k, top_k): | ||
| torch.random.manual_seed(123) | ||
| hidden_states = (torch.randn([m, k], dtype=torch.float) * 16).to( | ||
| torch.bfloat16).cuda().contiguous() | ||
| token_selected_experts = torch.randint(0, | ||
| 384, (m, top_k), | ||
| dtype=torch.int32, | ||
| device="cuda") | ||
| token_final_scales = torch.randn((m, top_k), | ||
| dtype=torch.float32, | ||
| device="cuda") | ||
|
|
||
| _run_megamoe_prepare(hidden_states, token_selected_experts, | ||
| token_final_scales) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("expert_dtype", [torch.int32, torch.int64]) | ||
| @pytest.mark.parametrize("scale_dtype", | ||
| [torch.float32, torch.float16, torch.bfloat16]) | ||
| @skip_pre_blackwell_unittest | ||
| def test_megamoe_prepare_accepts_supported_topk_dtypes(expert_dtype, | ||
| scale_dtype): | ||
| torch.random.manual_seed(123) | ||
| m, k, top_k = 5, 256, 3 | ||
| hidden_states = (torch.randn([m, k], dtype=torch.float) * 16).to( | ||
| torch.bfloat16).cuda().contiguous() | ||
| token_selected_experts = torch.randint(0, | ||
| 384, (m, top_k), | ||
| dtype=expert_dtype, | ||
| device="cuda") | ||
| token_final_scales = torch.randn((m, top_k), | ||
| dtype=scale_dtype, | ||
| device="cuda") | ||
|
|
||
| _run_megamoe_prepare(hidden_states, token_selected_experts, | ||
| token_final_scales) | ||
|
|
||
|
|
||
| @skip_pre_blackwell_unittest | ||
| def test_megamoe_prepare_allows_zero_tokens(): | ||
| k, top_k = 256, 3 | ||
| hidden_states = torch.empty((0, k), dtype=torch.bfloat16, device="cuda") | ||
| token_selected_experts = torch.empty((0, top_k), | ||
| dtype=torch.int32, | ||
| device="cuda") | ||
| token_final_scales = torch.empty((0, top_k), | ||
| dtype=torch.float32, | ||
| device="cuda") | ||
| x_out = torch.empty((1, k), dtype=torch.float8_e4m3fn, device="cuda") | ||
| x_sf_out = torch.empty((1, k // 128), dtype=torch.int32, device="cuda") | ||
| topk_idx_out = torch.empty((1, top_k), dtype=torch.int64, device="cuda") | ||
| topk_weights_out = torch.empty((1, top_k), | ||
| dtype=torch.float32, | ||
| device="cuda") | ||
|
|
||
| torch.ops.trtllm.megamoe_prepare(hidden_states, token_selected_experts, | ||
| token_final_scales, x_out, x_sf_out, | ||
| topk_idx_out, topk_weights_out) | ||
| torch.cuda.synchronize() | ||
|
|
||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Coverage is still insufficient for the public op's validation contract.
These tests cover the happy path, dtype matrix, and zero-token fast path, but cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp also rejects bad dtypes, mismatched shapes, undersized output buffers, and hidden_size % 128 != 0. None of those guards is exercised here, so a regression in the public API contract would ship unnoticed.
Please add negative cases in tests/unittest/_torch/thop/parallel/test_fp8_quantize.py for at least:
- unsupported
input/ output dtypes, - mismatched
[num_tokens, top_k]shapes, hidden_sizenot divisible by 128,- undersized
x_out/x_sf_out.
As per path instructions, tests/**: Act as a QA engineer reviewing test changes and coverage for TensorRT-LLM. Keep feedback actionable: suggest concrete list file names and whether coverage is sufficient, insufficient, or needs follow-up outside the PR.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unittest/_torch/thop/parallel/test_fp8_quantize.py` around lines 237 -
301, Add negative coverage in test_megamoe_prepare_* and _run_megamoe_prepare so
the public megalMoePrepare op’s validation paths are exercised, not just the
happy path. Specifically, add cases in
tests/unittest/_torch/thop/parallel/test_fp8_quantize.py for unsupported
input/output dtypes, mismatched [num_tokens, top_k] shapes, hidden_size not
divisible by 128, and undersized x_out/x_sf_out buffers. Use the existing test
helpers and torch.ops.trtllm.megamoe_prepare entrypoint to keep the checks
aligned with cpp/tensorrt_llm/thop/megaMoePrepareOp.cpp.
Source: Path instructions
Summary by CodeRabbit
New Features
Bug Fixes
Description
This PR carves the PR-5 DeepGEMM/MegaMoE follow-up scope from
github/feat/deepseek_v4aftera7728aaonto currentgithub/main.It includes:
It intentionally excludes NVFP4 MegaMoE backend/tuning commits
aad9e0f008andf90246b3d8because equivalent material is already onmain, and it excludes themodeling_deepseekv4.pyhunks because that file is not present on currentmainand belongs to the PR-3 DSv4 model/sparse-attention stack.Rebase/conflict notes are recorded in
tmp/dsv4_final_prs/PR5_rebase_conflicts_20260625.md.Test Coverage
python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --use_ccache --cuda_architectures "90-real;100-real" --configure_cmake.venv-3.12/bin/pip install --force-reinstall --no-deps build/tensorrt_llm-1.3.0rc20-cp312-cp312-linux_x86_64.whllibth_common.socontains the PR-5 C++ tactic replacement string.git diff --checkpre-commit run --files ...passed on touched source/test files.6 passed, 3 warningsPATH="$PWD/.venv-3.12/bin:$PATH" CUDA_VISIBLE_DEVICES=1 timeout 12m .venv-3.12/bin/python -m pytest tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py tests/unittest/_torch/thop/parallel/test_fp8_quantize.py -q --tb=short934 passed, 9 skipped, 245 warningsFull
tests/unittest/_torch/modules/moecaveat:CUDA_LAUNCH_BLOCKING=1,-x, and a 15 minute timeout did not reproduce a first failure before timeout.PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.