[TRTLLM-13629][test] Optimize MoE CI test-db#15624
Conversation
…, quant/backend refinement - Move full backend matrices and less-used variants to post-merge (test_moe_backend -k, test_moe_module unquantized and FP8 per-tensor QDQ, bf16 4gpus, online/static eplb, bf16_trtllm_gen, moe_comm_boundary). - Remove always-skip dead entries: GPT-OSS test_w4_4gpus triton on Blackwell (Hopper-only), and all moe_backend=WIDEEP cases. - Dedup equivalent platforms (GB200 == B200, SM100): drop B200-side duplicate TestDeepSeekV3Lite 4gpus cases, keep GB200. - Diversify moe_backend coverage for DeepSeekV3Lite nvfp4_4gpus across CUTLASS/TRTLLM/CUTEDSL (per CutlassFusedMoE SM support table). - Refine CUTLASS quant buckets by SM support: split FP8 block-scale (keep pre) vs per-tensor QDQ (move post); delete test_fused_moe.py legacy entries. Signed-off-by: xxi <xxi@nvidia.com>
📝 WalkthroughWalkthroughIntegration test lists are updated to move MoE coverage between pre_merge and post_merge stages, narrow CUTLASS/FP8 selectors, and revise DeepSeekV3Lite, GPTOSS, NemotronV3Super, MoEComm, and legacy fused MoE entries across multiple GPU-specific matrices. ChangesMoE integration test matrix updates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/integration/test_lists/test-db/l0_b300.yml`:
- Around line 77-80: The l0_b300 test list is unintentionally broader because
the existing `test_moe_backend.py::test_moe_backend -k "CUTEDSL"` and `-k
"DEEPGEMM"` filters also match `MEGAMOE_CUTEDSL` and `MEGAMOE_DEEPGEMM`. Update
this block to keep the MegaMoE variants split out consistently with the other
list, either by adding explicit `MEGAMOE_*` entries in the
`test_moe_backend.py::test_moe_backend` set or by narrowing the `-k` filters so
they do not capture them.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c83e3041-aa9b-43c3-b649-6d6756d746b8
📒 Files selected for processing (9)
tests/integration/test_lists/test-db/l0_b200.ymltests/integration/test_lists/test-db/l0_b300.ymltests/integration/test_lists/test-db/l0_dgx_b200.ymltests/integration/test_lists/test-db/l0_dgx_b300.ymltests/integration/test_lists/test-db/l0_dgx_h100.ymltests/integration/test_lists/test-db/l0_gb200_multi_gpus.ymltests/integration/test_lists/test-db/l0_gb300_multi_gpus.ymltests/integration/test_lists/test-db/l0_h100.ymltests/integration/test_lists/test-db/l0_rtx_pro_6000.yml
💤 Files with no reviewable changes (2)
- tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml
- tests/integration/test_lists/test-db/l0_gb300_multi_gpus.yml
| orchestrator: mpi | ||
| tests: | ||
| # ---- non-quantized (quant=None) moved to post-merge ---- | ||
| - unittest/_torch/modules/moe/test_moe_module.py::test_configurable_moe_multi_gpu_eplb -k "None" |
There was a problem hiding this comment.
You need to define a new post merge stage like: https://github.com/NVIDIA/TensorRT-LLM/blob/main/jenkins/L0_Test.groovy#L4469.
It seems only 2 case will be moved to post merge, maybe we can revert changes here to avoid defining a new post merge stage.
Summary
Rebalances and trims the MoE-related entries across the per-GPU L0 test-db lists to cut redundant pre-merge cost while preserving (and in places broadening) real-world coverage. Test-list only; no source/runtime changes.
Test plan
Summary by CodeRabbit