[TRTLLM-13629][test] Optimize MoE CI test-db by xxi-nv · Pull Request #15624 · NVIDIA/TensorRT-LLM

xxi-nv · 2026-06-25T10:14:45Z

Summary

Rebalances and trims the MoE-related entries across the per-GPU L0 test-db lists to cut redundant pre-merge cost while preserving (and in places broadening) real-world coverage. Test-list only; no source/runtime changes.

Stage rebalance (pre -> post-merge): full per-backend matrices (test_moe_backend -k), unquantized (quant=None) and per-tensor FP8 (QDQ) variants of test_moe_module, bf16 4gpus, online/static EPLB, bf16_trtllm_gen, moe_comm_boundary; a representative is kept in pre.
Dead-entry removal (zero coverage loss): GPT-OSS test_w4_4gpus triton on Blackwell (Hopper-only) and all moe_backend=WIDEEP cases.
Equivalent-platform dedup (GB200 == B200, SM100): drop B200-side duplicate TestDeepSeekV3Lite 4gpus cases, keep GB200.
Backend diversification: DeepSeekV3Lite nvfp4_4gpus and the dgx_b200 post set spread moe_backend across CUTLASS/TRTLLM/CUTEDSL per the CutlassFusedMoE SM support table; CUTLASS FP8 buckets split by SM (block-scale in pre, per-tensor QDQ to post).

Test plan

pre-commit test-list validators pass (entry-existence AST check, waives duplicate check, DCO).
All 9 modified YAML files parse; no intra-block duplicate entries.
Refined -k buckets verified against CutlassFusedMoE._QUANT_SUPPORT_TABLE to target only SM-supported quants.

Summary by CodeRabbit

Tests
- Refined several integration test matrices across GPU environments.
- Moved some MoE-related checks to post-merge and narrowed pre-merge coverage for faster, more targeted validation.
- Updated multi-GPU coverage for DeepSeekV3Lite, GPTOSS, and NemotronV3Super, including adjusted quantization and cache variants.
- Removed a few legacy or redundant MoE test selections while keeping key backend coverage.

…, quant/backend refinement - Move full backend matrices and less-used variants to post-merge (test_moe_backend -k, test_moe_module unquantized and FP8 per-tensor QDQ, bf16 4gpus, online/static eplb, bf16_trtllm_gen, moe_comm_boundary). - Remove always-skip dead entries: GPT-OSS test_w4_4gpus triton on Blackwell (Hopper-only), and all moe_backend=WIDEEP cases. - Dedup equivalent platforms (GB200 == B200, SM100): drop B200-side duplicate TestDeepSeekV3Lite 4gpus cases, keep GB200. - Diversify moe_backend coverage for DeepSeekV3Lite nvfp4_4gpus across CUTLASS/TRTLLM/CUTEDSL (per CutlassFusedMoE SM support table). - Refine CUTLASS quant buckets by SM support: split FP8 block-scale (keep pre) vs per-tensor QDQ (move post); delete test_fused_moe.py legacy entries. Signed-off-by: xxi <xxi@nvidia.com>

coderabbitai · 2026-06-25T10:21:16Z

📝 Walkthrough

Walkthrough

Integration test lists are updated to move MoE coverage between pre_merge and post_merge stages, narrow CUTLASS/FP8 selectors, and revise DeepSeekV3Lite, GPTOSS, NemotronV3Super, MoEComm, and legacy fused MoE entries across multiple GPU-specific matrices.

Changes

MoE integration test matrix updates

Layer / File(s)	Summary
H100 stage split `tests/integration/test_lists/test-db/l0_h100.yml`, `tests/integration/test_lists/test-db/l0_dgx_h100.yml`	`pre_merge` MoE entries are narrowed and split from `post_merge` entries, with CUTLASS and FP8 block-scale cases separated from the remaining CUTLASS variants.
B200/B300 backend relocation `tests/integration/test_lists/test-db/l0_b200.yml`, `tests/integration/test_lists/test-db/l0_b300.yml`	Legacy MoE backend entries are removed from `pre_merge` sections and re-added in `post_merge` blocks with updated selectors.
DGX B200 selector edits `tests/integration/test_lists/test-db/l0_dgx_b200.yml`	Nemotron, DeepSeekV3Lite, GPTOSS, and MoE module entries are trimmed and reselected across the pre_merge and post_merge test lists.
DGX B300 and GB200 matrix edits `tests/integration/test_lists/test-db/l0_dgx_b300.yml`, `tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml`	Post_merge DeepSeekV3Lite and NemotronV3Super entries are added, GPTOSS coverage is adjusted, and legacy MoE/MoEComm entries are removed or relocated.
GB300 and RTX cleanup `tests/integration/test_lists/test-db/l0_gb300_multi_gpus.yml`, `tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml`	The legacy fused MoE module test is removed from GB300, DeepSeekV3Lite online_eplb variants are narrowed, and commented-out WIDEEP entries are dropped from RTX Pro 6000.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

liji-nv
yuxianq
hyukn

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description covers the summary and testing intent, but it only loosely follows the template and omits an explicit Test Coverage section.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title is concise and accurately reflects the PR’s main change: optimizing the MoE CI test-db.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/test_lists/test-db/l0_b300.yml`:
- Around line 77-80: The l0_b300 test list is unintentionally broader because
the existing `test_moe_backend.py::test_moe_backend -k "CUTEDSL"` and `-k
"DEEPGEMM"` filters also match `MEGAMOE_CUTEDSL` and `MEGAMOE_DEEPGEMM`. Update
this block to keep the MegaMoE variants split out consistently with the other
list, either by adding explicit `MEGAMOE_*` entries in the
`test_moe_backend.py::test_moe_backend` set or by narrowing the `-k` filters so
they do not capture them.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c83e3041-aa9b-43c3-b649-6d6756d746b8

📥 Commits

Reviewing files that changed from the base of the PR and between a02214a and dcd1f47.

📒 Files selected for processing (9)

tests/integration/test_lists/test-db/l0_b200.yml
tests/integration/test_lists/test-db/l0_b300.yml
tests/integration/test_lists/test-db/l0_dgx_b200.yml
tests/integration/test_lists/test-db/l0_dgx_b300.yml
tests/integration/test_lists/test-db/l0_dgx_h100.yml
tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
tests/integration/test_lists/test-db/l0_gb300_multi_gpus.yml
tests/integration/test_lists/test-db/l0_h100.yml
tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml

💤 Files with no reviewable changes (2)

tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml
tests/integration/test_lists/test-db/l0_gb300_multi_gpus.yml

QiJune · 2026-06-25T12:52:28Z

+      orchestrator: mpi
+  tests:
+  # ---- non-quantized (quant=None) moved to post-merge ----
+  - unittest/_torch/modules/moe/test_moe_module.py::test_configurable_moe_multi_gpu_eplb -k "None"


You need to define a new post merge stage like: https://github.com/NVIDIA/TensorRT-LLM/blob/main/jenkins/L0_Test.groovy#L4469.

It seems only 2 case will be moved to post merge, maybe we can revert changes here to avoid defining a new post merge stage.

github-actions Bot assigned xxi-nv Jun 25, 2026

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread tests/integration/test_lists/test-db/l0_b300.yml

xxi-nv changed the title ~~[None][test] Optimize MoE CI test-db: stage rebalance, platform dedup, quant/backend refinement~~ [TRTLLM-13629][test] Optimize MoE CI test-db Jun 25, 2026

xxi-nv requested a review from QiJune June 25, 2026 10:42

QiJune reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRTLLM-13629][test] Optimize MoE CI test-db#15624

[TRTLLM-13629][test] Optimize MoE CI test-db#15624
xxi-nv wants to merge 1 commit into
NVIDIA:mainfrom
xxi-nv:user/xxi/moe-ci-testdb-opt

xxi-nv commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

QiJune Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xxi-nv commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

QiJune Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xxi-nv commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

QiJune Jun 25, 2026 •

edited

Loading