Skip to content

[TRTLLM-13629][test] Optimize MoE CI test-db#15624

Open
xxi-nv wants to merge 1 commit into
NVIDIA:mainfrom
xxi-nv:user/xxi/moe-ci-testdb-opt
Open

[TRTLLM-13629][test] Optimize MoE CI test-db#15624
xxi-nv wants to merge 1 commit into
NVIDIA:mainfrom
xxi-nv:user/xxi/moe-ci-testdb-opt

Conversation

@xxi-nv

@xxi-nv xxi-nv commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Rebalances and trims the MoE-related entries across the per-GPU L0 test-db lists to cut redundant pre-merge cost while preserving (and in places broadening) real-world coverage. Test-list only; no source/runtime changes.

  • Stage rebalance (pre -> post-merge): full per-backend matrices (test_moe_backend -k), unquantized (quant=None) and per-tensor FP8 (QDQ) variants of test_moe_module, bf16 4gpus, online/static EPLB, bf16_trtllm_gen, moe_comm_boundary; a representative is kept in pre.
  • Dead-entry removal (zero coverage loss): GPT-OSS test_w4_4gpus triton on Blackwell (Hopper-only) and all moe_backend=WIDEEP cases.
  • Equivalent-platform dedup (GB200 == B200, SM100): drop B200-side duplicate TestDeepSeekV3Lite 4gpus cases, keep GB200.
  • Backend diversification: DeepSeekV3Lite nvfp4_4gpus and the dgx_b200 post set spread moe_backend across CUTLASS/TRTLLM/CUTEDSL per the CutlassFusedMoE SM support table; CUTLASS FP8 buckets split by SM (block-scale in pre, per-tensor QDQ to post).

Test plan

  • pre-commit test-list validators pass (entry-existence AST check, waives duplicate check, DCO).
  • All 9 modified YAML files parse; no intra-block duplicate entries.
  • Refined -k buckets verified against CutlassFusedMoE._QUANT_SUPPORT_TABLE to target only SM-supported quants.

Summary by CodeRabbit

  • Tests
    • Refined several integration test matrices across GPU environments.
    • Moved some MoE-related checks to post-merge and narrowed pre-merge coverage for faster, more targeted validation.
    • Updated multi-GPU coverage for DeepSeekV3Lite, GPTOSS, and NemotronV3Super, including adjusted quantization and cache variants.
    • Removed a few legacy or redundant MoE test selections while keeping key backend coverage.

…, quant/backend refinement

- Move full backend matrices and less-used variants to post-merge (test_moe_backend -k, test_moe_module unquantized and FP8 per-tensor QDQ, bf16 4gpus, online/static eplb, bf16_trtllm_gen, moe_comm_boundary).
- Remove always-skip dead entries: GPT-OSS test_w4_4gpus triton on Blackwell (Hopper-only), and all moe_backend=WIDEEP cases.
- Dedup equivalent platforms (GB200 == B200, SM100): drop B200-side duplicate TestDeepSeekV3Lite 4gpus cases, keep GB200.
- Diversify moe_backend coverage for DeepSeekV3Lite nvfp4_4gpus across CUTLASS/TRTLLM/CUTEDSL (per CutlassFusedMoE SM support table).
- Refine CUTLASS quant buckets by SM support: split FP8 block-scale (keep pre) vs per-tensor QDQ (move post); delete test_fused_moe.py legacy entries.

Signed-off-by: xxi <xxi@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Integration test lists are updated to move MoE coverage between pre_merge and post_merge stages, narrow CUTLASS/FP8 selectors, and revise DeepSeekV3Lite, GPTOSS, NemotronV3Super, MoEComm, and legacy fused MoE entries across multiple GPU-specific matrices.

Changes

MoE integration test matrix updates

Layer / File(s) Summary
H100 stage split
tests/integration/test_lists/test-db/l0_h100.yml, tests/integration/test_lists/test-db/l0_dgx_h100.yml
pre_merge MoE entries are narrowed and split from post_merge entries, with CUTLASS and FP8 block-scale cases separated from the remaining CUTLASS variants.
B200/B300 backend relocation
tests/integration/test_lists/test-db/l0_b200.yml, tests/integration/test_lists/test-db/l0_b300.yml
Legacy MoE backend entries are removed from pre_merge sections and re-added in post_merge blocks with updated selectors.
DGX B200 selector edits
tests/integration/test_lists/test-db/l0_dgx_b200.yml
Nemotron, DeepSeekV3Lite, GPTOSS, and MoE module entries are trimmed and reselected across the pre_merge and post_merge test lists.
DGX B300 and GB200 matrix edits
tests/integration/test_lists/test-db/l0_dgx_b300.yml, tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
Post_merge DeepSeekV3Lite and NemotronV3Super entries are added, GPTOSS coverage is adjusted, and legacy MoE/MoEComm entries are removed or relocated.
GB300 and RTX cleanup
tests/integration/test_lists/test-db/l0_gb300_multi_gpus.yml, tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml
The legacy fused MoE module test is removed from GB300, DeepSeekV3Lite online_eplb variants are narrowed, and commented-out WIDEEP entries are dropped from RTX Pro 6000.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • liji-nv
  • yuxianq
  • hyukn
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description check ✅ Passed The description covers the summary and testing intent, but it only loosely follows the template and omits an explicit Test Coverage section.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title is concise and accurately reflects the PR’s main change: optimizing the MoE CI test-db.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/test_lists/test-db/l0_b300.yml`:
- Around line 77-80: The l0_b300 test list is unintentionally broader because
the existing `test_moe_backend.py::test_moe_backend -k "CUTEDSL"` and `-k
"DEEPGEMM"` filters also match `MEGAMOE_CUTEDSL` and `MEGAMOE_DEEPGEMM`. Update
this block to keep the MegaMoE variants split out consistently with the other
list, either by adding explicit `MEGAMOE_*` entries in the
`test_moe_backend.py::test_moe_backend` set or by narrowing the `-k` filters so
they do not capture them.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c83e3041-aa9b-43c3-b649-6d6756d746b8

📥 Commits

Reviewing files that changed from the base of the PR and between a02214a and dcd1f47.

📒 Files selected for processing (9)
  • tests/integration/test_lists/test-db/l0_b200.yml
  • tests/integration/test_lists/test-db/l0_b300.yml
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
  • tests/integration/test_lists/test-db/l0_dgx_b300.yml
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
  • tests/integration/test_lists/test-db/l0_gb300_multi_gpus.yml
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml
💤 Files with no reviewable changes (2)
  • tests/integration/test_lists/test-db/l0_rtx_pro_6000.yml
  • tests/integration/test_lists/test-db/l0_gb300_multi_gpus.yml

Comment thread tests/integration/test_lists/test-db/l0_b300.yml
@xxi-nv xxi-nv changed the title [None][test] Optimize MoE CI test-db: stage rebalance, platform dedup, quant/backend refinement [TRTLLM-13629][test] Optimize MoE CI test-db Jun 25, 2026
@xxi-nv xxi-nv requested a review from QiJune June 25, 2026 10:42
orchestrator: mpi
tests:
# ---- non-quantized (quant=None) moved to post-merge ----
- unittest/_torch/modules/moe/test_moe_module.py::test_configurable_moe_multi_gpu_eplb -k "None"

@QiJune QiJune Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to define a new post merge stage like: https://github.com/NVIDIA/TensorRT-LLM/blob/main/jenkins/L0_Test.groovy#L4469.

It seems only 2 case will be moved to post merge, maybe we can revert changes here to avoid defining a new post merge stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants