feat(moe): add MoE inference and expert parallel support by qinyiqun · Pull Request #444 · InfiniTensor/InfiniLM

qinyiqun · 2026-06-18T02:17:45Z

Summary

Add a generic MoE layer stack under csrc/layers/moe.
Route Qwen3-MoE through the generic SparseMoeBlock, TopKRouter, FusedMoeExperts, and FusedMoE runner.
Add MoE EP dispatchers for local_allreduce and allgather_reducescatter.
Add a reserved deepep backend interface for future integration.
Move the old per-expert MoeMLP into csrc/layers/moe/legacy and keep DeepSeek-V2 on the legacy path.
Pass MoE EP config through Python args and model config instead of bench-owned environment variables.
Optimize rank-local safetensors loading for EP expert weights.
Support Qwen3/Qwen3Next GQA cases where num_key_value_heads < tp_size.

Motivation

Closes #

InfiniLM needs a reusable MoE inference path that can support Qwen3-MoE models and provide a clear abstraction boundary for future high-performance EP backends such as DeepEP.

The current implementation focuses on correctness and data-flow alignment first:

TP-only MoE works through the standard dispatcher.
DP=1 EP uses local_allreduce as the preferred current path.
allgather_reducescatter is available as a correctness-oriented backend.
DeepEP is explicitly reserved but not implemented in this PR.

Type of Change

feat — new feature / new model
refactor — code restructuring without behavior change
perf — performance improvement (no behavioral change)
fix — bug fix
test — adding or fixing tests only
docs — documentation only
build / ci — build system or CI configuration
chore — tooling, formatting, or other non-code changes
Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Please attach screenshots for the final tested commands.

Suggested coverage:

Qwen3-30B-A3B, TP=1, EP disabled
Qwen3-30B-A3B, TP=2, EP=2, local_allreduce
Qwen3-30B-A3B, TP=2, EP=2, allgather_reducescatter
Qwen3-235B-A22B, TP=8, EP=8, local_allreduce
Qwen3-8B-base non-MoE regression, TP=2, graph enabled
DeepSeek-V2-Lite loading/regression for legacy MoE path if applicable

Benchmark / Performance Impact

Initial measured examples on A100:

Qwen3-30B-A3B, TP=2/EP=2, local_allreduce, graph enabled:
- Prefill and decode are functional.
- Decode performance is currently limited by MoE communication and temporary fused MoE kernel quality.
Qwen3-235B-A22B, TP=8/EP=8, local_allreduce, graph enabled:
- Model loading and decode are functional.
- Nsys shows decode is dominated by communication, especially allreduce-heavy paths.

This PR does not claim final high-performance MoE EP parity with vLLM/SGLang. It establishes the correct abstraction and execution path for later DeepEP/fused MoE work.

Notes for Reviewers

local_allreduce is the recommended current EP backend for DP=1.
allgather_reducescatter is correctness-oriented and expected to be slower.
deepep is intentionally a placeholder interface.
prepare_moe_input-style CUTLASS grouped GEMM flow is not used by the current InfiniLM MoE runner.
DeepSeek-V2 remains on layers/moe/legacy and is not migrated to the new fused Qwen3-MoE path.
Non-MoE models should show MoE EP backend: disabled.

Checklist

Title, Branch, and Commits

PR title follows Conventional Commits.
Branch name follows <type>/xxx-yyyy-zzzz.
Each commit message follows Conventional Commits.
Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable.
No stray merge commits from main.
No fixup! / squash! / wip commits remain.
Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

Changes are scoped to MoE inference, EP config/loading, and required model compatibility.
No debug prints or temporary MoE logs are left behind.
Public API changes are intentional and reflected in Python/C++ callers.

C++ Specific

Changed files are formatted by scripts/format.py.
Project builds cleanly on NVIDIA.

Python Specific

Changed files are formatted by scripts/format.py.

Testing

Passed single request test, or reason for skipping is documented.
Passed offline performance test, or reason for skipping is documented.
Passed sanity test, or reason for skipping is documented.
Passed service test, or reason for skipping is documented.

- add reusable MoE router, dispatcher, runner, and expert abstractions - enable Qwen3 MoE fused inference with TP-local expert parallel routing - add graph-safe MoE workspace handling and EP backend selection through engine config - preserve legacy MoE path for existing DeepSeek V2 code

pengcheng888 · 2026-06-26T06:16:24Z

    struct CompiledResult {
        InfinilmModel::Input input;
        Compiled compiled;
+        std::shared_ptr<InfinilmModel::Output> replay_output;


这个新增的replay_output变量，以及graph编译时新增和修改的代码。可以注释或解释一下么，不知道啥意思

已补充注释。这里的 replay_output 是 graph capture 时为输出保留的普通 Output handle；compiled 里保存的是 GraphTensor/graph 对象，replay 后需要通过这个 handle 拿回模型输出。这样 get_compiled 时可以直接返回可复用的 graph replay 结果。

这个不影响static 推理，已测试

pengcheng888 · 2026-06-26T06:24:54Z

        throw std::runtime_error(" Model object not found. ");
    }
-    return workers_.front()->state_dict_keys();
+    std::vector<std::string> keys;


这个写法，我看了好一会才看懂。
其是等价于下面的写法。先求set, 最后再赋值给key_vec.
`
std::unordered_setstd::string keys;
for (const auto& worker : workers_) {
const auto& worker_keys = worker->state_dict_keys();
keys.insert(worker_keys.begin(), worker_keys.end());
}

std::vectorstd::string keys_vec(keys.begin(), keys.end());
return keys_vec;

`

这里改的核心目的：合并多 rank/worker 的 state_dict_keys() 时去重，但保留稳定顺序。
背景是 InferEngine 里有多个 RankWorker，比如 TP=2 时，每个 worker 都有一份模型结构，因此 worker->state_dict_keys() 里很多 key 是重复的。如果 InferEngine::state_dict_keys() 只是把所有 worker 的 key 拼起来，Python 侧 check_parameters() 虽然最后会转 set，但返回列表会非常冗余，也不利于定位 missing/unexpected key。

pengcheng888 · 2026-06-26T06:27:17Z

            } else if (local_cmd == Command::LOAD_BATCH) {
                try {
-                    model_->load_parameters_no_sync(local_params);
+                    model_->load_parameters_no_sync(local_params, local_params_strict);


等价于这个写法么 model_->load_parameters_no_sync(local_params, strict);

是的，现在等价于直接调用 model_->load_parameters_no_sync(local_params, local_params_strict)。这里需要把 strict 继续传下去，否则 Python 侧传入的 non-strict load 对 MoE packed weight 不生效。

pengcheng888 · 2026-06-26T07:18:17Z

        self.parser.add_argument("--model", type=str, required=True)
        self.parser.add_argument("--device", type=str, default="cpu")
        self.parser.add_argument("--tp", "--tensor-parallel-size", type=int, default=1)
+        self.parser.add_argument("--dp", "--data-parallel-size", type=int, default=1)


提供测试命令

这个dp只有python中被使用，不会传递给c++么

测试命令可以用：

CUDA_VISIBLE_DEVICES=2,3 python examples/bench.py
--device=nvidia
--model=xxxxx
--enable-paged-attn
--attn=flash-attn
--tp=2
--ep=2
--moe-ep-backend=local_allreduce
--input-len=16,1024
--output-len=1024
--batch-size=1
--enable-graph

dp现在是预留口，我们没有支持dp功能，但是dp会跟moe有执行层的绑定，所以先留出来了

pengcheng888 · 2026-06-26T07:22:50Z

    ):
        self.hf_config = read_hf_config(model_path)
        self.hf_generation_config = read_hf_generation_config(model_path)
+        self.hf_config["moe_ep_backend"] = moe_ep_backend


moe_ep_backend和moe_ep_size，怎么能放进hf_config中。

hf_config对应c++中model_config的config_json变量，内容只是 config.json中的信息。

已经修改

pengcheng888 · 2026-06-26T09:18:41Z

 }


+def _is_internal_moe_packed_weight(key: str) -> bool:


Qwen3-235B-A22B中有这个权重么

w13和w2是moe的一种写法，本质上是gateup 和 down

pengcheng888 · 2026-06-26T09:23:55Z

+    if backend not in {
+        "disabled",


能解释下这四个后端么，该怎么用

已把 backend 说明抽到统一 help 文案里。disabled 表示不启用 EP；local_allreduce 表示复用 TP group 做 EP，当前用于 TP=EP 的场景；allgather_reducescatter 是后续 DP/EP 路由方案；deepep 是 DeepEP dispatcher。auto 会根据 EP/DP 配置选择默认 backend。

pengcheng888 · 2026-06-26T09:25:34Z

+    return "moe" in model_type or "num_experts" in config
+
+
+def configure_moe_ep_backend(


configure_moe_ep_backend ， _is_moe_model， _normalize_moe_ep_backend 这几个函数是重复的

pengcheng888 · 2026-06-26T09:36:51Z

@@ -386,7 +390,8 @@ def state_dict_keyname(self):

    def load_state_dict(self, state_dict, strict=None):


添加了strict参数，但感觉这个pr好像没有被使用到。

strict 参数现在用于 load_state_dict -> load_params -> C++ worker 的批量参数加载链路。MoE/quantized 路径会注册内部 packed 参数，这些参数不直接存在于 HF checkpoint 中，因此需要 non-strict loading 支持。

pengcheng888 · 2026-06-26T09:39:14Z

 namespace infinilm::models::qwen3_moe {

-class Qwen3MoeSparseMoeBlock : public infinicore::nn::Module {
+class Qwen3MoeSparseMoeBlock final : public infinilm::layers::moe::SparseMoeBlock {


继承后貌似啥也没干。这里直接 using Qwen3MoeSparseMoeBlock = public infinilm::layers::moe::SparseMoeBlock 可以么。

这里暂时保留 adapter class。原因是 TextDecoderLayer 构造 MLP block 使用的是 (config, layer_idx, device)，而通用 SparseMoeBlock 的构造顺序是 (config, device, layer_idx)。using alias 不能适配构造函数签名，所以这里保留一个很薄的 Qwen3Moe adapter，后续统一构造函数签名后可以删掉。

qinyiqun requested a review from a team June 18, 2026 02:17

qinyiqun force-pushed the moe branch from adb5ae9 to 4b3058a Compare June 18, 2026 09:30

qinyiqun force-pushed the moe branch from 4b3058a to f2d4861 Compare June 23, 2026 02:20

qinyiqun requested a review from pengcheng888 June 25, 2026 08:17

pengcheng888 reviewed Jun 26, 2026

View reviewed changes

qinyiqun added 2 commits June 30, 2026 02:48

Address MoE review comments

c52ce88

Fix DeepSeek MLA dense prefill attention

83102c6

		return "moe" in model_type or "num_experts" in config


		def configure_moe_ep_backend(

		@@ -386,7 +390,8 @@ def state_dict_keyname(self):

		def load_state_dict(self, state_dict, strict=None):

Uh oh!

Conversation

qinyiqun commented Jun 18, 2026

Summary

Motivation

Type of Change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Title, Branch, and Commits

Scope and Design

C++ Specific

Python Specific

Testing

Uh oh!

pengcheng888 Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinyiqun Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pengcheng888 Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pengcheng888 Jun 26, 2026 •

edited

Loading

qinyiqun Jun 30, 2026 •

edited

Loading

pengcheng888 Jun 26, 2026 •

edited

Loading