feat(moe): add MoE inference and expert parallel support#444
Conversation
- add reusable MoE router, dispatcher, runner, and expert abstractions - enable Qwen3 MoE fused inference with TP-local expert parallel routing - add graph-safe MoE workspace handling and EP backend selection through engine config - preserve legacy MoE path for existing DeepSeek V2 code
| struct CompiledResult { | ||
| InfinilmModel::Input input; | ||
| Compiled compiled; | ||
| std::shared_ptr<InfinilmModel::Output> replay_output; |
There was a problem hiding this comment.
这个新增的replay_output变量,以及graph编译时新增和修改的代码。可以注释或解释一下么,不知道啥意思
There was a problem hiding this comment.
已补充注释。这里的 replay_output 是 graph capture 时为输出保留的普通 Output handle;compiled 里保存的是 GraphTensor/graph 对象,replay 后需要通过这个 handle 拿回模型输出。这样 get_compiled 时可以直接返回可复用的 graph replay 结果。
这个不影响static 推理,已测试
| throw std::runtime_error(" Model object not found. "); | ||
| } | ||
| return workers_.front()->state_dict_keys(); | ||
| std::vector<std::string> keys; |
There was a problem hiding this comment.
这个写法,我看了好一会才看懂。
其是等价于下面的写法。先求set, 最后再赋值给key_vec.
`
std::unordered_setstd::string keys;
for (const auto& worker : workers_) {
const auto& worker_keys = worker->state_dict_keys();
keys.insert(worker_keys.begin(), worker_keys.end());
}
std::vectorstd::string keys_vec(keys.begin(), keys.end());
return keys_vec;
`
There was a problem hiding this comment.
这里改的核心目的:合并多 rank/worker 的 state_dict_keys() 时去重,但保留稳定顺序。
背景是 InferEngine 里有多个 RankWorker,比如 TP=2 时,每个 worker 都有一份模型结构,因此 worker->state_dict_keys() 里很多 key 是重复的。如果 InferEngine::state_dict_keys() 只是把所有 worker 的 key 拼起来,Python 侧 check_parameters() 虽然最后会转 set,但返回列表会非常冗余,也不利于定位 missing/unexpected key。
| } else if (local_cmd == Command::LOAD_BATCH) { | ||
| try { | ||
| model_->load_parameters_no_sync(local_params); | ||
| model_->load_parameters_no_sync(local_params, local_params_strict); |
There was a problem hiding this comment.
等价于这个写法么 model_->load_parameters_no_sync(local_params, strict);
There was a problem hiding this comment.
是的,现在等价于直接调用 model_->load_parameters_no_sync(local_params, local_params_strict)。这里需要把 strict 继续传下去,否则 Python 侧传入的 non-strict load 对 MoE packed weight 不生效。
| self.parser.add_argument("--model", type=str, required=True) | ||
| self.parser.add_argument("--device", type=str, default="cpu") | ||
| self.parser.add_argument("--tp", "--tensor-parallel-size", type=int, default=1) | ||
| self.parser.add_argument("--dp", "--data-parallel-size", type=int, default=1) |
There was a problem hiding this comment.
这个dp只有python中被使用,不会传递给c++么
There was a problem hiding this comment.
测试命令可以用:
CUDA_VISIBLE_DEVICES=2,3 python examples/bench.py
--device=nvidia
--model=xxxxx
--enable-paged-attn
--attn=flash-attn
--tp=2
--ep=2
--moe-ep-backend=local_allreduce
--input-len=16,1024
--output-len=1024
--batch-size=1
--enable-graph
dp现在是预留口,我们没有支持dp功能,但是dp会跟moe有执行层的绑定,所以先留出来了
| ): | ||
| self.hf_config = read_hf_config(model_path) | ||
| self.hf_generation_config = read_hf_generation_config(model_path) | ||
| self.hf_config["moe_ep_backend"] = moe_ep_backend |
There was a problem hiding this comment.
moe_ep_backend和moe_ep_size,怎么能放进hf_config中。
hf_config对应c++中model_config的config_json变量,内容只是 config.json中的信息。
| } | ||
|
|
||
|
|
||
| def _is_internal_moe_packed_weight(key: str) -> bool: |
There was a problem hiding this comment.
Qwen3-235B-A22B中有这个权重么
There was a problem hiding this comment.
w13和w2是moe的一种写法,本质上是gateup 和 down
| if backend not in { | ||
| "disabled", |
There was a problem hiding this comment.
已把 backend 说明抽到统一 help 文案里。disabled 表示不启用 EP;local_allreduce 表示复用 TP group 做 EP,当前用于 TP=EP 的场景;allgather_reducescatter 是后续 DP/EP 路由方案;deepep 是 DeepEP dispatcher。auto 会根据 EP/DP 配置选择默认 backend。
| return "moe" in model_type or "num_experts" in config | ||
|
|
||
|
|
||
| def configure_moe_ep_backend( |
There was a problem hiding this comment.
configure_moe_ep_backend , _is_moe_model, _normalize_moe_ep_backend 这几个函数是重复的
| @@ -386,7 +390,8 @@ def state_dict_keyname(self): | |||
|
|
|||
| def load_state_dict(self, state_dict, strict=None): | |||
There was a problem hiding this comment.
添加了strict参数,但感觉这个pr好像没有被使用到。
There was a problem hiding this comment.
strict 参数现在用于 load_state_dict -> load_params -> C++ worker 的批量参数加载链路。MoE/quantized 路径会注册内部 packed 参数,这些参数不直接存在于 HF checkpoint 中,因此需要 non-strict loading 支持。
| namespace infinilm::models::qwen3_moe { | ||
|
|
||
| class Qwen3MoeSparseMoeBlock : public infinicore::nn::Module { | ||
| class Qwen3MoeSparseMoeBlock final : public infinilm::layers::moe::SparseMoeBlock { |
There was a problem hiding this comment.
继承后貌似啥也没干。 这里直接 using Qwen3MoeSparseMoeBlock = public infinilm::layers::moe::SparseMoeBlock 可以么。
There was a problem hiding this comment.
这里暂时保留 adapter class。原因是 TextDecoderLayer 构造 MLP block 使用的是 (config, layer_idx, device),而通用 SparseMoeBlock 的构造顺序是 (config, device, layer_idx)。using alias 不能适配构造函数签名,所以这里保留一个很薄的 Qwen3Moe adapter,后续统一构造函数签名后可以删掉。
Summary
csrc/layers/moe.SparseMoeBlock,TopKRouter,FusedMoeExperts, andFusedMoErunner.local_allreduceandallgather_reducescatter.deepepbackend interface for future integration.MoeMLPintocsrc/layers/moe/legacyand keep DeepSeek-V2 on the legacy path.num_key_value_heads < tp_size.Motivation
Closes #
InfiniLM needs a reusable MoE inference path that can support Qwen3-MoE models and provide a clear abstraction boundary for future high-performance EP backends such as DeepEP.
The current implementation focuses on correctness and data-flow alignment first:
local_allreduceas the preferred current path.allgather_reducescatteris available as a correctness-oriented backend.Type of Change
feat— new feature / new modelrefactor— code restructuring without behavior changeperf— performance improvement (no behavioral change)fix— bug fixtest— adding or fixing tests onlydocs— documentation onlybuild/ci— build system or CI configurationchore— tooling, formatting, or other non-code changesTest Results of Involved Models on Supported Platforms (Please attach screenshots)
Please attach screenshots for the final tested commands.
Suggested coverage:
local_allreduceallgather_reducescatterlocal_allreduceBenchmark / Performance Impact
Initial measured examples on A100:
local_allreduce, graph enabled:local_allreduce, graph enabled:This PR does not claim final high-performance MoE EP parity with vLLM/SGLang. It establishes the correct abstraction and execution path for later DeepEP/fused MoE work.
Notes for Reviewers
local_allreduceis the recommended current EP backend for DP=1.allgather_reducescatteris correctness-oriented and expected to be slower.deepepis intentionally a placeholder interface.prepare_moe_input-style CUTLASS grouped GEMM flow is not used by the current InfiniLM MoE runner.layers/moe/legacyand is not migrated to the new fused Qwen3-MoE path.MoE EP backend: disabled.Checklist
Title, Branch, and Commits
<type>/xxx-yyyy-zzzz.main.fixup!/squash!/wipcommits remain.Scope and Design
C++ Specific
scripts/format.py.Python Specific
scripts/format.py.Testing