Add AutoEP by tohtana · Pull Request #7938 · deepspeedai/DeepSpeed

tohtana · 2026-03-31T00:11:17Z

This PR adds AutoEP (Automatic Expert Parallelism) to DeepSpeed training for HuggingFace MoE models.

AutoEP detects MoE blocks during deepspeed.initialize(), builds the required EP/EDP process groups, and replaces supported MoE blocks with an EP-enabled execution path, so expert parallelism can be enabled with DeepSpeed config only and without model code changes.

Current scope in this PR is the base AutoEP feature:

ZeRO stages 0, 1, and 2 support
checkpoint save/load support
universal checkpoint conversion support

ZeRO-3 extensions are intentionally left as follow-up work (#7928 should be merged for this work)

Supported presets in this PR:

Mixtral
Qwen3-MoE
DeepSeek-V2
DeepSeek-V3
LLaMA-4

For end-to-end benchmarking and testing, an AutoEP example is available in DeepSpeedExamples:

https://github.com/tohtana/DeepSpeedExamples/tree/tohtana/add_auto_ep/training/expert_parallel

Attribution

This implementation substantially builds on TorchTitan's MoE / expert-parallel implementation, and we want to explicitly acknowledge that prior work.

The TorchTitan-derived pieces in this PR are primarily:

deepspeed/moe/ep_router.py: adapted from TorchTitan's TokenChoiceTopKRouter
deepspeed/moe/ep_experts.py: adapted from TorchTitan's GroupedExperts and grouped-GEMM expert execution path
deepspeed/moe/ep_kernels.py: adapted from TorchTitan's TokenReorderer, generate_permute_indices, Triton fill-indices kernel, and token-group alignment / padding helpers
deepspeed/module_inject/auto_ep_layer.py: adapts the same router -> reorder -> dispatch -> local expert compute -> combine structure used in TorchTitan's MoE / EP flow

Relevant TorchTitan sources:

The DeepSpeed-specific work in this PR is the AutoEP integration layer around those building blocks:

HuggingFace MoE detection and structural validation
model-family presets and custom-config path
weight repacking from HF expert layouts into grouped expert tensors
DeepSpeed runtime group setup and module replacement
DeepSpeed checkpoint save/load and universal checkpoint support
DeepSpeed docs and tests

Design

The implementation is split into a few layers:

deepspeed/module_inject/auto_ep_config.py
- user config parsing
- built-in model presets
- validation for EP topology and per-model constraints
deepspeed/module_inject/auto_ep.py
- scans the model for MoE blocks
- validates the detected structure
- builds a MoELayerSpec for each supported MoE layer
- replaces the original HF block with AutoEPMoELayer
deepspeed/module_inject/auto_ep_layer.py
- the drop-in execution wrapper for a detected MoE block
- implements router execution, token reorder, EP dispatch/combine, local expert compute, and shared-expert merge
deepspeed/moe/ep_router.py, deepspeed/moe/ep_experts.py, deepspeed/moe/ep_kernels.py
- reusable MoE runtime pieces for routing, grouped expert compute, token permutation, and aligned grouped-GEMM execution
deepspeed/moe/ep_repack.py
- converts HF expert weights into the grouped expert layout expected by the runtime
deepspeed/runtime/engine.py and checkpoint conversion code
- wires AutoEP into deepspeed.initialize()
- handles checkpoint save/load metadata and universal checkpoint integration

At runtime, the execution path is:

detect and replace supported HF MoE blocks during initialization
route tokens with the EP router
reorder tokens by expert assignment
perform all-to-all dispatch across the EP group when autoep_size > 1
run local grouped expert compute
all-to-all combine and restore the original token order
merge shared experts if the model has them

Adding new model support

There are two supported ways to extend AutoEP to a new MoE model family.

Add a preset in PRESET_MODELS.
This is the preferred path for a model family we want to support out of the box. A preset defines:

MoE layer pattern
router child name
experts child name
expert weight names / layout
num_experts and top_k config attributes
routing defaults
optional shared-expert structure

Use the custom config path.
For models that are not yet built into DeepSpeed, AutoEP can be driven from config with:

moe_layer_pattern
router_pattern
expert_pattern
expert_w1, expert_w2, expert_w3
num_experts_attr
top_k_attr
optional shared-expert fields

Once detection can produce a valid MoELayerSpec, the replacement, execution, and checkpoint paths are shared.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana · 2026-03-31T00:36:37Z

This feature is still experimental. The next steps are:

ZeRO3 integration ([Feature] Enable AutoEP Compatibility with ZeRO-3 #7928)
Add gpt-oss to preset (Requested by @jiosephlee)

We welcome help testing and validating this on large-scale models.

jiosephlee · 2026-03-31T05:16:16Z

@tohtana I wish I could be of help, but I haven't written code on this level; if you could clarify on what you mean by a preset for gpt-oss, or if there are other first-issues kind of work I could help with, I would gladly look into it

tohtana · 2026-03-31T05:55:26Z

Hi @jiosephlee,
Thank you for offering help! It would be great if you can try gpt-oss once it is implemented with this AutoEP work.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

nathon-lee · 2026-04-08T11:14:08Z

Hi @tohtana, I noticed this PR is in draft now.
I saw your note that the feature is still experimental and that there are a few next steps pending.
Is there anything I can help with — testing, validation, or follow-up changes?

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> (cherry picked from commit cc45af3)

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana · 2026-04-11T01:59:52Z

Hi @nathon-lee,
Thank you for the kind offer! Yes, I'm still working on verification. It would be great if you could help me run more models. Current, we have these presets:

Mixtral
Qwen3-MoE
DeepSeek-V2
DeepSeek-V3
LLaMA-4

I've been using Mixtral for verification but haven't checked other models enough. If you can try and share the feedback, it would be super helpful. Reducing the number of layers would be fine.

nathon-lee · 2026-04-11T04:25:48Z

ok, @tohtana

delock · 2026-04-13T10:02:49Z

Hi @tohtana, @PKUWZP and I is testing Moonlight-16B-A6B model finetuning which is a muon pretrained model. When working on this model, we found some changes to AutoEP is necessary to make AutoEP work with noaux_ac MoE (which is used by DeepSeek and Moonlight) model and with models using Muon optimizers. The changes is in following branch, is it possible to merge this branch into your PR? Thanks! https://github.com/deepspeedai/DeepSpeed/tree/gma/autoep-muon-fixes

delock · 2026-04-13T15:18:58Z

Hi @tohtana, @PKUWZP and I is testing Moonlight-16B-A6B model finetuning which is a muon pretrained model. When working on this model, we found some changes to AutoEP is necessary to make AutoEP work with noaux_ac MoE (which is used by DeepSeek and Moonlight) model and with models using Muon optimizers. The changes is in following branch, is it possible to merge this branch into your PR? Thanks! https://github.com/deepspeedai/DeepSpeed/tree/gma/autoep-muon-fixes

Here is the model that requires this change to work with AutoEP, https://huggingface.co/delock/Moonlight-16B-A3B-finetune-fixed

tohtana · 2026-04-16T06:05:37Z

Hi @delock,

Thank you for sharing the patch! It definitely fixes an issue with DeepSeek / Moonlight.
I also found that we need to pass some HF model configs to AutoEP:

n_group -> num_expert_groups
topk_group -> num_limited_groups
routed_scaling_factor -> route_scale

Can you update our patch?

On the Muon side, I also found that it removes @compiler.compile() and replaces one batched Newton-Schulz call with a Python loop over experts. Current Muon already supports batched matrices. Can we use it?

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

nathon-lee · 2026-04-16T10:17:57Z

Hi @tohtana,

I ran some tests on the new AutoEP presets and wanted to share my findings.

1. Qwen preset works as expected
I successfully ran Qwen/Qwen1.5-MoE-A2.7B using the qwen preset. AutoEP correctly detected and partitioned the experts across 2 GPUs. You can check the run log here: autoep_qwen3_moe.log

2. There are still some issues testing other models:

Llama 4: Testing end-to-end via Hugging Face is difficult. Official repos are gated (returning 403 without a token), and community dummy models (like llamafactory/tiny-random-Llama-4) have invalid config.json files that crash the transformers library before DeepSpeed starts.
- Suggestion: It might be helpful to use pure PyTorch mock models (dummy classes mimicking the MoE structures) for unit tests to avoid HF network/auth dependencies.
DeepSeek: AutoEP partitioned DeepSeek-V2-Lite successfully. However, for DeepSeek-V3-Base, it returned AutoEP: no MoE layers detected in model. The regex for V3 might need a check against its current HF structure.

(Note: I also intentionally loaded the Qwen model but applied the llama4 preset. AutoEP correctly threw the "no MoE layers detected" error, which verifies that the regex parser isolation works correctly.)

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Two fixes addressing masahiro's review feedback on PR #7938: 1. Auto-fill AutoEPConfig from HF model config (auto_ep_config.py, auto_ep.py): add fill_autoep_config_from_hf() which maps HF field names to AutoEP internal names on AutoEP.__init__: - n_group -> num_expert_groups - topk_group -> num_limited_groups - routed_scaling_factor -> route_scale User-supplied values always take precedence. Without this, Moonlight (DeepSeek-V3) training used route_scale=1.0 instead of 2.446, producing systematically wrong MoE output magnitudes. 2. Restore batched Newton-Schulz in muon_update (original_muon.py): replace the per-expert Python loop with a single batched call to zeropower_via_newtonschulz5, which already supports ndim>=2 inputs. This restores GPU parallelism across all E experts per step. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

tohtana · 2026-05-14T08:12:27Z

@tjruwase @nathon-lee @PKUWZP
This PR is finally ready to review. You can also refer to an example.

Experiments on a single node (using small models with less layers) showed good performance. Please feel free to try with larger models on multiple nodes.

Model	ZeRO-3 leaf	AutoEP (+ZeRO-1)
Qwen3.5 MoE (8 layers)	42,128.05 tok/s, 34.99 GB	87,540.15 tok/s, 25.58 GB (`2.08x` throughput, `0.73x` memory vs ZeRO-3)
Llama4 (7 layers)	19,144.07 tok/s, 56.95 GB	60,178.91 tok/s, 60.08 GB (`3.14x` throughput, `1.06x` memory vs 7-layer ZeRO-3)
Mixtral 8x7B (8 layers)	32,622.11 tok/s, 50.47 GB	69,052.31 tok/s, 35.03 GB (`2.12x` throughput, `0.69x` memory vs ZeRO-3)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff4ebf1b6d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

…lier Remove AutoEP backward loss multiplier

…-groups-mpu Use active MPU for AutoEP sequence-parallel size

Three changes to make Moonlight-16B-A3B (DeepSeek-V3 MoE) work with AutoEP + Muon + ZeRO-2: 1. e_score_correction_bias: copy the pretrained noaux_tc score-correction bias from the source gate into AutoEP routers and apply it in the TokenChoiceTopKRouter forward pass so expert selection matches the pretrained checkpoint. 2. is_expert_group: mark GroupedExperts w1/w2/w3 tensors with is_expert_group=True so Muon applies Newton-Schulz independently per expert slice rather than treating the stacked (E, I, O) tensor as a single matrix. muon_update grows an is_expert_group kwarg; all four call sites inside original_muon.py and the ZeRO-2 path in stage_1_and_2.py pass getattr(p, 'is_expert_group', False). 3. Muon + MoE param groups in engine.py: flatten dict-style param groups produced by configure_moe_param_groups before filtering by use_muon; re-tag optimizer flags after AutoEP layer replacement; add name keys for MoE group splitting; call split_params_into_different_moe_groups when the model has MoE layers. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Two fixes addressing masahiro's review feedback on PR deepspeedai#7938: 1. Auto-fill AutoEPConfig from HF model config (auto_ep_config.py, auto_ep.py): add fill_autoep_config_from_hf() which maps HF field names to AutoEP internal names on AutoEP.__init__: - n_group -> num_expert_groups - topk_group -> num_limited_groups - routed_scaling_factor -> route_scale User-supplied values always take precedence. Without this, Moonlight (DeepSeek-V3) training used route_scale=1.0 instead of 2.446, producing systematically wrong MoE output magnitudes. 2. Restore batched Newton-Schulz in muon_update (original_muon.py): replace the per-expert Python loop with a single batched call to zeropower_via_newtonschulz5, which already supports ndim>=2 inputs. This restores GPU parallelism across all E experts per step. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Now that the per-expert Python loop is replaced with a single batched call to zeropower_via_newtonschulz5, muon_update has no dynamic control flow and can be compiled again. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Signed-off-by: Guokai Ma <guokai.ma@intel.com>

1. gram_newtonschulz: replace torch.addmm (2D only) with equivalent a*Q + Z@Q to support batched 3D expert weight tensors [num_local_experts, n, m]. Also fix diagonal() to specify dim1/dim2 for 3D tensors. 2. deepseek_v3 preset: remove e_score_correction_bias from unsupported_router_bias_names since auto_ep_layer.py already copies it correctly (lines 398-402). Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

delock · 2026-05-19T07:06:16Z

Hi @tohtana, I have a branch gma/autoep-muon-fixes on this repo with 5 commits on top of the current PR HEAD that add AutoEP + Muon compatibility for Moonlight (DeepSeek-V3 MoE):

e_score_correction_bias — Copy pretrained noaux_tc score-correction bias into AutoEP routers
is_expert_group — Muon applies Newton-Schulz independently per expert slice; fix gram_newtonschulz for batched 3D tensors
fill_autoep_config_from_hf — Auto-fill AutoEPConfig from HF model config (n_group, topk_group, routed_scaling_factor)
Muon + MoE param groups in engine.py — Flatten dict-style param groups; re-tag optimizer flags; split MoE groups for correct EP gradient sync
ns_method respect in expert group path + deepseek_v3 preset fix

Could you merge these into your branch? I tried to create a PR to your branch but there seems a lot of excessive commits from master branch. The following commands should get you cherry pick the 5 commits I mentioned.

git fetch https://github.com/deepspeedai/DeepSpeed.git gma/autoep-muon-fixes
git merge FETCH_HEAD

Thanks!

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana · 2026-05-20T05:39:01Z

Thank you @delock!
I merged the improvements. The CI is currently broken due to a HF version issue, but #8016 will unblock it.

delock · 2026-05-21T06:21:37Z

Hi @tohtana , another question, I'm testing freeze most of the parameters (including expert parameters) except expert router parameters. During the process I found that the freezed expert are not freezed any more and I got an OOM during training. Should AutoEP sync requires_grad flags after create the new EP tensors?

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana · 2026-05-21T20:55:59Z

Great catch, @delock! I appreciate your help with testing AutoEP.
I updated to copy attributes of parameters including requires_grad, dtype, and device.

refactor(autoep-zero3): drop files already covered by deepspeedai#7938

tohtana added 7 commits February 6, 2026 20:59

add autoep

bea50ef

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

add checkpointing

2c041db

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix format

c2a89bc

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

add custom patterns

fd07c93

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix optimizer resumption

cabfebc

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

autoep: fix post-dispatch local expert permutation grouping

a2ab10d

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Merge branch 'master' into tohtana/add_autoep

e79c2e8

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana mentioned this pull request Mar 31, 2026

[Feature] Enable AutoEP Compatibility with ZeRO-3 #7928

Closed

PKUWZP self-requested a review March 31, 2026 01:17

tohtana added 4 commits March 31, 2026 18:33

fix(autoep): restore ep_count helper

046db04

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

test(autoep): make checkpoint tests cpu-safe

71a0a36

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Fix AutoEP ZeRO-2 expert gradient scaling

fae0276

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix(autoep): preserve manual backward parity

5f7dc1e

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana added 2 commits April 10, 2026 16:57

fix(autoep): align combine path with grouped-mm baseline

90f86c7

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> (cherry picked from commit cc45af3)

feat(autoep): add selectable combine implementations

b207c5e

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Merge branch 'master' into tohtana/add_autoep

0e872cc

Merge branch 'master' into tohtana/add_autoep

6b65486

test(autoep): update combine_from_routed calls

2c873a9

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix(autoep): support llama4 fused experts

240495c

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana requested review from hwchen2017, loadams and tjruwase as code owners May 14, 2026 08:07

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

Comment thread deepspeed/runtime/engine.py Outdated

Comment thread deepspeed/runtime/engine.py Outdated

Remove AutoEP backward loss multiplier

a05f092

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana mentioned this pull request May 14, 2026

Remove AutoEP backward loss multiplier tohtana/DeepSpeed#23

Merged

Use active MPU for AutoEP sequence-parallel size

71077a5

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana mentioned this pull request May 14, 2026

Use active MPU for AutoEP sequence-parallel size tohtana/DeepSpeed#24

Merged

tohtana and others added 7 commits May 14, 2026 08:55

Merge pull request #23 from tohtana/tohtana/remove-autoep-loss-multip…

045b061

…lier Remove AutoEP backward loss multiplier

Merge pull request #24 from tohtana/tohtana/fix-autoep-sp-size-before…

04b8abd

…-groups-mpu Use active MPU for AutoEP sequence-parallel size

fix(muon): restore @compiler.compile() on muon_update

6012bef

Now that the per-expert Python loop is replaced with a single batched call to zeropower_via_newtonschulz5, muon_update has no dynamic control flow and can be compiled again. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

fix(muon): respect ns_method in expert group path

dde07ab

Signed-off-by: Guokai Ma <guokai.ma@intel.com>

delock mentioned this pull request May 19, 2026

AutoEP + Muon fixes for Moonlight (DeepSeek-V3 MoE) tohtana/DeepSpeed#25

Closed

Merge AutoEP Muon fixes

ea73adc

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Merge branch 'master' into tohtana/add_autoep

61f3dca

Merge branch 'master' into tohtana/add_autoep

827b633

nathon-lee added a commit to nathon-lee/DeepSpeed_woo that referenced this pull request May 21, 2026

refactor(autoep-zero3): drop files already covered by deepspeedai#7938

8fd0cf4

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

nathon-lee mentioned this pull request May 21, 2026

refactor(autoep-zero3): drop files already covered by #7938 nathon-lee/DeepSpeed_woo#19

Merged

tohtana added 3 commits May 21, 2026 10:12

Preserve AutoEP requires_grad flags

71205e3

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Preserve AutoEP replacement dtype and device

b2e23a8

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Format AutoEP dtype device test

31c4bd4

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

nathon-lee added a commit to nathon-lee/DeepSpeed_woo that referenced this pull request May 22, 2026

Merge pull request #19 from nathon-lee/fix_autoEP_zero3

3d1de7c

refactor(autoep-zero3): drop files already covered by deepspeedai#7938

Conversation

tohtana commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Attribution

Design

Adding new model support

Uh oh!

tohtana commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiosephlee commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tohtana commented Mar 31, 2026

Uh oh!

nathon-lee commented Apr 8, 2026

Uh oh!

tohtana commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nathon-lee commented Apr 11, 2026 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

delock commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

delock commented Apr 13, 2026

Uh oh!

tohtana commented Apr 16, 2026

Uh oh!

nathon-lee commented Apr 16, 2026

Uh oh!

tohtana commented May 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

delock commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tohtana commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

delock commented May 21, 2026

Uh oh!

tohtana commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tohtana commented Mar 31, 2026 •

edited

Loading

tohtana commented Mar 31, 2026 •

edited

Loading

jiosephlee commented Mar 31, 2026 •

edited

Loading

tohtana commented Apr 11, 2026 •

edited

Loading

nathon-lee commented Apr 11, 2026 via email •

edited

Loading

delock commented Apr 13, 2026 •

edited

Loading

delock commented May 19, 2026 •

edited

Loading

tohtana commented May 20, 2026 •

edited

Loading

tohtana commented May 21, 2026 •

edited

Loading