DIstributed branch base by 3outeille · Pull Request #46269 · huggingface/transformers

3outeille · 2026-05-28T17:32:45Z

Current state of things

Training has been confirming to work in TRL thanks to @AmineDiro cf EP + Trainer integration on top of DistributedConfig (#45028) #46126

I was validated end-to-end with trl.SFTTrainer on Qwen3-30B-A3B at multiple shapes (16k / 32k / 64k, 2n / 4n / 8n). Healthy training (loss bit-exact match against our prod fork at LR=0) and very competitive MFU 👏 👏
Inference works but depends on Dtensor (which we don't want as it is very slow)
DTensor-based TP
Native FSDP2 support in transformer
Sequence parallelism for activations and norms via per-model _sp_plan entries
MoE expert parallelism + packed-weight sharding for fused gate_up_proj /
grouped_mm kernels
Shard-on-read via DtensorShardOperation
Resume under a different parallelism layout.
Standard safetensors export. model.save_pretrained(dir) (no split files checkpoint when distributed) writes a fully-gathered
Distributed checkpointing for both model and optimizer.

What's missing:

The PR has been reverted FSDP + TP & native save/load distributed #45028 due to:
- missing support of EP
- Inference on plain tensors for Continuous batching

TODO

Critical (blocking)
- Bring EP => ep_router + moe_experts_ep_allreduce ParallelStyles (work began by @AmineDiro at EP + Trainer integration on top of DistributedConfig (#45028) #46126)
- Mix plain tensors & dtensors in Colwise/Rowwise to enable inference for continuous batching
Medium:
- push checkpoint to hf_buckets by default
- MoEExpertsParallel
  - go from module level to param level ("layers.*.mlp.experts.gate_up_proj": "grouped_gemm", "layers.*.mlp.experts.down_proj": "grouped_gemm", "layers.*.mlp.experts": "moe_experts_allreduce",)
Minor:
- handling sparse + dense in the sp_plan for models like cohere2_moe, mellum, qwen3_moe
- default tp_plan + ep_plan for inference / sp_plan + ep_plan training
- Add DistributedMixin to isolate (save_pretrained(distributed_checkpoint=True) from PreTrainedModel
- cleaning FSDP by unifying manual/auto path
- make sure your sharding plan is correct for a given model (like if I do tp_size=8, is it valid ?)

PR

Tp param level #46290 (param_level, ep_router, handling sparse + dense)
- sp + ep training / tp + ep inference #46292 (tp + ep_plan / sp + ep_plan)
  - cleaning _accumulate_local_param_grad #46394
  - tp_inference_training

This reverts commit 295cee3.

HuggingFaceDocBuilderDev · 2026-05-28T17:46:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-05-28T18:51:22Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe, apertus, arcee, aria, audioflamingo3, bamba, bitnet, cohere, cohere2, cohere2_moe, csm, cwm, data2vec, dbrx, deepseek_v2, deepseek_v3

AmineDiro · 2026-05-29T07:45:14Z

@3outeille & @ArthurZucker should I port #46126 or thats something already in work and just close it ?

3outeille · 2026-05-29T17:49:51Z

@AmineDiro you can change the base to this branch !

3outeille added 2 commits May 28, 2026 17:31

Revert "[Revert] FSDP+Dtensor refactor related changes (#46246)"

18a712b

This reverts commit 295cee3.

not unifying but still better than before

863e12b

3outeille changed the title ~~DIstributed branch~~ DIstributed branch base May 28, 2026

3outeille added 2 commits May 28, 2026 18:08

mellum now inherits tp and sp plan from qwen3_moe

f914e30

handle dense + sparse mixing for mellum model in SP plan

2d01eb3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DIstributed branch base#46269

DIstributed branch base#46269
3outeille wants to merge 4 commits into
mainfrom
distributed

3outeille commented May 28, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

AmineDiro commented May 29, 2026

Uh oh!

3outeille commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

3outeille commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current state of things

What's missing:

TODO

PR

Uh oh!

HuggingFaceDocBuilderDev commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

AmineDiro commented May 29, 2026

Uh oh!

3outeille commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

3outeille commented May 28, 2026 •

edited

Loading