Skip to content

DIstributed branch base#46269

Open
3outeille wants to merge 4 commits into
mainfrom
distributed
Open

DIstributed branch base#46269
3outeille wants to merge 4 commits into
mainfrom
distributed

Conversation

@3outeille
Copy link
Copy Markdown
Member

@3outeille 3outeille commented May 28, 2026

Current state of things

  • Training has been confirming to work in TRL thanks to @AmineDiro cf EP + Trainer integration on top of DistributedConfig (#45028) #46126

    I was validated end-to-end with trl.SFTTrainer on Qwen3-30B-A3B at multiple shapes (16k / 32k / 64k, 2n / 4n / 8n). Healthy training (loss bit-exact match against our prod fork at LR=0) and very competitive MFU 👏 👏

  • Inference works but depends on Dtensor (which we don't want as it is very slow)
  • DTensor-based TP
  • Native FSDP2 support in transformer
  • Sequence parallelism for activations and norms via per-model _sp_plan entries
  • MoE expert parallelism + packed-weight sharding for fused gate_up_proj /
    grouped_mm kernels
  • Shard-on-read via DtensorShardOperation
  • Resume under a different parallelism layout.
  • Standard safetensors export. model.save_pretrained(dir) (no split files checkpoint when distributed) writes a fully-gathered
  • Distributed checkpointing for both model and optimizer.

What's missing:

TODO

  • Critical (blocking)
  • Medium:
    • push checkpoint to hf_buckets by default
    • MoEExpertsParallel
      • go from module level to param level ("layers.*.mlp.experts.gate_up_proj": "grouped_gemm", "layers.*.mlp.experts.down_proj": "grouped_gemm", "layers.*.mlp.experts": "moe_experts_allreduce",)
  • Minor:
    • handling sparse + dense in the sp_plan for models like cohere2_moe, mellum, qwen3_moe
    • default tp_plan + ep_plan for inference / sp_plan + ep_plan training
    • Add DistributedMixin to isolate (save_pretrained(distributed_checkpoint=True) from PreTrainedModel
    • cleaning FSDP by unifying manual/auto path
    • make sure your sharding plan is correct for a given model (like if I do tp_size=8, is it valid ?)

PR

@3outeille 3outeille changed the title DIstributed branch DIstributed branch base May 28, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe, apertus, arcee, aria, audioflamingo3, bamba, bitnet, cohere, cohere2, cohere2_moe, csm, cwm, data2vec, dbrx, deepseek_v2, deepseek_v3

@AmineDiro
Copy link
Copy Markdown
Member

@3outeille & @ArthurZucker should I port #46126 or thats something already in work and just close it ?

@3outeille
Copy link
Copy Markdown
Member Author

@AmineDiro you can change the base to this branch !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants