[PyTorch] Add distributed Muon optimizer by vcherepanov-nv · Pull Request #2920 · NVIDIA/TransformerEngine

vcherepanov-nv · 2026-04-23T19:15:38Z

Description

Add a distributed Muon optimizer, based on newton_schulz orthogonalization

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Add an optimizer class and tests

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-23T19:21:00Z

Greptile Summary

Adds MuonOptimizer, a distributed Muon optimizer that applies SGD-momentum followed by Newton-Schulz orthogonalization on tensor-parallel parameter shards, and a newton_schulz_tp convenience wrapper that supports both distributed and duplicated TP modes. The existing newton_schulz.py is renamed into optimizers/.

P1 — breaking module rename: transformer_engine/pytorch/newton_schulz.py is renamed to transformer_engine/pytorch/optimizers/newton_schulz.py with no backward-compat stub, silently removing the transformer_engine.pytorch.newton_schulz module path. Direct imports such as from transformer_engine.pytorch.newton_schulz import get_coefficients will raise ModuleNotFoundError, yet the PR is marked non-breaking.

Confidence Score: 4/5

Safe to merge after addressing the breaking module-path removal; the optimizer logic itself is correct.

One P1 finding: renaming newton_schulz.py without a backward-compat stub breaks the transformer_engine.pytorch.newton_schulz module path for any downstream direct imports. The optimizer's numerical correctness (momentum, Nesterov, weight decay, distributed normalization, scale factor) is sound and well-tested.

transformer_engine/pytorch/__init__.py and the absence of a stub at transformer_engine/pytorch/newton_schulz.py.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/optimizers/muon.py	New MuonOptimizer class with distributed Newton-Schulz orthogonalization; momentum, Nesterov, and weight-decay branches are correct; `get_coefficients` is re-allocated on every step (P2).
transformer_engine/pytorch/optimizers/newton_schulz.py	Renamed from `transformer_engine/pytorch/newton_schulz.py`; adds `newton_schulz_tp` and `_orthogonalize_replicated` helpers; logic for distributed/duplicated TP modes looks correct.
transformer_engine/pytorch/init.py	Import updated to the new module path; old `transformer_engine.pytorch.newton_schulz` module path is silently dropped, breaking any direct module-level imports from that path (P1).
tests/pytorch/distributed/run_muon_optimizer.py	New distributed test worker; reference implementation correctly uses the full matrix shape without double-scaling the partition dimension.
tests/pytorch/distributed/test_muon_optimizer.py	New pytest harness covering both partition dims, bfloat16, L2 weight decay, missing-process-group guard, and per-parameter partition_dim resolution.
transformer_engine/pytorch/optimizers/init.py	Adds `MuonOptimizer` and `get_muon_scale_factor` to the `optimizers` package exports.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant MuonOptimizer
    participant DistNormalize as _distributed_normalize_p2_
    participant NS as newton_schulz_tp
    participant CusolverMp as CusolverMpCtx

    Caller->>MuonOptimizer: step()
    MuonOptimizer->>MuonOptimizer: weight decay (decoupled or L2)
    MuonOptimizer->>MuonOptimizer: momentum_buffer.lerp_(grad, 1-β)
    MuonOptimizer->>MuonOptimizer: Nesterov update = grad.lerp(buf, β)
    MuonOptimizer->>DistNormalize: _distributed_normalize_p2_(update, eps)
    DistNormalize-->>DistNormalize: local norm²
    DistNormalize->>DistNormalize: all_reduce(SUM) → global norm²
    DistNormalize-->>MuonOptimizer: update ÷= global_norm (in-place)
    MuonOptimizer->>NS: newton_schulz_tp(update, ctx, partition_dim, tp_mode=distributed)
    alt partition_dim == 0
        NS->>NS: x_t = update.mT.contiguous()
        NS->>CusolverMp: newton_schulz(x_t, ctx, num_iters)
        NS-->>NS: update.copy_(x_t.mT)
    else partition_dim == 1
        NS->>CusolverMp: newton_schulz(update, ctx, num_iters)
    end
    CusolverMp-->>NS: in-place orthogonalized shard
    NS-->>MuonOptimizer: orthogonalized update
    MuonOptimizer->>MuonOptimizer: update *= scale_factor * extra_scale
    MuonOptimizer->>Caller: p -= lr * orth_update

Comments Outside Diff (1)

transformer_engine/pytorch/__init__.py, line 62-66 (link)

transformer_engine.pytorch.newton_schulz module path is removed without a backward-compat stub

transformer_engine/pytorch/newton_schulz.py existed in the base branch and was accessible as the module transformer_engine.pytorch.newton_schulz. This PR renames it to transformer_engine/pytorch/optimizers/newton_schulz.py and updates the __init__.py re-exports for the public symbols, but leaves no stub at the old path. Any downstream code that imported directly from the old module path — e.g., from transformer_engine.pytorch.newton_schulz import get_coefficients — will now fail with ModuleNotFoundError. Adding a shim module at transformer_engine/pytorch/newton_schulz.py that re-exports from the new location would preserve backward compatibility while the PR is marked non-breaking.

_{Reviews (4): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-04-23T19:21:04Z

+    def step(self, closure=None):
+        """Perform a single optimization step."""
+        loss = None
+        if closure is not None:
+            loss = closure()
+


Closure called inside @torch.no_grad(), preventing gradient computation

closure() is invoked while torch.no_grad() is active. Any loss.backward() call inside the closure will silently produce zero/no gradients. The standard PyTorch pattern (used in SGD, Adam, etc.) is to wrap the closure in with torch.enable_grad():.

Suggested change

def step(self, closure=None):

"""Perform a single optimization step."""

loss = None

if closure is not None:

loss = closure()

@torch.no_grad()

def step(self, closure=None):

"""Perform a single optimization step."""

loss = None

if closure is not None:

with torch.enable_grad():

loss = closure()

greptile-apps · 2026-04-23T19:21:04Z

+    scale_mode: str,
+    extra_scale_factor: float,
+    eps: float,
+) -> torch.Tensor:
+    global_shape = [grad.size(0), grad.size(1)]
+    global_shape[partition_dim] *= world_size


Reference global_shape incorrectly scales an already-full tensor

_reference_orthogonalize receives the full matrix (shape full_shape) but then multiplies global_shape[partition_dim] by world_size a second time. For partition_dim=1 with world_size=2 and full_shape=(96, 128) this gives global_shape=[96, 256], so get_muon_scale_factor returns max(96,256)^0.5 = 16. The optimizer, operating on the shard (96, 64), correctly reconstructs global_shape=[96, 128] and computes max(96,128)^0.5 ≈ 11.3. This √2 discrepancy means the reference cannot correctly validate the optimizer's output.

The global_shape[partition_dim] *= world_size line should be removed since the input is already the full matrix.

greptile-apps · 2026-04-23T19:21:06Z

+    if mode == "unit_rms_norm":
+        return (size_out / size_in) ** 0.5


unit_rms_norm mode can divide by zero when size_in == 0

(size_out / size_in) ** 0.5 raises ZeroDivisionError when size_in is 0. While the optimizer validates that the partition dimension is non-empty, it doesn't ensure the other dimension is non-zero. Consider adding a guard or documenting that both dimensions must be strictly positive.

greptile-apps · 2026-04-23T19:21:06Z

+                if group["nesterov"]:
+                    update = grad.lerp(momentum_buffer, group["momentum"])
+                else:
+                    update = momentum_buffer


Non-Nesterov update is an alias to momentum_buffer, not a copy

update = momentum_buffer holds a reference. If _orthogonalize ever modifies its input in-place in a future refactor, the momentum buffer will be silently corrupted. _orthogonalize currently clones the input immediately so this is safe today, but a defensive .clone() or comment would make the intent explicit.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

skyw

I'd advice NOT to expose it in public API. Keeping it in test only if that is the purpose.

Having an optimizer with most code copied invites fragmentation.

Before this, all optimizer TE provides are more optimized fused version. I'd say a highly optimized Fused Muon with similar concept can be justified, but would need more consideration because it has more dependencies on other part of the training pipeline than elementwise optimizers.

skyw · 2026-04-27T20:33:43Z

+    on tensor-parallel parameter shards. The local parameter shard must represent a
+    partition of a logical 2D matrix across the provided NCCL process group.
+
+    Args:


Q: Does TE use numpy style docstring instead of Google style?

skyw · 2026-04-27T20:34:33Z

+
+    def __init__(
+        self,
+        params: Iterable[torch.nn.Parameter | dict],


Nit: The type here doesn't match PyTorch internal. Should be fine for the purpose of this class.

skyw · 2026-04-27T20:35:30Z

+        scale_mode: MuonScaleT = "spectral",
+        extra_scale_factor: float = 1.0,
+        process_group: Optional[dist.ProcessGroup] = None,
+        partition_dim: int = 1,


Fix: partition_dim is per parameter.

skyw · 2026-04-27T20:36:28Z

+            raise ValueError(f"Invalid weight_decay value: {weight_decay}")
+        if num_ns_steps < 1:
+            raise ValueError(f"num_ns_steps must be at least 1, got {num_ns_steps}")
+        if partition_dim not in (0, 1):


Q: Does this class intend to support non-distributed case? partition_dim would be -1 in TE in such case.

skyw · 2026-04-27T20:37:09Z

+
+        if process_group is None:
+            if not dist.is_initialized():
+                raise RuntimeError("MuonOptimizer requires torch.distributed to be initialized.")


Same question above regarding single GPU support.

skyw · 2026-04-27T20:39:01Z

+        if process_group is None:
+            if not dist.is_initialized():
+                raise RuntimeError("MuonOptimizer requires torch.distributed to be initialized.")
+            process_group = dist.group.WORLD


Suggestion: This silent behavior is dangerous. If user forgot to pass the correct TP group, wrong group will be used.

skyw · 2026-04-27T20:43:42Z

+        global_shape[partition_dim] *= world_size
+
+        orth_grad = grad.clone()
+        transposed = partition_dim == 0


Attn: This is from common Row and Column wise tensor parallelism in most LLM. It would be sub optimal for anything other than that. Add comment if the assumption is made.

vcherepanov-nv · 2026-04-28T19:02:52Z

Having an optimizer with most code copied invites fragmentation.

The idea was to give something to users, who use TE, but not Megatron-LM. By fragmentation you mean that we want to encourage everyone to use Megatron-LM? Or that the optimizer being relatively thin thing on top of newton_schulz call, and the users should have no trouble creating it themselves?

I don't think we gain anything by putting it into tests, since we already have tests for newton_schulz call. So we need to decide whether we want this PR, or should abandon it altogether. @cyanguwa

skyw · 2026-04-28T21:15:03Z

Having an optimizer with most code copied invites fragmentation.

The idea was to give something to users, who use TE, but not Megatron-LM. By fragmentation you mean that we want to encourage everyone to use Megatron-LM? Or that the optimizer being relatively thin thing on top of newton_schulz call, and the users should have no trouble creating it themselves?

I don't think we gain anything by putting it into tests, since we already have tests for newton_schulz call. So we need to decide whether we want this PR, or should abandon it altogether. @cyanguwa

Fragmentation means there will be different flavor of muon in emerging optimizer and TE, also a lot of copied code. TE can have stalled feature when emerging optimizer updates. Megatron-LM will always have its own version because there are implementation specific things need to be hooked together. For example, how QKV is implemetned, or fused swighlu.
For TE, I think an example of how to build a version of emerging optimizer use TE NS backend would be good to have. But providing optimizer (not fusion optimized version) confuses customers.
Having said that, I would love for TE to have a more optimized version. similar idea as fusedAdam, etc.

cyanguwa · 2026-04-30T19:14:56Z

Should we move newton_schulz.py to this directory? Also, how do we expect Megatron to call us for this functionality? Thanks.

Should we move newton_schulz.py to this directory?

No, don't think so.

Should we move newton_schulz.py to this directory?

Megatron will call newton_shulz directly from their optimizers. This one is for other users.

I'd prefer moving them into something like transformer_engine/pytorch/cusolver. But I suppose that is orthogonal to this PR.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

cyanguwa · 2026-05-04T22:15:58Z

run CI here "/te-ci torch L1"; add tests to qa/Lx_pytorch_unittest/test.sh?
please create newton_schulz_tp API to include partition_dim/mode params, and for Megatron integration
please test non-distributed cases (per comment above)
please move newton_schulz.py to te/pytorch/optimizers; if we have more solvers to integrate to TE or more use cases of Newton-Schulz, we can definitely restructure the code, but we don't see that in the near future

cyanguwa · 2026-05-04T22:44:04Z

@skyw Just following up on the discussion above - our purpose for this PR was two-fold. One was to provide an equivalent newton_schulz_tp API for Megatron; the other one was to provide a dialed-down version of Muon Optimizer class so direct TE users can access the Newton-Schulz solver. I understand this may cause divergence in TE and Megatron's Muon support, but we do want to expose this feature to direct users of TE as well. Hope that helps. Thanks.

timmoon10 · 2026-05-04T23:01:59Z

I'd prefer moving them into something like transformer_engine/pytorch/cusolver. But I suppose that is orthogonal to this PR.

timmoon10 · 2026-05-04T23:03:14Z

We should make sure to include this in the QA script: https://github.com/NVIDIA/TransformerEngine/blob/main/qa/L1_pytorch_distributed_unittest/test.sh

timmoon10 · 2026-05-04T23:08:33Z

+LAUNCH_CMD = ["torchrun", f"--nproc_per_node={NUM_PROCS}"]
+
+
+def _run_test(dtype: str, partition_dim: int, weight_decay_mode: str) -> None:


Each torchrun launch is somewhat expensive. Instead of launching a separate torchrun for each test case, it's better to launch a single torchrun instance and to perform multiple tests internally. See distributed/test_fusible_ops.py for an example.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

skyw · 2026-05-05T03:15:27Z

@skyw Just following up on the discussion above - our purpose for this PR was two-fold. One was to provide an equivalent newton_schulz_tp API for Megatron; the other one was to provide a dialed-down version of Muon Optimizer class so direct TE users can access the Newton-Schulz solver. I understand this may cause divergence in TE and Megatron's Muon support, but we do want to expose this feature to direct users of TE as well. Hope that helps. Thanks.

Megatron is wrapped over emgering-optimizers with megatron specific details, like TP and how QKV are organized. The most optimizer logic is in emerging-optimizers. Could TE do the same? I understand introducing a new dependency may have concern, let me know. The biggest concern is actually large portion of duplicated code.

What I would favor is having newton_schulz_tp in one release, test it out. and have a well optimized version of Muon optimizer class in the next. There are a lot of optimizations (fusion, batch, graph capturability etc.) that can and I believe should go into TE.

vcherepanov-nv and others added 2 commits April 23, 2026 18:50

Add distributed Muon optimizer

a2df6f8

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e332a8e

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Apr 23, 2026

View reviewed changes

Fix Muon closure and reference test

1304712

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

vcherepanov-nv changed the title ~~[Draft] [PyTorch] Add distributed Muon optimizer~~ [PyTorch] Add distributed Muon optimizer Apr 27, 2026

vcherepanov-nv requested a review from cyanguwa April 27, 2026 18:12

vcherepanov-nv added the 2.16.0 label Apr 27, 2026

vcherepanov-nv requested a review from timmoon10 April 27, 2026 18:13

skyw reviewed Apr 27, 2026

View reviewed changes

cyanguwa reviewed Apr 30, 2026

View reviewed changes

vcherepanov-nv and others added 3 commits May 1, 2026 07:27

Fix Muon optimizer distributed API handling

958923d

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Fix Muon optimizer docs and params typing

860d6a7

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

037375c

for more information, see https://pre-commit.ci

timmoon10 reviewed May 4, 2026

View reviewed changes

vcherepanov-nv and others added 4 commits May 4, 2026 23:21

Add tensor-parallel Newton-Schulz wrapper

6d043e3

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Move Newton-Schulz wrapper into optimizers

e9758a3

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Use tensor-parallel Newton-Schulz in Muon

d7597b6

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

47a02d0

for more information, see https://pre-commit.ci

		if mode == "unit_rms_norm":
		return (size_out / size_in) ** 0.5

		LAUNCH_CMD = ["torchrun", f"--nproc_per_node={NUM_PROCS}"]


		def _run_test(dtype: str, partition_dim: int, weight_decay_mode: str) -> None:

Conversation

vcherepanov-nv commented Apr 23, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

skyw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vcherepanov-nv commented Apr 28, 2026

Uh oh!

skyw commented Apr 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyanguwa commented May 4, 2026

Uh oh!

cyanguwa commented May 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skyw commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented Apr 23, 2026 •

edited

Loading