Skip to content

Commit 7932533

Browse files
committed
Refine Muon blog: convergence results, LR tuning guide, and formatting fixes
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
1 parent a91018f commit 7932533

2 files changed

Lines changed: 12 additions & 38 deletions

File tree

blogs/muon-optimizer/README.md

Lines changed: 12 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,14 @@ Muon optimizer has gained momentum with more and more use from community and als
55
## What is Muon optimizer?
66
Muon is an optimizer designed for hidden 2D weights of a neural network. It takes gradient of the weight, computes its momentum, and applies Newton-Schulz iterations to orthogonalize the momentum matrix, then uses this orthogonalized matrix to update the weight[1](https://kellerjordan.github.io/posts/muon/). Because Muon only maintains one momentum buffer (versus Adam’s two), it uses less memory for optimizer states.
77

8-
The orthogonalization step is key to Muon’s convergence advantage in pretraining. In practice, gradient updates for 2D weights in transformers tend to have very high condition numbers — they are nearly low-rank, dominated by a few large singular directions. By orthogonalizing the momentum matrix, Muon equalizes all singular values, effectively amplifying rare but important update directions that would otherwise be overshadowed. This leads to better sample efficiency: in NanoGPT speedrunning benchmarks, Muon improved training speed by 35% over AdamW, and at 1.5B parameter scale it reached GPT-2 XL level performance approximately 25% faster than AdamW[1](https://kellerjordan.github.io/posts/muon/).
8+
The orthogonalization step is key to Muon’s convergence advantage in pretraining. In practice, gradient updates for 2D weights in transformers tend to have very high condition numbers — they are nearly low-rank, dominated by a few large singular directions. By orthogonalizing the momentum matrix, Muon equalizes all singular values, effectively amplifying rare but important update directions that would otherwise be overshadowed. This leads to better sample efficiency: in NanoGPT speedrunning benchmarks[2](https://github.com/KellerJordan/modded-nanogpt), Muon improved training speed by 35% over AdamW, and at 1.5B parameter scale it reached GPT-2 XL level performance approximately 25% faster than AdamW[1](https://kellerjordan.github.io/posts/muon/).
99

1010
Muon is used by Keller Jordan’s mod of NanoGPT[2](https://github.com/KellerJordan/modded-nanogpt), Andrej Karpathy’s nanochat[3](https://github.com/karpathy/nanochat), and a variant of Muon (MuonClip) is also used by the production-level LLM Kimi-K2 from MoonShot[4](https://arxiv.org/pdf/2507.20534). More recently, Zhipu AI’s GLM-5 (744B parameters) confirmed the use of Muon optimizer in both GLM-4.5 and GLM-5 pretraining, along with a “Muon Split” technique that splits MLA up-projection matrices by attention head and orthogonalizes each head independently, addressing a performance gap between MLA and GQA when using Muon[5](https://arxiv.org/abs/2602.15763).
1111

1212
## Muon Optimizer support in DeepSpeed
13-
One of the challenges of applying Muon optimizer to DeepSpeed is that previous optimizers (SGD, Adam) look at gradients as flattened buffers. Thus it is hard to swap in Muon optimizer in the same place because the gradient buffers are already flattened. We move the Muon update to the get_flat_partition function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter gradients are still in unflattened stages, thus we can easily apply the Muon updates.
13+
One of the challenges of applying Muon optimizer to DeepSpeed is that previous optimizers (SGD, Adam) look at gradients as flattened buffers. Thus it is hard to swap in Muon optimizer in the same place because the gradient buffers are already flattened. We move the Muon update to `get_flat_partition` function of stage 1 and 2 `DeepSpeedZeroOptimizer` in which per parameter gradients are still in unflattened stages, thus we can easily apply the Muon updates.
1414

15-
Muon optimizer works for hidden 2D gradients. We apply a parse in model engine initializer to tag the model parameter with 'use_muon', if and only if the model parameter is 2D and is hidden. When Muon optimizer is used, any gradient with parameter match 'use_muon' will use Muon optimizer to update weight.
15+
Muon optimizer works for hidden 2D gradients. We apply a parse in model engine initializer to tag the model parameter with `use_muon`, if and only if the model parameter is 2D and is hidden. When Muon optimizer is used, any gradient with parameter match `use_muon` will use Muon optimizer to update weight.
1616

1717
Note that Muon is a hybrid optimizer: it uses Muon updates only for 2D hidden weights and falls back to Adam for all other parameters (embeddings, layer norms, biases, lm_head). The DeepSpeed config supports separate learning rates via `muon_lr` (for Muon parameters) and `adam_lr` (for Adam parameters).
1818

@@ -26,60 +26,34 @@ cd deepspeed_finetune_demo
2626

2727
## Muon Optimizer Convergence Experiment Result
2828

29-
We compared Muon optimizer with AdamW optimizer by finetuning a Qwen2.5-3B model on the tatsu-lab/alpaca dataset. To ensure a fair comparison, we performed learning rate sweeps for both optimizers independently and report results at each optimizer’s best configuration.
29+
We tested Muon optimizer by finetuning a Qwen2.5-3B model on the tatsu-lab/alpaca dataset.
3030

3131
**Training Configuration:**
3232
- Model: Qwen2.5-3B
3333
- Dataset: tatsu-lab/alpaca
3434
- ZeRO Stage 2, bf16
3535
- Batch size: 32 (4 per GPU), 8 GPUs (A100 40GB)
3636
- 1 epoch (~1460 steps), eval every 100 steps
37+
- Muon lr: 5e-3, Adam lr: 5e-6
3738
- LR schedule: constant (no warmup, no decay)
3839
- Gradient clipping: 1.0
3940

40-
**AdamW Optimizer Hyperparameters:**
41-
- betas: (0.9, 0.999)
42-
- eps: 1e-8
43-
- weight_decay: 0.01
41+
![Muon optimizer convergence on Qwen2.5-3B](images/muon_loss_3b.png)
4442

45-
**Muon Optimizer Hyperparameters:**
46-
- momentum: 0.95 (Muon parameters)
47-
- betas: (0.9, 0.999) (Adam parameters)
48-
- eps: 1e-8
49-
- weight_decay: 0.01
43+
Muon optimizer converges smoothly and shows no overfitting during finetuning.
5044

51-
**Learning Rate Sweep Results:**
45+
### Tuning Learning Rate for Muon Optimizer
5246

53-
For AdamW, we swept lr across {1e-6, 2e-6, 5e-6, 1e-5}. For Muon, we first swept muon_lr across {1e-4, 2e-4, 5e-4, 1e-3, 2e-3, 5e-3, 1e-2} with adam_lr=2e-6, then swept adam_lr across {2e-6, 5e-6, 1e-5} with muon_lr=5e-3.
47+
Since Muon is a hybrid optimizer with separate `muon_lr` and `adam_lr`, finding the optimal learning rate combination requires a different approach than a single-optimizer setup. We recommend the following two-step process:
5448

55-
| Optimizer | Learning Rate | Final Eval Loss |
56-
|-----------|---------------|-----------------|
57-
| AdamW | lr=1e-5 | 1.2404 |
58-
| AdamW | lr=5e-6 | 1.2001 |
59-
| **AdamW** | **lr=2e-6** | **1.1842** |
60-
| AdamW | lr=1e-6 | 1.1883 |
61-
| Muon | muon_lr=5e-3, adam_lr=2e-6 | 1.1996 |
62-
| **Muon** | **muon_lr=5e-3, adam_lr=5e-6** | **1.1966** |
63-
| Muon | muon_lr=5e-3, adam_lr=1e-5 | 1.1970 |
64-
65-
**Convergence Trajectory (Best Configuration per Optimizer):**
66-
67-
| Step | AdamW (lr=2e-6) | Muon (muon_lr=5e-3, adam_lr=5e-6) |
68-
|------|-----------------|-----------------------------------|
69-
| 0 | 1.3278 | 1.3300 |
70-
| 100 | 1.2205 | 1.2814 |
71-
| 200 | 1.2101 | 1.2300 |
72-
| 500 | 1.1969 | 1.2107 |
73-
| 1000 | 1.1894 | 1.2009 |
74-
| 1400 | **1.1842** | **1.1966** |
75-
76-
In this finetuning experiment, AdamW achieves a slightly lower final eval loss (1.1842) compared to Muon (1.1966). AdamW also converges faster in early training steps. This result is consistent with the observation that Muon’s strength has been demonstrated primarily in pretraining settings, while finetuning a pretrained model on a small dataset may not fully benefit from Muon’s orthogonalization approach.
49+
1. Fix `adam_lr` as a ratio of `muon_lr` (e.g., `adam_lr = muon_lr / 50`), then sweep `muon_lr` to find the best value.
50+
2. With the best `muon_lr` fixed, sweep `adam_lr` to find the optimal combination.
7751

7852
## Muon Optimizer Memory Savings
7953
Muon optimizer uses less memory for optimizer states than Adam, because it maintains one momentum buffer per parameter instead of two (first and second moment).
8054

8155
### Memory Usage Comparison
82-
Note that Muon is a hybrid optimizer: 2D hidden weights use Muon (1 buffer), while remaining parameters (embeddings, layer norms, lm_head) still use Adam (2 buffers). The actual memory savings depend on the fraction of parameters that are 2D hidden weights. For typical transformer models, 70-80% of parameters are 2D hidden weights, so optimizer state memory is reduced by roughly 35-40%. However, because total GPU memory also includes model weights, gradients, and activations, the end-to-end memory reduction is smaller (see measured results below).
56+
Note that Muon is a hybrid optimizer: 2D hidden weights use Muon (1 buffer), while remaining parameters (embeddings, layer norms, lm_head) still use Adam (2 buffers). The actual memory savings depend on the fraction of parameters that are 2D hidden weights. For typical transformer models, approximately 90% of parameters are 2D hidden weights, so optimizer state memory is reduced by roughly 45%. However, because total GPU memory also includes model weights, gradients, and activations, the end-to-end memory reduction is smaller (see measured results below).
8357

8458
| Optimizer | State Buffers per Param | Memory per Parameter |
8559
|-----------|------------------------|---------------------|
332 KB
Loading

0 commit comments

Comments
 (0)