docs: extract data parallel training into seperate example doc

stefpi · Awni Hannun · commit 33d4211709f7 · 2026-01-28T11:49:51.000-08:00
diff --git a/docs/src/examples/data_parallelism.rst b/docs/src/examples/data_parallelism.rst
@@ -0,0 +1,90 @@
+.. _data_parallelism:
+
+Data Parallelism
+================
+
+MLX enables efficient data parallel distributed training through its distributed communication primitives.
+
+.. _training_example:
+
+Training Example
+----------------
+
+In this section we will adapt an MLX training loop to support data parallel
+distributed training. Namely, we will average the gradients across a set of
+hosts before applying them to the model.
+
+Our training loop looks like the following code snippet if we omit the model,
+dataset and optimizer initialization.
+
+.. code:: python
+
+    model = ...
+    optimizer = ...
+    dataset = ...
+
+    def step(model, x, y):
+        loss, grads = loss_grad_fn(model, x, y)
+        optimizer.update(model, grads)
+        return loss
+
+    for x, y in dataset:
+        loss = step(model, x, y)
+        mx.eval(loss, model.parameters())
+
+All we have to do to average the gradients across machines is perform an
+:func:`all_sum` and divide by the size of the :class:`Group`. Namely we
+have to :func:`mlx.utils.tree_map` the gradients with following function.
+
+.. code:: python
+
+    def all_avg(x):
+        return mx.distributed.all_sum(x) / mx.distributed.init().size()
+
+Putting everything together our training loop step looks as follows with
+everything else remaining the same.
+
+.. code:: python
+
+    from mlx.utils import tree_map
+
+    def all_reduce_grads(grads):
+        N = mx.distributed.init().size()
+        if N == 1:
+            return grads
+        return tree_map(
+            lambda x: mx.distributed.all_sum(x) / N,
+            grads
+        )
+
+    def step(model, x, y):
+        loss, grads = loss_grad_fn(model, x, y)
+        grads = all_reduce_grads(grads)  # <--- This line was added
+        optimizer.update(model, grads)
+        return loss
+
+Utilizing ``nn.average_gradients``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Although the code example above works correctly; it performs one communication
+per gradient. It is significantly more efficient to aggregate several gradients
+together and perform fewer communication steps.
+
+This is the purpose of :func:`mlx.nn.average_gradients`. The final code looks
+almost identical to the example above:
+
+.. code:: python
+
+    model = ...
+    optimizer = ...
+    dataset = ...
+
+    def step(model, x, y):
+        loss, grads = loss_grad_fn(model, x, y)
+        grads = mx.nn.average_gradients(grads)  # <---- This line was added
+        optimizer.update(model, grads)
+        return loss
+
+    for x, y in dataset:
+        loss = step(model, x, y)
+        mx.eval(loss, model.parameters())
diff --git a/docs/src/examples/tensor_parallelism.rst b/docs/src/examples/tensor_parallelism.rst
@@ -1,3 +1,5 @@
+.. _tensor_parallelism:
+
 Tensor Parallelism
 ==================
 
@@ -60,7 +62,7 @@ We can create partial inputs based on rank. For example, for an input with 1024
   layer = nn.ShardedToAllLinear(1024, 1024, bias=False) # initialize the layer
   y = layer(x[part]) # process sharded input
 
-This code splits the 1024 input features into ``world.size()`` different groups which are assigned continuously based on ``world.rank()``. More information about distributed communication can be found in the :doc:`Distributed Communication <../usage/distributed>` page. 
+This code splits the 1024 input features into ``world.size()`` different groups which are assigned continuously based on ``world.rank()``. More information about distributed communication can be found in the :ref:`Distributed Communication <usage_distributed>` page. 
 
 :class:`QuantizedShardedToAllLinear <mlx.nn.QuantizedShardedToAllLinear>` is the quantized equivalent of :class:`mlx.nn.ShardedToAllLinear`.
 Similar to :class:`mlx.nn.QuantizedLinear`, its parameters are frozen and
@@ -117,7 +119,7 @@ LLM Inference with Tensor Parallelism
 
 We can apply these TP techniques to LLMs in order to enable inference for much larger models by sharding parameters from huge layers across multiple devices.
 
-To demonstrate this, let's apply TP to the Transformer block of our :doc:`Llama Inference <../examples/llama-inference>` example. In this example, we will use the same inference script as the Llama Inference example, which can be found in `mlx-examples`_.
+To demonstrate this, let's apply TP to the Transformer block of our :doc:`Llama Inference <llama-inference>` example. In this example, we will use the same inference script as the Llama Inference example, which can be found in `mlx-examples`_.
 
 Our first edit is to initialize the distributed communication group and get the current process rank:
 
diff --git a/docs/src/index.rst b/docs/src/index.rst
@@ -54,6 +54,7 @@ are the CPU and GPU.
    examples/linear_regression
    examples/mlp
    examples/llama-inference
+   examples/data_parallelism
    examples/tensor_parallelism
 
 .. toctree::
diff --git a/docs/src/usage/distributed.rst b/docs/src/usage/distributed.rst
@@ -117,89 +117,11 @@ The following examples aim to clarify the backend initialization logic in MLX:
     world_ring = mx.distributed.init(backend="ring")
     world_any = mx.distributed.init()  # same as MPI because it was initialized first!
 
-.. _training_example:
+Distributed Program Examples
+----------------------------
 
-Training Example
-----------------
-
-In this section we will adapt an MLX training loop to support data parallel
-distributed training. Namely, we will average the gradients across a set of
-hosts before applying them to the model.
-
-Our training loop looks like the following code snippet if we omit the model,
-dataset and optimizer initialization.
-
-.. code:: python
-
-    model = ...
-    optimizer = ...
-    dataset = ...
-
-    def step(model, x, y):
-        loss, grads = loss_grad_fn(model, x, y)
-        optimizer.update(model, grads)
-        return loss
-
-    for x, y in dataset:
-        loss = step(model, x, y)
-        mx.eval(loss, model.parameters())
-
-All we have to do to average the gradients across machines is perform an
-:func:`all_sum` and divide by the size of the :class:`Group`. Namely we
-have to :func:`mlx.utils.tree_map` the gradients with following function.
-
-.. code:: python
-
-    def all_avg(x):
-        return mx.distributed.all_sum(x) / mx.distributed.init().size()
-
-Putting everything together our training loop step looks as follows with
-everything else remaining the same.
-
-.. code:: python
-
-    from mlx.utils import tree_map
-
-    def all_reduce_grads(grads):
-        N = mx.distributed.init().size()
-        if N == 1:
-            return grads
-        return tree_map(
-            lambda x: mx.distributed.all_sum(x) / N,
-            grads
-        )
-
-    def step(model, x, y):
-        loss, grads = loss_grad_fn(model, x, y)
-        grads = all_reduce_grads(grads)  # <--- This line was added
-        optimizer.update(model, grads)
-        return loss
-
-Utilizing ``nn.average_gradients``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Although the code example above works correctly; it performs one communication
-per gradient. It is significantly more efficient to aggregate several gradients
-together and perform fewer communication steps.
-
-This is the purpose of :func:`mlx.nn.average_gradients`. The final code looks
-almost identical to the example above:
-
-.. code:: python
-
-    model = ...
-    optimizer = ...
-    dataset = ...
-
-    def step(model, x, y):
-        loss, grads = loss_grad_fn(model, x, y)
-        grads = mx.nn.average_gradients(grads)  # <---- This line was added
-        optimizer.update(model, grads)
-        return loss
-
-    for x, y in dataset:
-        loss = step(model, x, y)
-        mx.eval(loss, model.parameters())
+- :ref:`Data Parallelism <data_parallelism>`
+- :ref:`Tensor Parallelism <tensor_parallelism>`
 
 .. _ring_section: