the first native Pytorch distributed training backend for Apple Silicon #3802

ksasso1028 · 2026-03-21T18:23:17Z

ksasso1028
Mar 21, 2026

I spent way too much time building MCCL - a PyTorch backend that lets you train models across multiple Macs connected with a Thunderbolt cable.
Before you get excited: it's roughly 10x slower than just using one GPU. This is not a performance hack.

I started this because I was curious if you could actually make two MacBooks work together for ML training, and I wanted to understand how PyTorch's distributed backends work. Turns out you can, but it involves a ridiculous amount of plumbing.

The setup is pretty straightforward - you connect two Macs with Thunderbolt, run standard PyTorch DDP code, and it actually works. The backend handles TCP over the Thunderbolt connection, uses Accelerate for f32 math and Metal shaders for fp16 stuff.

There's a demo video in the repo showing it working: https://github.com/mps-ddp/mccl

I tested it on M1 Max + M4 Max MacBooks. Getting the gradients to sync properly across machines was surprisingly satisfying, even though the whole thing is completely impractical.

Could it be faster? Maybe with RDMA over Thunderbolt 5 or better algorithms, but honestly I just wanted to see if I could make it work at all.

I'm definitely looking for additional eyes from experts who really know what they're doing, integrating into pytorch could be interesting...lets chat!

cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the first native Pytorch distributed training backend for Apple Silicon #3802

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

the first native Pytorch distributed training backend for Apple Silicon #3802

Uh oh!

ksasso1028 Mar 21, 2026

Replies: 0 comments

ksasso1028
Mar 21, 2026