the first native Pytorch distributed training backend for Apple Silicon #3802
ksasso1028
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I spent way too much time building MCCL - a PyTorch backend that lets you train models across multiple Macs connected with a Thunderbolt cable.
Before you get excited: it's roughly 10x slower than just using one GPU. This is not a performance hack.
I started this because I was curious if you could actually make two MacBooks work together for ML training, and I wanted to understand how PyTorch's distributed backends work. Turns out you can, but it involves a ridiculous amount of plumbing.
The setup is pretty straightforward - you connect two Macs with Thunderbolt, run standard PyTorch DDP code, and it actually works. The backend handles TCP over the Thunderbolt connection, uses Accelerate for f32 math and Metal shaders for fp16 stuff.
There's a demo video in the repo showing it working: https://github.com/mps-ddp/mccl
I tested it on M1 Max + M4 Max MacBooks. Getting the gradients to sync properly across machines was surprisingly satisfying, even though the whole thing is completely impractical.
Could it be faster? Maybe with RDMA over Thunderbolt 5 or better algorithms, but honestly I just wanted to see if I could make it work at all.
I'm definitely looking for additional eyes from experts who really know what they're doing, integrating into pytorch could be interesting...lets chat!
cheers!
Beta Was this translation helpful? Give feedback.
All reactions