mlx-gpt

This is a recreation of the GPT-2 architecture using the MLX library, closely following along with the organization and commit history of Andrej Karpathy's build-nanogpt repository. While his repository focuses a lot on optimization using CUDA on PyTorch, MLX build's out optimizations using Metal and take advantage of Apple silicon's unified memory architecture. All code execution was tested on an Apple M1 Max Macbook Pro.

This project came about due to an interest in optimizing neural network model training for Apple M-Series chips and the MLX library. Obviously I don't expect to see quite as much of performance gains as you could squeeze out of using CUDA on PyTorch with 8 A100 GPUs, but I'm interested to see what is even possible on these tiny chips.

run

Using a device with Apple silicon (any M-series chip):

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

To start training the model, use python train_gpt2.py

If you would like to use distributed communication, install openmpi using brew install openmpi and execute the following:

mpirun -np <NUM_PROCESSES> -x DYLD_LIBRARY_PATH=/opt/homebrew/lib python train_gpt2.py

where <NUM_PROCESSES> is the amount of processes to distribute your work into. For example, if you were to use -np 2 on your local machine, two separate Python processes would spin up and execute at the same time to train the model.

The main use case of openmpi is to distribute workloads across different M-series machines. For more information see the MLX documentation.

notes

Alongside this repository is a notes document containing my experiences/struggles with the implementation of each commit. It hopefully should also highlight some of the key differences between implementing a neural network in MLX vs. PyTorch, and the benefits/consequences that come with it.

potential improvements

I never was able to figure out why I was unable to crush the training loss on a dataset of ~1000 lines of text. If I am to revisit this project in the future, I would compare each of the gradients to either a manual backprop implementation or the PyTorch gradients.

Gain access to another M-series system (maybe just a M1 Pro Mac Mini) to test the implementation of distributed communication. It's been verified to work locally, but not across multiple systems. This would also enable longer training since each step would also be a lot faster.

For some reason using bfloat16 actually slowed down training. Because of this, it was left out of the final implementation. There should be some way to use bfloat16 to give some kind of speed improvement during training.

references

I was able to read into how some others approached this same task via their own repositories:

vithursant: nanoGPT_mlx
pranavjad: mlx-gpt2
dx-dtran: gpt2-mlx

And of course, this repository wouldn't have been possible to create without Andrej Karpathy's original work in creating nanoGPT and the Zero-to-Hero series.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
notes.md		notes.md
playground.ipynb		playground.ipynb
requirements.txt		requirements.txt
train.txt		train.txt
train_gpt2.py		train_gpt2.py
val.txt		val.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlx-gpt

run

notes

potential improvements

references

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mlx-gpt

run

notes

potential improvements

references

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages