A from-scratch implementation of GPT-2 in PyTorch, built as a learning exercise following Andrej Karpathy's nanoGPT.
I wanted to understand how LLMs work at the code level after learning the conceptual. So I read through nanoGPT and implemented it myself, writing each component from scratch, but the model itself is the bare minimum.
The architecture is GPT-2 small (124M parameters): token and positional embeddings, causal self-attention with multiple heads, feedforward MLP blocks, residual connections, and layer normalization throughout.
uv init
uv python pin 3.11
uv add torch --index-url https://download.pytorch.org/whl/cu124
uv add tiktoken numpyPrepare the data first:
python prepare.pyThen train:
python train.pyTrained on an NVIDIA A40 (48GB) via HPC. Checkpoints are saved to ckpt.pt every 500 steps. An example SLURM script is included as well.
- nanoGPT by Andrej Karpathy
- Attention Is All You Need