A modern language model implementation from scratch, incorporating insights from recent research papers.
This project is available under a dual license:
Free for: Academic research, educational purposes, personal projects, open source contributions, and non-commercial use.
Requires a commercial license for: Commercial products, SaaS applications, revenue-generating applications, and any profit-making use.
Contact: carlos.gutierrez@carg.dev for commercial licensing inquiries.
Please cite this work in academic publications.
See LICENSE or LICENSE.txt for the full license text.
- What is This?
- Features
- Quick Start
- Mathematical Foundations
- Architecture Explained
- Project Structure
- Installation
- Usage
- Configuration
- Diagrams
- References
A Transformer-based language model implementing autoregressive next-token prediction using multi-head self-attention, positional encoding, and modern training optimizations (mixed precision, gradient accumulation, KV caching).
The model learns to write by reading large amounts of text, discovering patterns like "after 'the cat' usually comes 'sat' or 'ran'", enabling it to generate coherent sentences. It processes text sequentially, predicting each next word based on the context provided by previous words.
- Transformer Architecture: Multi-head self-attention mechanism from "Attention Is All You Need"
- Long Context Support: Efficient handling of long sequences with Rotary Positional Encoding (RoPE)
- Training Optimizations: Mixed precision training, gradient accumulation, gradient clipping
- Modern Best Practices: Pre-norm architecture, GELU activation, weight tying
- Comprehensive Evaluation: Perplexity, accuracy metrics, and generation utilities
Option 1: Automated Setup (Recommended)
# Run the setup script - it handles everything automatically!
./setup.sh
# Then activate the virtual environment
source venv/bin/activate
# Download data and train
python3 download_large_data.py wiki --version 103
python3 train.py --data data/wikitext_103.txt --config config.json --device cudaOption 2: Manual Setup
# 1. Create and activate virtual environment (REQUIRED on modern Linux systems)
python3 -m venv venv
source venv/bin/activate
# 2. Upgrade pip
pip install --upgrade pip
# 3. Install dependencies
pip install -r requirements.txt
# 4. Download data
python3 download_large_data.py wiki --version 103
# 5. Train
python3 train.py --data data/wikitext_103.txt --config config.json --device cuda
# 6. Generate
python3 inference.py --checkpoint checkpoints/checkpoint_epoch_10.pt \
--prompt "The future of artificial intelligence" --device cudaNote: On modern Debian/Ubuntu systems (especially Python 3.12+), pip prevents system-wide installations to protect system packages. Always use a virtual environment (python3 -m venv venv) before installing dependencies.
See GETTING_STARTED.md for detailed instructions.
Words are converted into numerical representations that the model can process. Each word (token) is assigned a unique ID, which is then mapped to a dense vector representation - think of it as converting words into a format the computer can understand and manipulate mathematically.
Mathematical Formulation:
Given a vocabulary of size
where
Word order is crucial in language - "Cat bites dog" is very different from "Dog bites cat". Since transformers process all tokens simultaneously, we need to inject positional information so the model knows which word comes first, second, third, etc.
Mathematical Formulation:
Sinusoidal Positional Encoding (for position
The final embedding combines token and position:
where
Attention mechanisms allow the model to understand relationships between words in a sentence. When reading "The cat sat on the mat", the model learns to connect "cat" with "sat" because they're related. Multi-head attention enables the model to focus on different types of relationships simultaneously - syntax, semantics, and context.
Mathematical Formulation:
Given input
- Project to Query, Key, Value:
- Scaled Dot-Product Attention:
-
Multi-Head Attention splits into
$h$ heads:
- Causal Masking (for autoregressive generation):
After attention identifies which words relate to each other, the feed-forward network performs non-linear transformations on the information. This step allows the model to synthesize and combine the attended information, similar to mixing ingredients to create a final output.
Mathematical Formulation:
GELU Activation:
For Kids: Imagine building with LEGO blocks. Each block does something special (attention or thinking), and you can stack many blocks on top of each other to build something amazing!
Mathematical Formulation:
For a transformer block with input
- Self-Attention Sublayer:
- Feed-Forward Sublayer:
Layer Normalization:
For Kids: Think of it like an assembly line! The words go through many stations (layers), each doing something special, until finally we get a prediction of what word comes next.
Mathematical Formulation:
Given input token sequence
For Kids: When we train, we show the computer a sentence and ask "what word comes next?" If it guesses wrong, we say "try again!" and it learns from its mistake.
Mathematical Formulation:
For a sequence of tokens
Perplexity (measure of model confidence):
Lower perplexity = better model!
For Kids: The computer writes one word at a time. After each word, it thinks "what word makes sense next?" and picks one.
Mathematical Formulation:
Given prompt
- Initialize:
$\mathbf{s} = \mathbf{p}$ - For
$i = k+1, \ldots, k+n$ :
Top-k Sampling:
Top-p (Nucleus) Sampling:
Find smallest set
For Kids: Learning is like climbing a mountain. You take small steps in the right direction. AdamW is like having a smart guide that knows which direction is best and adjusts your step size automatically!
Mathematical Formulation:
For parameter
Momentum:
RMSprop:
Bias Correction:
Parameter Update:
where
For Kids: Sometimes the computer gets too excited and tries to learn too fast (like running too fast down a hill). We slow it down so it doesn't fall!
Mathematical Formulation:
where
For Kids:
π Text Input β π€ Turn into Numbers β π§ Thinking Layers β β¨ Predict Next Word β π Generate Text
For Scientists:
Token IDs β Embeddings β Positional Encoding β NΓ Transformer Blocks β Output Projection β Logits β Sampling β Text
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INPUT TEXT β
β "The cat sat on the mat" β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TOKENIZATION β
β [1, 45, 123, 67, 45, 234] (each word β number) β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TOKEN EMBEDDING β
β E β β^(VΓd_model) | Each token β vector of size d_model β
β [1, 45, ...] β [[0.1, 0.3, ...], [0.2, -0.1, ...], ...] β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β POSITIONAL ENCODING β
β PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) β
β h_i = E[t_i] + PE(i) (add position info) β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRANSFORMER BLOCK 1 β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Pre-Norm Multi-Head Attention β β
β β Attention(Q, K, V) = softmax(QK^T/βd_k)V β β
β β + Residual Connection β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Pre-Norm Feed-Forward Network β β
β β FFN(x) = GELU(xWβ + bβ)Wβ + bβ β β
β β + Residual Connection β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRANSFORMER BLOCK 2 β
β (Same structure as Block 1) β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
... (N-2 more blocks) ...
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRANSFORMER BLOCK N β
β (Same structure) β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FINAL LAYER NORM β
β LayerNorm(h_L) β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OUTPUT PROJECTION β
β y = h_L Β· W_out (W_out β β^(d_model Γ V)) β
β Output: logits for each token in vocabulary β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SOFTMAX β
β p(t | context) = softmax(y) β
β Probability distribution over vocabulary β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SAMPLING β
β t_{next} ~ Sample(p) (with temperature, top-k, top-p) β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OUTPUT TEXT β
β "The cat sat on the mat and..." β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
sheepOp/
βββ models/ # Model architectures
β βββ __init__.py # Module exports
β βββ transformer.py # Main transformer model
β βββ attention.py # Attention mechanisms
β βββ blocks.py # Building blocks (FFN, TransformerBlock)
β βββ optimized_attention.py # KV caching, optimized inference
β βββ prefetching.py # Data prefetching utilities
βββ data/ # Data loading utilities
β βββ __init__.py # Dataset and tokenizer
βββ training/ # Training utilities
β βββ __init__.py # Trainer class
β βββ metrics.py # Training metrics and plotting
βββ config.py # Configuration management
βββ config.json # Example configuration file
βββ train.py # Training script
βββ inference.py # Inference script
βββ example.py # Usage examples
βββ utils.py # Evaluation utilities
βββ requirements.txt # Dependencies
Important: On modern Debian/Ubuntu systems (Python 3.12+), you must use a virtual environment. The system prevents system-wide pip installations to protect system packages.
# 1. Create virtual environment
python3 -m venv venv
# 2. Activate virtual environment
source venv/bin/activate
# 3. Upgrade pip
pip install --upgrade pip
# 4. Install dependencies
pip install -r requirements.txt
# 5. For large dataset downloads (optional)
pip install datasetsAlternative: Use the automated setup script which handles everything:
./setup.sh
source venv/bin/activatepython3 train.py \
--data data/amazon_reviews.txt \
--config config.json \
--device cudapython3 inference.py \
--checkpoint checkpoints/checkpoint_epoch_10.pt \
--prompt "The future of artificial intelligence" \
--optimized \
--device cudaSee config.json for all available settings. Key parameters:
- Model:
vocab_size,d_model,num_layers,num_heads - Training:
batch_size,learning_rate,max_epochs - Data:
max_length,data_dir
graph TB
subgraph "Input Processing"
A[Text Input] --> B[Tokenization]
B --> C[Token Embedding<br/>E β β^VΓd]
C --> D[Positional Encoding<br/>PE pos,2i = sin pos/10000^2i/d]
D --> E[Embedding + Position<br/>h = E + PE]
end
subgraph "Transformer Stack"
E --> F[Transformer Block 1]
F --> G[Transformer Block 2]
G --> H[...]
H --> I[Transformer Block N]
end
subgraph "Transformer Block Detail"
F --> F1[Layer Norm 1]
F1 --> F2[Multi-Head Attention<br/>QK^T/βd_k β softmax β V]
F2 --> F3[Residual + Dropout]
F --> F3
F3 --> F4[Layer Norm 2]
F4 --> F5[Feed-Forward<br/>GELU xWβ + bβWβ + bβ]
F5 --> F6[Residual + Dropout]
F3 --> F6
end
subgraph "Output Processing"
I --> J[Final Layer Norm]
J --> K[Output Projection<br/>y = hW_out]
K --> L[Softmax<br/>p = softmax y]
L --> M[Sampling<br/>t ~ p with temp, top-k, top-p]
M --> N[Generated Text]
end
style A fill:#e1f5ff
style N fill:#fff4e1
style F fill:#e8f5e9
style F2 fill:#ffe1f5
style F5 fill:#ffe1f5
graph LR
subgraph "Input"
X[Input Tokens<br/>xβ, xβ, ..., xβ]
end
subgraph "Q, K, V Projections"
X --> Q[Query Q<br/>Q = XW_Q]
X --> K[Key K<br/>K = XW_K]
X --> V[Value V<br/>V = XW_V]
end
subgraph "Attention Computation"
Q --> M[Matrix Multiply<br/>QK^T]
K --> M
M --> S[Scale<br/>Γ·βd_k]
S --> Mask{Causal Mask?}
Mask -->|Yes| CM[Mask<br/>M_ij = -β if i < j]
Mask -->|No| SM[Skip Mask]
CM --> Soft[Softmax<br/>exp scores / Ξ£exp]
SM --> Soft
Soft --> WV[Weighted Sum<br/>Attention = softmax scores Γ V]
V --> WV
end
subgraph "Multi-Head"
WV --> Split[Split into h heads]
Split --> H1[Head 1]
Split --> H2[Head 2]
Split --> H3[...]
Split --> Hh[Head h]
H1 --> Concat[Concatenate]
H2 --> Concat
H3 --> Concat
Hh --> Concat
Concat --> Out[Output Projection<br/>WO]
end
style X fill:#e1f5ff
style Out fill:#fff4e1
style Soft fill:#ffe1f5
style WV fill:#e8f5e9
graph TD
A[Start Training] --> B[Load Data]
B --> C[Create Tokenizer]
C --> D[Build Vocabulary]
D --> E[Create DataLoader]
E --> F[Initialize Model]
F --> G[Setup Optimizer<br/>AdamW]
G --> H[Setup LR Scheduler<br/>Cosine Annealing]
H --> I[For Each Epoch]
I --> J[For Each Batch]
J --> K[Forward Pass<br/>y = Model x]
K --> L[Compute Loss<br/>L = -log p t_next]
L --> M[Backward Pass<br/>βL]
M --> N[Gradient Accumulation]
N --> O{Accumulation<br/>Complete?}
O -->|No| J
O -->|Yes| P[Gradient Clipping<br/>Clip gradient norm]
P --> Q[Optimizer Step<br/>ΞΈ = ΞΈ - Ξ·βL]
Q --> R[LR Scheduler Step]
R --> S{End of<br/>Epoch?}
S -->|No| J
S -->|Yes| T[Evaluate on<br/>Validation Set]
T --> U[Compute Perplexity<br/>exp L]
U --> V{Best<br/>Model?}
V -->|Yes| W[Save Checkpoint]
V -->|No| X[Save Regular Checkpoint]
W --> Y{More<br/>Epochs?}
X --> Y
Y -->|Yes| I
Y -->|No| Z[Training Complete]
style A fill:#e1f5ff
style Z fill:#fff4e1
style K fill:#e8f5e9
style L fill:#ffe1f5
style P fill:#fff4e1
graph TD
A[Load Checkpoint] --> B[Initialize Model]
B --> C[Set Eval Mode]
C --> D[Input Prompt]
D --> E[Tokenize<br/>t = t1, t2, ..., tk]
E --> F[Encode<br/>h = E t + PE]
F --> G[Forward Pass<br/>y = Model h]
G --> H[Get Logits<br/>y β β^V]
H --> I[Apply Temperature<br/>p = softmax y/Ο]
I --> J{Top-k<br/>Filter?}
J -->|Yes| K[Keep Top-k<br/>p' = filter p, k]
J -->|No| L[p' = p]
K --> M{Top-p<br/>Filter?}
L --> M
M -->|Yes| N[Nucleus Sampling<br/>p' = filter p', p]
M -->|No| O[p'' = p']
N --> O
O --> P[Sample Token<br/>t_i ~ p'']
P --> Q[Append to Sequence<br/>s = s βͺ t_i]
Q --> R{Max Length<br/>Reached?}
R -->|No| G
R -->|Yes| S[Decode Tokens<br/>text = decode s]
S --> T[Output Text]
style A fill:#e1f5ff
style T fill:#fff4e1
style G fill:#e8f5e9
style P fill:#ffe1f5
graph TB
subgraph "Configuration"
CFG[config.py<br/>Config Classes]
CFGJSON[config.json<br/>JSON Settings]
end
subgraph "Models"
TRANS[transformer.py<br/>TransformerModel]
ATT[attention.py<br/>MultiHeadAttention]
BLOCKS[blocks.py<br/>TransformerBlock, FFN]
OPT[optimized_attention.py<br/>KV Cache, Optimized Inference]
end
subgraph "Data"
DATA[data/__init__.py<br/>TextDataset, SimpleTokenizer]
TOKEN[SimpleTokenizer<br/>encode/decode]
DATASET[TextDataset<br/>PyTorch Dataset]
end
subgraph "Training"
TRAIN[training/__init__.py<br/>Trainer]
METRICS[training/metrics.py<br/>TrainingMetrics]
end
subgraph "Scripts"
TRAIN_SCRIPT[train.py<br/>Training Entry Point]
INFER[inference.py<br/>Generation Script]
EX[example.py<br/>Usage Examples]
end
subgraph "Utils"
UTILS[utils.py<br/>Evaluation Functions]
end
CFG --> TRAIN_SCRIPT
CFGJSON --> TRAIN_SCRIPT
TRANS --> TRAIN_SCRIPT
TRANS --> INFER
ATT --> TRANS
BLOCKS --> TRANS
OPT --> TRANS
DATA --> TRAIN_SCRIPT
DATA --> INFER
TOKEN --> DATASET
DATASET --> TRAINER
TRAIN --> TRAIN_SCRIPT
METRICS --> TRAIN
UTILS --> TRAIN_SCRIPT
style TRANS fill:#e1f5ff
style TRAINER fill:#fff4e1
style DATA fill:#e8f5e9
style ATT fill:#ffe1f5
| Concept | Formula | Description |
|---|---|---|
| Token Embedding | Maps token ID to vector | |
| Positional Encoding | Adds position information | |
| Attention | Computes attention weights | |
| Feed-Forward | Non-linear transformation | |
| Layer Norm | Normalizes activations | |
| Loss | Cross-entropy loss | |
| Perplexity | Measure of model confidence | |
| AdamW Update | Optimizer step |
- Attention Is All You Need (Vaswani et al., 2017) - Original Transformer paper
- Optimizing LLM Inference and Retrieval - Production RAG systems optimizations
- RoPE: Rotary Position Embedding - Efficient positional encoding for long sequences
- Various papers on LLM training, hallucinations, and long context handling
For Kids: You've learned how computers can read and write! Just like you practice writing stories, computers practice by reading millions of books. The more they practice, the better they get! π
For Scientists: This implementation follows modern best practices including pre-norm architecture, weight tying, mixed precision training, and efficient inference optimizations. The codebase is modular, well-documented, and optimized for both training and production deployment.