A from-scratch PyTorch implementation of the Transformer model introduced in Attention Is All You Need. Trained on the OPUS Books dataset for English → Italian machine translation.
This project implements every component of the original Transformer architecture directly in PyTorch, without relying on high-level NLP libraries for the model itself. Key components include:
-
Input Embedding with
$\sqrt{d_{\text{model}}}$ scaling - Sinusoidal Positional Encoding
- Multi-Head Self-Attention and Cross-Attention
- Layer Normalization, Residual Connections, and Feed-Forward blocks
-
Encoder–Decoder architecture with configurable depth (
$N$ layers) - Greedy decoding for inference
.
├── model.py # Full Transformer implementation
├── dataset.py # Bilingual dataset, tokenization, and causal masking
├── train.py # Training loop with validation and TensorBoard logging
├── config.py # Hyperparameters and configuration
├── requirements.txt # Python dependencies
└── model_architecture.png # Architecture diagram
# Create a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txtDependencies: PyTorch ≥2.1, Hugging Face datasets & tokenizers, TensorBoard, tqdm.
python train.pyBy default, the model trains on English → Italian translation using the OPUS Books dataset (90/10 train/validation split). Configuration is in config.py:
| Parameter | Default | Description |
|---|---|---|
batch_size |
8 | Training batch size |
num_epochs |
20 | Number of training epochs |
lr |
Learning rate (Adam) | |
seq_len |
350 | Maximum sequence length |
d_model |
512 | Model dimension |
lang_src / lang_tgt
|
en / it
|
Source and target languages |
Model weights are saved to weights/ after each epoch. Tokenizers are saved as tokenizer_en.json and tokenizer_it.json.
TensorBoard logs are written to runs/tmodel/. Launch with:
tensorboard --logdir runs/The following diagram and sections describe the model architecture in detail.
The model is assembled by build_transformer in model.py, which follows the dimensions from the original paper:
| Hyperparameter | Default | Description |
|---|---|---|
d_model |
512 | Embedding / model dimension |
N |
6 | Number of encoder and decoder layers |
h |
8 | Number of attention heads ( |
d_ff |
2048 | Hidden dimension of the feed-forward network |
dropout |
0.1 | Dropout probability |
Weights are initialized with Xavier uniform initialization.
Convert original sentence into a vector of size
Original Sentence (Tokens) → Input IDs (Position in Vocab) → Embedding (Vector)
In the embedding layer, weights are multiplied by
A vector of size
where
The encoder consists of
- Multi-Head Self-Attention on the source sequence
- Feed-Forward Network with ReLU activation
- Two Add & Norm residual connections with layer normalization
Each item is normalized independently across its features. For item
Learnable parameters gamma (
Two linear transformations with a ReLU activation in between:
The input is projected into query, key, and value matrices
where
The decoder also has
- Masked Multi-Head Self-Attention (prevents attending to future tokens)
-
Multi-Head Cross-Attention where
$Q$ comes from the decoder and$K, V$ come from the encoder output - Feed-Forward Network
- Three Add & Norm residual connections
Two masks are used: the target mask for self-attention (causal + padding) and the source mask for cross-attention (padding only).
During inference, the model uses greedy decoding: start with a [SOS] token, generate one token at a time, append it to the decoder input, and repeat until [EOS] or max_len is reached.
- Vaswani et al., Attention Is All You Need (2017). arXiv:1706.03762
