A PyTorch implementation of a GPT-style decoder-only Transformer language model trained on the Penn Treebank (PTB) dataset with Byte-Pair Encoding (BPE) tokenization. The model generates text autoregressively with support for multiple sampling strategies.
Evaluation Metrics: NLL (Negative Log-Likelihood), PPL (Perplexity)
This project implements a complete GPT-style language model pipeline:
- Model: Decoder-only Transformer architecture with causal self-attention
- Tokenization: Byte-Pair Encoding (BPE) for subword tokenization
- Training: Autoregressive language modeling with teacher forcing
- Inference: Multiple decoding strategies (greedy, temperature sampling, top-K, nucleus sampling)
The model learns to predict the next token in a sequence and can generate coherent text continuations from arbitrary prompts.
- Python 3.8+
- CUDA 11.8+ (recommended for GPU acceleration)
- 10GB+ disk space (for dataset and model checkpoints)
- 8GB+ RAM
cd <project-directory>python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtRequired Dependencies:
torch>=2.0.0
tokenizers>=0.13.0
pyyaml>=6.0
matplotlib>=3.5.0
nltk>=3.8
- Format: Plain text files with one sentence per line
- Structure: Token-separated text (e.g., "word1 word2 word3 ...")
- Split: 70% train / 15% validation / 15% test
- Preprocessing: Tokenization and vocabulary building
- Link: https://www.kaggle.com/datasets/aliakay8/penn-treebank-dataset
The dataset expects three files with space-separated tokens:
data/
├── train.txt # Training data (70%)
├── valid.txt # Validation data (15%)
└── test.txt # Test data (15%)
Each line contains one sequence with tokens separated by spaces:
the quick brown fox jumped over the lazy dog
she went to the store and bought some milk
...
All hyperparameters are configured in config/main.yml. Override individual settings via CLI.
data:
train_path: data/train.txt
valid_path: data/valid.txt
test_path: data/test.txt
max_len: 50 # Maximum sequence length during preprocessing
min_freq: 0 # Minimum token frequency for vocab inclusion
special_tokens:
pad: "<pad>" # Padding token
eos: "<eos>" # End-of-sequence token
unk: "<unk>" # Unknown token
model:
vocab_size: 10000 # BPE vocabulary size (auto-updated)
d_model: 512 # Embedding/hidden dimension
n_heads: 8 # Number of attention heads
n_layers: 6 # Number of decoder layers
max_len: 256 # Maximum sequence length for position embeddings
d_ff: 2048 # Feedforward network dimension
dropout: 0.1 # Dropout rate
training:
epochs: 30 # Number of training epochs
batch_size: 32 # Training batch size
learning_rate: 5e-4 # Initial learning rate
weight_decay: 1e-5 # L2 regularization strength
grad_clip: 1.0 # Gradient clipping threshold
label_smoothing: 0.1 # Label smoothing factor (0-1)
patience: 5 # Early stopping patience
warmup_steps: 1500 # Linear warmup steps
checkpoint:
save_dir: results/ # Directory to save checkpoints
reproducibility:
seed: 42 # Random seed for reproducibilitypython train_caption.py --config config/config.ymlUses the default configuration and saves results to the directory specified in the config.
python train_caption.py \
--config config/main.yml \
--epochs 50 \
--batch_size 64 \
--lr 3e-4 \
--seed 42 \
--decay 1e-5 \
--save_dir results/custom_run/| Argument | Type | Description |
|---|---|---|
--config |
PATH | Path to YAML config file (default: config/main.yml) |
--epochs |
INT | Number of training epochs (overrides config) |
--batch_size |
INT | Batch size (overrides config) |
--lr |
FLOAT | Learning rate (overrides config) |
--seed |
INT | Random seed (overrides config) |
--decay |
FLOAT | Weight decay/L2 regularization (overrides config) |
--save_dir |
PATH | Directory for saving checkpoints (overrides config) |
After training completes, the following files are saved in your save_dir:
| File | Description |
|---|---|
best_model.pt |
Best model checkpoint (state_dict, hyperparams, epoch) |
bpe_tokenizer.json |
Trained BPE tokenizer (vocabulary snapshot) |
best_config.yml |
Configuration file used (for reproducibility) |
training_curves.png |
Loss and Perplexity curves visualization |
train.log |
Detailed training logs with all metrics |
results.txt |
Summary with final metrics and model info |
python generate_caption.py --result_dir resultsGenerates text from default prompts using the trained model.
python generate_caption.py \
--result_dir results \
--prompt "the cat" "she went to" "once upon a time" \
--max_tokens 50 \
--temperature 0.8 \
--top_p 0.95 \
--gen_num 5 \
--seed 42 \
--greedy| Argument | Type | Default | Description |
|---|---|---|---|
--result_dir |
PATH | - | Path to trained model directory (required) |
--prompt |
LIST[STR] | See below | Text prompts to continue |
--max_tokens |
INT | 50 | Maximum number of tokens to generate |
--temperature |
FLOAT | 0.7 | Sampling temperature (controls randomness) |
--top_k |
INT | 0 | Top-K sampling (0 to disable) |
--top_p |
FLOAT | 0.9 | Nucleus sampling probability (0-1) |
--gen_num |
INT | 20 | Number of samples per prompt |
--seed |
INT | - | Random seed (optional) |
--greedy |
FLAG | False | Use greedy decoding (deterministic) |
--handle_unk |
FLAG | False | Mask unknown tokens during generation |
--out |
PATH | - | Output file path (default: generated_text.txt) |
If no prompts are specified, the following are used:
"the cat", "she went to", "it was a", "there was a",
"i think that", "the company said", "in the market",
"once upon a time", "the reason is", "at the end of the"