Skip to content

MostafaK2/GPT-Style_LM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPT-Style Decoder-Only Transformer for NLP

A PyTorch implementation of a GPT-style decoder-only Transformer language model trained on the Penn Treebank (PTB) dataset with Byte-Pair Encoding (BPE) tokenization. The model generates text autoregressively with support for multiple sampling strategies.

Evaluation Metrics: NLL (Negative Log-Likelihood), PPL (Perplexity)

Project Overview

This project implements a complete GPT-style language model pipeline:

  1. Model: Decoder-only Transformer architecture with causal self-attention
  2. Tokenization: Byte-Pair Encoding (BPE) for subword tokenization
  3. Training: Autoregressive language modeling with teacher forcing
  4. Inference: Multiple decoding strategies (greedy, temperature sampling, top-K, nucleus sampling)

The model learns to predict the next token in a sequence and can generate coherent text continuations from arbitrary prompts.

Prerequisites

  • Python 3.8+
  • CUDA 11.8+ (recommended for GPU acceleration)
  • 10GB+ disk space (for dataset and model checkpoints)
  • 8GB+ RAM

Installation & Setup

1. Clone Repository

cd <project-directory>

2. Create Virtual Environment (Recommended)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

Required Dependencies:

torch>=2.0.0
tokenizers>=0.13.0
pyyaml>=6.0
matplotlib>=3.5.0
nltk>=3.8

Dataset

Penn Treebank (PTB) Dataset

File Structure

The dataset expects three files with space-separated tokens:

data/
├── train.txt          # Training data (70%)
├── valid.txt          # Validation data (15%)
└── test.txt           # Test data (15%)

Each line contains one sequence with tokens separated by spaces:

the quick brown fox jumped over the lazy dog
she went to the store and bought some milk
...

Configuration

All hyperparameters are configured in config/main.yml. Override individual settings via CLI.

Data Configuration

data:
  train_path: data/train.txt
  valid_path: data/valid.txt
  test_path: data/test.txt
  max_len: 50              # Maximum sequence length during preprocessing
  min_freq: 0              # Minimum token frequency for vocab inclusion

special_tokens:
  pad: "<pad>"             # Padding token
  eos: "<eos>"             # End-of-sequence token
  unk: "<unk>"             # Unknown token

model:
  vocab_size: 10000        # BPE vocabulary size (auto-updated)
  d_model: 512             # Embedding/hidden dimension
  n_heads: 8               # Number of attention heads
  n_layers: 6              # Number of decoder layers
  max_len: 256             # Maximum sequence length for position embeddings
  d_ff: 2048               # Feedforward network dimension
  dropout: 0.1             # Dropout rate

training:
  epochs: 30               # Number of training epochs
  batch_size: 32           # Training batch size
  learning_rate: 5e-4      # Initial learning rate
  weight_decay: 1e-5       # L2 regularization strength
  grad_clip: 1.0           # Gradient clipping threshold
  label_smoothing: 0.1     # Label smoothing factor (0-1)
  patience: 5              # Early stopping patience
  warmup_steps: 1500       # Linear warmup steps

checkpoint:
  save_dir: results/       # Directory to save checkpoints

reproducibility:
  seed: 42                 # Random seed for reproducibility

Training

Minimal Command (Default Configuration)

python train_caption.py --config config/config.yml

Uses the default configuration and saves results to the directory specified in the config.

Custom Training Configuration

python train_caption.py \
    --config config/main.yml \
    --epochs 50 \
    --batch_size 64 \
    --lr 3e-4 \
    --seed 42 \
    --decay 1e-5 \
    --save_dir results/custom_run/

Training Arguments

Argument Type Description
--config PATH Path to YAML config file (default: config/main.yml)
--epochs INT Number of training epochs (overrides config)
--batch_size INT Batch size (overrides config)
--lr FLOAT Learning rate (overrides config)
--seed INT Random seed (overrides config)
--decay FLOAT Weight decay/L2 regularization (overrides config)
--save_dir PATH Directory for saving checkpoints (overrides config)

Training Output Files

After training completes, the following files are saved in your save_dir:

File Description
best_model.pt Best model checkpoint (state_dict, hyperparams, epoch)
bpe_tokenizer.json Trained BPE tokenizer (vocabulary snapshot)
best_config.yml Configuration file used (for reproducibility)
training_curves.png Loss and Perplexity curves visualization
train.log Detailed training logs with all metrics
results.txt Summary with final metrics and model info

Text Generation

Minimal Command

python generate_caption.py --result_dir results

Generates text from default prompts using the trained model.

Custom Generation Parameters

python generate_caption.py \
    --result_dir results \
    --prompt "the cat" "she went to" "once upon a time" \
    --max_tokens 50 \
    --temperature 0.8 \
    --top_p 0.95 \
    --gen_num 5 \
    --seed 42 \
    --greedy

Generation Arguments

Argument Type Default Description
--result_dir PATH - Path to trained model directory (required)
--prompt LIST[STR] See below Text prompts to continue
--max_tokens INT 50 Maximum number of tokens to generate
--temperature FLOAT 0.7 Sampling temperature (controls randomness)
--top_k INT 0 Top-K sampling (0 to disable)
--top_p FLOAT 0.9 Nucleus sampling probability (0-1)
--gen_num INT 20 Number of samples per prompt
--seed INT - Random seed (optional)
--greedy FLAG False Use greedy decoding (deterministic)
--handle_unk FLAG False Mask unknown tokens during generation
--out PATH - Output file path (default: generated_text.txt)

Default Prompts

If no prompts are specified, the following are used:

"the cat", "she went to", "it was a", "there was a", 
"i think that", "the company said", "in the market", 
"once upon a time", "the reason is", "at the end of the"

About

Decoder-only GPT-style Transformer for autoregressive language modeling with BPE tokenization, supporting greedy, temperature, top-k, and nucleus sampling

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages