GPT-Style Decoder-Only Transformer for NLP

A PyTorch implementation of a GPT-style decoder-only Transformer language model trained on the Penn Treebank (PTB) dataset with Byte-Pair Encoding (BPE) tokenization. The model generates text autoregressively with support for multiple sampling strategies.

Evaluation Metrics: NLL (Negative Log-Likelihood), PPL (Perplexity)

Project Overview

This project implements a complete GPT-style language model pipeline:

Model: Decoder-only Transformer architecture with causal self-attention
Tokenization: Byte-Pair Encoding (BPE) for subword tokenization
Training: Autoregressive language modeling with teacher forcing
Inference: Multiple decoding strategies (greedy, temperature sampling, top-K, nucleus sampling)

The model learns to predict the next token in a sequence and can generate coherent text continuations from arbitrary prompts.

Prerequisites

Python 3.8+
CUDA 11.8+ (recommended for GPU acceleration)
10GB+ disk space (for dataset and model checkpoints)
8GB+ RAM

Installation & Setup

1. Clone Repository

cd <project-directory>

2. Create Virtual Environment (Recommended)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

Required Dependencies:

torch>=2.0.0
tokenizers>=0.13.0
pyyaml>=6.0
matplotlib>=3.5.0
nltk>=3.8

Dataset

Penn Treebank (PTB) Dataset

Format: Plain text files with one sentence per line
Structure: Token-separated text (e.g., "word1 word2 word3 ...")
Split: 70% train / 15% validation / 15% test
Preprocessing: Tokenization and vocabulary building
Link: https://www.kaggle.com/datasets/aliakay8/penn-treebank-dataset

File Structure

The dataset expects three files with space-separated tokens:

data/
├── train.txt          # Training data (70%)
├── valid.txt          # Validation data (15%)
└── test.txt           # Test data (15%)

Each line contains one sequence with tokens separated by spaces:

the quick brown fox jumped over the lazy dog
she went to the store and bought some milk
...

Configuration

All hyperparameters are configured in config/main.yml. Override individual settings via CLI.

Data Configuration

data:
  train_path: data/train.txt
  valid_path: data/valid.txt
  test_path: data/test.txt
  max_len: 50              # Maximum sequence length during preprocessing
  min_freq: 0              # Minimum token frequency for vocab inclusion

special_tokens:
  pad: "<pad>"             # Padding token
  eos: "<eos>"             # End-of-sequence token
  unk: "<unk>"             # Unknown token

model:
  vocab_size: 10000        # BPE vocabulary size (auto-updated)
  d_model: 512             # Embedding/hidden dimension
  n_heads: 8               # Number of attention heads
  n_layers: 6              # Number of decoder layers
  max_len: 256             # Maximum sequence length for position embeddings
  d_ff: 2048               # Feedforward network dimension
  dropout: 0.1             # Dropout rate

training:
  epochs: 30               # Number of training epochs
  batch_size: 32           # Training batch size
  learning_rate: 5e-4      # Initial learning rate
  weight_decay: 1e-5       # L2 regularization strength
  grad_clip: 1.0           # Gradient clipping threshold
  label_smoothing: 0.1     # Label smoothing factor (0-1)
  patience: 5              # Early stopping patience
  warmup_steps: 1500       # Linear warmup steps

checkpoint:
  save_dir: results/       # Directory to save checkpoints

reproducibility:
  seed: 42                 # Random seed for reproducibility

Training

Minimal Command (Default Configuration)

python train_caption.py --config config/config.yml

Uses the default configuration and saves results to the directory specified in the config.

Custom Training Configuration

python train_caption.py \
    --config config/main.yml \
    --epochs 50 \
    --batch_size 64 \
    --lr 3e-4 \
    --seed 42 \
    --decay 1e-5 \
    --save_dir results/custom_run/

Training Arguments

Argument	Type	Description
`--config`	PATH	Path to YAML config file (default: `config/main.yml`)
`--epochs`	INT	Number of training epochs (overrides config)
`--batch_size`	INT	Batch size (overrides config)
`--lr`	FLOAT	Learning rate (overrides config)
`--seed`	INT	Random seed (overrides config)
`--decay`	FLOAT	Weight decay/L2 regularization (overrides config)
`--save_dir`	PATH	Directory for saving checkpoints (overrides config)

Training Output Files

After training completes, the following files are saved in your save_dir:

File	Description
`best_model.pt`	Best model checkpoint (state_dict, hyperparams, epoch)
`bpe_tokenizer.json`	Trained BPE tokenizer (vocabulary snapshot)
`best_config.yml`	Configuration file used (for reproducibility)
`training_curves.png`	Loss and Perplexity curves visualization
`train.log`	Detailed training logs with all metrics
`results.txt`	Summary with final metrics and model info

Text Generation

Minimal Command

python generate_caption.py --result_dir results

Generates text from default prompts using the trained model.

Custom Generation Parameters

python generate_caption.py \
    --result_dir results \
    --prompt "the cat" "she went to" "once upon a time" \
    --max_tokens 50 \
    --temperature 0.8 \
    --top_p 0.95 \
    --gen_num 5 \
    --seed 42 \
    --greedy

Generation Arguments

Argument	Type	Default	Description
`--result_dir`	PATH	-	Path to trained model directory (required)
`--prompt`	LIST[STR]	See below	Text prompts to continue
`--max_tokens`	INT	50	Maximum number of tokens to generate
`--temperature`	FLOAT	0.7	Sampling temperature (controls randomness)
`--top_k`	INT	0	Top-K sampling (0 to disable)
`--top_p`	FLOAT	0.9	Nucleus sampling probability (0-1)
`--gen_num`	INT	20	Number of samples per prompt
`--seed`	INT	-	Random seed (optional)
`--greedy`	FLAG	False	Use greedy decoding (deterministic)
`--handle_unk`	FLAG	False	Mask unknown tokens during generation
`--out`	PATH	-	Output file path (default: `generated_text.txt`)

Default Prompts

If no prompts are specified, the following are used:

"the cat", "she went to", "it was a", "there was a", 
"i think that", "the company said", "in the market", 
"once upon a time", "the reason is", "at the end of the"

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
results		results
.gitignore		.gitignore
README.md		README.md
generate.py		generate.py
model.py		model.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-Style Decoder-Only Transformer for NLP

Project Overview

Prerequisites

Installation & Setup

1. Clone Repository

2. Create Virtual Environment (Recommended)

3. Install Dependencies

Dataset

Penn Treebank (PTB) Dataset

File Structure

Configuration

Data Configuration

Training

Minimal Command (Default Configuration)

Custom Training Configuration

Training Arguments

Training Output Files

Text Generation

Minimal Command

Custom Generation Parameters

Generation Arguments

Default Prompts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPT-Style Decoder-Only Transformer for NLP

Project Overview

Prerequisites

Installation & Setup

1. Clone Repository

2. Create Virtual Environment (Recommended)

3. Install Dependencies

Dataset

Penn Treebank (PTB) Dataset

File Structure

Configuration

Data Configuration

Training

Minimal Command (Default Configuration)

Custom Training Configuration

Training Arguments

Training Output Files

Text Generation

Minimal Command

Custom Generation Parameters

Generation Arguments

Default Prompts

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages