Skip to content

Latest commit

 

History

History
134 lines (107 loc) · 3.27 KB

File metadata and controls

134 lines (107 loc) · 3.27 KB

plunk

A CLI tool for training and generating text with various language model architectures.

Features

  • Multiple model architectures (bigrams, trigrams, n-grams, attention-based, transformers)
  • Character-level tokenization
  • Easy-to-use CLI interface
  • Model persistence (save/load trained models)
  • Customizable hyperparameters

Installation

git clone <repository-url>
cd plunk
# Install dependencies (torch, etc.)

Usage

List Available Models

python src/plunk.py list-models

Available models:

  • base-bigram - Simple bigram baseline
  • bigram - Bigram with embeddings
  • trigram - Trigram model
  • ngram - N-gram model (configurable n)
  • attentive-bigram - Bigram with attention
  • computative-bigram - Bigram with computation layers
  • transformer-bigram - Full transformer architecture

Training a Model

python src/plunk.py train \
  --model transformer-bigram \
  --data data/input.txt \
  --output trained_models/my_model.pth \
  --max-iters 5000 \
  --batch-size 8 \
  --block-size 32 \
  --embedding-dim 64

Parameters:

  • --model: Model architecture to use
  • --data: Path to training text file
  • --output: Where to save the trained model
  • --max-iters: Number of training iterations (default: 10000)
  • --batch-size: Batch size (default: 4)
  • --block-size: Context length (default: 16)
  • --embedding-dim: Embedding dimension (default: 32)
  • --n: N-gram size for ngram model (default: 4)

Generating Text

Generate a specific number of tokens:

python src/plunk.py generate \
  --model-path trained_models/my_model.pth \
  --model transformer-bigram \
  --prompt "To be or not to be" \
  --length 500 \
  --embedding-dim 64 \
  --block-size 32

Generate indefinitely (streams output until Ctrl+C):

python src/plunk.py generate \
  --model-path trained_models/my_model.pth \
  --model transformer-bigram \
  --prompt "To be or not to be" \
  --embedding-dim 64 \
  --block-size 32

Parameters:

  • --model-path: Path to saved model file
  • --model: Model architecture (must match training)
  • --prompt: Starting text (optional)
  • --length: Number of tokens to generate (omit for infinite generation)
  • --embedding-dim: Must match training settings
  • --block-size: Must match training settings

Data

Download training data:

wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O data/input.txt

Tokenization

This project uses character-level encoding for simplicity. For production use, consider:

Examples

Train a transformer model:

python src/plunk.py train \
  --model transformer-bigram \
  --data data/preseren.txt \
  --output trained_models/transformer.pth \
  --max-iters 5000 \
  --batch-size 8 \
  --block-size 32 \
  --embedding-dim 64

Generate text:

python src/plunk.py generate \
  --model-path trained_models/transformer.pth \
  --model transformer-bigram \
  --prompt "Hello " \
  --length 300 \
  --embedding-dim 64 \
  --block-size 32

Acknowledgements

Much of the work done here is based directly on Andrej Karpathy's video.