Skip to content

timkalan/plunk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

plunk

A CLI tool for training and generating text with various language model architectures.

Features

  • Multiple model architectures (bigrams, trigrams, n-grams, attention-based, transformers)
  • Character-level tokenization
  • Easy-to-use CLI interface
  • Model persistence (save/load trained models)
  • Customizable hyperparameters

Installation

git clone <repository-url>
cd plunk
# Install dependencies (torch, etc.)

Usage

List Available Models

python src/plunk.py list-models

Available models:

  • base-bigram - Simple bigram baseline
  • bigram - Bigram with embeddings
  • trigram - Trigram model
  • ngram - N-gram model (configurable n)
  • attentive-bigram - Bigram with attention
  • computative-bigram - Bigram with computation layers
  • transformer-bigram - Full transformer architecture

Training a Model

python src/plunk.py train \
  --model transformer-bigram \
  --data data/input.txt \
  --output trained_models/my_model.pth \
  --max-iters 5000 \
  --batch-size 8 \
  --block-size 32 \
  --embedding-dim 64

Parameters:

  • --model: Model architecture to use
  • --data: Path to training text file
  • --output: Where to save the trained model
  • --max-iters: Number of training iterations (default: 10000)
  • --batch-size: Batch size (default: 4)
  • --block-size: Context length (default: 16)
  • --embedding-dim: Embedding dimension (default: 32)
  • --n: N-gram size for ngram model (default: 4)

Generating Text

Generate a specific number of tokens:

python src/plunk.py generate \
  --model-path trained_models/my_model.pth \
  --model transformer-bigram \
  --prompt "To be or not to be" \
  --length 500 \
  --embedding-dim 64 \
  --block-size 32

Generate indefinitely (streams output until Ctrl+C):

python src/plunk.py generate \
  --model-path trained_models/my_model.pth \
  --model transformer-bigram \
  --prompt "To be or not to be" \
  --embedding-dim 64 \
  --block-size 32

Parameters:

  • --model-path: Path to saved model file
  • --model: Model architecture (must match training)
  • --prompt: Starting text (optional)
  • --length: Number of tokens to generate (omit for infinite generation)
  • --embedding-dim: Must match training settings
  • --block-size: Must match training settings

Data

Download training data:

wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O data/input.txt

Tokenization

This project uses character-level encoding for simplicity. For production use, consider:

Examples

Train a transformer model:

python src/plunk.py train \
  --model transformer-bigram \
  --data data/preseren.txt \
  --output trained_models/transformer.pth \
  --max-iters 5000 \
  --batch-size 8 \
  --block-size 32 \
  --embedding-dim 64

Generate text:

python src/plunk.py generate \
  --model-path trained_models/transformer.pth \
  --model transformer-bigram \
  --prompt "Hello " \
  --length 300 \
  --embedding-dim 64 \
  --block-size 32

Acknowledgements

Much of the work done here is based directly on Andrej Karpathy's video.

About

Language Modeling for Dummies

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors