A CLI tool for training and generating text with various language model architectures.
- Multiple model architectures (bigrams, trigrams, n-grams, attention-based, transformers)
- Character-level tokenization
- Easy-to-use CLI interface
- Model persistence (save/load trained models)
- Customizable hyperparameters
git clone <repository-url>
cd plunk
# Install dependencies (torch, etc.)python src/plunk.py list-modelsAvailable models:
base-bigram- Simple bigram baselinebigram- Bigram with embeddingstrigram- Trigram modelngram- N-gram model (configurable n)attentive-bigram- Bigram with attentioncomputative-bigram- Bigram with computation layerstransformer-bigram- Full transformer architecture
python src/plunk.py train \
--model transformer-bigram \
--data data/input.txt \
--output trained_models/my_model.pth \
--max-iters 5000 \
--batch-size 8 \
--block-size 32 \
--embedding-dim 64Parameters:
--model: Model architecture to use--data: Path to training text file--output: Where to save the trained model--max-iters: Number of training iterations (default: 10000)--batch-size: Batch size (default: 4)--block-size: Context length (default: 16)--embedding-dim: Embedding dimension (default: 32)--n: N-gram size for ngram model (default: 4)
Generate a specific number of tokens:
python src/plunk.py generate \
--model-path trained_models/my_model.pth \
--model transformer-bigram \
--prompt "To be or not to be" \
--length 500 \
--embedding-dim 64 \
--block-size 32Generate indefinitely (streams output until Ctrl+C):
python src/plunk.py generate \
--model-path trained_models/my_model.pth \
--model transformer-bigram \
--prompt "To be or not to be" \
--embedding-dim 64 \
--block-size 32Parameters:
--model-path: Path to saved model file--model: Model architecture (must match training)--prompt: Starting text (optional)--length: Number of tokens to generate (omit for infinite generation)--embedding-dim: Must match training settings--block-size: Must match training settings
Download training data:
wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O data/input.txtThis project uses character-level encoding for simplicity. For production use, consider:
- tiktoken by OpenAI
- sentencepiece by Google
Train a transformer model:
python src/plunk.py train \
--model transformer-bigram \
--data data/preseren.txt \
--output trained_models/transformer.pth \
--max-iters 5000 \
--batch-size 8 \
--block-size 32 \
--embedding-dim 64Generate text:
python src/plunk.py generate \
--model-path trained_models/transformer.pth \
--model transformer-bigram \
--prompt "Hello " \
--length 300 \
--embedding-dim 64 \
--block-size 32Much of the work done here is based directly on Andrej Karpathy's video.