Skip to content

bednarjosef/czecher

Repository files navigation

image

czecher

A lightweight Transformer that learns where to put commas in Czech sentences. Trained on 11 million sentences (128 tokens), parallel GPU training with torchrun (DDP) on 4x RTX 5090, and Wandb logging.

For more details about the model and the training process please view my short PDF report here.

Try it out

You can try out the Czecher model on czecher.cz.

Dataset

For simplicity of my own training and for everyone else to use I uploaded a filtered version of the SYN2006PUB corpus on HuggingFace. Available here.

Quickstart - Linux

Get the repository and initialize

# Get the repository locally
git clone https://github.com/bednarjosef/czecher.git
cd czecher

# Create a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install packages from requirements.txt
pip install -r requirements.txt

# Run setup.py (creates directories)
python3 setup.py

Download the dataset

# Download a specified HF dataset repository into /downloads/
python3 download.py --repo josefbednar/syn2006pub-11m-128-tokens-commas

Single-GPU train

python3 train.py

Multi-GPU (4 GPUs on one node)

torchrun --standalone --nproc_per_node=4 train.py

Important: Whatever max_tokens you build the dataset with must match the model max_tokens at train time. If you rebuild with a different length, rebuild both inputs.bin and labels.bin.

About

A large transformer model trained on 11 million Czech sentences to learn comma correction in text.

Topics

Resources

Stars

Watchers

Forks

Contributors