A lightweight Transformer that learns where to put commas in Czech sentences. Trained on 11 million sentences (128 tokens), parallel GPU training with torchrun (DDP) on 4x RTX 5090, and Wandb logging.
For more details about the model and the training process please view my short PDF report here.
You can try out the Czecher model on czecher.cz.
For simplicity of my own training and for everyone else to use I uploaded a filtered version of the SYN2006PUB corpus on HuggingFace. Available here.
# Get the repository locally
git clone https://github.com/bednarjosef/czecher.git
cd czecher
# Create a virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install packages from requirements.txt
pip install -r requirements.txt
# Run setup.py (creates directories)
python3 setup.py# Download a specified HF dataset repository into /downloads/
python3 download.py --repo josefbednar/syn2006pub-11m-128-tokens-commaspython3 train.py
torchrun --standalone --nproc_per_node=4 train.py
Important: Whatever max_tokens you build the dataset with must match the model max_tokens at train time. If you rebuild with a different length, rebuild both inputs.bin and labels.bin.