czecher

czecher

A lightweight Transformer that learns where to put commas in Czech sentences. Trained on 11 million sentences (128 tokens), parallel GPU training with torchrun (DDP) on 4x RTX 5090, and Wandb logging.

For more details about the model and the training process please view my short PDF report here.

Try it out

You can try out the Czecher model on czecher.cz.

Dataset

For simplicity of my own training and for everyone else to use I uploaded a filtered version of the SYN2006PUB corpus on HuggingFace. Available here.

Quickstart - Linux

Get the repository and initialize

# Get the repository locally
git clone https://github.com/bednarjosef/czecher.git
cd czecher

# Create a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install packages from requirements.txt
pip install -r requirements.txt

# Run setup.py (creates directories)
python3 setup.py

Download the dataset

# Download a specified HF dataset repository into /downloads/
python3 download.py --repo josefbednar/syn2006pub-11m-128-tokens-commas

Single-GPU train

python3 train.py

Multi-GPU (4 GPUs on one node)

torchrun --standalone --nproc_per_node=4 train.py

Important: Whatever max_tokens you build the dataset with must match the model max_tokens at train time. If you rebuild with a different length, rebuild both inputs.bin and labels.bin.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
utils		utils
.gitignore		.gitignore
README.md		README.md
create_dataset.py		create_dataset.py
dataset.py		dataset.py
download.py		download.py
model.py		model.py
requirements.txt		requirements.txt
setup.py		setup.py
setup.sh		setup.sh
tokenizer.py		tokenizer.py
train.py		train.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

czecher

Try it out

Dataset

Quickstart - Linux

Get the repository and initialize

Download the dataset

Single-GPU train

Multi-GPU (4 GPUs on one node)

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

czecher

Try it out

Dataset

Quickstart - Linux

Get the repository and initialize

Download the dataset

Single-GPU train

Multi-GPU (4 GPUs on one node)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages