feat: integrate PleIAs/CommonLingua byte-level LID model by malteos · Pull Request #4 · commoncrawl/commonlid-eval

malteos · 2026-05-11T10:00:41Z

Summary

Adds CommonLingua (PleIAs/CommonLingua) as a registered LID model under model_id = "commonlingua". CommonLingua is a 2.35M-param byte-level model (Apache 2.0) covering 334 languages — it does not fit any existing extra because it ships a custom PyTorch architecture (no HF transformers integration, no tokenizer files).

New optional extra [commonlingua] (pulls only torch, no transformers stack).
Upstream model.py is vendored at src/commonlid/vendor/commonlingua/model.py (Apache 2.0, rev 43fe88d). Vendoring is preferred over weights_only=False + remote model.py so we never execute pickled remote code at load time.
requires_preprocessing = False — byte-level model relies on casing as a strong language signal; the OpenLID normer's lowercasing collapses Latin-script predictions.
Device selection mirrors AfroLID: MPS > CUDA > CPU.

Eval results

Dataset	n_samples	macro F1 (gold)	micro F1 (gold)	accuracy	runtime (MPS)
`commonlid`	373,230	0.5726	0.8034	77.58%	9 min @ 691 samples/s
`commonlid_nano`	1,507	0.6563	0.7726	73.85%	2.4 s @ 636 samples/s

The full-CommonLID accuracy lines up with the model card's 77.63% strict-accuracy claim (delta 0.05%).

Test plan

make lint && make format-check && make typecheck clean
make test — 237 tests pass, coverage 94.6%
Sanity: commonlid predict --model commonlingua --text "..." returns expected language
Full eval on commonlid + commonlid_nano produces well-formed summary.json files
Local Gradio leaderboard (make leaderboard) renders both new rows

🤖 Generated with Claude Code

Adds a `commonlingua` model wired into the registry, backed by a vendored copy of upstream's `model.py` (Apache 2.0, rev 43fe88d) so we don't need `weights_only=False` to load remote pickled code. The model is exposed via a new `[commonlingua]` optional extra that pulls only `torch` (no transformers stack); device selection mirrors AfroLID's MPS > CUDA > CPU. `requires_preprocessing = False` because the byte-level architecture relies on casing as a strong language signal — the OpenLID normer's lowercasing collapses Latin-script predictions. Eval on the full CommonLID dataset (373,230 samples) gives a micro accuracy of 77.58%, matching the model card's 77.63% claim.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate PleIAs/CommonLingua byte-level LID model#4

feat: integrate PleIAs/CommonLingua byte-level LID model#4
malteos wants to merge 1 commit into
mainfrom
feat/commonlingua

malteos commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

malteos commented May 11, 2026

Summary

Eval results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant