Skip to content

feat: integrate PleIAs/CommonLingua byte-level LID model#4

Open
malteos wants to merge 1 commit into
mainfrom
feat/commonlingua
Open

feat: integrate PleIAs/CommonLingua byte-level LID model#4
malteos wants to merge 1 commit into
mainfrom
feat/commonlingua

Conversation

@malteos
Copy link
Copy Markdown
Collaborator

@malteos malteos commented May 11, 2026

Summary

Adds CommonLingua (PleIAs/CommonLingua) as a registered LID model under model_id = "commonlingua". CommonLingua is a 2.35M-param byte-level model (Apache 2.0) covering 334 languages — it does not fit any existing extra because it ships a custom PyTorch architecture (no HF transformers integration, no tokenizer files).

  • New optional extra [commonlingua] (pulls only torch, no transformers stack).
  • Upstream model.py is vendored at src/commonlid/vendor/commonlingua/model.py (Apache 2.0, rev 43fe88d). Vendoring is preferred over weights_only=False + remote model.py so we never execute pickled remote code at load time.
  • requires_preprocessing = False — byte-level model relies on casing as a strong language signal; the OpenLID normer's lowercasing collapses Latin-script predictions.
  • Device selection mirrors AfroLID: MPS > CUDA > CPU.

Eval results

Dataset n_samples macro F1 (gold) micro F1 (gold) accuracy runtime (MPS)
commonlid 373,230 0.5726 0.8034 77.58% 9 min @ 691 samples/s
commonlid_nano 1,507 0.6563 0.7726 73.85% 2.4 s @ 636 samples/s

The full-CommonLID accuracy lines up with the model card's 77.63% strict-accuracy claim (delta 0.05%).

Test plan

  • make lint && make format-check && make typecheck clean
  • make test — 237 tests pass, coverage 94.6%
  • Sanity: commonlid predict --model commonlingua --text "..." returns expected language
  • Full eval on commonlid + commonlid_nano produces well-formed summary.json files
  • Local Gradio leaderboard (make leaderboard) renders both new rows

🤖 Generated with Claude Code

Adds a `commonlingua` model wired into the registry, backed by a vendored
copy of upstream's `model.py` (Apache 2.0, rev 43fe88d) so we don't need
`weights_only=False` to load remote pickled code. The model is exposed via
a new `[commonlingua]` optional extra that pulls only `torch` (no
transformers stack); device selection mirrors AfroLID's MPS > CUDA > CPU.

`requires_preprocessing = False` because the byte-level architecture relies
on casing as a strong language signal — the OpenLID normer's lowercasing
collapses Latin-script predictions.

Eval on the full CommonLID dataset (373,230 samples) gives a micro
accuracy of 77.58%, matching the model card's 77.63% claim.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant