Add Malayalam characters to char classes so periods tokenize correctly (#12898) by vineethsaivs · Pull Request #13978 · explosion/spaCy

vineethsaivs · 2026-06-13T10:22:55Z

Description

Malayalam sentence-terminal periods were not split off the final word:

import spacy
nlp = spacy.blank("ml")
[t.text for t in nlp("ഭാഷയാണ് മലയാളം.")]
# before: ['ഭാഷയാണ്', 'മലയാളം.']
# after:  ['ഭാഷയാണ്', 'മലയാളം', '.']

Malayalam letters were not part of the _uncased character group in spacy/lang/char_classes.py, so the tokenizer did not treat them as alphabetic characters and the suffix rules never separated the trailing period.

Following @svlandeg's suggestion on the issue, this defines the Malayalam Unicode block (U+0D00-U+0D7F) and adds it to _uncased, mirroring the existing Indic scripts (Tamil, Telugu, Kannada). A tokenizer regression test is added under spacy/tests/lang/ml/. Tamil and the existing Malayalam tests are unaffected.

Types of change

Bug fix (Malayalam tokenization).

Checklist

I ran the tests, and all new and existing tests passed (spacy/tests/lang/ml/).
Contributor agreement: I have not added .github/contributors/ because the spaCy Contributor Agreement is a copyright-assignment that must be signed by me, the human contributor; I will add .github/contributors/vineethsaivs.md before merge.

Disclosure: prepared with AI assistance; reviewed by me and I can explain every line.

Malayalam letters were not part of the `_uncased` character group, so the tokenizer did not treat them as alphabetic characters and a sentence-terminal period stayed attached to the final word (e.g. "മലയാളം." was a single token instead of "മലയാളം" + "."). Define the Malayalam Unicode block (U+0D00-U+0D7F) and add it to `_uncased`, mirroring the other Indic scripts (Tamil, Telugu, Kannada). Adds a tokenizer regression test. Fixes explosion#12898 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Malayalam characters to char classes so periods tokenize correctly (#12898)#13978

Add Malayalam characters to char classes so periods tokenize correctly (#12898)#13978
vineethsaivs wants to merge 1 commit into
explosion:masterfrom
vineethsaivs:fix/ml-tokenizer-period

vineethsaivs commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vineethsaivs commented Jun 13, 2026

Description

Types of change

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant