Skip to content

Add Malayalam characters to char classes so periods tokenize correctly (#12898)#13978

Open
vineethsaivs wants to merge 1 commit into
explosion:masterfrom
vineethsaivs:fix/ml-tokenizer-period
Open

Add Malayalam characters to char classes so periods tokenize correctly (#12898)#13978
vineethsaivs wants to merge 1 commit into
explosion:masterfrom
vineethsaivs:fix/ml-tokenizer-period

Conversation

@vineethsaivs

Copy link
Copy Markdown

Description

Fixes #12898.

Malayalam sentence-terminal periods were not split off the final word:

import spacy
nlp = spacy.blank("ml")
[t.text for t in nlp("ഭാഷയാണ് മലയാളം.")]
# before: ['ഭാഷയാണ്', 'മലയാളം.']
# after:  ['ഭാഷയാണ്', 'മലയാളം', '.']

Malayalam letters were not part of the _uncased character group in spacy/lang/char_classes.py, so the tokenizer did not treat them as alphabetic characters and the suffix rules never separated the trailing period.

Following @svlandeg's suggestion on the issue, this defines the Malayalam Unicode block (U+0D00-U+0D7F) and adds it to _uncased, mirroring the existing Indic scripts (Tamil, Telugu, Kannada). A tokenizer regression test is added under spacy/tests/lang/ml/. Tamil and the existing Malayalam tests are unaffected.

Types of change

Bug fix (Malayalam tokenization).

Checklist

  • I ran the tests, and all new and existing tests passed (spacy/tests/lang/ml/).
  • Contributor agreement: I have not added .github/contributors/ because the spaCy Contributor Agreement is a copyright-assignment that must be signed by me, the human contributor; I will add .github/contributors/vineethsaivs.md before merge.

Disclosure: prepared with AI assistance; reviewed by me and I can explain every line.

Malayalam letters were not part of the `_uncased` character group, so the
tokenizer did not treat them as alphabetic characters and a sentence-terminal
period stayed attached to the final word (e.g. "മലയാളം." was a single token
instead of "മലയാളം" + ".").

Define the Malayalam Unicode block (U+0D00-U+0D7F) and add it to `_uncased`,
mirroring the other Indic scripts (Tamil, Telugu, Kannada). Adds a tokenizer
regression test.

Fixes explosion#12898

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sentence-terminal periods not tokenized properly in Malayalam text

1 participant