Add Malayalam characters to char classes so periods tokenize correctly (#12898)#13978
Open
vineethsaivs wants to merge 1 commit into
Open
Add Malayalam characters to char classes so periods tokenize correctly (#12898)#13978vineethsaivs wants to merge 1 commit into
vineethsaivs wants to merge 1 commit into
Conversation
Malayalam letters were not part of the `_uncased` character group, so the tokenizer did not treat them as alphabetic characters and a sentence-terminal period stayed attached to the final word (e.g. "മലയാളം." was a single token instead of "മലയാളം" + "."). Define the Malayalam Unicode block (U+0D00-U+0D7F) and add it to `_uncased`, mirroring the other Indic scripts (Tamil, Telugu, Kannada). Adds a tokenizer regression test. Fixes explosion#12898 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes #12898.
Malayalam sentence-terminal periods were not split off the final word:
Malayalam letters were not part of the
_uncasedcharacter group inspacy/lang/char_classes.py, so the tokenizer did not treat them as alphabetic characters and the suffix rules never separated the trailing period.Following @svlandeg's suggestion on the issue, this defines the Malayalam Unicode block (
U+0D00-U+0D7F) and adds it to_uncased, mirroring the existing Indic scripts (Tamil, Telugu, Kannada). A tokenizer regression test is added underspacy/tests/lang/ml/. Tamil and the existing Malayalam tests are unaffected.Types of change
Bug fix (Malayalam tokenization).
Checklist
spacy/tests/lang/ml/)..github/contributors/because the spaCy Contributor Agreement is a copyright-assignment that must be signed by me, the human contributor; I will add.github/contributors/vineethsaivs.mdbefore merge.Disclosure: prepared with AI assistance; reviewed by me and I can explain every line.