BPE: honour every Split in a Sequence pre_tokenizer#1064
Closed
justinchuby wants to merge 2 commits into
Closed
Conversation
When the HuggingFace pre_tokenizer is a Sequence containing multiple
Split nodes (with Isolated behavior), the previous loader overwrote
pre_tokenizer_regex_ on every iteration, silently discarding all but
the last Split. That caused mis-tokenisation for models such as
tencent/Hy-MT1.5-1.8B-2bit, whose Sequence contains three Splits
(\\p{N}{1,3}, then a CJK range, then a GPT2-style fallback).
Concatenate the Split regexes with '|' instead — for Isolated behavior
a chained Split is equivalent to a single alternation, which is what
PreTokenizerWithRegEx::Compile already supports. Reject unsupported
Split behaviors (Removed / MergedWith*) explicitly rather than silently
producing wrong tokens.
Also adds a regression test in test_fast_tokenizer.py that exercises
the multi-Split path against the Hy-MT1.5 tokenizer.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes BPE tokenizer loading so that all Split regexes inside a HuggingFace Sequence pre_tokenizer are honored, instead of only the last one. The fix concatenates the Split regexes into a single alternation when all use Isolated behavior, and explicitly rejects unsupported behaviors.
Changes:
- Accumulate Split regexes from a Sequence pre_tokenizer and combine them via
|for compilation. - Reject Split behaviors other than
Isolatedwith a clear error. - Add a regression test loading
tencent/Hy-MT1.5-1.8B-2bitand comparing token IDs against the HuggingFace fast tokenizer.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| operators/tokenizer/bpe_tokenizer_model.hpp | Collect every Split regex in a Sequence, combine with |, and validate unsupported behaviors. |
| test/test_fast_tokenizer.py | New regression test using the Hy-MT1.5 tokenizer to verify multi-Split Sequence handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+69
to
+70
| np.testing.assert_array_equal( | ||
| expected_ids[0], actual_ids[n][:expected_ids.shape[1]]) |
Comment on lines
+61
to
+62
| tokenizer = AutoTokenizer.from_pretrained( | ||
| "tencent/Hy-MT1.5-1.8B-2bit", use_fast=True, trust_remote_code=True) |
Comment on lines
+115
to
+120
| auto iter_behavior = node.find("behavior"); | ||
| if (iter_behavior != node.end() && !iter_behavior->is_null()) { | ||
| auto behavior = iter_behavior->get<std::string>(); | ||
| if (behavior != "Isolated") { | ||
| return {kOrtxErrorNotImplemented, | ||
| std::string("Unsupported Split behavior in Sequence pretokenizer: ") + behavior}; |
1. Extract ValidateSplitBehavior() and apply it to both the top-level single-Split path and the Sequence-of-Splits path. The previous patch only validated behavior inside the Sequence loop, leaving the single-Split path silently accepting Removed/MergedWith*. 2. Replace the network-dependent Hy-MT1.5 regression test (which required trust_remote_code=True and pulled a third-party HF repo at test time) with a self-contained synthetic tokenizer built from the tokenizers library. The fixture constructs a Sequence with two Isolated Splits over the standard ByteLevel alphabet, asserts the saved tokenizer.json actually has multiple Splits (so the test can not silently regress into the single-Split case), and round-trips inputs that exercise both Split patterns. 3. Assert exact token-count equality (after stripping pad_token_id) and then array equality, instead of slicing actual_ids to the expected length. The previous slice would have let extra trailing tokens slip past silently. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
Collaborator
|
Hunyuan model support is now added and merged in this similar PR: #1059 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the HuggingFace
pre_tokenizeris aSequenceof multipleSplitnodes, the loader atbpe_tokenizer_model.hpp:96-115overwritespre_tokenizer_regex_on every iteration and only the last Split survives. All earlier regexes are silently dropped.This breaks tokenisers such as
tencent/Hy-MT1.5-1.8B-2bit, whose pre-tokenizer is:{"type": "Sequence", "pretokenizers": [ {"type": "Split", "pattern": {"Regex": "\\\\p{N}{1,3}"}, "behavior": "Isolated"}, {"type": "Split", "pattern": {"Regex": "[一-龥-ゟ゠-ヿ]+"}, "behavior": "Isolated"}, {"type": "Split", "pattern": {"Regex": "[!\"#$%&'()*+,\\\\-./:;<=>?@\\\\[\\\\\\\\\\\\]^_\`{|}~][A-Za-z]+|..."}, "behavior": "Isolated"}, {"type": "ByteLevel", ...} ]}Fix
When every Split has
behavior: "Isolated", applying them in sequence is equivalent to a single alternationp1|p2|...|pN— each chunk of the final segmentation matches exactly one of the patterns.PreTokenizerWithRegEx::Compilealready accepts that union (fast-paths recognised sub-patterns, falls back tostd::regexfor the rest), so we simply concatenate the Splits with|instead of overwriting.Unsupported behaviors (
Removed,MergedWith*) are now rejected explicitly rather than silently producing wrong tokens.Test
Added a regression test
test_hy_mt_sequence_of_splitsintest/test_fast_tokenizer.pythat loads the Hy-MT1.5 tokenizer and compares ORT-Extensions output against HuggingFace fast-tokenizer output for both English and Chinese inputs.Validation
End-to-end verified with a mobius-exported Hy-MT1.5-1.8B-2bit ONNX model running through ORT GenAI:
"The cat is sleeping on the chair."→"<6c>"(single byte token)"The cat is sleeping on the chair."→"El gato está durmiendo en la silla."