Support Sequence pre-tokenizer with sequential Split steps#1059
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves BPE tokenizer compatibility with Hugging Face tokenizer.json files that use a Sequence pre-tokenizer containing multiple Split steps (plus ByteLevel), by accumulating all Split regex patterns and fusing them (at higher priority) ahead of the base GPT-2/Llama pre-tokenizer regex to avoid incorrect/degenerate tokenization.
Changes:
- Accumulate
Splitregexes from a Sequence pre-tokenizer (instead of overwriting) and fuse them ahead of the base pre-tokenizer regex. - Normalize CR/LF bytes in JSON-loaded regex strings to
\\r/\\nto align with the compile-time pattern table matching logic. - Add hardcoded matchers + compile-time pattern-table entries for Hunyuan-specific Split regex patterns (CJK, punctuation/symbol, and ASCII-punct+letter).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
operators/tokenizer/bpe_utils.hpp |
Adds new hardcoded matchers and registers Hunyuan-specific regex branches in the pattern table used by PreTokenizerWithRegEx::Compile. |
operators/tokenizer/bpe_tokenizer_model.hpp |
Updates pre-tokenizer loading to collect/normalize multiple Split regexes and returns a fused regex (Split branches + base regex) from GetPreTokenizerRegex. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
This PR's branch predates #1045, which landed on main and added a top-level Split pre-tokenizer handler, |
3c64736 to
8230491
Compare
Rebased this branch on latest microsoft/main. The conflict with #1045 is resolved, and the newer top-level Split pre-tokenizer handling, no_op_pretokenizer_, and spm_model support are preserved. |
|
Similar PR: #1064 |
|
@kunal-vaishnavi pls review. |
|
@tanzeel-amd @VishalX these changes introduce a tokenizer error in Please try to repro and fix locally with an agent that has context on this session, we can jump in if needed. |
Thanks @sayanshaw24. We'll test and fix this. |
|
@sayanshaw24 could you pls help us with the instructions to run all the tests locally? If they are already in the repo, pls point us to it. |
Here's how to build and run the tests locally: # 1. Install the package in development/editable mode (from the repo root)
pip install -e .
# 2. Install dev/test dependencies
pip install -r requirements-dev.txt
# 3. Run the specific failing test
pytest test/test_pp_api.py::test_gpt_oss_chat_template -v
# Or run the full test suite
pytest testMore details are in The editable install ( The failing test is in |
542f4a4 to
59e91bd
Compare
|
@sayanshaw24 This CI failure revealed a core flaw in our logic. Rewrote the core logic to follow HuggingFace Tokenizer. |
|
the new CI failures are from a |
59e91bd to
b573762
Compare
|
Added empty-token guard, safe token_len calculation, and changed space_dif to int64_t. Also fixed a bug in SplitIsolated() where MatchWithSTLRegEx() could silently skip characters - added a test with a non-hardcoded regex pattern ([aeiou]+) to cover this path. |
Support Sequence pre-tokenizer with sequential Split steps
Problem
HuggingFace
tokenizer.jsonfiles may specify a Sequence pre-tokenizer containing multipleSplitsteps followed byByteLevel(e.g. Hunyuan, GPT-OSS, Qwen 2.5). The existing code onmainhandles the Sequence loop by overwritingpre_tokenizer_regex_with each Split entry — only the last Split regex survives, silently discarding earlier steps.For single-Split models (like GPT-OSS) this happened to work. For multi-Split models (like Hunyuan with 3 Split steps), earlier patterns were lost and tokenization produced incorrect results.
Solution
Implement true sequential pre-tokenization matching the HuggingFace tokenizers Rust library:
Sequence::pre_tokenize— theforloop that runs each step sequentially on the samePreTokenizedStringSplit::pre_tokenize— splits text using a regex with configurable behaviorIsolatedbehavior — keeps both matched and unmatched regions as separate chunksPattern::find_matchesfor Regex — returns contiguous coverage of the entire input, nothing is droppedByteLevelwithuse_regex: false— only does byte-to-unicode mapping, no GPT-2 regex splittingChanges
bpe_utils.hppSplitIsolated()— splits text preserving both matched and unmatched regions. Add CJK/Hunyuan hardcoded matchers + pattern table entries.bpe_tokenizer_model.hppsequence_steps_vector instead of overwriting a single string. AddNormalizeJsonRegexEscapes()for CR/LF handling from JSON parsing.bpe_kernels.ccTokenize(), branch onHasSequencePreTokenizer(): Sequence models run the sequential pipeline; non-Sequence models use the existing single-regex path unchanged.SpmTokenize()is untouched.test_fast_tokenizer.pytest_hunyuan_split_sequence_tokenization— builds a synthetic Hunyuan-style tokenizer (3 Split steps + ByteLevel) and verifies ORT matches HuggingFace across CJK, numbers, tabs, newlines, and mixed content.Non-Sequence tokenizers (GPT-2, Llama, Phi, CLIP, Falcon, etc.) are completely unaffected —
HasSequencePreTokenizer()returns false and the existing code path runs.Testing
test_gpt_oss_chat_template(was failing)test_hunyuan_split_sequence_tokenization(new)tencent/Hunyuan-1.8B-Instruct(11 strings)main