Skip to content

Support Sequence pre-tokenizer with sequential Split steps#1059

Merged
kunal-vaishnavi merged 2 commits into
microsoft:mainfrom
tanzeel-amd:turrahma/split_fix
May 23, 2026
Merged

Support Sequence pre-tokenizer with sequential Split steps#1059
kunal-vaishnavi merged 2 commits into
microsoft:mainfrom
tanzeel-amd:turrahma/split_fix

Conversation

@tanzeel-amd
Copy link
Copy Markdown
Contributor

@tanzeel-amd tanzeel-amd commented May 8, 2026

Support Sequence pre-tokenizer with sequential Split steps

Problem

HuggingFace tokenizer.json files may specify a Sequence pre-tokenizer containing multiple Split steps followed by ByteLevel (e.g. Hunyuan, GPT-OSS, Qwen 2.5). The existing code on main handles the Sequence loop by overwriting pre_tokenizer_regex_ with each Split entry — only the last Split regex survives, silently discarding earlier steps.

For single-Split models (like GPT-OSS) this happened to work. For multi-Split models (like Hunyuan with 3 Split steps), earlier patterns were lost and tokenization produced incorrect results.

Solution

Implement true sequential pre-tokenization matching the HuggingFace tokenizers Rust library:

Changes

File What
bpe_utils.hpp Add SplitIsolated() — splits text preserving both matched and unmatched regions. Add CJK/Hunyuan hardcoded matchers + pattern table entries.
bpe_tokenizer_model.hpp Store Sequence Split steps in sequence_steps_ vector instead of overwriting a single string. Add NormalizeJsonRegexEscapes() for CR/LF handling from JSON parsing.
bpe_kernels.cc In Tokenize(), branch on HasSequencePreTokenizer(): Sequence models run the sequential pipeline; non-Sequence models use the existing single-regex path unchanged. SpmTokenize() is untouched.
test_fast_tokenizer.py Add test_hunyuan_split_sequence_tokenization — builds a synthetic Hunyuan-style tokenizer (3 Split steps + ByteLevel) and verifies ORT matches HuggingFace across CJK, numbers, tabs, newlines, and mixed content.

Non-Sequence tokenizers (GPT-2, Llama, Phi, CLIP, Falcon, etc.) are completely unaffected — HasSequencePreTokenizer() returns false and the existing code path runs.

Testing

Test Result
test_gpt_oss_chat_template (was failing) PASS
test_hunyuan_split_sequence_tokenization (new) PASS
tencent/Hunyuan-1.8B-Instruct (11 strings) All match HuggingFace
Full test suite (local, Windows) 155 passed, 35 skipped — zero regressions vs main
Models verified GPT-2, GPT-OSS, Llama, Mistral, Phi-3/4, CLIP, BERT, RoBERTa, T5, Falcon, Qwen, DeepSeek, Whisper, OLMa, Hunyuan

Copilot AI review requested due to automatic review settings May 8, 2026 11:09
@tanzeel-amd tanzeel-amd requested a review from a team as a code owner May 8, 2026 11:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves BPE tokenizer compatibility with Hugging Face tokenizer.json files that use a Sequence pre-tokenizer containing multiple Split steps (plus ByteLevel), by accumulating all Split regex patterns and fusing them (at higher priority) ahead of the base GPT-2/Llama pre-tokenizer regex to avoid incorrect/degenerate tokenization.

Changes:

  • Accumulate Split regexes from a Sequence pre-tokenizer (instead of overwriting) and fuse them ahead of the base pre-tokenizer regex.
  • Normalize CR/LF bytes in JSON-loaded regex strings to \\r/\\n to align with the compile-time pattern table matching logic.
  • Add hardcoded matchers + compile-time pattern-table entries for Hunyuan-specific Split regex patterns (CJK, punctuation/symbol, and ASCII-punct+letter).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
operators/tokenizer/bpe_utils.hpp Adds new hardcoded matchers and registers Hunyuan-specific regex branches in the pattern table used by PreTokenizerWithRegEx::Compile.
operators/tokenizer/bpe_tokenizer_model.hpp Updates pre-tokenizer loading to collect/normalize multiple Split regexes and returns a fused regex (Split branches + base regex) from GetPreTokenizerRegex.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread operators/tokenizer/bpe_utils.hpp Outdated
Comment thread operators/tokenizer/bpe_tokenizer_model.hpp Outdated
Comment thread operators/tokenizer/bpe_tokenizer_model.hpp Outdated
@sayanshaw24
Copy link
Copy Markdown
Collaborator

This PR's branch predates #1045, which landed on main and added a top-level Split pre-tokenizer handler, no_op_pretokenizer_, and the spm_model parameter to GetPreTokenizerRegex. Please rebase on main — the Sequence loop changes here conflict with that handler, and GetPreTokenizerRegex needs the spm_model parameter restored so SPM models continue to get the Llama regex.

Comment thread operators/tokenizer/bpe_tokenizer_model.hpp Outdated
Comment thread operators/tokenizer/bpe_tokenizer_model.hpp Outdated
Comment thread operators/tokenizer/bpe_utils.hpp Outdated
@tanzeel-amd tanzeel-amd force-pushed the turrahma/split_fix branch from 3c64736 to 8230491 Compare May 21, 2026 08:03
@tanzeel-amd
Copy link
Copy Markdown
Contributor Author

This PR's branch predates #1045, which landed on main and added a top-level Split pre-tokenizer handler, no_op_pretokenizer_, and the spm_model parameter to GetPreTokenizerRegex. Please rebase on main — the Sequence loop changes here conflict with that handler, and GetPreTokenizerRegex needs the spm_model parameter restored so SPM models continue to get the Llama regex.

Rebased this branch on latest microsoft/main. The conflict with #1045 is resolved, and the newer top-level Split pre-tokenizer handling, no_op_pretokenizer_, and spm_model support are preserved.

@VishalX
Copy link
Copy Markdown
Contributor

VishalX commented May 21, 2026

Similar PR: #1064

@VishalX
Copy link
Copy Markdown
Contributor

VishalX commented May 21, 2026

@kunal-vaishnavi pls review.

@sayanshaw24
Copy link
Copy Markdown
Collaborator

@tanzeel-amd @VishalX these changes introduce a tokenizer error in test_gpt_oss_chat_template inside test_pp_api.py, which is causing the CIs to fail:

test_pp_api.py:436: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (<built-in function eq>, [200006, 17360, 200008, 3575, 553, 17554, ...], [200006, 17360, 200008, 3575, 553, 17554, ...])
kwds = {'err_msg': '', 'header': 'Arrays are not equal', 'strict': False, 'verbose': True}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
                   ^^^^^^^^^^^^^^^^^^^
E           AssertionError: 
E           Arrays are not equal

Please try to repro and fix locally with an agent that has context on this session, we can jump in if needed.

@VishalX
Copy link
Copy Markdown
Contributor

VishalX commented May 22, 2026

@tanzeel-amd @VishalX these changes introduce a tokenizer error in test_gpt_oss_chat_template inside test_pp_api.py, which is causing the CIs to fail:

test_pp_api.py:436: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (<built-in function eq>, [200006, 17360, 200008, 3575, 553, 17554, ...], [200006, 17360, 200008, 3575, 553, 17554, ...])
kwds = {'err_msg': '', 'header': 'Arrays are not equal', 'strict': False, 'verbose': True}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
                   ^^^^^^^^^^^^^^^^^^^
E           AssertionError: 
E           Arrays are not equal

Please try to repro and fix locally with an agent that has context on this session, we can jump in if needed.

Thanks @sayanshaw24. We'll test and fix this.

@VishalX
Copy link
Copy Markdown
Contributor

VishalX commented May 22, 2026

@sayanshaw24 could you pls help us with the instructions to run all the tests locally? If they are already in the repo, pls point us to it.

@sayanshaw24
Copy link
Copy Markdown
Collaborator

@sayanshaw24 could you pls help us with the instructions to run all the tests locally? If they are already in the repo, pls point us to it.

Here's how to build and run the tests locally:

# 1. Install the package in development/editable mode (from the repo root)
pip install -e .

# 2. Install dev/test dependencies
pip install -r requirements-dev.txt

# 3. Run the specific failing test
pytest test/test_pp_api.py::test_gpt_oss_chat_template -v

# Or run the full test suite
pytest test

More details are in docs/development.md.

The editable install (pip install -e .) builds the C++ extensions in-place so Python changes take effect immediately. If you need debug symbols, set debug=1 in setup.cfg before building.

The failing test is in test/test_pp_api.py.

@tanzeel-amd tanzeel-amd force-pushed the turrahma/split_fix branch 4 times, most recently from 542f4a4 to 59e91bd Compare May 22, 2026 21:36
@tanzeel-amd tanzeel-amd changed the title Support Split pre-tokenizer in BPE Sequence Support Sequence pre-tokenizer with sequential Split steps May 22, 2026
@tanzeel-amd
Copy link
Copy Markdown
Contributor Author

tanzeel-amd commented May 22, 2026

@sayanshaw24 This CI failure revealed a core flaw in our logic. Rewrote the core logic to follow HuggingFace Tokenizer.
Updated the PR description for details on solution.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comment thread operators/tokenizer/bpe_kernels.cc
Comment thread operators/tokenizer/bpe_kernels.cc Outdated
Comment thread operators/tokenizer/bpe_kernels.cc Outdated
Comment thread test/test_fast_tokenizer.py Outdated
Comment thread operators/tokenizer/bpe_kernels.cc
@sayanshaw24
Copy link
Copy Markdown
Collaborator

the new CI failures are from a ClipTokenizer segfault - just looked into it and it's the same size_t underflow Copilot flagged. SplitIsolated can produce space-only chunks that become empty after clean_up_spaces stripping, causing length() - 1 to wrap. Both the empty-guard and the underflow fix are needed.

@tanzeel-amd tanzeel-amd force-pushed the turrahma/split_fix branch from 59e91bd to b573762 Compare May 22, 2026 22:20
@tanzeel-amd
Copy link
Copy Markdown
Contributor Author

Added empty-token guard, safe token_len calculation, and changed space_dif to int64_t. Also fixed a bug in SplitIsolated() where MatchWithSTLRegEx() could silently skip characters - added a test with a non-hardcoded regex pattern ([aeiou]+) to cover this path.

@kunal-vaishnavi kunal-vaishnavi merged commit b077a83 into microsoft:main May 23, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants