Support Sequence pre-tokenizer with sequential Split steps by tanzeel-amd · Pull Request #1059 · microsoft/onnxruntime-extensions

tanzeel-amd · 2026-05-08T11:09:57Z

Support Sequence pre-tokenizer with sequential Split steps

Problem

HuggingFace tokenizer.json files may specify a Sequence pre-tokenizer containing multiple Split steps followed by ByteLevel (e.g. Hunyuan, GPT-OSS, Qwen 2.5). The existing code on main handles the Sequence loop by overwriting pre_tokenizer_regex_ with each Split entry — only the last Split regex survives, silently discarding earlier steps.

For single-Split models (like GPT-OSS) this happened to work. For multi-Split models (like Hunyuan with 3 Split steps), earlier patterns were lost and tokenization produced incorrect results.

Solution

Implement true sequential pre-tokenization matching the HuggingFace tokenizers Rust library:

Sequence::pre_tokenize — the for loop that runs each step sequentially on the same PreTokenizedString
Split::pre_tokenize — splits text using a regex with configurable behavior
Isolated behavior — keeps both matched and unmatched regions as separate chunks
Pattern::find_matches for Regex — returns contiguous coverage of the entire input, nothing is dropped
ByteLevel with use_regex: false — only does byte-to-unicode mapping, no GPT-2 regex splitting

Changes

File	What
`bpe_utils.hpp`	Add `SplitIsolated()` — splits text preserving both matched and unmatched regions. Add CJK/Hunyuan hardcoded matchers + pattern table entries.
`bpe_tokenizer_model.hpp`	Store Sequence Split steps in `sequence_steps_` vector instead of overwriting a single string. Add `NormalizeJsonRegexEscapes()` for CR/LF handling from JSON parsing.
`bpe_kernels.cc`	In `Tokenize()`, branch on `HasSequencePreTokenizer()`: Sequence models run the sequential pipeline; non-Sequence models use the existing single-regex path unchanged. `SpmTokenize()` is untouched.
`test_fast_tokenizer.py`	Add `test_hunyuan_split_sequence_tokenization` — builds a synthetic Hunyuan-style tokenizer (3 Split steps + ByteLevel) and verifies ORT matches HuggingFace across CJK, numbers, tabs, newlines, and mixed content.

Non-Sequence tokenizers (GPT-2, Llama, Phi, CLIP, Falcon, etc.) are completely unaffected — HasSequencePreTokenizer() returns false and the existing code path runs.

Testing

Test	Result
`test_gpt_oss_chat_template` (was failing)	PASS
`test_hunyuan_split_sequence_tokenization` (new)	PASS
`tencent/Hunyuan-1.8B-Instruct` (11 strings)	All match HuggingFace
Full test suite (local, Windows)	155 passed, 35 skipped — zero regressions vs `main`
Models verified	GPT-2, GPT-OSS, Llama, Mistral, Phi-3/4, CLIP, BERT, RoBERTa, T5, Falcon, Qwen, DeepSeek, Whisper, OLMa, Hunyuan

Copilot

Pull request overview

This PR improves BPE tokenizer compatibility with Hugging Face tokenizer.json files that use a Sequence pre-tokenizer containing multiple Split steps (plus ByteLevel), by accumulating all Split regex patterns and fusing them (at higher priority) ahead of the base GPT-2/Llama pre-tokenizer regex to avoid incorrect/degenerate tokenization.

Changes:

Accumulate Split regexes from a Sequence pre-tokenizer (instead of overwriting) and fuse them ahead of the base pre-tokenizer regex.
Normalize CR/LF bytes in JSON-loaded regex strings to \\r/\\n to align with the compile-time pattern table matching logic.
Add hardcoded matchers + compile-time pattern-table entries for Hunyuan-specific Split regex patterns (CJK, punctuation/symbol, and ASCII-punct+letter).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`operators/tokenizer/bpe_utils.hpp`	Adds new hardcoded matchers and registers Hunyuan-specific regex branches in the pattern table used by `PreTokenizerWithRegEx::Compile`.
`operators/tokenizer/bpe_tokenizer_model.hpp`	Updates pre-tokenizer loading to collect/normalize multiple Split regexes and returns a fused regex (Split branches + base regex) from `GetPreTokenizerRegex`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sayanshaw24 · 2026-05-15T21:09:38Z

This PR's branch predates #1045, which landed on main and added a top-level Split pre-tokenizer handler, no_op_pretokenizer_, and the spm_model parameter to GetPreTokenizerRegex. Please rebase on main — the Sequence loop changes here conflict with that handler, and GetPreTokenizerRegex needs the spm_model parameter restored so SPM models continue to get the Llama regex.

tanzeel-amd · 2026-05-21T08:46:53Z

This PR's branch predates #1045, which landed on main and added a top-level Split pre-tokenizer handler, no_op_pretokenizer_, and the spm_model parameter to GetPreTokenizerRegex. Please rebase on main — the Sequence loop changes here conflict with that handler, and GetPreTokenizerRegex needs the spm_model parameter restored so SPM models continue to get the Llama regex.

Rebased this branch on latest microsoft/main. The conflict with #1045 is resolved, and the newer top-level Split pre-tokenizer handling, no_op_pretokenizer_, and spm_model support are preserved.

VishalX · 2026-05-21T08:55:31Z

Similar PR: #1064

VishalX · 2026-05-21T08:56:42Z

@kunal-vaishnavi pls review.

sayanshaw24 · 2026-05-22T17:15:38Z

@tanzeel-amd @VishalX these changes introduce a tokenizer error in test_gpt_oss_chat_template inside test_pp_api.py, which is causing the CIs to fail:

test_pp_api.py:436: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (<built-in function eq>, [200006, 17360, 200008, 3575, 553, 17554, ...], [200006, 17360, 200008, 3575, 553, 17554, ...])
kwds = {'err_msg': '', 'header': 'Arrays are not equal', 'strict': False, 'verbose': True}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
                   ^^^^^^^^^^^^^^^^^^^
E           AssertionError: 
E           Arrays are not equal

Please try to repro and fix locally with an agent that has context on this session, we can jump in if needed.

VishalX · 2026-05-22T17:21:38Z

@tanzeel-amd @VishalX these changes introduce a tokenizer error in test_gpt_oss_chat_template inside test_pp_api.py, which is causing the CIs to fail:
test_pp_api.py:436: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (<built-in function eq>, [200006, 17360, 200008, 3575, 553, 17554, ...], [200006, 17360, 200008, 3575, 553, 17554, ...])
kwds = {'err_msg': '', 'header': 'Arrays are not equal', 'strict': False, 'verbose': True}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
                   ^^^^^^^^^^^^^^^^^^^
E           AssertionError: 
E           Arrays are not equal
Please try to repro and fix locally with an agent that has context on this session, we can jump in if needed.

Thanks @sayanshaw24. We'll test and fix this.

VishalX · 2026-05-22T17:28:06Z

@sayanshaw24 could you pls help us with the instructions to run all the tests locally? If they are already in the repo, pls point us to it.

sayanshaw24 · 2026-05-22T18:49:57Z

@sayanshaw24 could you pls help us with the instructions to run all the tests locally? If they are already in the repo, pls point us to it.

Here's how to build and run the tests locally:

# 1. Install the package in development/editable mode (from the repo root)
pip install -e .

# 2. Install dev/test dependencies
pip install -r requirements-dev.txt

# 3. Run the specific failing test
pytest test/test_pp_api.py::test_gpt_oss_chat_template -v

# Or run the full test suite
pytest test

More details are in docs/development.md.

The editable install (pip install -e .) builds the C++ extensions in-place so Python changes take effect immediately. If you need debug symbols, set debug=1 in setup.cfg before building.

The failing test is in test/test_pp_api.py.

tanzeel-amd · 2026-05-22T21:42:36Z

@sayanshaw24 This CI failure revealed a core flaw in our logic. Rewrote the core logic to follow HuggingFace Tokenizer.
Updated the PR description for details on solution.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

sayanshaw24 · 2026-05-22T22:15:23Z

the new CI failures are from a ClipTokenizer segfault - just looked into it and it's the same size_t underflow Copilot flagged. SplitIsolated can produce space-only chunks that become empty after clean_up_spaces stripping, causing length() - 1 to wrap. Both the empty-guard and the underflow fix are needed.

…fix and test

tanzeel-amd · 2026-05-22T22:49:20Z

Added empty-token guard, safe token_len calculation, and changed space_dif to int64_t. Also fixed a bug in SplitIsolated() where MatchWithSTLRegEx() could silently skip characters - added a test with a non-hardcoded regex pattern ([aeiou]+) to cover this path.

Copilot AI review requested due to automatic review settings May 8, 2026 11:09

tanzeel-amd requested a review from a team as a code owner May 8, 2026 11:09

Copilot started reviewing on behalf of tanzeel-amd May 8, 2026 11:10 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread operators/tokenizer/bpe_utils.hpp Outdated

Comment thread operators/tokenizer/bpe_tokenizer_model.hpp Outdated

Comment thread operators/tokenizer/bpe_tokenizer_model.hpp Outdated

kunal-vaishnavi requested a review from sayanshaw24 May 15, 2026 17:51

sayanshaw24 reviewed May 15, 2026

View reviewed changes

Comment thread operators/tokenizer/bpe_tokenizer_model.hpp Outdated

sayanshaw24 reviewed May 15, 2026

View reviewed changes

Comment thread operators/tokenizer/bpe_tokenizer_model.hpp Outdated

sayanshaw24 reviewed May 15, 2026

View reviewed changes

Comment thread operators/tokenizer/bpe_utils.hpp Outdated

tanzeel-amd force-pushed the turrahma/split_fix branch from 3c64736 to 8230491 Compare May 21, 2026 08:03

tanzeel-amd force-pushed the turrahma/split_fix branch 4 times, most recently from 542f4a4 to 59e91bd Compare May 22, 2026 21:36

tanzeel-amd changed the title ~~Support Split pre-tokenizer in BPE Sequence~~ Support Sequence pre-tokenizer with sequential Split steps May 22, 2026

tanzeel-amd requested review from Copilot and sayanshaw24 May 22, 2026 21:47

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread operators/tokenizer/bpe_kernels.cc

Comment thread operators/tokenizer/bpe_kernels.cc Outdated

Comment thread operators/tokenizer/bpe_kernels.cc Outdated

Comment thread test/test_fast_tokenizer.py Outdated

sayanshaw24 reviewed May 22, 2026

View reviewed changes

Comment thread operators/tokenizer/bpe_kernels.cc

Support Sequence pre-tokenizer with sequential Split steps

b573762

tanzeel-amd force-pushed the turrahma/split_fix branch from 59e91bd to b573762 Compare May 22, 2026 22:20

Address review: empty-token guard, signed offset, STL regex fallback …

8c455f4

…fix and test

sayanshaw24 approved these changes May 22, 2026

View reviewed changes

kunal-vaishnavi merged commit b077a83 into microsoft:main May 23, 2026
32 checks passed

sayanshaw24 mentioned this pull request May 26, 2026

BPE: honour every Split in a Sequence pre_tokenizer #1064

Closed

Conversation

tanzeel-amd commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Support Sequence pre-tokenizer with sequential Split steps

Problem

Solution

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayanshaw24 commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tanzeel-amd commented May 21, 2026

Uh oh!

VishalX commented May 21, 2026

Uh oh!

VishalX commented May 21, 2026

Uh oh!

sayanshaw24 commented May 22, 2026

Uh oh!

VishalX commented May 22, 2026

Uh oh!

VishalX commented May 22, 2026

Uh oh!

sayanshaw24 commented May 22, 2026

Uh oh!

tanzeel-amd commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayanshaw24 commented May 22, 2026

Uh oh!

tanzeel-amd commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tanzeel-amd commented May 8, 2026 •

edited

Loading

tanzeel-amd commented May 22, 2026 •

edited

Loading