BPE: honour every Split in a Sequence pre_tokenizer by justinchuby · Pull Request #1064 · microsoft/onnxruntime-extensions

justinchuby · 2026-05-19T23:44:44Z

Problem

When the HuggingFace pre_tokenizer is a Sequence of multiple Split nodes, the loader at bpe_tokenizer_model.hpp:96-115 overwrites pre_tokenizer_regex_ on every iteration and only the last Split survives. All earlier regexes are silently dropped.

This breaks tokenisers such as tencent/Hy-MT1.5-1.8B-2bit, whose pre-tokenizer is:

{"type": "Sequence", "pretokenizers": [
  {"type": "Split", "pattern": {"Regex": "\\\\p{N}{1,3}"}, "behavior": "Isolated"},
  {"type": "Split", "pattern": {"Regex": "[一-龥぀-ゟ゠-ヿ]+"}, "behavior": "Isolated"},
  {"type": "Split", "pattern": {"Regex": "[!\"#$%&'()*+,\\\\-./:;<=>?@\\\\[\\\\\\\\\\\\]^_\`{|}~][A-Za-z]+|..."}, "behavior": "Isolated"},
  {"type": "ByteLevel", ...}
]}

Fix

When every Split has behavior: "Isolated", applying them in sequence is equivalent to a single alternation p1|p2|...|pN — each chunk of the final segmentation matches exactly one of the patterns. PreTokenizerWithRegEx::Compile already accepts that union (fast-paths recognised sub-patterns, falls back to std::regex for the rest), so we simply concatenate the Splits with | instead of overwriting.

Unsupported behaviors (Removed, MergedWith*) are now rejected explicitly rather than silently producing wrong tokens.

Test

Added a regression test test_hy_mt_sequence_of_splits in test/test_fast_tokenizer.py that loads the Hy-MT1.5 tokenizer and compares ORT-Extensions output against HuggingFace fast-tokenizer output for both English and Chinese inputs.

Validation

End-to-end verified with a mobius-exported Hy-MT1.5-1.8B-2bit ONNX model running through ORT GenAI:

Before: "The cat is sleeping on the chair." → "<6c>" (single byte token)
After: "The cat is sleeping on the chair." → "El gato está durmiendo en la silla."

When the HuggingFace pre_tokenizer is a Sequence containing multiple Split nodes (with Isolated behavior), the previous loader overwrote pre_tokenizer_regex_ on every iteration, silently discarding all but the last Split. That caused mis-tokenisation for models such as tencent/Hy-MT1.5-1.8B-2bit, whose Sequence contains three Splits (\\p{N}{1,3}, then a CJK range, then a GPT2-style fallback). Concatenate the Split regexes with '|' instead — for Isolated behavior a chained Split is equivalent to a single alternation, which is what PreTokenizerWithRegEx::Compile already supports. Reject unsupported Split behaviors (Removed / MergedWith*) explicitly rather than silently producing wrong tokens. Also adds a regression test in test_fast_tokenizer.py that exercises the multi-Split path against the Hy-MT1.5 tokenizer. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes BPE tokenizer loading so that all Split regexes inside a HuggingFace Sequence pre_tokenizer are honored, instead of only the last one. The fix concatenates the Split regexes into a single alternation when all use Isolated behavior, and explicitly rejects unsupported behaviors.

Changes:

Accumulate Split regexes from a Sequence pre_tokenizer and combine them via | for compilation.
Reject Split behaviors other than Isolated with a clear error.
Add a regression test loading tencent/Hy-MT1.5-1.8B-2bit and comparing token IDs against the HuggingFace fast tokenizer.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
operators/tokenizer/bpe_tokenizer_model.hpp	Collect every Split regex in a Sequence, combine with `\|`, and validate unsupported behaviors.
test/test_fast_tokenizer.py	New regression test using the Hy-MT1.5 tokenizer to verify multi-Split Sequence handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            np.testing.assert_array_equal(
+                expected_ids[0], actual_ids[n][:expected_ids.shape[1]])


+        tokenizer = AutoTokenizer.from_pretrained(
+            "tencent/Hy-MT1.5-1.8B-2bit", use_fast=True, trust_remote_code=True)


+        auto iter_behavior = node.find("behavior");
+        if (iter_behavior != node.end() && !iter_behavior->is_null()) {
+          auto behavior = iter_behavior->get<std::string>();
+          if (behavior != "Isolated") {
+            return {kOrtxErrorNotImplemented,
+                    std::string("Unsupported Split behavior in Sequence pretokenizer: ") + behavior};


1. Extract ValidateSplitBehavior() and apply it to both the top-level single-Split path and the Sequence-of-Splits path. The previous patch only validated behavior inside the Sequence loop, leaving the single-Split path silently accepting Removed/MergedWith*. 2. Replace the network-dependent Hy-MT1.5 regression test (which required trust_remote_code=True and pulled a third-party HF repo at test time) with a self-contained synthetic tokenizer built from the tokenizers library. The fixture constructs a Sequence with two Isolated Splits over the standard ByteLevel alphabet, asserts the saved tokenizer.json actually has multiple Splits (so the test can not silently regress into the single-Split case), and round-trips inputs that exercise both Split patterns. 3. Assert exact token-count equality (after stripping pad_token_id) and then array equality, instead of slicing actual_ids to the expected length. The previous slice would have let extra trailing tokens slip past silently. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

sayanshaw24 · 2026-05-26T16:40:16Z

Hunyuan model support is now added and merged in this similar PR: #1059

Copilot AI review requested due to automatic review settings May 19, 2026 23:44

justinchuby requested a review from a team as a code owner May 19, 2026 23:44

Copilot AI reviewed May 19, 2026

View reviewed changes

Copilot started reviewing on behalf of justinchuby May 19, 2026 23:51 View session

VishalX mentioned this pull request May 21, 2026

Support Sequence pre-tokenizer with sequential Split steps #1059

Merged

sayanshaw24 closed this May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPE: honour every Split in a Sequence pre_tokenizer#1064

BPE: honour every Split in a Sequence pre_tokenizer#1064
justinchuby wants to merge 2 commits into
microsoft:mainfrom
justinchuby:fail-on-sequence-multi-split

justinchuby commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

sayanshaw24 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		np.testing.assert_array_equal(
		expected_ids[0], actual_ids[n][:expected_ids.shape[1]])

		tokenizer = AutoTokenizer.from_pretrained(
		"tencent/Hy-MT1.5-1.8B-2bit", use_fast=True, trust_remote_code=True)

Conversation

justinchuby commented May 19, 2026

Problem

Fix

Test

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

sayanshaw24 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants