Skip to content

BPE: honour every Split in a Sequence pre_tokenizer#1064

Closed
justinchuby wants to merge 2 commits into
microsoft:mainfrom
justinchuby:fail-on-sequence-multi-split
Closed

BPE: honour every Split in a Sequence pre_tokenizer#1064
justinchuby wants to merge 2 commits into
microsoft:mainfrom
justinchuby:fail-on-sequence-multi-split

Conversation

@justinchuby
Copy link
Copy Markdown
Contributor

Problem

When the HuggingFace pre_tokenizer is a Sequence of multiple Split nodes, the loader at bpe_tokenizer_model.hpp:96-115 overwrites pre_tokenizer_regex_ on every iteration and only the last Split survives. All earlier regexes are silently dropped.

This breaks tokenisers such as tencent/Hy-MT1.5-1.8B-2bit, whose pre-tokenizer is:

{"type": "Sequence", "pretokenizers": [
  {"type": "Split", "pattern": {"Regex": "\\\\p{N}{1,3}"}, "behavior": "Isolated"},
  {"type": "Split", "pattern": {"Regex": "[一-龥぀-ゟ゠-ヿ]+"}, "behavior": "Isolated"},
  {"type": "Split", "pattern": {"Regex": "[!\"#$%&'()*+,\\\\-./:;<=>?@\\\\[\\\\\\\\\\\\]^_\`{|}~][A-Za-z]+|..."}, "behavior": "Isolated"},
  {"type": "ByteLevel", ...}
]}

Fix

When every Split has behavior: "Isolated", applying them in sequence is equivalent to a single alternation p1|p2|...|pN — each chunk of the final segmentation matches exactly one of the patterns. PreTokenizerWithRegEx::Compile already accepts that union (fast-paths recognised sub-patterns, falls back to std::regex for the rest), so we simply concatenate the Splits with | instead of overwriting.

Unsupported behaviors (Removed, MergedWith*) are now rejected explicitly rather than silently producing wrong tokens.

Test

Added a regression test test_hy_mt_sequence_of_splits in test/test_fast_tokenizer.py that loads the Hy-MT1.5 tokenizer and compares ORT-Extensions output against HuggingFace fast-tokenizer output for both English and Chinese inputs.

Validation

End-to-end verified with a mobius-exported Hy-MT1.5-1.8B-2bit ONNX model running through ORT GenAI:

  • Before: "The cat is sleeping on the chair.""<6c>" (single byte token)
  • After: "The cat is sleeping on the chair.""El gato está durmiendo en la silla."

When the HuggingFace pre_tokenizer is a Sequence containing multiple
Split nodes (with Isolated behavior), the previous loader overwrote
pre_tokenizer_regex_ on every iteration, silently discarding all but
the last Split. That caused mis-tokenisation for models such as
tencent/Hy-MT1.5-1.8B-2bit, whose Sequence contains three Splits
(\\p{N}{1,3}, then a CJK range, then a GPT2-style fallback).

Concatenate the Split regexes with '|' instead — for Isolated behavior
a chained Split is equivalent to a single alternation, which is what
PreTokenizerWithRegEx::Compile already supports. Reject unsupported
Split behaviors (Removed / MergedWith*) explicitly rather than silently
producing wrong tokens.

Also adds a regression test in test_fast_tokenizer.py that exercises
the multi-Split path against the Hy-MT1.5 tokenizer.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Copilot AI review requested due to automatic review settings May 19, 2026 23:44
@justinchuby justinchuby requested a review from a team as a code owner May 19, 2026 23:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes BPE tokenizer loading so that all Split regexes inside a HuggingFace Sequence pre_tokenizer are honored, instead of only the last one. The fix concatenates the Split regexes into a single alternation when all use Isolated behavior, and explicitly rejects unsupported behaviors.

Changes:

  • Accumulate Split regexes from a Sequence pre_tokenizer and combine them via | for compilation.
  • Reject Split behaviors other than Isolated with a clear error.
  • Add a regression test loading tencent/Hy-MT1.5-1.8B-2bit and comparing token IDs against the HuggingFace fast tokenizer.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
operators/tokenizer/bpe_tokenizer_model.hpp Collect every Split regex in a Sequence, combine with |, and validate unsupported behaviors.
test/test_fast_tokenizer.py New regression test using the Hy-MT1.5 tokenizer to verify multi-Split Sequence handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/test_fast_tokenizer.py Outdated
Comment on lines +69 to +70
np.testing.assert_array_equal(
expected_ids[0], actual_ids[n][:expected_ids.shape[1]])
Comment thread test/test_fast_tokenizer.py Outdated
Comment on lines +61 to +62
tokenizer = AutoTokenizer.from_pretrained(
"tencent/Hy-MT1.5-1.8B-2bit", use_fast=True, trust_remote_code=True)
Comment on lines +115 to +120
auto iter_behavior = node.find("behavior");
if (iter_behavior != node.end() && !iter_behavior->is_null()) {
auto behavior = iter_behavior->get<std::string>();
if (behavior != "Isolated") {
return {kOrtxErrorNotImplemented,
std::string("Unsupported Split behavior in Sequence pretokenizer: ") + behavior};
1. Extract ValidateSplitBehavior() and apply it to both the top-level
   single-Split path and the Sequence-of-Splits path. The previous
   patch only validated behavior inside the Sequence loop, leaving the
   single-Split path silently accepting Removed/MergedWith*.

2. Replace the network-dependent Hy-MT1.5 regression test (which
   required trust_remote_code=True and pulled a third-party HF repo at
   test time) with a self-contained synthetic tokenizer built from the
   tokenizers library. The fixture constructs a Sequence with two
   Isolated Splits over the standard ByteLevel alphabet, asserts the
   saved tokenizer.json actually has multiple Splits (so the test can
   not silently regress into the single-Split case), and round-trips
   inputs that exercise both Split patterns.

3. Assert exact token-count equality (after stripping pad_token_id) and
   then array equality, instead of slicing actual_ids to the expected
   length. The previous slice would have let extra trailing tokens slip
   past silently.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
@sayanshaw24
Copy link
Copy Markdown
Collaborator

Hunyuan model support is now added and merged in this similar PR: #1059

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants