Skip to content

SentencePiece BPE produces incorrect output on Gemma 3 #3

@chonknick

Description

@chonknick

Reproducer

import kitoken
from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download

path = hf_hub_download("google/gemma-3-4b-it", "tokenizer.json")
ki = kitoken.Kitoken.from_tokenizers_file(path)
hf = Tokenizer.from_file(path)

text = "In [[political philosophy]], the concept of [[limited government]]"

ki_ids = ki.encode(text)
hf_ids = hf.encode(text, add_special_tokens=False).ids

print(f"kitoken: {len(ki_ids)} tokens → {ki_ids[:15]}")
print(f"HF:      {len(hf_ids)} tokens → {hf_ids[:15]}")
print(f"Match: {ki_ids == hf_ids}")

Output:

kitoken: 20 tokens → [878, 6655, 34773, 27688, 7016, 2318, 1955, 235265, 3178, 1055, 6654, 3153, 576, 6655, 17268, 3838, 7016, 2318, 1955, 235265]
HF:      16 tokens → [878, 6655, 34773, 27688, 7016, 2318, 1955, 235265, 3178, 138, 6654, 3153, 576, 6655, 17268, 3838, 7016, 2318, 1955, 235265]
Match: False

Details

kitoken produces ~13% more tokens than HuggingFace on enwik8 with Gemma 3's SentencePiece BPE tokenizer (3,575 vs 3,156 tokens on 10KB). The first divergence is at token index 9, where HF produces token 138 but kitoken produces 3153.

This suggests some merges are being skipped or merge ranks are being applied incorrectly for SentencePiece-style BPE models.

Tested with kitoken 0.10.1, tokenizers 0.25.0.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions