kitoken: 20 tokens → [878, 6655, 34773, 27688, 7016, 2318, 1955, 235265, 3178, 1055, 6654, 3153, 576, 6655, 17268, 3838, 7016, 2318, 1955, 235265]
HF: 16 tokens → [878, 6655, 34773, 27688, 7016, 2318, 1955, 235265, 3178, 138, 6654, 3153, 576, 6655, 17268, 3838, 7016, 2318, 1955, 235265]
Match: False
kitoken produces ~13% more tokens than HuggingFace on enwik8 with Gemma 3's SentencePiece BPE tokenizer (3,575 vs 3,156 tokens on 10KB). The first divergence is at token index 9, where HF produces token 138 but kitoken produces 3153.
This suggests some merges are being skipped or merge ranks are being applied incorrectly for SentencePiece-style BPE models.
Tested with kitoken 0.10.1, tokenizers 0.25.0.
Reproducer
Output:
Details
kitoken produces ~13% more tokens than HuggingFace on enwik8 with Gemma 3's SentencePiece BPE tokenizer (3,575 vs 3,156 tokens on 10KB). The first divergence is at token index 9, where HF produces token
138but kitoken produces3153.This suggests some merges are being skipped or merge ranks are being applied incorrectly for SentencePiece-style BPE models.
Tested with kitoken 0.10.1, tokenizers 0.25.0.