SentencePiece BPE produces incorrect output on Gemma 3

## Reproducer

```python
import kitoken
from tokenizers import Tokenizer
from huggingface_hub import hf_hub_download

path = hf_hub_download("google/gemma-3-4b-it", "tokenizer.json")
ki = kitoken.Kitoken.from_tokenizers_file(path)
hf = Tokenizer.from_file(path)

text = "In [[political philosophy]], the concept of [[limited government]]"

ki_ids = ki.encode(text)
hf_ids = hf.encode(text, add_special_tokens=False).ids

print(f"kitoken: {len(ki_ids)} tokens → {ki_ids[:15]}")
print(f"HF:      {len(hf_ids)} tokens → {hf_ids[:15]}")
print(f"Match: {ki_ids == hf_ids}")
```

Output:
```
kitoken: 20 tokens → [878, 6655, 34773, 27688, 7016, 2318, 1955, 235265, 3178, 1055, 6654, 3153, 576, 6655, 17268, 3838, 7016, 2318, 1955, 235265]
HF:      16 tokens → [878, 6655, 34773, 27688, 7016, 2318, 1955, 235265, 3178, 138, 6654, 3153, 576, 6655, 17268, 3838, 7016, 2318, 1955, 235265]
Match: False
```

## Details

kitoken produces ~13% more tokens than HuggingFace on enwik8 with Gemma 3's SentencePiece BPE tokenizer (3,575 vs 3,156 tokens on 10KB). The first divergence is at token index 9, where HF produces token `138` but kitoken produces `3153`.

This suggests some merges are being skipped or merge ranks are being applied incorrectly for SentencePiece-style BPE models.

Tested with kitoken 0.10.1, tokenizers 0.25.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SentencePiece BPE produces incorrect output on Gemma 3 #3

Reproducer

Details

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

SentencePiece BPE produces incorrect output on Gemma 3 #3

Description

Reproducer

Details

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions