This guide provides a high-level architectural overview of the script_bpe repository to help AI Coding Assistants work effectively in this codebase.
This repository implements SCRIPT encoding-based pre-tokenization and BPE/Unigram tokenization for multilingual text processing. The core innovation is using Unicode script properties to create more efficient and robust tokenizers for multilingual data.
Papers:
- BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
- Which Pieces Does Unigram Tokenization Really Need?
The codebase follows a layered architecture with clear separation of concerns:
Text Input
↓
[Pretokenization Layer] - Chunks text and encodes to base tokens
↓
[Corpus Layer] - Pretokenized, partitioned training data
↓
[Training Layer] - BPE or Unigram algorithm
↓
[Tokenizer Layer] - Trained model for encode/decode
Purpose: Transform raw text into sequences of "base tokens" (atomic units for tokenization)
Key Concepts:
- Base tokens: Atomic units - either bytes (UTF-8) or script/index pairs
- Pretokenization: Chunking text into segments before encoding
- Encoding: Converting characters to base token sequences
Two Main Approaches:
-
UTF-8 Pretokenizer (
UTF8Pretokenizer):- Encodes each byte as a base token
- Uses regex patterns (GPT-4, GPT-4o) for chunking
- Base token = single byte
-
Script Pretokenizer (
ScriptPretokenizer):- Uses Unicode script properties to encode characters
- Each character → (script_block_id, index_within_block)
- Base token = pair of script tokens
- Can optionally chunk by script boundaries instead of regex
Configuration System:
PretokenizerConfig: Base configuration (normalization, regex, etc.)ScriptPretokenizerConfig: Script-specific settingsUTF8PretokenizerConfig: UTF-8-specific settings- Registry in
pretokenize/__init__.pymaps names → configs
Key Variants (see PRETOKENIZER_REGISTRY):
bytes_gpt4: Classic UTF-8 + GPT-4 regexbytes_gpt4o_cb: UTF-8 + GPT-4o regex + character boundariesscriptenc_cb: Script encoding with character boundaries (PROPOSED METHOD for BPE)scriptenc_cbi: Script encoding with character boundaries + inherited enforcementscriptenc_gpt4o_cb: Hybrid (regex chunking + script encoding)scriptenc_nosplit_cb: No regex chunking (very slow, for ablations)scriptenc2_cb,scriptenc3_cb: V2/V3 encoding variants
Character Boundary Enforcement (enforce_char_boundaries):
- Prevents merges that would create invalid UTF-8 sequences
- Three levels:
False: No restrictions (can create partial characters)True(cb): Only complete characters allowedenforce_inherited=True(cbi): Also prevents inherited scripts at start
Script Encoding (scriptencoding.py):
ScriptConfig: Defines how Unicode scripts/categories map to blocksScriptBlock: A group of characters sharing script+category- Three encoding versions (
ScriptEncodingV1,ScriptEncodingV2,ScriptEncodingV3):- V1: More granular script categories
- V2: High-resource scripts get categories, low-resource collapse to ALL
- V3: Like V2, but with broader space-combining behavior (all scripts get PSF)
- Special handling: Hiragana merged with Han, space-combining scripts
- Registry variants:
scriptenc_cb(V1),scriptenc2_cb(V2),scriptenc3_cb(V3)
Purpose: Efficiently store and access pretokenized training data
Architecture:
PretokenizedCorpus: Represents a pretokenized dataset- Stored as partitioned Parquet files for parallel access
- Format:
(chunk_bytes, count)pairs - Metadata includes pretokenizer hash for validation
Storage Structure:
results/corpora/
└── {corpus_name}/
└── {pretokenizer_hash}/
├── metadata.json
├── part_0000.parquet
├── part_0001.parquet
└── ...
Workflow:
- Raw text →
from_texts()→ pretokenize with workers - Group identical chunks, count frequencies
- Partition into ~128 files for parallel loading
- Workers can iterate their partition independently
Registry (registry.py):
load_corpus_by_name(): Lazy-loads or creates corpora- Supports HuggingFace datasets (OSCAR, CulturaX, finewiki)
MONOLINGUAL_DATASETS: List of 12 language-script pairs (e.g.,eng_latn_300mb,kor_hang_300mb,zho_hans_300mb) with ~300MB each, ordered by bytes/char
Two tokenization algorithms, sharing a common base:
BaseToken: Token with id and atomic_tokens sequenceBaseTokenizer: Abstract interface for encode/decode/save/loadBaseTrainer: Training configuration and logging- Common stats: compression metrics, Renyi entropy, etc.
Training (trainer.py):
- Multi-worker architecture using fork-server for parallel processing
- Each worker maintains local chunks and pair counts
- Main loop:
- Find most frequent token pair
- Create new token, broadcast merge to workers
- Workers update local counts, send deltas back
- Aggregate deltas, update global pair counts
- Uses heaps for efficient top-pair selection
- Enforces merge policies from pretokenizer (char boundaries, etc.)
Model (tokenizer.py):
MergeRule: (token_a, token_b) → token_c mappingToken: Tracks original_count and current_count- Encoding: Greedy heap-based merge application
- Decoding: Concatenate atomic tokens, pass to pretokenizer
Key Insight: BPE training is compute-intensive because every merge updates counts across the entire corpus. Multi-worker design parallelizes this.
Training (trainer.py):
- EM algorithm (Expectation-Maximization):
- E-step: Calculate expected token counts via forward-backward
- M-step: Update token probabilities, prune low-count tokens
- Iterate until vocabulary shrinks to target size
- Initialization (
init_algorithms.py):- Extract frequent substrings from corpus
- Four strategies: "simple", "corpus_long", "corpus_intermediate", "corpus_fallback"
- Uses suffix arrays and LCP for efficient pattern mining
- Pruning strategies:
- Training: Viterbi-based (considers alternative segmentations)
- Final: Score-based (faster, just keeps top-N by probability)
- Defensive pruning: Keeps tokens if their alternatives would also be removed
Model (model.py - UnigramModel class):
UnigramToken: Token with log_prob and required flagTrie: Fast prefix matching for lattice constructionLattice: Dynamic programming for Viterbi/marginal calculation- Forward-backward algorithm for E-step
- Viterbi for best path
- Encoding: Viterbi decoding on lattice
- Decoding: Concatenate atomic tokens
- Save/load via JSON + gzip (same pattern as BPE)
Key Insight: Unigram is probabilistic - same text can tokenize differently. Training uses forward-backward to estimate marginal probabilities, but encoding uses Viterbi for deterministic output.
Command-line interface:
uv run train --corpus <name> -n <vocab_size> --pretokenizer <name> --model bpe|unigramFlow:
- Get pretokenizer from registry
- Load/create corpus (cached by pretokenizer hash)
- Create trainer with config
- Train and save to
results/{bpe|unigram}_tokenizers/{corpus}/n{size}/{pretokenizer}.json.gz
Utilities:
tokenizer_save_path(): Consistent path namingload_tokenizers_for_dataset(): Load all pretokenizer variants for comparison
Type Definitions:
TokenSeq = array.array: Efficient integer sequencesPretokenizedT = list[TokenSeq]: List of chunksInputTokenSeq: Flexible input type
Multiprocessing:
- Uses
forkservercontext to avoid copy-on-write issues gc.freeze()before forking for memory efficiency
Logging:
- Custom logger with uptime tracking
- Used consistently across trainers
Purpose: Utilities for evaluating tokenizers and generating paper results
Key Components:
metrics.py:evaluate_on_corpus()- compute compression and entropy/objective metrics (works for both BPE and Unigram)morphscore.py:MorphScore- evaluate morphological quality of tokenizationexperiments.py:get_config_hash(),flatten_model_metadata()- experiment tracking helpersformatting.py: LaTeX table formatting utilities (format_with_relchange(),format_latex_value(),format_tokens_millions(),format_vocab_value(),mark_biggest_in_group())
Usage:
from script_bpe.analysis import evaluate_on_corpus, MorphScoreTests organized by component:
tests/
├── pretokenize/ # Pretokenizer correctness (encoding, merge rules, digits)
├── bpe/ # BPE training and merge logic
├── tokenizers/
│ ├── bpe/ # BPE tokenizer functionality
│ ├── unigram/ # Unigram model, lattice, training
│ └── base/ # Base tokenizer interface
├── corpus/ # Corpus creation and loading
├── cli/ # Command-line interface smoke tests
└── conftest.py # Shared fixtures (taylorswift.txt, pretokenizers)
Testing Patterns:
- Use
taylorswift.txtas small test corpus - Fixtures for common pretokenizers
- Tests for both correctness and edge cases
- Separate tests for unit components (trie, lattice) vs integration
- Corpus is tied to a specific pretokenizer via hash
- Prevents accidentally mixing incompatible base token encodings
- Hash includes all config: normalization, regex, script config, etc.
- Pretokenizers: Name → Config → Class instantiation
- Corpora: Name → Load from cache or create from HuggingFace
- Allows easy experimentation with variants
- Both tokenizers save full pretokenizer config
- Enables standalone loading without external dependencies
- JSON + gzip for compression
- BPE: Workers process corpus partitions, communicate deltas
- Unigram: Single-process (EM is sequential)
- Corpus designed for efficient worker iteration
token_allowed(): Checks if merge is valid per pretokenizer rulesbpe_merge_allowed(): Specifically for BPE merge decisions- Enforces char boundaries, inherited script rules, etc.
from script_bpe import get_pretokenizer, BPETrainer, BPETrainerConfig
from script_bpe.corpus import load_corpus_by_name
pretokenizer = get_pretokenizer("scriptenc_cb")
corpus = load_corpus_by_name("eng_latn_300mb", pretokenizer)
config = BPETrainerConfig(additional_vocab_size=64000, num_workers=8)
trainer = BPETrainer(pretokenizer, corpus, config)
tokenizer = trainer.train()
tokenizer.save("path/to/tokenizer.json.gz")Paper utilities in paper_utils/ batch-train all variants:
bash paper_utils/script_bpe/train_monolingual.sh # Uses GNU parallel- Define config in
pretokenize/__init__.py(e.g., new regex pattern) - Add to
PRETOKENIZER_REGISTRY - Existing
Pretokenizerclasses handle instantiation via registry
tokenizer = BPETokenizer.load("path/to/tokenizer.json.gz")
corpus = load_corpus_by_name("test_corpus", tokenizer.pretokenizer)
stats = tokenizer.corpus_performance(corpus)
# stats includes: chars_per_token, bytes_per_token, shannon_bits, etc.Core Logic (read these to understand algorithms):
pretokenize/pretokenizer.py(~340 lines): Pretokenization logictokenizers/bpe/trainer.py(~250 lines): BPE training looptokenizers/bpe/tokenizer.py: BPE model (encoding, decoding, merge rules)tokenizers/unigram/trainer.py(~375 lines): Unigram EM algorithmtokenizers/unigram/model.py(~260 lines): UnigramModel, Trie, Lattice, Viterbi
Configuration (read these to understand variants):
pretokenize/__init__.py: All pretokenizer variants and registrypretokenize/scriptencoding.py: Script encoding schemes (V1, V2, V3)corpus/registry.py: Available corpora and HuggingFace dataset loaders
Entry Points:
train.py: CLI training script (supports both BPE and Unigram)__init__.py: Public API exports (BPETokenizer, UnigramModel, trainers, etc.)
- Base token: Atomic unit for BPE/Unigram (byte or script/index pair)
- Atomic tokens: Synonym for base tokens (see
Token.atomic_tokens) - Pretokenization: Chunking and encoding to base tokens
- Chunk: A contiguous sequence of base tokens (one pretokenization unit)
- Merge rule: BPE operation combining two tokens → one token
- Script: Unicode script property (Latin, Arabic, Han, etc.)
- Category: Unicode category (Letter, Mark, Punctuation, etc.)
- Character boundaries: Merge policy preventing partial UTF-8 characters
- Inherited script: Unicode combining marks that inherit script from base char
- Lattice: Graph of possible token segmentations (Unigram)
- Viterbi: Algorithm to find best path through lattice
- Forward-backward: Algorithm to compute marginal probabilities (EM E-step)
- Adding tests: Use existing fixtures in
conftest.py, add to relevant subdirectory - Debugging training: Set
verbose=Truein trainer config for detailed logs - Memory issues: BPE training is memory-intensive; reduce
num_workersor corpus size - Pretokenizer experiments: Clone existing config in registry, modify parameters
- Results are cached: Tokenizers saved to
results/, delete to retrain - Corpus is cached: Pretokenized corpora in
results/corpora/, reused across runs
- Pretokenizer hash mismatch: Corpus created with different pretokenizer config
- Slow training:
nosplitvariants don't use regex, process entire docs as one chunk - Invalid merges: Check
enforce_char_boundariesandenforce_inheritedsettings - Empty vocabulary:
additional_vocab_size=0skips training (useful for corpus prep) - Import errors: Use
from script_bpe import Xnotfrom script_bpe.tokenizers.bpe import X - Forgetting uv: Use
uv runfor all Python commands to automatically run in the correct venv.
- BPE training: O(n_merges × corpus_size), parallelizable, memory-intensive
- Unigram training: O(iterations × corpus_size), sequential EM, more I/O bound
- Script encoding: ~2x more base tokens than UTF-8, but better multilingual compression
- Corpus loading: Lazy-loaded from partitions, minimal memory footprint
- Serialization: gzip compression ~10x reduction in file size
paper_utils/script_bpe/: BPE paper reproduction utilities (training scripts, compression notebooks)- Paper: BPE Stays on SCRIPT
- Scripts:
train_monolingual.sh,train_multilingual.sh - Notebooks:
monolingual_compression.ipynb,multilingual_compression.ipynb,blocks.ipynb
paper_utils/unigram/: Unigram paper reproduction utilities (table generation, hyperparameter tuning)- Paper: Which Pieces Does Unigram Tokenization Really Need?
- Scripts:
run_all_experiments.sh,train_hyperparameters.py - Table generators:
generate_main_tables.py,generate_appendix_tables.py,generate_finewiki_comparison.py - Analysis:
generate_morphscore_examples.py,generate_scatterplot.py
results/: Cached tokenizers and corpora (not in git)tests/data/taylorswift.txt: Small test corpus