Tokenizers: From Text to Numbers

How neural networks understand language.

the problem

Neural networks work with numbers. Text is strings. You need to convert "Hello world" into [1, 42, 7] somehow. But which conversion?

Character-level: ['H','e','l','l','o',' ','w','o','r','l','d']
Word-level: ['Hello', 'world']
BPE: ['Hel', 'lo', 'world']

Each choice shapes what your model learns.

character-level: simple and universal

class CharTokenizer:
    def __init__(self):
        self.stoi = {}
        self.itos = {}
        
    def build_vocab(self, texts):
        chars = set(''.join(texts))
        for i, char in enumerate(sorted(chars)):
            self.stoi[char] = i
            self.itos[i] = char
            
    def encode(self, text):
        return [self.stoi[char] for char in text]
        
    def decode(self, indices):
        return ''.join(self.itos[i] for i in indices)

Pros: Never sees unknown words. Works for any language.
Cons: Long sequences. Hard to learn word meanings.

word-level: semantic units

class WordTokenizer:
    def tokenize(self, text):
        return re.findall(r"\b\w+\b|[^\s\w]", text.lower())
        
    def build_vocab(self, texts):
        all_tokens = []
        for text in texts:
            all_tokens.extend(self.tokenize(text))
            
        counts = Counter(all_tokens)
        vocab = [token for token, count in counts.most_common() 
                if count >= self.min_freq]

Pros: Captures word meanings. Shorter sequences.
Cons: Large vocabularies. Unknown words become <unk>.

bpe: best of both worlds

Start with characters. Merge frequent pairs iteratively.

# Initial: "hello" -> ['h', 'e', 'l', 'l', 'o']
# Count pairs: ('l','l') appears most
# Merge: "hello" -> ['h', 'e', 'll', 'o'] 
# Repeat until vocab_size reached

The algorithm:

def build_vocab(self, texts):
    # Start with character-level splits
    vocab = {}
    for word, freq in word_freqs.items():
        vocab[' '.join(word) + ' </w>'] = freq
    
    # Iteratively merge most frequent pairs
    for _ in range(num_merges):
        pairs = defaultdict(int)
        for word, freq in vocab.items():
            word_pairs = get_pairs(word.split())
            for pair in word_pairs:
                pairs[pair] += freq
                
        best_pair = max(pairs, key=pairs.get)
        vocab = merge_vocab(best_pair, vocab)
        self.merges.append(best_pair)

Example walkthrough:

Text: "hello hello world"
Initial vocab: {'h e l l o </w>': 2, 'w o r l d </w>': 1}

Iteration 1: Most frequent pair is ('l', 'l')
Merge: {'h e ll o </w>': 2, 'w o r l d </w>': 1}

Iteration 2: Most frequent pair is ('h', 'e') 
Merge: {'he ll o </w>': 2, 'w o r l d </w>': 1}

Continue until vocab_size reached...

the three approaches compared

Method	Sequence Length	Unknown Words	Use Case
Char	Very Long	Never	Small data, multilingual
Word	Short	Common	Large vocab, semantic tasks
BPE	Medium	Rare	Production systems

special tokens

Every tokenizer needs these:

special_tokens = ['<pad>', '<unk>', '<bos>', '<eos>']
# <pad>: padding for batches
# <unk>: unknown words  
# <bos>: beginning of sequence
# <eos>: end of sequence

implementation pattern

All tokenizers follow this interface:

class Tokenizer:
    def build_vocab(self, texts):
        # Build token <-> index mappings
        pass
        
    def encode(self, text):
        # text -> list of indices
        return [self.stoi[token] for token in tokens]
        
    def decode(self, indices):
        # indices -> text (join method depends on tokenizer type)
        tokens = [self.itos[i] for i in indices]
        return ''.join(tokens)  # char-level: no spaces
        # return ' '.join(tokens)  # word-level: spaces between words

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizers: From Text to Numbers

the problem

character-level: simple and universal

word-level: semantic units

bpe: best of both worlds

the three approaches compared

special tokens

implementation pattern

FilesExpand file tree

TOKENIZERS.md

Latest commit

History

TOKENIZERS.md

File metadata and controls

Tokenizers: From Text to Numbers

the problem

character-level: simple and universal

word-level: semantic units

bpe: best of both worlds

the three approaches compared

special tokens

implementation pattern