Skip to content

Latest commit

 

History

History
1189 lines (854 loc) · 35.8 KB

File metadata and controls

1189 lines (854 loc) · 35.8 KB

Llama 3 Tokenizer for Go

A pure Go implementation of the Llama 3 tokenizer, providing exact compatibility with the official Llama 3 tokenization used in models 3.0, 3.1, 3.2, and 3.3.

Features

  • Exact Compatibility: Produces identical token sequences to the official implementation
  • Full UTF-8 Support: Handles multilingual text and emojis correctly
  • All Special Tokens: Supports all 256 special tokens including <|begin_of_text|>, <|end_of_text|>, etc.
  • Thread-Safe: Safe for concurrent use with built-in caching
  • Zero Dependencies: Pure Go implementation with only standard library dependencies
  • High Performance: Optimized BPE implementation with caching

Installation

go get github.com/agentstation/tokenizer/llama3

Usage

Basic Usage

package main

import (
    "fmt"
    "github.com/agentstation/tokenizer/llama3"
)

func main() {
    // Create tokenizer with default Llama 3 vocabulary
    tokenizer, err := llama3.New()
    if err != nil {
        panic(err)
    }
    
    // Encode text to tokens
    text := "Hello world!"
    tokens := tokenizer.Encode(text, nil)
    fmt.Printf("Text: %s\n", text)
    fmt.Printf("Tokens: %v\n", tokens)
    // Output: [128000, 9906, 1917, 0, 128001]
    
    // Decode tokens back to text
    decoded := tokenizer.Decode(tokens)
    fmt.Printf("Decoded: %s\n", decoded)
    // Output: <|begin_of_text|>Hello world!<|end_of_text|>
}

Encoding Options

Control the addition of special tokens:

// Without special tokens
opts := &llama3.EncodeOptions{
    BOS: false,  // Don't add <|begin_of_text|>
    EOS: false,  // Don't add <|end_of_text|>
}
tokens := tokenizer.Encode("Hello world!", opts)
// Output: [9906, 1917, 0]

Special Tokens

Work with special tokens:

// Get special token ID
id, err := tokenizer.GetSpecialTokenID("<|end_of_text|>")
if err == nil {
    fmt.Printf("EOT token ID: %d\n", id)
}

// Encode text containing special tokens
text := "<|start_header_id|>system<|end_header_id|>You are a helpful assistant."
tokens := tokenizer.Encode(text, nil)

Advanced Options

Create a tokenizer with custom configuration:

// Create tokenizer with custom cache size
tokenizer, err := llama3.New(
    llama3.WithCacheSize(8192), // Custom cache size (default: 4096)
)
if err != nil {
    panic(err)
}

// Or with custom data files
vocabBase64 := "..." // Base64-encoded vocabulary JSON (about 1.5MB)
mergesBinary := "..." // Base64-encoded binary merge rules (about 1.5MB)
specialTokens := []string{
    "<|begin_of_text|>",
    "<|end_of_text|>",
    "<|start_header_id|>",
    "<|end_header_id|>",
    // ... other special tokens
}

tokenizer, err := llama3.New(
    llama3.WithVocabData(vocabBase64, mergesBinary, specialTokens),
)

// Example: Loading from files
vocabData, err := os.ReadFile("vocab_base64.txt")
if err != nil {
    panic(err)
}
mergesData, err := os.ReadFile("merges_binary.txt")
if err != nil {
    panic(err)
}

tokenizer, err = llama3.New(
    llama3.WithVocabData(
        string(vocabData),
        string(mergesData),
        []string{
            "<|begin_of_text|>",
            "<|end_of_text|>",
            "<|start_header_id|>",
            "<|end_header_id|>",
            "<|eot_id|>",
            "<|python_tag|>",
            // Add all 256 special tokens as needed
        },
    ),
)

Optimistic Token Counting

For fine-tuned models with custom special tokens:

// Counts any <|...|> pattern as a special token
count := tokenizer.OptimisticCount("Custom text with <|my_token|> special tokens")

Implementation Details

This implementation follows the Llama 3 tokenization specification:

  1. Pre-tokenization: Uses a state machine to split text into words and subwords
  2. Byte-level encoding: Converts text to UTF-8 bytes with special character mappings
  3. BPE Algorithm: Applies Byte Pair Encoding with the Llama 3 merge rules
  4. Special token handling: Recognizes and preserves all Llama 3 special tokens

The tokenizer uses a vocabulary size of 128,256 tokens, including:

  • 128,000 base tokens
  • 256 special tokens

Compatibility

Compatible with:

  • Llama 3.0
  • Llama 3.1
  • Llama 3.2
  • Llama 3.3
  • Fine-tuned models based on Llama 3

Full JavaScript Compatibility

This implementation achieves 100% compatibility with the JavaScript reference implementation through a custom state machine that exactly replicates the regex behavior. All edge cases, including complex whitespace patterns, are handled correctly.

For detailed implementation notes and technical design decisions, see IMPLEMENTATION.md.

Data Files

The tokenizer requires two data files:

  • vocab_base64.txt: Base64-encoded vocabulary (1.5MB)
  • merges_binary.txt: Base64-encoded merge rules (1.5MB)

The data files are included in this repository and will be automatically loaded when you use the tokenizer.

These files were extracted from the llama3-tokenizer-js project.

Build Options

Option 1: Embedded Data (Recommended)

# Build with embedded data files
go build -tags embed

# The binary will contain the tokenizer data

Option 2: External Data Files

# Build without embedded data
go build

# Place data files in one of these locations:
# - Same directory as the binary
# - ./llama3/ subdirectory
# - Parent directory

The tokenizer will automatically try to load data from standard locations if not embedded.

Testing

Run the test suite:

go test ./llama3

Run compatibility tests (476 test cases):

go test -run TestCompatibility -v ./llama3

Run benchmarks:

go test -bench=. ./llama3

Performance

The tokenizer is optimized for production use with:

  • Object pooling: Reuses state machines and token buffers for 36% less memory usage
  • BPE caching: Caches merge operations for repeated tokens
  • Efficient data structures: Priority queue for BPE, pre-computed lookups
  • Comprehensive benchmarks: See OPTIMIZATIONS.md for implementation details

Run benchmarks:

go test -bench=. -benchmem ./llama3

License

MIT License - see LICENSE file for details.

Acknowledgments

This implementation is based on the JavaScript llama3-tokenizer-js by belladoreai. The vocabulary and merge data files were extracted from their bundled JavaScript implementation.

llama3

import "github.com/agentstation/tokenizer/llama3"

Package llama3 implements the Llama 3 tokenizer in Go. This file contains all constants used throughout the tokenizer implementation.

Package llama3 implements the Llama 3 tokenizer in pure Go.

This package provides exact compatibility with the official Llama 3 tokenization, supporting byte-level BPE (Byte Pair Encoding) tokenization with all special tokens. It is a faithful port of the JavaScript implementation and produces identical token sequences.

Overview

The Llama 3 tokenizer uses a three-stage process:

  1. Pre-tokenization: Text is split into words, whitespace, and punctuation using a state machine that replicates the JavaScript regex behavior
  2. Byte-level encoding: Text is converted to a custom byte representation
  3. BPE algorithm: Subword units are merged according to learned merge rules

The tokenizer uses a vocabulary of 128,256 tokens:

  • 128,000 base tokens
  • 256 special tokens (e.g., <|begin_of_text|>, <|end_of_text|>)

Architecture

┌─────────────┐
│  Input Text │
└──────┬──────┘
       │
       ▼
┌─────────────────┐     ┌─────────────────┐
│ Special Token   │────▶│ State Machine   │
│ Splitting       │     │ Pre-tokenization│
└─────────────────┘     └────────┬────────┘
                                 │
                                 ▼
                        ┌─────────────────┐
                        │ Byte-level      │
                        │ Encoding        │
                        └────────┬────────┘
                                 │
                                 ▼
                        ┌─────────────────┐
                        │ BPE Algorithm   │
                        │ (with caching)  │
                        └────────┬────────┘
                                 │
                                 ▼
                        ┌─────────────────┐
                        │ Token IDs       │
                        └─────────────────┘

Basic Usage

tokenizer, err := llama3.New()
if err != nil {
    log.Fatal(err)
}

// Encode text to token IDs
tokens := tokenizer.Encode("Hello, world!", nil)

// Decode token IDs back to text
text := tokenizer.Decode(tokens)

Advanced Usage

The tokenizer can be configured with various options:

// Create with custom cache size
tokenizer, err := llama3.New(
    llama3.WithCacheSize(1000),
)

// Create with custom vocabulary and merges
tokenizer, err := llama3.New(
    llama3.WithVocabulary(customVocab),
    llama3.WithMerges(customMerges),
    llama3.WithSpecialTokens(customSpecialTokens),
)

State Machine

The pre-tokenization stage uses a custom state machine that exactly replicates the JavaScript regex pattern. This ensures 100% compatibility, including edge cases like negative lookahead for whitespace patterns.

The state machine matches patterns in this order:

  1. Contractions: (?i:'s|'t|'re|'ve|'m|'ll|'d)
  2. Words with prefix: [^\r\n\p{L}\p{N}]?\p{L}+
  3. Numbers: \p{N}{1,3}
  4. Punctuation: ?[^\s\p{L}\p{N}]+[\r\n]*
  5. Newlines: \s*[\r\n]+
  6. Whitespace: \s+(?!\S)

Performance

The tokenizer is optimized for production use:

  • Object pooling reduces allocations by 36%
  • BPE results are cached for repeated tokens
  • State machines and token buffers are reused
  • Thread-safe design allows concurrent usage

Memory Management

The package uses sync.Pool for efficient memory management:

  • State machines are pooled and reused
  • Token buffers are pooled (up to 1024 capacity)
  • BPE merge operations use a priority queue

Pool Usage Patterns:

  1. State Machine Pooling (stateMachinePool)
  • Reuses StateMachine instances across tokenization calls
  • Reduces allocations for the input rune slice and token slice
  • Pool never limits the number of state machines
  • State machines are reset before reuse
  1. Token Buffer Pooling (tokenBufPool)
  • Reuses []string slices for collecting tokens
  • Initial capacity: 64 tokens
  • Maximum pooled capacity: 1024 tokens
  • Buffers exceeding the maximum are not returned to the pool

Memory Lifecycle:

  1. Allocation: First call creates new instance, subsequent calls may reuse 2. Usage: Instance is used for one tokenization operation 3. Return to Pool: References cleared, slices reset, large buffers discarded 4. Garbage Collection: Go runtime may clear pools during GC

Performance: Benchmarks show 36% memory reduction with pooling

Error Handling

The package defines custom error types for better error handling:

  • DataError: Issues with loading or processing tokenizer data
  • TokenError: Issues with specific tokens or token IDs
  • ConfigError: Issues with tokenizer configuration

All errors implement the error interface and support error wrapping.

Thread Safety

The tokenizer is safe for concurrent use. Multiple goroutines can encode and decode text simultaneously without issues. The internal cache uses read-write mutexes for efficient concurrent access.

Package llama3 implements the Llama 3 tokenizer in Go. It provides exact compatibility with the official Llama 3 tokenization, supporting byte-level BPE tokenization with all special tokens.

Package llama3 implements the Llama 3 tokenizer in Go. This file contains the public API including interfaces and options.

Index

Variables

Common errors.

var (
    // ErrDataNotFound indicates that the tokenizer data files could not be found.
    ErrDataNotFound = errors.New("tokenizer data not found")

    // ErrInvalidToken indicates an invalid token was provided.
    ErrInvalidToken = errors.New("invalid token")

    // ErrTokenNotFound indicates a token was not found in the vocabulary.
    ErrTokenNotFound = errors.New("token not found")

    // ErrInvalidTokenID indicates an invalid token ID was provided.
    ErrInvalidTokenID = errors.New("invalid token ID")
)

Scanner option functions - these are re-exported from the scanner package.

var (
    // WithBufferSize sets the internal buffer size for reading.
    // Default is 4096 bytes.
    WithBufferSize = scanner.WithBufferSize

    // WithMaxBuffer sets the maximum buffer size before forcing tokenization.
    // This prevents unbounded memory growth for pathological inputs.
    // Default is 1MB.
    WithMaxBuffer = scanner.WithMaxBuffer

    // WithEncodeOptions sets encoding options for the scanner.
    WithEncodeOptions = func(opts *EncodeOptions) ScannerOption {
        return scanner.WithEncodeOptions(&scanner.EncodeOptions{
            BOS: opts.BOS,
            EOS: opts.EOS,
        })
    }
)

func NewConfigError(field string, value any, err error) error

NewConfigError creates a new ConfigError.

func NewDataError(op, path string, err error) error

NewDataError creates a new DataError.

func NewTokenError(op, token string, err error) error

NewTokenError creates a new TokenError.

func NewTokenIDError(op string, tokenID int, err error) error

NewTokenIDError creates a new TokenError with a token ID.

type BPE

BPE is the interface for Byte Pair Encoding processing. BPE merges frequently occurring character pairs to create subword tokens.

type BPE interface {
    // EncodeBPE applies byte pair encoding to a pre-tokenized string.
    // Returns a slice of token IDs representing the encoded text.
    EncodeBPE(pretoken string) []int
}

type Cache

Cache is the interface for caching BPE results. BPE tokenization can be expensive for repeated text patterns, so caching improves performance significantly.

The cache key is typically the pre-tokenized text string, and the value is the slice of token IDs produced by BPE.

Implementations should be thread-safe if the tokenizer will be used concurrently.

type Cache interface {
    // Get retrieves a cached BPE result.
    // Returns the token IDs and true if found, or nil and false if not cached.
    Get(key string) ([]int, bool)

    // Put stores a BPE result in the cache.
    // The implementation may evict old entries based on its eviction policy.
    Put(key string, value []int)
}

ConfigError represents an error in tokenizer configuration.

type ConfigError struct {
    Field string // Configuration field that has an error
    Value any    // The invalid value
    Err   error  // Underlying error
}

func (*ConfigError) Error

func (e *ConfigError) Error() string

func (*ConfigError) Unwrap

func (e *ConfigError) Unwrap() error

DataError represents an error related to tokenizer data loading or processing.

type DataError struct {
    Op   string // Operation that failed
    Path string // File path if applicable
    Err  error  // Underlying error
}

func (*DataError) Error

func (e *DataError) Error() string

func (*DataError) Unwrap

func (e *DataError) Unwrap() error

type Decoder

Decoder is the interface for decoding tokens to text. This interface is useful for testing and creating mock implementations.

type Decoder interface {
    // Decode converts a sequence of token IDs back to text.
    Decode(tokens []int) string
}

DecoderFunc is an adapter to allow ordinary functions to be used as Decoders. This is useful for creating mock decoders in tests.

type DecoderFunc func(tokens []int) string

func (DecoderFunc) Decode

func (f DecoderFunc) Decode(tokens []int) string

Decode calls f(tokens).

EncodeOptions controls the encoding behavior.

type EncodeOptions struct {
    // BOS adds the beginning-of-text token if true (default: true)
    BOS bool
    // EOS adds the end-of-text token if true (default: true)
    EOS bool
}

type Encoder

Encoder is the interface for encoding text to tokens. This interface is useful for testing and creating mock implementations.

type Encoder interface {
    // Encode converts text to a sequence of token IDs.
    Encode(text string, opts *EncodeOptions) []int
}

EncoderFunc is an adapter to allow ordinary functions to be used as Encoders. This is useful for creating mock encoders in tests.

type EncoderFunc func(text string, opts *EncodeOptions) []int

func (EncoderFunc) Encode

func (f EncoderFunc) Encode(text string, opts *EncodeOptions) []int

Encode calls f(text, opts).

type Option

Option is a functional option for configuring a Tokenizer.

type Option func(*config) error

func WithCacheSize(size int) Option

WithCacheSize sets the maximum size of the BPE cache. Set to 0 to disable caching. Default is unlimited.

func WithDataFiles(vocabPath, mergesPath string) Option

WithDataFiles loads vocabulary and merges from files instead of embedded data. The vocabulary file should contain base64-encoded vocabulary data. The merges file should contain base64-encoded binary merge data.

func WithDataLoader(loader VocabularyDataLoader) Option

WithDataLoader sets a custom data loader for the tokenizer. This allows loading vocabulary and merges from custom sources.

func WithSpecialTokens(tokens []string) Option

WithSpecialTokens sets custom special tokens for the tokenizer. If nil, the default Llama 3 special tokens will be used.

PreTokenizer is the interface for pre-tokenization. Pre-tokenization splits text into words, numbers, punctuation, etc. before the BPE algorithm is applied.

type PreTokenizer interface {
    // PreTokenize splits text into pre-tokens according to the tokenizer's rules.
    // Returns a slice of pre-token strings ready for BPE processing.
    PreTokenize(text string) []string
}

type Scanner

Scanner provides streaming tokenization following the bufio.Scanner pattern. It reads text incrementally and produces tokens one at a time.

type Scanner interface {
    // Scan advances to the next token. Returns false at EOF or on error.
    Scan() bool

    // Token returns the most recent token ID produced by Scan.
    // Valid only after a successful call to Scan.
    Token() int

    // Text returns the text that produced the current token.
    // Valid only after a successful call to Scan.
    Text() string

    // Err returns the first error encountered during scanning.
    Err() error
}

ScannerOption configures scanner behavior.

type ScannerOption = scanner.Option

TokenError represents an error related to token operations.

type TokenError struct {
    Token   string // The token that caused the error
    TokenID int    // The token ID if applicable
    Op      string // Operation that failed
    Err     error  // Underlying error
}

func (*TokenError) Error

func (e *TokenError) Error() string

func (*TokenError) Unwrap

func (e *TokenError) Unwrap() error

Tokenizer implements the Llama 3 BPE tokenizer.

type Tokenizer struct {
    // contains filtered or unexported fields
}

func New

func New(opts ...Option) (*Tokenizer, error)

New creates a new Llama 3 tokenizer with the given options. If no options are provided, the default Llama 3 vocabulary and settings will be used.

Example:

tokenizer, err := llama3.New()
if err != nil {
    return err
}

// With custom vocabulary:
tokenizer, err := llama3.New(
    llama3.WithVocabulary(customVocab),
    llama3.WithMerges(customMerges),
)

// With cache size limit:
tokenizer, err := llama3.New(
    llama3.WithCacheSize(1000),
)

func (*Tokenizer) AppendTokens

func (t *Tokenizer) AppendTokens(dst []int, text string, opts *EncodeOptions) []int

AppendTokens appends tokens to dst, avoiding allocations when possible. dst can be nil, in which case a new slice is allocated. The resulting slice is returned and may have a different backing array than dst.

func (*Tokenizer) Decode

func (t *Tokenizer) Decode(tokenIDs []int) string

Decode converts a sequence of token IDs back into text.

Example

package main

import (
	"fmt"
	"log"

	"github.com/agentstation/tokenizer/llama3"
)

func main() {
	tokenizer, err := llama3.New()
	if err != nil {
		log.Fatal(err)
	}

	// Decode token IDs back to text
	tokens := []int{9906, 1917, 0}
	text := tokenizer.Decode(tokens)

	fmt.Printf("Decoded text: %s\n", text)
	// Output would be: Hello world!
}

func (*Tokenizer) DecodeBytes

func (t *Tokenizer) DecodeBytes(tokenIDs []int) []byte

DecodeBytes converts a sequence of token IDs back to UTF-8 bytes. This avoids string allocation and is useful for performance-critical paths.

func (*Tokenizer) Encode

func (t *Tokenizer) Encode(text string, opts *EncodeOptions) []int

Encode converts text into a sequence of token IDs. If opts is nil, default options will be used.

Example

package main

import (
	"fmt"
	"log"

	"github.com/agentstation/tokenizer/llama3"
)

func main() {
	// Create a tokenizer
	tokenizer, err := llama3.New()
	if err != nil {
		log.Fatal(err)
	}

	// Encode some text
	text := "Hello, world!"
	tokens := tokenizer.Encode(text, nil)

	fmt.Printf("Text: %s\n", text)
	fmt.Printf("Token count: %d\n", len(tokens))
	// Note: actual output depends on having the Llama 3 data files
}

Example (Without Special Tokens)

package main

import (
	"fmt"
	"log"

	"github.com/agentstation/tokenizer/llama3"
)

func main() {
	tokenizer, err := llama3.New()
	if err != nil {
		log.Fatal(err)
	}

	// Encode without special tokens
	opts := &llama3.EncodeOptions{
		BOS: false,
		EOS: false,
	}

	text := "Hello, world!"
	tokens := tokenizer.Encode(text, opts)

	fmt.Printf("Tokens without BOS/EOS: %d\n", len(tokens))
}

func (*Tokenizer) EncodeBPE

func (t *Tokenizer) EncodeBPE(pretoken string) []int

EncodeBPE implements the BPE interface.

func (*Tokenizer) EncodeBytes

func (t *Tokenizer) EncodeBytes(data []byte, opts *EncodeOptions) []int

EncodeBytes converts bytes into a sequence of token IDs. This avoids string conversion overhead for binary data.

func (*Tokenizer) GetSpecialTokenID

func (t *Tokenizer) GetSpecialTokenID(token string) (int, error)

GetSpecialTokenID returns the token ID for a special token string.

Example

package main

import (
	"fmt"
	"log"

	"github.com/agentstation/tokenizer/llama3"
)

func main() {
	tokenizer, err := llama3.New()
	if err != nil {
		log.Fatal(err)
	}

	// Get the ID of a special token
	tokenID, err := tokenizer.GetSpecialTokenID("<|begin_of_text|>")
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Begin-of-text token ID: %d\n", tokenID)
	// Output would be: 128000
}

func (*Tokenizer) NewScanner

func (t *Tokenizer) NewScanner(r io.Reader, opts ...ScannerOption) Scanner

NewScanner creates a scanner for streaming tokenization. The scanner processes input with bounded memory usage, making it suitable for large files or continuous streams.

func (*Tokenizer) OptimisticCount

func (t *Tokenizer) OptimisticCount(text string) int

OptimisticCount returns the token count assuming anything that looks like a special token is actually a special token. This is useful for fine-tuned models with modified special tokens.

func (*Tokenizer) PreTokenize

func (t *Tokenizer) PreTokenize(text string) []string

PreTokenize implements the PreTokenizer interface.

func (*Tokenizer) Process

func (t *Tokenizer) Process(r io.Reader, w io.Writer) (int64, error)

Process handles large files with controlled memory usage. It reads from r, tokenizes the content, and writes token IDs to w. Returns the number of tokens written and any error encountered.

func (*Tokenizer) TokenStream

func (t *Tokenizer) TokenStream(r io.Reader) (<-chan int, <-chan error)

TokenStream provides channel-based streaming for concurrent processing. The tokens channel will be closed when scanning completes. Any error will be sent on the error channel.

func (*Tokenizer) VocabSize

func (t *Tokenizer) VocabSize() int

VocabSize returns the size of the vocabulary including special tokens.

VocabularyDataLoader is the interface for loading tokenizer vocabulary data. This includes vocabulary and merge rules needed for tokenization.

Implementations can load data from embedded resources, files, or custom sources. The tokenizer will call LoadVocabulary first, then LoadMerges.

type VocabularyDataLoader interface {
    // LoadVocabulary loads and returns the vocabulary tokens.
    // The returned slice contains tokens indexed by their token ID.
    LoadVocabulary() ([]string, error)

    // LoadMerges loads and returns the BPE merge rules.
    // The returned map uses merge identifiers as keys and priorities as values.
    LoadMerges() (map[string]int, error)
}

VocabularyDataLoaderFunc is an adapter to allow using functions as VocabularyDataLoaders. This is useful for testing or custom data loading logic.

type VocabularyDataLoaderFunc struct {
    VocabFunc  func() ([]string, error)
    MergesFunc func() (map[string]int, error)
}

func (VocabularyDataLoaderFunc) LoadMerges

func (d VocabularyDataLoaderFunc) LoadMerges() (map[string]int, error)

LoadMerges calls the MergesFunc.

func (VocabularyDataLoaderFunc) LoadVocabulary

func (d VocabularyDataLoaderFunc) LoadVocabulary() ([]string, error)

LoadVocabulary calls the VocabFunc.

Generated by gomarkdoc