A pure Go implementation of the Llama 3 tokenizer, providing exact compatibility with the official Llama 3 tokenization used in models 3.0, 3.1, 3.2, and 3.3.
- Exact Compatibility: Produces identical token sequences to the official implementation
- Full UTF-8 Support: Handles multilingual text and emojis correctly
- All Special Tokens: Supports all 256 special tokens including
<|begin_of_text|>,<|end_of_text|>, etc. - Thread-Safe: Safe for concurrent use with built-in caching
- Zero Dependencies: Pure Go implementation with only standard library dependencies
- High Performance: Optimized BPE implementation with caching
go get github.com/agentstation/tokenizer/llama3package main
import (
"fmt"
"github.com/agentstation/tokenizer/llama3"
)
func main() {
// Create tokenizer with default Llama 3 vocabulary
tokenizer, err := llama3.New()
if err != nil {
panic(err)
}
// Encode text to tokens
text := "Hello world!"
tokens := tokenizer.Encode(text, nil)
fmt.Printf("Text: %s\n", text)
fmt.Printf("Tokens: %v\n", tokens)
// Output: [128000, 9906, 1917, 0, 128001]
// Decode tokens back to text
decoded := tokenizer.Decode(tokens)
fmt.Printf("Decoded: %s\n", decoded)
// Output: <|begin_of_text|>Hello world!<|end_of_text|>
}Control the addition of special tokens:
// Without special tokens
opts := &llama3.EncodeOptions{
BOS: false, // Don't add <|begin_of_text|>
EOS: false, // Don't add <|end_of_text|>
}
tokens := tokenizer.Encode("Hello world!", opts)
// Output: [9906, 1917, 0]Work with special tokens:
// Get special token ID
id, err := tokenizer.GetSpecialTokenID("<|end_of_text|>")
if err == nil {
fmt.Printf("EOT token ID: %d\n", id)
}
// Encode text containing special tokens
text := "<|start_header_id|>system<|end_header_id|>You are a helpful assistant."
tokens := tokenizer.Encode(text, nil)Create a tokenizer with custom configuration:
// Create tokenizer with custom cache size
tokenizer, err := llama3.New(
llama3.WithCacheSize(8192), // Custom cache size (default: 4096)
)
if err != nil {
panic(err)
}
// Or with custom data files
vocabBase64 := "..." // Base64-encoded vocabulary JSON (about 1.5MB)
mergesBinary := "..." // Base64-encoded binary merge rules (about 1.5MB)
specialTokens := []string{
"<|begin_of_text|>",
"<|end_of_text|>",
"<|start_header_id|>",
"<|end_header_id|>",
// ... other special tokens
}
tokenizer, err := llama3.New(
llama3.WithVocabData(vocabBase64, mergesBinary, specialTokens),
)
// Example: Loading from files
vocabData, err := os.ReadFile("vocab_base64.txt")
if err != nil {
panic(err)
}
mergesData, err := os.ReadFile("merges_binary.txt")
if err != nil {
panic(err)
}
tokenizer, err = llama3.New(
llama3.WithVocabData(
string(vocabData),
string(mergesData),
[]string{
"<|begin_of_text|>",
"<|end_of_text|>",
"<|start_header_id|>",
"<|end_header_id|>",
"<|eot_id|>",
"<|python_tag|>",
// Add all 256 special tokens as needed
},
),
)For fine-tuned models with custom special tokens:
// Counts any <|...|> pattern as a special token
count := tokenizer.OptimisticCount("Custom text with <|my_token|> special tokens")This implementation follows the Llama 3 tokenization specification:
- Pre-tokenization: Uses a state machine to split text into words and subwords
- Byte-level encoding: Converts text to UTF-8 bytes with special character mappings
- BPE Algorithm: Applies Byte Pair Encoding with the Llama 3 merge rules
- Special token handling: Recognizes and preserves all Llama 3 special tokens
The tokenizer uses a vocabulary size of 128,256 tokens, including:
- 128,000 base tokens
- 256 special tokens
Compatible with:
- Llama 3.0
- Llama 3.1
- Llama 3.2
- Llama 3.3
- Fine-tuned models based on Llama 3
This implementation achieves 100% compatibility with the JavaScript reference implementation through a custom state machine that exactly replicates the regex behavior. All edge cases, including complex whitespace patterns, are handled correctly.
For detailed implementation notes and technical design decisions, see IMPLEMENTATION.md.
The tokenizer requires two data files:
vocab_base64.txt: Base64-encoded vocabulary (1.5MB)merges_binary.txt: Base64-encoded merge rules (1.5MB)
The data files are included in this repository and will be automatically loaded when you use the tokenizer.
These files were extracted from the llama3-tokenizer-js project.
Option 1: Embedded Data (Recommended)
# Build with embedded data files
go build -tags embed
# The binary will contain the tokenizer dataOption 2: External Data Files
# Build without embedded data
go build
# Place data files in one of these locations:
# - Same directory as the binary
# - ./llama3/ subdirectory
# - Parent directoryThe tokenizer will automatically try to load data from standard locations if not embedded.
Run the test suite:
go test ./llama3Run compatibility tests (476 test cases):
go test -run TestCompatibility -v ./llama3Run benchmarks:
go test -bench=. ./llama3The tokenizer is optimized for production use with:
- Object pooling: Reuses state machines and token buffers for 36% less memory usage
- BPE caching: Caches merge operations for repeated tokens
- Efficient data structures: Priority queue for BPE, pre-computed lookups
- Comprehensive benchmarks: See OPTIMIZATIONS.md for implementation details
Run benchmarks:
go test -bench=. -benchmem ./llama3MIT License - see LICENSE file for details.
This implementation is based on the JavaScript llama3-tokenizer-js by belladoreai. The vocabulary and merge data files were extracted from their bundled JavaScript implementation.
import "github.com/agentstation/tokenizer/llama3"Package llama3 implements the Llama 3 tokenizer in Go. This file contains all constants used throughout the tokenizer implementation.
Package llama3 implements the Llama 3 tokenizer in pure Go.
This package provides exact compatibility with the official Llama 3 tokenization, supporting byte-level BPE (Byte Pair Encoding) tokenization with all special tokens. It is a faithful port of the JavaScript implementation and produces identical token sequences.
The Llama 3 tokenizer uses a three-stage process:
- Pre-tokenization: Text is split into words, whitespace, and punctuation using a state machine that replicates the JavaScript regex behavior
- Byte-level encoding: Text is converted to a custom byte representation
- BPE algorithm: Subword units are merged according to learned merge rules
The tokenizer uses a vocabulary of 128,256 tokens:
- 128,000 base tokens
- 256 special tokens (e.g., <|begin_of_text|>, <|end_of_text|>)
┌─────────────┐
│ Input Text │
└──────┬──────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ Special Token │────▶│ State Machine │
│ Splitting │ │ Pre-tokenization│
└─────────────────┘ └────────┬────────┘
│
▼
┌─────────────────┐
│ Byte-level │
│ Encoding │
└────────┬────────┘
│
▼
┌─────────────────┐
│ BPE Algorithm │
│ (with caching) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Token IDs │
└─────────────────┘
tokenizer, err := llama3.New()
if err != nil {
log.Fatal(err)
}
// Encode text to token IDs
tokens := tokenizer.Encode("Hello, world!", nil)
// Decode token IDs back to text
text := tokenizer.Decode(tokens)
The tokenizer can be configured with various options:
// Create with custom cache size
tokenizer, err := llama3.New(
llama3.WithCacheSize(1000),
)
// Create with custom vocabulary and merges
tokenizer, err := llama3.New(
llama3.WithVocabulary(customVocab),
llama3.WithMerges(customMerges),
llama3.WithSpecialTokens(customSpecialTokens),
)
The pre-tokenization stage uses a custom state machine that exactly replicates the JavaScript regex pattern. This ensures 100% compatibility, including edge cases like negative lookahead for whitespace patterns.
The state machine matches patterns in this order:
- Contractions: (?i:'s|'t|'re|'ve|'m|'ll|'d)
- Words with prefix: [^\r\n\p{L}\p{N}]?\p{L}+
- Numbers: \p{N}{1,3}
- Punctuation: ?[^\s\p{L}\p{N}]+[\r\n]*
- Newlines: \s*[\r\n]+
- Whitespace: \s+(?!\S)
The tokenizer is optimized for production use:
- Object pooling reduces allocations by 36%
- BPE results are cached for repeated tokens
- State machines and token buffers are reused
- Thread-safe design allows concurrent usage
The package uses sync.Pool for efficient memory management:
- State machines are pooled and reused
- Token buffers are pooled (up to 1024 capacity)
- BPE merge operations use a priority queue
Pool Usage Patterns:
- State Machine Pooling (stateMachinePool)
- Reuses StateMachine instances across tokenization calls
- Reduces allocations for the input rune slice and token slice
- Pool never limits the number of state machines
- State machines are reset before reuse
- Token Buffer Pooling (tokenBufPool)
- Reuses []string slices for collecting tokens
- Initial capacity: 64 tokens
- Maximum pooled capacity: 1024 tokens
- Buffers exceeding the maximum are not returned to the pool
Memory Lifecycle:
- Allocation: First call creates new instance, subsequent calls may reuse 2. Usage: Instance is used for one tokenization operation 3. Return to Pool: References cleared, slices reset, large buffers discarded 4. Garbage Collection: Go runtime may clear pools during GC
Performance: Benchmarks show 36% memory reduction with pooling
The package defines custom error types for better error handling:
- DataError: Issues with loading or processing tokenizer data
- TokenError: Issues with specific tokens or token IDs
- ConfigError: Issues with tokenizer configuration
All errors implement the error interface and support error wrapping.
The tokenizer is safe for concurrent use. Multiple goroutines can encode and decode text simultaneously without issues. The internal cache uses read-write mutexes for efficient concurrent access.
Package llama3 implements the Llama 3 tokenizer in Go. It provides exact compatibility with the official Llama 3 tokenization, supporting byte-level BPE tokenization with all special tokens.
Package llama3 implements the Llama 3 tokenizer in Go. This file contains the public API including interfaces and options.
- Variables
- func NewConfigError(field string, value any, err error) error
- func NewDataError(op, path string, err error) error
- func NewTokenError(op, token string, err error) error
- func NewTokenIDError(op string, tokenID int, err error) error
- type BPE
- type Cache
- type ConfigError
- type DataError
- type Decoder
- type DecoderFunc
- type EncodeOptions
- type Encoder
- type EncoderFunc
- type Option
- type PreTokenizer
- type Scanner
- type ScannerOption
- type TokenError
- type Tokenizer
- func New(opts ...Option) (*Tokenizer, error)
- func (t *Tokenizer) AppendTokens(dst []int, text string, opts *EncodeOptions) []int
- func (t *Tokenizer) Decode(tokenIDs []int) string
- func (t *Tokenizer) DecodeBytes(tokenIDs []int) []byte
- func (t *Tokenizer) Encode(text string, opts *EncodeOptions) []int
- func (t *Tokenizer) EncodeBPE(pretoken string) []int
- func (t *Tokenizer) EncodeBytes(data []byte, opts *EncodeOptions) []int
- func (t *Tokenizer) GetSpecialTokenID(token string) (int, error)
- func (t *Tokenizer) NewScanner(r io.Reader, opts ...ScannerOption) Scanner
- func (t *Tokenizer) OptimisticCount(text string) int
- func (t *Tokenizer) PreTokenize(text string) []string
- func (t *Tokenizer) Process(r io.Reader, w io.Writer) (int64, error)
- func (t *Tokenizer) TokenStream(r io.Reader) (<-chan int, <-chan error)
- func (t *Tokenizer) VocabSize() int
- type VocabularyDataLoader
- type VocabularyDataLoaderFunc
var (
// ErrDataNotFound indicates that the tokenizer data files could not be found.
ErrDataNotFound = errors.New("tokenizer data not found")
// ErrInvalidToken indicates an invalid token was provided.
ErrInvalidToken = errors.New("invalid token")
// ErrTokenNotFound indicates a token was not found in the vocabulary.
ErrTokenNotFound = errors.New("token not found")
// ErrInvalidTokenID indicates an invalid token ID was provided.
ErrInvalidTokenID = errors.New("invalid token ID")
)Scanner option functions - these are re-exported from the scanner package.
var (
// WithBufferSize sets the internal buffer size for reading.
// Default is 4096 bytes.
WithBufferSize = scanner.WithBufferSize
// WithMaxBuffer sets the maximum buffer size before forcing tokenization.
// This prevents unbounded memory growth for pathological inputs.
// Default is 1MB.
WithMaxBuffer = scanner.WithMaxBuffer
// WithEncodeOptions sets encoding options for the scanner.
WithEncodeOptions = func(opts *EncodeOptions) ScannerOption {
return scanner.WithEncodeOptions(&scanner.EncodeOptions{
BOS: opts.BOS,
EOS: opts.EOS,
})
}
)func NewConfigError
func NewConfigError(field string, value any, err error) errorNewConfigError creates a new ConfigError.
func NewDataError
func NewDataError(op, path string, err error) errorNewDataError creates a new DataError.
func NewTokenError
func NewTokenError(op, token string, err error) errorNewTokenError creates a new TokenError.
func NewTokenIDError
func NewTokenIDError(op string, tokenID int, err error) errorNewTokenIDError creates a new TokenError with a token ID.
type BPE
BPE is the interface for Byte Pair Encoding processing. BPE merges frequently occurring character pairs to create subword tokens.
type BPE interface {
// EncodeBPE applies byte pair encoding to a pre-tokenized string.
// Returns a slice of token IDs representing the encoded text.
EncodeBPE(pretoken string) []int
}type Cache
Cache is the interface for caching BPE results. BPE tokenization can be expensive for repeated text patterns, so caching improves performance significantly.
The cache key is typically the pre-tokenized text string, and the value is the slice of token IDs produced by BPE.
Implementations should be thread-safe if the tokenizer will be used concurrently.
type Cache interface {
// Get retrieves a cached BPE result.
// Returns the token IDs and true if found, or nil and false if not cached.
Get(key string) ([]int, bool)
// Put stores a BPE result in the cache.
// The implementation may evict old entries based on its eviction policy.
Put(key string, value []int)
}type ConfigError
ConfigError represents an error in tokenizer configuration.
type ConfigError struct {
Field string // Configuration field that has an error
Value any // The invalid value
Err error // Underlying error
}func (*ConfigError) Error
func (e *ConfigError) Error() stringfunc (*ConfigError) Unwrap
func (e *ConfigError) Unwrap() errortype DataError
DataError represents an error related to tokenizer data loading or processing.
type DataError struct {
Op string // Operation that failed
Path string // File path if applicable
Err error // Underlying error
}func (*DataError) Error
func (e *DataError) Error() stringfunc (*DataError) Unwrap
func (e *DataError) Unwrap() errortype Decoder
Decoder is the interface for decoding tokens to text. This interface is useful for testing and creating mock implementations.
type Decoder interface {
// Decode converts a sequence of token IDs back to text.
Decode(tokens []int) string
}type DecoderFunc
DecoderFunc is an adapter to allow ordinary functions to be used as Decoders. This is useful for creating mock decoders in tests.
type DecoderFunc func(tokens []int) stringfunc (DecoderFunc) Decode
func (f DecoderFunc) Decode(tokens []int) stringDecode calls f(tokens).
type EncodeOptions
EncodeOptions controls the encoding behavior.
type EncodeOptions struct {
// BOS adds the beginning-of-text token if true (default: true)
BOS bool
// EOS adds the end-of-text token if true (default: true)
EOS bool
}type Encoder
Encoder is the interface for encoding text to tokens. This interface is useful for testing and creating mock implementations.
type Encoder interface {
// Encode converts text to a sequence of token IDs.
Encode(text string, opts *EncodeOptions) []int
}type EncoderFunc
EncoderFunc is an adapter to allow ordinary functions to be used as Encoders. This is useful for creating mock encoders in tests.
type EncoderFunc func(text string, opts *EncodeOptions) []intfunc (EncoderFunc) Encode
func (f EncoderFunc) Encode(text string, opts *EncodeOptions) []intEncode calls f(text, opts).
type Option
Option is a functional option for configuring a Tokenizer.
type Option func(*config) errorfunc WithCacheSize
func WithCacheSize(size int) OptionWithCacheSize sets the maximum size of the BPE cache. Set to 0 to disable caching. Default is unlimited.
func WithDataFiles
func WithDataFiles(vocabPath, mergesPath string) OptionWithDataFiles loads vocabulary and merges from files instead of embedded data. The vocabulary file should contain base64-encoded vocabulary data. The merges file should contain base64-encoded binary merge data.
func WithDataLoader
func WithDataLoader(loader VocabularyDataLoader) OptionWithDataLoader sets a custom data loader for the tokenizer. This allows loading vocabulary and merges from custom sources.
func WithSpecialTokens
func WithSpecialTokens(tokens []string) OptionWithSpecialTokens sets custom special tokens for the tokenizer. If nil, the default Llama 3 special tokens will be used.
type PreTokenizer
PreTokenizer is the interface for pre-tokenization. Pre-tokenization splits text into words, numbers, punctuation, etc. before the BPE algorithm is applied.
type PreTokenizer interface {
// PreTokenize splits text into pre-tokens according to the tokenizer's rules.
// Returns a slice of pre-token strings ready for BPE processing.
PreTokenize(text string) []string
}type Scanner
Scanner provides streaming tokenization following the bufio.Scanner pattern. It reads text incrementally and produces tokens one at a time.
type Scanner interface {
// Scan advances to the next token. Returns false at EOF or on error.
Scan() bool
// Token returns the most recent token ID produced by Scan.
// Valid only after a successful call to Scan.
Token() int
// Text returns the text that produced the current token.
// Valid only after a successful call to Scan.
Text() string
// Err returns the first error encountered during scanning.
Err() error
}type ScannerOption
ScannerOption configures scanner behavior.
type ScannerOption = scanner.Optiontype TokenError
TokenError represents an error related to token operations.
type TokenError struct {
Token string // The token that caused the error
TokenID int // The token ID if applicable
Op string // Operation that failed
Err error // Underlying error
}func (*TokenError) Error
func (e *TokenError) Error() stringfunc (*TokenError) Unwrap
func (e *TokenError) Unwrap() errortype Tokenizer
Tokenizer implements the Llama 3 BPE tokenizer.
type Tokenizer struct {
// contains filtered or unexported fields
}func New
func New(opts ...Option) (*Tokenizer, error)New creates a new Llama 3 tokenizer with the given options. If no options are provided, the default Llama 3 vocabulary and settings will be used.
Example:
tokenizer, err := llama3.New()
if err != nil {
return err
}
// With custom vocabulary:
tokenizer, err := llama3.New(
llama3.WithVocabulary(customVocab),
llama3.WithMerges(customMerges),
)
// With cache size limit:
tokenizer, err := llama3.New(
llama3.WithCacheSize(1000),
)
func (*Tokenizer) AppendTokens
func (t *Tokenizer) AppendTokens(dst []int, text string, opts *EncodeOptions) []intAppendTokens appends tokens to dst, avoiding allocations when possible. dst can be nil, in which case a new slice is allocated. The resulting slice is returned and may have a different backing array than dst.
func (*Tokenizer) Decode
func (t *Tokenizer) Decode(tokenIDs []int) stringDecode converts a sequence of token IDs back into text.
Example
package main
import (
"fmt"
"log"
"github.com/agentstation/tokenizer/llama3"
)
func main() {
tokenizer, err := llama3.New()
if err != nil {
log.Fatal(err)
}
// Decode token IDs back to text
tokens := []int{9906, 1917, 0}
text := tokenizer.Decode(tokens)
fmt.Printf("Decoded text: %s\n", text)
// Output would be: Hello world!
}func (*Tokenizer) DecodeBytes
func (t *Tokenizer) DecodeBytes(tokenIDs []int) []byteDecodeBytes converts a sequence of token IDs back to UTF-8 bytes. This avoids string allocation and is useful for performance-critical paths.
func (*Tokenizer) Encode
func (t *Tokenizer) Encode(text string, opts *EncodeOptions) []intEncode converts text into a sequence of token IDs. If opts is nil, default options will be used.
Example
package main
import (
"fmt"
"log"
"github.com/agentstation/tokenizer/llama3"
)
func main() {
// Create a tokenizer
tokenizer, err := llama3.New()
if err != nil {
log.Fatal(err)
}
// Encode some text
text := "Hello, world!"
tokens := tokenizer.Encode(text, nil)
fmt.Printf("Text: %s\n", text)
fmt.Printf("Token count: %d\n", len(tokens))
// Note: actual output depends on having the Llama 3 data files
}Example (Without Special Tokens)
package main
import (
"fmt"
"log"
"github.com/agentstation/tokenizer/llama3"
)
func main() {
tokenizer, err := llama3.New()
if err != nil {
log.Fatal(err)
}
// Encode without special tokens
opts := &llama3.EncodeOptions{
BOS: false,
EOS: false,
}
text := "Hello, world!"
tokens := tokenizer.Encode(text, opts)
fmt.Printf("Tokens without BOS/EOS: %d\n", len(tokens))
}func (*Tokenizer) EncodeBPE
func (t *Tokenizer) EncodeBPE(pretoken string) []intEncodeBPE implements the BPE interface.
func (*Tokenizer) EncodeBytes
func (t *Tokenizer) EncodeBytes(data []byte, opts *EncodeOptions) []intEncodeBytes converts bytes into a sequence of token IDs. This avoids string conversion overhead for binary data.
func (*Tokenizer) GetSpecialTokenID
func (t *Tokenizer) GetSpecialTokenID(token string) (int, error)GetSpecialTokenID returns the token ID for a special token string.
Example
package main
import (
"fmt"
"log"
"github.com/agentstation/tokenizer/llama3"
)
func main() {
tokenizer, err := llama3.New()
if err != nil {
log.Fatal(err)
}
// Get the ID of a special token
tokenID, err := tokenizer.GetSpecialTokenID("<|begin_of_text|>")
if err != nil {
log.Fatal(err)
}
fmt.Printf("Begin-of-text token ID: %d\n", tokenID)
// Output would be: 128000
}func (*Tokenizer) NewScanner
func (t *Tokenizer) NewScanner(r io.Reader, opts ...ScannerOption) ScannerNewScanner creates a scanner for streaming tokenization. The scanner processes input with bounded memory usage, making it suitable for large files or continuous streams.
func (*Tokenizer) OptimisticCount
func (t *Tokenizer) OptimisticCount(text string) intOptimisticCount returns the token count assuming anything that looks like a special token is actually a special token. This is useful for fine-tuned models with modified special tokens.
func (*Tokenizer) PreTokenize
func (t *Tokenizer) PreTokenize(text string) []stringPreTokenize implements the PreTokenizer interface.
func (*Tokenizer) Process
func (t *Tokenizer) Process(r io.Reader, w io.Writer) (int64, error)Process handles large files with controlled memory usage. It reads from r, tokenizes the content, and writes token IDs to w. Returns the number of tokens written and any error encountered.
func (*Tokenizer) TokenStream
func (t *Tokenizer) TokenStream(r io.Reader) (<-chan int, <-chan error)TokenStream provides channel-based streaming for concurrent processing. The tokens channel will be closed when scanning completes. Any error will be sent on the error channel.
func (*Tokenizer) VocabSize
func (t *Tokenizer) VocabSize() intVocabSize returns the size of the vocabulary including special tokens.
type VocabularyDataLoader
VocabularyDataLoader is the interface for loading tokenizer vocabulary data. This includes vocabulary and merge rules needed for tokenization.
Implementations can load data from embedded resources, files, or custom sources. The tokenizer will call LoadVocabulary first, then LoadMerges.
type VocabularyDataLoader interface {
// LoadVocabulary loads and returns the vocabulary tokens.
// The returned slice contains tokens indexed by their token ID.
LoadVocabulary() ([]string, error)
// LoadMerges loads and returns the BPE merge rules.
// The returned map uses merge identifiers as keys and priorities as values.
LoadMerges() (map[string]int, error)
}VocabularyDataLoaderFunc is an adapter to allow using functions as VocabularyDataLoaders. This is useful for testing or custom data loading logic.
type VocabularyDataLoaderFunc struct {
VocabFunc func() ([]string, error)
MergesFunc func() (map[string]int, error)
}func (VocabularyDataLoaderFunc) LoadMerges
func (d VocabularyDataLoaderFunc) LoadMerges() (map[string]int, error)LoadMerges calls the MergesFunc.
func (VocabularyDataLoaderFunc) LoadVocabulary
func (d VocabularyDataLoaderFunc) LoadVocabulary() ([]string, error)LoadVocabulary calls the VocabFunc.
Generated by gomarkdoc