27 Apr 14:38

7f1623b

Release v0.23.1 Latest

Latest

TL;DR

tokenizers 0.23.1 is the first proper stable release in the 0.23 line — 0.23.0 only ever shipped as rc0 because the release pipeline itself was broken (Node side hadn't shipped multi-platform binaries since 2023, Python side was on pyo3 0.27 without free-threaded support). 0.23.1 is the version where everything actually goes out the door together: full Node multi-platform wheels for the first time in years, Python 3.14 (regular and free-threaded 3.14t), full type hints for every Python class, and a stack of measurable perf wins on the BPE / added-vocab hot paths.

There is no functional 0.23.0 published — we tag 0.23.1 directly so users don't accidentally pull a never-shipped version.

🚨 Breaking changes

Drop Python 3.9 (#1952) — requires-python = ">=3.10"; 3.9 users stay on 0.22.x.
add_tokens normalizes content at insertion (#1995) — re-saved tokenizer.json may differ in the added_tokens block. Existing files load unchanged.
Type stubs are precise (#1928, #1997) — methods that returned Any now return real types; mypy --strict may surface previously-hidden errors. Stub layout also moved from tokenizers/<sub>/__init__.pyi to tokenizers/<sub>.pyi. This breaks the surface of some of the processors like RobertaProcessign's __init__ .
3.14t-only: setters/getters return PyResult<T> because of Arc<RwLock<Tokenizer>>; a poisoned lock surfaces as PyException instead of a panic.

⚡ Performance — measured locally on this Mac, not lifted from PRs

Run with cargo bench --bench <name> -- --save-baseline v0_22_2 on v0.22.2, then --baseline v0_22_2 on v0.23.1. Numbers are point-in-time wall clock on a single laptop; relative deltas are what matters, absolute numbers will differ on CI hardware.

Added-vocabulary deserialize — the headline win (#1995, #1999)

bench: improve added_vocab_deserialize to reflect real-world workloads (#2000) is now representative of how transformers actually loads tokenizer.json files. The combined effect of daachorse for the matching automaton plus the normalize-on-insert refactor is enormous on this workload:

benchmark	v0.22.2	v0.23.1	change
100k tokens, special, no norm	~410 ms	248 ms	−40%
100k tokens, non-special, no norm	~7.1 s	273 ms	−96%
100k tokens, special, NFKC	~395 ms	235 ms	−40%
100k tokens, non-special, NFKC	~7.4 s	290 ms	−96%
400k tokens, special, no norm	~15 s	980 ms	−94%

Real-world impact: loading a Llama-3-style tokenizer with a large set of added tokens dropped from "noticeable pause" to "instant".

BPE encode

benchmark	v0.22.2	v0.23.1	change
`BPE GPT2 encode batch, no cache`	530 ms	446 ms	−16%
`BPE GPT2 encode batch` (cached)	690 ms	685 ms	noise
`BPE GPT2 encode` (single)	1.95 s	1.94 s	noise
`BPE Train (small)`	32.6 ms	31.5 ms	−3%
`BPE Train (big)`	1.01 s	988 ms	−2%

The BPE per-thread cache PR (#2028) shows much larger wins on highly-parallel workloads (+47–62% at 88+ threads on a server box, per the PR's own measurements on Vera). Single-thread batch numbers above are flat or slightly improved because cache-hit overhead was already low without contention.

Llama-3 encode

benchmark	v0.22.2	v0.23.1	change
`llama3-encode` (single)	2.10 s	2.02 s	−4%
`llama3-batch`	438 ms	408 ms	−7%
`llama3-offsets`	410 ms	395 ms	−4%

Truncation early exit (#1990)

Right-direction truncation no longer pre-tokenizes past max_length. The new truncation_benchmark doesn't exist on v0.22.2 so there's no apples-to-apples here, but the PR's own measurements on the same machine showed −20–28% across a range of max_length values for right-truncation; left-truncation unchanged.

Other perf improvements (no direct comparable bench)

BPE::Builder::build no longer formats strings in a hot loop (#2010) — ~45% faster Tokenizer::from_file on Llama-3 in the PR's profile.
BPE per-thread cache (#2028) — see Vera numbers in PR description for parallel scale-out.

🔄 Serialization / deserialization

The tokenizer.json format is forward-compatible: existing files load on 0.23 unchanged. Two things to know if you re-save:

added_tokens entries created via add_tokens(..., normalized=True) will have their content normalized at save time — see breaking-change note above.
tokenizer.train(...) no longer keeps a redundant added_tokens/special_tokens Vec separate from the added_tokens_map_r. Public API surface unchanged; only the internal struct shape moved.

bench: improve added_vocab_deserialize to reflect real-world workloads (#2000) lands a more realistic micro-benchmark for this surface; if you're tracking deserialize perf in your own CI, the new bench is the one to compare against.

🐍 Python: free-threaded 3.14t support

Dedicated wheels for python3.14t (the free-threaded build introduced in PEP 703). The wheel:

Declares Py_MOD_GIL_NOT_USED, so importing tokenizers does not force the GIL back on.
Builds without the abi3 cargo feature (free-threaded Python doesn't expose the limited API).
Goes through Arc<RwLock<Tokenizer>> for the inner state so concurrent setters and encoders don't race PyO3's per-pyclass borrow check.

A new stress-test module tests/test_freethreaded.py exercises N-encoder × M-setter races on a single Tokenizer and asserts no RuntimeError: Already borrowed, no RwLock poisoning, and that sys._is_gil_enabled() is False post-import.

For the regular CPython wheel everything is unchanged.

📦 Node.js bindings: first proper multi-platform release since 2023

The npm package now ships 13 platforms (macOS x64/arm64/universal, Windows x64/i686/arm64, Linux x64/arm64/armv7 in both glibc and musl, Android arm64/armv7) — previous workflows only built 3 of those, leaving Apple Silicon / Linux ARM / Alpine users with package-not-found errors since 2023 (#1365, #1703, #1922). Fixed via #1970 + #2034, which also bumps @napi-rs/cli to v3 and switches cross-builds to cargo-zigbuild.

🧷 Type hints & typing for all classes (#1928, #1997)

Every class in the python bindings now ships proper .pyi stubs — Tokenizer, AddedToken, Encoding, every decoder / model / normalizer / pre-tokenizer / processor / trainer. Editors and type checkers (mypy, pyright, ty) see real signatures with types and docstrings instead of falling back to Any.

The stubs are generated automatically from the compiled extension via tools/stub-gen (Rust binary using pyo3-introspection). Re-running make style regenerates them; CI guards against regenerated-vs-checked-in drift. If the generator ever returns 0 docstrings (e.g. because the [patch.crates-io] pin in .cargo/config.toml falls out of sync with the pyo3 dep version), it now hard-aborts with a precise diagnostic instead of silently emitting bare-bones stubs.

>>> from tokenizers import Tokenizer
>>> # IDEs now resolve every method, every kwarg, every return type
>>> Tokenizer.from_pretrained("bert-base-cased")

⚠️ As called out in breaking changes: stricter type info means previously-hidden type errors in user code may now surface under mypy --strict.

✨ Other features

Unigram sampling: models.Unigram now exposes alpha and nbest_size for subword regularization (parity with Google's implementation, #1994). Closes long-standing requests #730 and #849.
Weakref support on Tokenizer (#1958) — useful for long-lived caches that don't want to keep tokenizers alive.
CI benchmark regression detection on PRs (#2013) — every PR runs ci_benchmark against the stored baseline and posts a comparison chart to the PR.
Longer-context Llama-3 benchmarks (#1971) for tracking head-room on multi-thousand-token inputs.

🛠 Other fixes

EncodingVisualizer: unclosed annotation span fixed (#1911), HTML escape applied to output (#1937).
DecodeStream: __copy__ / __deepcopy__ (#1930).
Pre-tokenize: removed an unnecessary to_vec() from slice (#1964).
Replace wget / norvig URL with HF Hub downloads in test data fetch (#2018).
uv support in the Python Makefile (#1977).
Several security-pin bumps on workflow SHAs (#2004, #2005, #2006, #2016, #2017).

👥 Contributors

Thanks to everyone who shipped commits between v0.22.2 and v0.23.1:

@ArthurZucker, @finnagin, @gordonmessmer, @jberg5, @kennethsible, @llukito, @MayCXC, @McPatate, @michaelfeil, @mrkm4ntr, @musicinmybrain, @ngoldbaum, @OhashiReon, @paulinebm, @podarok, @rtrompier, @sebpop, @Shivam-Bhardwaj, @threexc, @wheynelau, @xanderlent — plus @dependabot and @hf-security-analysis for keeping pins fresh.

Full Changelog: v0.22.2...v0.23.1

Contributors

podarok, sebpop, and 20 other contributors

Assets 2

02 Dec 13:01

ArthurZucker

v0.22.2

f383101

Release v0.22.2

What's Changed

Okay mostly doing the release for these PR:

Update deserialize of added tokens by @ArthurZucker in #1891
update stub for typing by @ArthurZucker in #1896
bump PyO3 to 0.26 by @davidhewitt in #1901

Basically good typing with at least ty, and a lot fast (from 4 to 8x faster) loading vocab with a lot of added tokens and GIL free !?

ci: add support for building Win-ARM64 wheels by @MugundanMCW in #1869
Add cargo-semver-checks to Rust CI workflow by @haixuanTao in #1875
Update indicatif dependency by @gordonmessmer in #1867
Bump node-forge from 1.3.1 to 1.3.2 in /tokenizers/examples/unstable_wasm/www by @dependabot[bot] in #1889
Bump js-yaml from 3.14.1 to 3.14.2 in /bindings/node by @dependabot[bot] in #1892
fix: used normalize_str in BaseTokenizer.normalize by @ishitab02 in #1884
[MINOR:TYPO] Update mod.rs by @cakiki in #1883
Remove runtime stderr warning from Python bindings by @Copilot in #1898
Mark immutable pyclasses as frozen by @ngoldbaum in #1861
DOCS: add add_prefix_space to processors.ByteLevel by @CloseChoice in #1878
Bump express from 4.21.2 to 4.22.1 in /tokenizers/examples/unstable_wasm/www by @dependabot[bot] in #1903

New Contributors

@MugundanMCW made their first contribution in #1869
@haixuanTao made their first contribution in #1875
@gordonmessmer made their first contribution in #1867
@ishitab02 made their first contribution in #1884
@Copilot made their first contribution in #1898
@ngoldbaum made their first contribution in #1861
@CloseChoice made their first contribution in #1878

Full Changelog: v0.22.1...v0.22.2

Contributors

davidhewitt, ngoldbaum, and 8 other contributors

Assets 2

19 Sep 09:52

ArthurZucker

v0.22.1

afaae08

v0.22.1

Release v0.22.1

Main change:

Bump huggingface_hub upper version (#1866) from @Wauplin
chore(trainer): add and improve trainer signature (#1838) from @shenxiangzhuang
Some doc updates: c91d76a, 7b02178, 57eb8d7

Contributors

Wauplin and shenxiangzhuang

Assets 2

29 Aug 10:25

ArthurZucker

v0.22.0

4630f94

v0.22.0

What's Changed

Bump on-headers and compression in /tokenizers/examples/unstable_wasm/www by @dependabot[bot] in #1827
Implement from_bytes and read_bytes Methods in WordPiece Tokenizer for WebAssembly Compatibility by @sondalex in #1758
fix: use AHashMap to fix compile error by @b00f in #1840
New stream by @ArthurZucker in #1856
[docs] Add more decoders by @pcuenca in #1849
Fix missing parenthesis in EncodingVisualizer.calculate_label_colors by @Liam-DeVoe in #1853
Update quicktour.mdx re: Issue #1625 by @WilliamPLaCroix in #1846
remove stray comment by @sanderland in #1831
Fix typo in README by @aisk in #1808
RUSTSEC-2024-0436 - replace paste with pastey by @nystromjd in #1834
Tokenizer: Add native async bindings, via py03-async-runtimes. by @michaelfeil in #1843

New Contributors

@b00f made their first contribution in #1840
@pcuenca made their first contribution in #1849
@Liam-DeVoe made their first contribution in #1853
@WilliamPLaCroix made their first contribution in #1846
@sanderland made their first contribution in #1831
@aisk made their first contribution in #1808
@nystromjd made their first contribution in #1834
@michaelfeil made their first contribution in #1843

Full Changelog: v0.21.3...v0.22.0rc0

Contributors

aisk, pcuenca, and 9 other contributors

Assets 2

28 Jul 13:18

Narsil

v0.21.4

e892882

v0.21.4

Full Changelog: v0.21.3...v0.21.4

No change, the 0.21.3 release failed, this is just a re-release.

https://github.com/huggingface/tokenizers/releases/tag/v0.21.3

Assets 2

04 Jul 11:58

Narsil

v0.21.3

dd4fc3d

v0.21.3

What's Changed

Clippy fixes. by @Narsil in #1818
Fixed an introduced backward breaking change in our Rust APIs.

Full Changelog: v0.21.2...v0.21.3

Contributors

Narsil

Assets 2

24 Jun 10:26

ArthurZucker

v0.21.2

df1e36f

v0.21.2

What's Changed

This release if focused around some performance optimization, enabling broader python no gil support, and fixing some onig issues!

Update the release builds following 0.21.1. by @Narsil in #1746
replace lazy_static with stabilized std::sync::LazyLock in 1.80 by @sftse in #1739
Fix no-onig no-wasm builds by @414owen in #1772
Fix typos in strings and comments by @co63oc in #1770
Fix type notation of merges in BPE Python binding by @Coqueue in #1766
Bump http-proxy-middleware from 2.0.6 to 2.0.9 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1762
Fix data path in test_continuing_prefix_trainer_mismatch by @GaetanLepage in #1747
clippy by @ArthurZucker in #1781
Update pyo3 and rust-numpy depends for no-gil/free-threading compat by @Qubitium in #1774
Use ApiBuilder::from_env() in from_pretrained function by @BenLocal in #1737
Upgrade onig, to get it compiling with GCC 15 by @414owen in #1771
Itertools upgrade by @sftse in #1756
Bump webpack-dev-server from 4.10.0 to 5.2.1 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1792
Bump brace-expansion from 1.1.11 to 1.1.12 in /bindings/node by @dependabot in #1796
Fix features blending into a paragraph by @bionicles in #1798
Adding throughput to benches to have a more consistent measure across by @Narsil in #1800
Upgrading dependencies. by @Narsil in #1801
[docs] Whitespace by @stevhliu in #1785
Hotfixing the stub. by @Narsil in #1802
Bpe clones by @sftse in #1707
Fixed Length Pre-Tokenizer by @jonvet in #1713
Consolidated optimization ahash dary compact str by @Narsil in #1799
🚨 breaking: Fix training with special tokens by @ArthurZucker in #1617

New Contributors

@414owen made their first contribution in #1772
@co63oc made their first contribution in #1770
@Coqueue made their first contribution in #1766
@GaetanLepage made their first contribution in #1747
@Qubitium made their first contribution in #1774
@BenLocal made their first contribution in #1737
@bionicles made their first contribution in #1798
@stevhliu made their first contribution in #1785
@jonvet made their first contribution in #1713

Full Changelog: v0.21.1...v0.21.2rc0

Contributors

Narsil, Qubitium, and 11 other contributors

Assets 2

13 Mar 10:44

Narsil

v0.21.1

133db48

v0.21.1

What's Changed

Update dev version and pyproject.toml by @ArthurZucker in #1693
Add feature flag hint to README.md, fixes #1633 by @sftse in #1709
Upgrade to PyO3 0.23 by @Narsil in #1708
Fixing the README. by @Narsil in #1714
Fix typo in Split docstrings by @Dylan-Harden3 in #1701
Fix typos by @tinyboxvk in #1715
Update documentation of Rust feature by @sondalex in #1711
Fix panic in DecodeStream::step due to incorrect index usage by @n0gu-furiosa in #1699
Fixing the stream by removing the read_index altogether. by @Narsil in #1716
Fixing NormalizedString append when normalized is empty. by @Narsil in #1717
🚨 Support updating template processors by @ArthurZucker in #1652. Removed in this release to keep backware compatibility temporarily.
Update metadata as Python3.7 and Python3.8 support was dropped by @earlytobed in #1724
Add rustls-tls feature by @torymur in #1732

New Contributors

@Dylan-Harden3 made their first contribution in #1701
@sondalex made their first contribution in #1711
@n0gu-furiosa made their first contribution in #1699
@earlytobed made their first contribution in #1724
@torymur made their first contribution in #1732

Full Changelog: v0.21.0...v0.21.1

Contributors

Narsil, torymur, and 7 other contributors

Assets 2

12 Mar 09:47

Narsil

v0.21.1rc0

4722e21

v0.21.1rc0 Pre-release

Pre-release

What's Changed

Update dev version and pyproject.toml by @ArthurZucker in #1693
Add feature flag hint to README.md, fixes #1633 by @sftse in #1709
Upgrade to PyO3 0.23 by @Narsil in #1708
Fixing the README. by @Narsil in #1714
Fix typo in Split docstrings by @Dylan-Harden3 in #1701
Fix typos by @tinyboxvk in #1715
Update documentation of Rust feature by @sondalex in #1711
Fix panic in DecodeStream::step due to incorrect index usage by @n0gu-furiosa in #1699
Fixing the stream by removing the read_index altogether. by @Narsil in #1716
Fixing NormalizedString append when normalized is empty. by @Narsil in #1717
🚨 Support updating template processors by @ArthurZucker in #1652
Update metadata as Python3.7 and Python3.8 support was dropped by @earlytobed in #1724
Add rustls-tls feature by @torymur in #1732

New Contributors

@Dylan-Harden3 made their first contribution in #1701
@sondalex made their first contribution in #1711
@n0gu-furiosa made their first contribution in #1699
@earlytobed made their first contribution in #1724
@torymur made their first contribution in #1732

Full Changelog: v0.21.0...v0.21.1rc0

Contributors

Narsil, torymur, and 7 other contributors

Assets 2

15 Nov 11:12

Narsil

v0.21.0

cf102e6

Release v0.21.0

Release v0.20.4 v0.21.0

More cache options. by @Narsil in #1675
Disable caching for long strings. by @Narsil in #1676
Testing ABI3 wheels to reduce number of wheels by @Narsil in #1674
Adding an API for decode streaming. by @Narsil in #1677
Decode stream python by @Narsil in #1678
Fix encode_batch and encode_batch_fast to accept ndarrays again by @diliop in #1679

We also no longer support python 3.7 or 3.8 (similar to transformers) as they are deprecated.

Full Changelog: v0.20.3...v0.21.0

Contributors

Narsil and diliop

Assets 2

Releases: huggingface/tokenizers

Release v0.23.1

TL;DR

🚨 Breaking changes

⚡ Performance — measured locally on this Mac, not lifted from PRs

Added-vocabulary deserialize — the headline win (#1995, #1999)

BPE encode

Llama-3 encode

Truncation early exit (#1990)

Other perf improvements (no direct comparable bench)

🔄 Serialization / deserialization

🐍 Python: free-threaded 3.14t support

📦 Node.js bindings: first proper multi-platform release since 2023

🧷 Type hints & typing for all classes (#1928, #1997)

✨ Other features

🛠 Other fixes

👥 Contributors

Contributors

Uh oh!

Release v0.22.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.22.1

Release v0.22.1

Contributors

Uh oh!

v0.22.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.21.4

Uh oh!

v0.21.3

What's Changed

Contributors

Uh oh!

v0.21.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.21.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.21.1rc0

What's Changed

New Contributors

Contributors

Uh oh!

Release v0.21.0

Release v0.20.4 v0.21.0

Contributors

Uh oh!