Skip to content

Commit 5f14d3c

Browse files
committed
Add definition fetching from web URLs
1 parent 4847c5d commit 5f14d3c

12 files changed

Lines changed: 168 additions & 36 deletions

File tree

Cargo.toml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ all = [
5757
"regex-unicode",
5858
"regex-perf",
5959
"multiversion",
60+
"web",
6061
]
6162

6263
# Enables standard library features
@@ -100,6 +101,9 @@ regex-onig = ["dep:onig"]
100101
# Enables the use of multiversion for generating multiple code paths with different CPU feature utilization
101102
multiversion = ["dep:multiversion"]
102103

104+
# Enables fetching definitions from HuggingFace or other URLs
105+
web = ["std", "dep:reqwest"]
106+
103107
# Enables the use of unstable features
104108
unstable = []
105109

@@ -157,6 +161,9 @@ serde_json = { version = "1.0", default-features = false, features = ["alloc"],
157161
# optional dependencies for the multiversion feature
158162
multiversion = { version = "0.8", default-features = false, optional = true }
159163

164+
# optional dependencies for the web feature
165+
reqwest = { version = "0.13", default-features = false, features = ["blocking", "rustls", "system-proxy"], optional = true }
166+
160167
[dev-dependencies]
161168

162169
kitoken = { path = ".", default-features = false, features = [

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111

1212
```rust
1313
use kitoken::Kitoken;
14-
let encoder = Kitoken::from_file("models/llama4.kit")?;
14+
let encoder = Kitoken::from_web("hf:Qwen/Qwen3.5-9B")?;
1515

1616
let tokens = encoder.encode("Your future belongs to me.", true)?;
1717
let string = String::from_utf8(encoder.decode(&tokens, true)?)?;
@@ -29,19 +29,19 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
2929
Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
3030
- **Supports input and output processing**\
3131
Including unicode-aware normalization, pre-tokenization and post-processing options.
32-
- **Compact data format**\
32+
- **Compact data encoding**\
3333
Definitions are stored in an efficient binary format and without merge list.
3434

3535
See also [`kitoken-cli`](./packages/cli) for Kitoken in the command line.
3636

3737
## Compatibility
3838

39-
Kitoken can load and convert many existing tokenizer formats. Every supported format is [tested](./tests) against the original implementation across a variety of inputs to ensure correctness and compatibility.
39+
Kitoken can load and convert most existing tokenizer formats. Every supported format is [tested](./tests) against the original implementation across a wide variety of inputs to ensure correctness and compatibility.
4040

4141
> [!NOTE]
4242
> Most models on [Hugging Face](https://huggingface.co) are supported. Just take the `tokenizer.json` or `spiece.model` and load it into Kitoken.
4343
44-
Kitoken aims to be output-identical with existing implementations for all models. See the notes below for differences in specific cases.
44+
Kitoken aims to be output-identical with existing implementations for all models. <sup>See the notes below for differences in specific cases.</sup>
4545

4646
### SentencePiece
4747

@@ -59,8 +59,8 @@ If the model does not contain a trainer definition, `Unigram` is assumed as the
5959
<details>
6060
<summary>Notes</summary>
6161

62-
- SentencePiece uses [different `nfkc` normalization rules in the `nmt_nfkc` and `nmt_nfkc_cf` schemes](https://github.com/google/sentencepiece/blob/master/doc/normalization.md) than during regular `nfkc` normalization. This difference is not entirely additive and prevents the normalization of `` to `~`. Kitoken uses the regular `nfkc` normalization rules for `nmt_nfkc` and `nmt_nfkc_cf` and normalizes `` to `~`.
63-
- SentencePiece's implementation of Unigram merges pieces with the same merge priority in a different order depending on preceding non-encodable pieces. For example, with `xlnet_base_cased`, SentencePiece encodes `.nnn` and `Զnnn` as `.., 8705, 180` but `ԶԶnnn` as `.., 180, 8705`. Kitoken always merges pieces with the same merge priority in the same order, resulting in `.., 180, 8705` for either case in the example and matching the behavior of Tokenizers.
62+
- SentencePiece uses [different `nfkc` normalization rules in the `nmt_nfkc` and `nmt_nfkc_cf` schemes](https://github.com/google/sentencepiece/blob/master/doc/normalization.md) than during regular `nfkc` normalization, preventing the normalization of `` to `~`. Kitoken uses the regular `nfkc` normalization rules for `nmt_nfkc` and `nmt_nfkc_cf`.
63+
- SentencePiece's implementation of Unigram merges pieces with the same merge priority in a different order depending on preceding non-encodable pieces. Kitoken always merges pieces with the same merge priority in the same order, matching the behavior of Tokenizers.
6464

6565
</details>
6666

@@ -83,7 +83,7 @@ Some normalization, post-processing and decoding options used by Tokenizers are
8383
<details>
8484
<summary>Notes</summary>
8585

86-
- When using a `BPE` definition with an incomplete vocabulary and without an `unk` token, Tokenizers skips over non-encodable pieces and attempts to merge the surrounding ones. Kitoken always considers non-encodable pieces as un-mergeable and encodes the surrounding pieces individually. This can affect models that exploit the behavior of Tokenizers with a deliberately restricted vocabulary.
86+
- Tokenizers skips over non-encodable pieces and attempts to merge the surrounding ones when using an incomplete vocabulary without an `unk` token. Kitoken always considers non-encodable pieces as un-mergeable and encodes the surrounding pieces individually. This can affect models that exploit the behavior of Tokenizers with a deliberately restricted vocabulary.
8787
- Tokenizers normalizes inputs character-by-character, while Kitoken normalizes inputs as one. This can result in differences during case-folding in some cases. For example, greek letter `Σ` has two lowercase forms, `σ` for within-word and `ς` for end-of-word use. Tokenizers will always lowercase `Σ` to `σ`, while Kitoken will lowercase it to either depending on the context.
8888

8989
</details>

packages/cli/Cargo.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,13 @@ path = "src/main.rs"
1919

2020
kitoken = { path = "../..", features = ["all"] }
2121

22-
log = { version = "0.4.27" }
23-
clap = { version = "4.5.36", features = [
22+
log = { version = "0.4" }
23+
clap = { version = "4.5", features = [
2424
"std",
2525
"color",
2626
"suggestions",
2727
"derive",
2828
"unicode",
2929
"wrap_help",
3030
] }
31-
simple_logger = { version = "5.0" }
31+
simple_logger = { version = "5.2" }

packages/cli/src/main.rs

Lines changed: 28 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
use std::fs::File;
22
use std::io::{BufReader, BufWriter, Read, Seek};
3-
use std::path::Path;
3+
use std::path::{Path, PathBuf};
44
use std::sync::Once;
55

66
use clap::Parser;
7-
use kitoken::{Definition, DeserializationError, Kitoken};
7+
use kitoken::{Definition, DeserializationError, Kitoken, WebRequestError};
88

99
#[derive(Parser)]
1010
enum Command {
@@ -82,20 +82,18 @@ pub fn main() {
8282
std::process::exit(1);
8383
}
8484
for path in paths {
85-
convert(&path, true).unwrap_or_else(|error| {
85+
convert_path(path.to_str().unwrap(), true).unwrap_or_else(|error| {
8686
eprintln!("{}", error);
8787
std::process::exit(1);
8888
});
8989
}
9090
}
9191
Command::Compare { one, two } => {
92-
let one = Path::new(&one);
93-
let two = Path::new(&two);
94-
let one = convert(one, false).unwrap_or_else(|error| {
92+
let one = convert_path(&one, false).unwrap_or_else(|error| {
9593
eprintln!("{}", error);
9694
std::process::exit(1);
9795
});
98-
let two = convert(two, false).unwrap_or_else(|error| {
96+
let two = convert_path(&two, false).unwrap_or_else(|error| {
9997
eprintln!("{}", error);
10098
std::process::exit(1);
10199
});
@@ -129,18 +127,16 @@ pub fn main() {
129127
}
130128
}
131129
Command::Inspect { path } => {
132-
let path = Path::new(&path);
133-
let model = convert(path, false).unwrap_or_else(|error| {
130+
let model = convert_path(&path, false).unwrap_or_else(|error| {
134131
eprintln!("{}", error);
135132
std::process::exit(1);
136133
});
137134
println!("Specials: {:#?}", model.specials);
138135
println!("{:#?}", model);
139136
}
140137
Command::Encode { model, input } => {
141-
let model = Path::new(&model);
142138
let inputp = Path::new(&input);
143-
let model = convert(model, false).unwrap_or_else(|error| {
139+
let model = convert_path(&model, false).unwrap_or_else(|error| {
144140
eprintln!("{}", error);
145141
std::process::exit(1);
146142
});
@@ -166,9 +162,8 @@ pub fn main() {
166162
println!()
167163
}
168164
Command::Decode { model, input } => {
169-
let model = Path::new(&model);
170165
let inputp = Path::new(&input);
171-
let model = convert(model, false).unwrap_or_else(|error| {
166+
let model = convert_path(&model, false).unwrap_or_else(|error| {
172167
eprintln!("{}", error);
173168
std::process::exit(1);
174169
});
@@ -202,10 +197,27 @@ pub fn main() {
202197
}
203198
}
204199

205-
pub fn convert(path: &Path, write: bool) -> Result<Definition, DeserializationError> {
200+
pub fn convert_web(url: &str) -> Result<Definition, WebRequestError> {
201+
let definition = Definition::from_web(url)?;
202+
eprintln!("Definition loaded from {}", definition.meta.source);
203+
convert(definition, None).map_err(|e| e.into())
204+
}
205+
206+
pub fn convert_path(path: &str, write: bool) -> Result<Definition, WebRequestError> {
207+
if path.starts_with("hf:") || path.starts_with("http:") || path.starts_with("https:") {
208+
return convert_web(path);
209+
}
206210
let mut reader = BufReader::new(File::open(path)?);
207211
let definition = Definition::from_reader(&mut reader)?;
208-
eprintln!("Definition loaded from {}", path.display());
212+
eprintln!("Definition loaded from {}", path);
213+
eprintln!("Input size: {} bytes", reader.stream_position()?);
214+
convert(definition, write.then(|| PathBuf::from(path).with_extension("kit")))
215+
.map_err(|e| e.into())
216+
}
217+
218+
pub fn convert(
219+
definition: Definition, out: Option<PathBuf>,
220+
) -> Result<Definition, DeserializationError> {
209221
match definition.model {
210222
kitoken::Model::BytePair { .. } => eprintln!("Model type: BPE"),
211223
kitoken::Model::Unigram { .. } => eprintln!("Model type: Unigram"),
@@ -214,9 +226,7 @@ pub fn convert(path: &Path, write: bool) -> Result<Definition, DeserializationEr
214226
}
215227
eprintln!("Vocab size: {}", definition.model.vocab().len());
216228
eprintln!("Specials size: {}", definition.specials.len());
217-
eprintln!("Input size: {} bytes", reader.stream_position()?);
218-
if write {
219-
let out = path.with_extension("kit");
229+
if let Some(out) = out {
220230
let mut writer = BufWriter::new(File::create(&out)?);
221231
definition.to_writer(&mut writer)?;
222232
eprintln!("Definition written to {}", out.display());

packages/python/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
```py
1313
from kitoken import Kitoken
1414

15-
encoder = Kitoken.from_file("models/llama4.model")
15+
encoder = Kitoken.from_web("hf:Qwen/Qwen3.5-9B")
1616

1717
tokens = encoder.encode("hello world!", True)
1818
string = encoder.decode(tokens).decode("utf-8")
@@ -25,12 +25,12 @@ assert string == "hello world!"
2525
Kitoken is a fast and versatile tokenizer for language models compatible with [SentencePiece](https://github.com/google/sentencepiece), [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers), [OpenAI Tiktoken](https://github.com/openai/tiktoken) and [Mistral Tekken](https://docs.mistral.ai/guides/tokenization), supporting BPE, Unigram and WordPiece tokenization.
2626

2727
- **Fast and efficient tokenization**\
28-
Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](//github.com/Systemcluster/kitoken#benchmarks) for comparisons with different datasets.
28+
Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](#benchmarks) for comparisons with different datasets.
2929
- **Runs in all environments**\
3030
Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
3131
- **Supports input and output processing**\
32-
Including unicode-aware normalization, pre-tokenization and post-decoding options.
33-
- **Compact data format**\
32+
Including unicode-aware normalization, pre-tokenization and post-processing options.
33+
- **Compact data encoding**\
3434
Definitions are stored in an efficient binary format and without merge list.
3535

3636
See the main [README](//github.com/Systemcluster/kitoken) for more information.

packages/python/kitoken.pyi

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,9 @@ class Kitoken:
4949
"""
5050
...
5151

52-
def encode_all(self, text: list[str], encode_specials: Optional[bool] = False) -> list[list[int]]:
52+
def encode_all(
53+
self, text: list[str], encode_specials: Optional[bool] = False
54+
) -> list[list[int]]:
5355
"""
5456
Encodes the given texts into sequences of tokens.
5557
If `encode_specials` is `True`, the text is first split around special tokens which are separately encoded with the special encoder.
@@ -69,7 +71,9 @@ class Kitoken:
6971
"""
7072
...
7173

72-
def decode_all(self, data: list[list[int]], decode_specials: Optional[bool] = False) -> list[bytes]:
74+
def decode_all(
75+
self, data: list[list[int]], decode_specials: Optional[bool] = False
76+
) -> list[bytes]:
7377
"""
7478
Decodes the given sequences of tokens into texts.
7579
Returns a list of lists of bytes, or an error if no byte sequence for a token exists in the decoder and no unknown token is set in the configuration.
@@ -177,3 +181,12 @@ class Kitoken:
177181
:param path: The path to the file.
178182
"""
179183
...
184+
185+
@staticmethod
186+
def from_web(url: str) -> Kitoken:
187+
"""
188+
Initializes the tokenizer from a model URL.
189+
190+
:param url: The URL to the file.
191+
"""
192+
...

packages/python/src/lib.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,13 @@ impl Kitoken {
229229
.map_err(convert_error)?,
230230
})
231231
}
232+
233+
#[staticmethod]
234+
pub fn from_web(url: &str, py: Python<'_>) -> PyResult<Kitoken> {
235+
Ok(Kitoken {
236+
inner: py.detach(|| Inner::from_web(url)).map(Arc::new).map_err(convert_error)?,
237+
})
238+
}
232239
}
233240

234241
#[pymodule]

packages/python/test.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,10 @@
3939

4040
encoder.to_bytes()
4141
print("OK")
42+
43+
try:
44+
encoder = Kitoken.from_web("hf:Qwen/Qwen3.5-9B")
45+
print(encoder)
46+
print("Web OK")
47+
except Exception as e:
48+
print(f"Web ERR: {e}")

src/encoder/wordpiece.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,7 @@ impl WordPiece {
170170
|| self.max_word_chars > 0 && indices.clone().count() > self.max_word_chars
171171
{
172172
if fallback.peek() == Some(&Fallback::Unknown) && self.unknown.is_some() {
173+
#[allow(clippy::unnecessary_unwrap)]
173174
result.push(self.unknown.as_ref().unwrap().id);
174175
} else if fallback.peek() == Some(&Fallback::Skip) {
175176
} else {

src/lib.rs

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@
5353
//! - `regex-onig`: Enables use of the `oniguruma` regex engine instead of `fancy-regex`.
5454
//! Generally not recommended since it has worse runtime performance and adds a dependency on the native `oniguruma` library.
5555
//! However, it may be useful for certain models that require specific regex behavior that is not supported by or differs with `fancy-regex`.
56+
//! - `web`: Enables fetching definitions from HuggingFace or other URLs.
5657
5758
#![no_std]
5859
#![cfg_attr(docsrs, feature(doc_auto_cfg, doc_cfg_hide))]
@@ -73,6 +74,8 @@ mod vocab;
7374

7475
#[cfg(feature = "serialization")]
7576
mod serialization;
77+
#[cfg(feature = "web")]
78+
mod web;
7679

7780
pub mod convert;
7881

@@ -95,6 +98,11 @@ pub use crate::vocab::*;
9598

9699
#[cfg(feature = "serialization")]
97100
pub use crate::serialization::*;
101+
#[cfg(feature = "web")]
102+
pub use crate::web::*;
103+
104+
#[doc(hidden)]
105+
pub mod util;
98106

99107
/// Errors encountered during initialization.
100108
#[non_exhaustive]
@@ -104,13 +112,17 @@ pub enum InitializationError {
104112
#[error("invalid config: {0}")]
105113
InvalidConfig(ConfigurationError),
106114
/// The encoder and scores must have the same length in unigram mode.
107-
#[error("encoder and scores must have the same length in unigram mode and every token must have a score")]
115+
#[error(
116+
"encoder and scores must have the same length in unigram mode and every token must have a score"
117+
)]
108118
InvalidScores,
109119
/// The encoder and decoder must have the same length and the encoder must not have duplicates.
110120
#[error("encoder and decoder must have the same length and vocab must not have duplicates")]
111121
InvalidEncoder,
112122
/// The special encoder and decoder must have the same length and the special encoder must not have duplicates.
113-
#[error("special encoder and decoder must have the same length and specials must not have duplicates")]
123+
#[error(
124+
"special encoder and decoder must have the same length and specials must not have duplicates"
125+
)]
114126
InvalidSpecialEncoder,
115127
/// The split regex failed to compile.
116128
#[error("invalid regex: {0}")]

0 commit comments

Comments
 (0)