You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -29,19 +29,19 @@ Kitoken is a fast and versatile tokenizer for language models compatible with [S
29
29
Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
30
30
-**Supports input and output processing**\
31
31
Including unicode-aware normalization, pre-tokenization and post-processing options.
32
-
-**Compact data format**\
32
+
-**Compact data encoding**\
33
33
Definitions are stored in an efficient binary format and without merge list.
34
34
35
35
See also [`kitoken-cli`](./packages/cli) for Kitoken in the command line.
36
36
37
37
## Compatibility
38
38
39
-
Kitoken can load and convert many existing tokenizer formats. Every supported format is [tested](./tests) against the original implementation across a variety of inputs to ensure correctness and compatibility.
39
+
Kitoken can load and convert most existing tokenizer formats. Every supported format is [tested](./tests) against the original implementation across a wide variety of inputs to ensure correctness and compatibility.
40
40
41
41
> [!NOTE]
42
42
> Most models on [Hugging Face](https://huggingface.co) are supported. Just take the `tokenizer.json` or `spiece.model` and load it into Kitoken.
43
43
44
-
Kitoken aims to be output-identical with existing implementations for all models. See the notes below for differences in specific cases.
44
+
Kitoken aims to be output-identical with existing implementations for all models. <sup>See the notes below for differences in specific cases.</sup>
45
45
46
46
### SentencePiece
47
47
@@ -59,8 +59,8 @@ If the model does not contain a trainer definition, `Unigram` is assumed as the
59
59
<details>
60
60
<summary>Notes</summary>
61
61
62
-
- SentencePiece uses [different `nfkc` normalization rules in the `nmt_nfkc` and `nmt_nfkc_cf` schemes](https://github.com/google/sentencepiece/blob/master/doc/normalization.md) than during regular `nfkc` normalization. This difference is not entirely additive and prevents the normalization of `~` to `~`. Kitoken uses the regular `nfkc` normalization rules for `nmt_nfkc` and `nmt_nfkc_cf` and normalizes `~` to `~`.
63
-
- SentencePiece's implementation of Unigram merges pieces with the same merge priority in a different order depending on preceding non-encodable pieces. For example, with `xlnet_base_cased`, SentencePiece encodes `.nnn` and `Զnnn` as `.., 8705, 180` but `ԶԶnnn` as `.., 180, 8705`. Kitoken always merges pieces with the same merge priority in the same order, resulting in `.., 180, 8705` for either case in the example and matching the behavior of Tokenizers.
62
+
- SentencePiece uses [different `nfkc` normalization rules in the `nmt_nfkc` and `nmt_nfkc_cf` schemes](https://github.com/google/sentencepiece/blob/master/doc/normalization.md) than during regular `nfkc` normalization, preventing the normalization of `~` to `~`. Kitoken uses the regular `nfkc` normalization rules for `nmt_nfkc` and `nmt_nfkc_cf`.
63
+
- SentencePiece's implementation of Unigram merges pieces with the same merge priority in a different order depending on preceding non-encodable pieces. Kitoken always merges pieces with the same merge priority in the same order, matching the behavior of Tokenizers.
64
64
65
65
</details>
66
66
@@ -83,7 +83,7 @@ Some normalization, post-processing and decoding options used by Tokenizers are
83
83
<details>
84
84
<summary>Notes</summary>
85
85
86
-
-When using a `BPE` definition with an incomplete vocabulary and without an `unk` token, Tokenizers skips over non-encodable pieces and attempts to merge the surrounding ones. Kitoken always considers non-encodable pieces as un-mergeable and encodes the surrounding pieces individually. This can affect models that exploit the behavior of Tokenizers with a deliberately restricted vocabulary.
86
+
- Tokenizers skips over non-encodable pieces and attempts to merge the surrounding ones when using an incomplete vocabulary without an `unk` token. Kitoken always considers non-encodable pieces as un-mergeable and encodes the surrounding pieces individually. This can affect models that exploit the behavior of Tokenizers with a deliberately restricted vocabulary.
87
87
- Tokenizers normalizes inputs character-by-character, while Kitoken normalizes inputs as one. This can result in differences during case-folding in some cases. For example, greek letter `Σ` has two lowercase forms, `σ` for within-word and `ς` for end-of-word use. Tokenizers will always lowercase `Σ` to `σ`, while Kitoken will lowercase it to either depending on the context.
Kitoken is a fast and versatile tokenizer for language models compatible with [SentencePiece](https://github.com/google/sentencepiece), [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers), [OpenAI Tiktoken](https://github.com/openai/tiktoken) and [Mistral Tekken](https://docs.mistral.ai/guides/tokenization), supporting BPE, Unigram and WordPiece tokenization.
26
26
27
27
-**Fast and efficient tokenization**\
28
-
Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](//github.com/Systemcluster/kitoken#benchmarks) for comparisons with different datasets.
28
+
Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks](#benchmarks) for comparisons with different datasets.
29
29
-**Runs in all environments**\
30
30
Native in Rust and with bindings for [Web](./packages/javascript), [Node](./packages/javascript) and [Python](./packages/python); see [kitoken.dev](https://kitoken.dev) for a web demo.
31
31
-**Supports input and output processing**\
32
-
Including unicode-aware normalization, pre-tokenization and post-decoding options.
33
-
-**Compact data format**\
32
+
Including unicode-aware normalization, pre-tokenization and post-processing options.
33
+
-**Compact data encoding**\
34
34
Definitions are stored in an efficient binary format and without merge list.
35
35
36
36
See the main [README](//github.com/Systemcluster/kitoken) for more information.
0 commit comments