Skip to content

Commit 5a0391e

Browse files
committed
Bump version to 0.3.22
1 parent 16dba8c commit 5a0391e

2 files changed

Lines changed: 22 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.3.22]
11+
- perf (TFFT): Optimize longest_token_prefix with Numpy SIMD and fast-fail probe
12+
- Vectorization: Replaced standard Python zip loop with Numpy SIMD comparison for high-performance context matching.
13+
14+
- Fast Exit: Added an O(1) probe check (a[0] != b[0]) to eliminate Numpy conversion overhead on mismatches. Making the "Time To First Token" virtually instantaneous for cached sessions.
15+
16+
- Memory Optimization: Only the intersection of the two sequences (`[:min_len]`) is converted to Numpy arrays, minimizing memory allocation.
17+
18+
- Result: Achieved ~5x speedup (129ms -> 25ms) in KV cache reuse scenarios and ~2.5x speedup (554.23ms -> 201.62ms) in load time while maintaining stability on long contexts.
19+
20+
- The comparative test results are here: https://github.com/JamePeng/llama-cpp-python/issues/47#issuecomment-3761094840
21+
22+
- This change significantly reduces latency in RAG and chat applications on long contexts.
23+
24+
- feat: [Add support for adaptive_p and infill samplers and optimize the sampler logic.](https://github.com/JamePeng/llama-cpp-python/commit/99e7ece91a9765ae31922c2bc79f5be1e22bf61e)
25+
26+
- feat: [Add initialization checks to the Encoder-Decoder architecture.](https://github.com/JamePeng/llama-cpp-python/commit/a43d904dbac45b87d4086b096779b139fb52a34e)
27+
28+
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/10c98cbdf623d982f7491e8de5711e916a913192](https://github.com/ggml-org/llama.cpp/commit/10c98cbdf623d982f7491e8de5711e916a913192)
29+
- feat: Sync llama.cpp llama/mtmd API Binding 20260116
30+
1031
## [0.3.21]
1132
- perf: optimize tokenization and detokenization logic
1233
- Refactor `tokenize`, `token_to_piece`, and `detokenize` methods in `_internals.py` to significantly reduce Python loop overhead and improve the batch-processing performance and stability of `load`/`prompt-eval`.

llama_cpp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
from .llama_cpp import *
22
from .llama import *
33

4-
__version__ = "0.3.21"
4+
__version__ = "0.3.22"

0 commit comments

Comments
 (0)