You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
v0.7.2: turbo_kv_5b_fast — near-lossless quality at fp32 parity speed
New KV type TQ_TYPE_TURBO_KV_5B_FAST. Same Variant F algorithm as
turbo_kv_5b (RHT + 32-level Lloyd-Max codebook), but stores each
5-bit index as a full byte instead of bit-packed. Wastes 3 bits per
index, but eliminates the scalar bit-extraction overhead that kept
turbo_kv_5b at -8.8% vs fp32 in v0.7.1.
Llama 3.2 3B PPL eval (3 runs each, CPU-only):
Type Bytes Compression PPL Δ vs FP32 tok/s vs FP32
--------------- ----- ----------- ------ --------- ----- --------
fp32 — 1× 13.56 — 17.93 baseline
turbo_kv_4b 72 7.1× 14.08 +3.8% 18.13 +1.1% ⭐
turbo_kv_5b 88 5.8× 13.65 +0.7% 16.93 -5.6%
turbo_kv_5b_fast 136 3.76× 13.65 +0.7% 17.53 -2.2% 🆕
↑
new
Pareto
The 5b_fast inner loop is pure NEON tbl with no scalar unpack — just
vld1q_u8 + vqtbl2q_s8 + int8→fp32 + scale + fma. This is the cleanest
implementation in the codebase and the closest to fp32 parity for
near-lossless quality.
Trade-off: 3.76× compression vs 5.8× for turbo_kv_5b. Same +0.7% PPL
on Llama 3.2 3B (verified — both share the same 32-level codebook
and algorithm). Use case: "want near-lossless quality + parity speed,
have memory to spare for less compression".
Block layout (136 bytes):
norm(2) + residual_norm(2) + inv_std(2) + _pad(2)
+ mse_indices[128] ← one byte per 5-bit index (waste 3 bits each)
Files changed:
- include/turboquant/tq_types.h: TQ_TYPE_TURBO_KV_5B_FAST enum,
block_tq_turbo_kv_5b_fast struct (136 bytes), size assertion
- src/core/tq_turbo_kv.c: tq_turbo_kv_5b_fast_quantize_ref,
tq_turbo_kv_5b_fast_dequantize_ref, tq_turbo_kv_5b_fast_attention_ref
(pure NEON tbl, ~30% fewer instructions per element than 5b)
- src/core/tq_traits.c: traits table entry + format spec case
- tools/quant.c: CLI parser
- integrations/llamacpp/tq_kv_cache.cpp: ggml type registration,
TQ_GGML_WRAPPERS / TQ_GGML_VEC_DOT, parse map
- .gitignore: add build_nometal/
35/35 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments