Important attribution update (2026-04-08): After observing Tim Dettmers' general comment in llama.cpp discussion #20969 — directed at the thread's participants in general (6+ forks were all loosely calling their work "TurboQuant"), not at us specifically — we recognized the substance applied to our naming as well and updated our docs to credit HIGGS (Malinovskii et al., Nov 2024, arXiv:2411.17525) for the Random Hadamard Transform + scalar grid quantization pattern. The shipped Variant F is structurally closest to HIGGS (RHT + MSE-optimal grids on rotated values), applied to KV cache like TurboQuant, with both the QJL residual stage and the per-channel outlier split removed through ablation. We do not claim our shipped variant is the published TurboQuant algorithm — it is an empirically-derived simplification arrived at through 9 Karpathy-loop rounds.
Run date: 2026-04-08
Paper: Zandieh et al., TurboQuant, ICLR 2026
Hardware: Apple M1 Pro, 8 threads
Dataset:bench/data/ppl_1k.txt(1040 tokens, perplexity benchmark)
Verdict: Variant F (commitac3c46a) —turbo_kv_4bBEATSuniform_4bat the same bit budget.
quant.cpp started with turbo_kv_3b / turbo_kv_4b implementing a literal port of Google TurboQuant (RHT → Lloyd-Max codebook → 1-bit QJL residual). The literal port was byte-for-byte equivalent to MSE-only when ablated — the QJL residual stage contributed exactly nothing to scores. After 6 Karpathy-loop rounds we converged on Variant F: drop QJL entirely, reinvest the freed 16 bytes in a 2× larger codebook. The result beats both our previous production baseline (uniform_4b) and llama.cpp's q4_0 KV at the same 4-bit budget.
| KV type | Bit budget | PPL | Δ vs FP32 | Status |
|---|---|---|---|---|
| FP32 | 32 | 13.56 | — | baseline |
turbo_kv_4b (Variant F) ⭐ |
4 | 14.28 | +5.3% | best 4-bit |
uniform_4b |
4 | 14.41 | +6.3% | previous champion |
llama.cpp q4_0 KV (lit. survey) |
4 | ~14.99 | +10.6% | for comparison |
turbo_kv_3b (Variant F) |
3 | 15.39 | +13.5% | best 3-bit |
Six rounds of score-driven iteration on Llama 3.2 3B PPL:
| Round | Variant | What changed | turbo_kv_4b | turbo_kv_3b | Decision |
|---|---|---|---|---|---|
| 0 | Literal port | RHT + Lloyd-Max-Gauss + QJL + ‖r‖, inv_std=√d |
16.03 | 25.84 | baseline |
| 1 | A — empirical std | per-block 1/std instead of √d | 15.87 | 25.07 | keep |
| 2 | B — max-abs no-clip | `inv_std = MAX_CENT / max( | x | )` | 15.39 |
| 3 | C — 99th percentile | Winsorized | 17.24 | — | revert |
| 4 | D — K·std sweep | K ∈ {1.5,2,2.5,3,3.5,4} | 15.53 (K=2 best) | — | B still wins |
| 5 | E — uniform linear | drop codebook, 8-level min/max | 16.28 | — | revert |
| 6 | F — drop QJL, ↑codebook | reinvest 16 QJL bytes in larger codebook | 14.28 ✅ | 15.39 ✅ | shipped |
Total improvement vs literal port: −1.75 PPL on 4b, −10.45 PPL on 3b.
| KV type | Bits/elem | PPL | Δ vs FP32 | Notes |
|---|---|---|---|---|
| fp32 | 32 | 13.56 | baseline | reference |
uniform_4b + FP16 V |
4 | 14.41 | +6.3% | simple per-block min-max ✅ recommended |
turbo_kv_4b + FP16 V |
4 | 16.03 | +18.2% | RHT + 3-bit codebook + 1-bit QJL |
turbo_kv_3b + FP16 V |
3 | 25.84 | +90.6% | RHT + 2-bit codebook + 1-bit QJL ❌ |
| KV type | Bits/elem | PPL | Δ vs FP32 | tok/s | vs FP32 speed |
|---|---|---|---|---|---|
| fp32 | 32 | 16.88 | baseline | 35.2 | baseline |
turbo_kv_5b 🏆 |
5 | 17.00 | +0.7% | 28.3 | −19.6% |
turbo_kv_4b ⭐ |
4 | 18.11 | +7.3% | 30.4 | −13.6% |
uniform_4b |
4 | 19.21 | +13.8% | 28.0 | −20.5% |
turbo_kv_3b |
3 | 27.18 | +61% ❌ | 28.3 | −19.6% |
Cross-size pattern (3 models tested):
| Model | turbo_kv_4b PPL Δ | turbo_kv_5b PPL Δ | turbo_kv_3b PPL Δ |
|---|---|---|---|
| SmolLM2 135M | +5.8% | +1.7% | n/a |
| Llama 3.2 1B | +7.3% | +0.7% | +61% ❌ |
| Llama 3.2 3B | +5.7% | +0.7% | +13.3% |
Findings:
turbo_kv_5bis consistently near-lossless across model sizes (~1% PPL Δ)turbo_kv_4bPPL gap is 5–8% across sizes, slightly worse on smaller modelsturbo_kv_3bis unsuitable below 3B parameters — 3-bit codebook is too coarse for the smaller models' more concentrated KV distributions- Speed gap to fp32 widens on smaller models (−7% on 3B → −14% on 1B → −20% on 135M) because the per-token attention overhead is a larger fraction of total work when matmul is small
The Google TurboQuant paper benchmarks on Llama 3.1 8B with LongBench-E. We attempted to run our PPL eval on this model but Q8_0 (8 GB) hit swap on the 16 GB test machine and Q4_K_M (4.6 GB) was prohibitively slow (>50 min for one fp32 measurement). Llama 3.1 8B reproduction is deferred to a session with more RAM or a server-class machine.
| KV type | Bits/elem | PPL | Δ vs FP32 | Notes |
|---|---|---|---|---|
| fp32 | 32 | 18.62 | baseline | reference |
uniform_4b + FP16 V |
4 | 20.33 | +9.2% | |
turbo_kv_4b + FP16 V |
4 | 24.94 | +33.9% | |
turbo_kv_3b + FP16 V |
3 | 68.23 | +266% | catastrophic |
| Model | Method | Paper number |
|---|---|---|
| Llama-3.1-8B | Full cache | LongBench 50.06 |
| Llama-3.1-8B | TurboQuant 3.5-bit | LongBench 50.06 (identical to baseline) |
| Llama-3.1-8B | TurboQuant 2.5-bit | LongBench 49.44 |
| — | NIH @ 3-bit | ~0.997 (vs 1.000 baseline) |
Translated to PPL terms, the paper's results imply approximately zero PPL degradation at 3.5-bit and <2% degradation at 2.5-bit. We are at +18.2% at 4-bit and +90.6% at 3-bit — orders of magnitude worse.
| Component | Status | Notes |
|---|---|---|
Per-vector L2 normalization (‖x‖ stored as FP16) |
✅ correct | Lines 180–185 |
Random Hadamard Transform (tq_rht_transform) |
✅ correct | Walsh-Hadamard + Rademacher |
| Lloyd-Max-Gaussian centroids | ✅ correct | Match Max 1960 N(0,1) tables to 4 decimals |
inv_std = √d rescaling |
Assumes coords are exactly N(0, 1/d). For finite d the marginal distribution of a uniform unit vector coordinate is Beta(1/2, (d−1)/2) rescaled, NOT exactly Gaussian. |
|
Residual norm ‖r‖₂ stored as FP16 |
✅ correct | Lines 226–230 |
| 1-bit QJL sign hash on residual | ✅ correct | compute_qjl_signs |
| Pre-rotated query optimization | ✅ correct | q_rot = RHT(query) once |
| Inner product estimator combining stages | dot1 + r_norm * qjl_correction — formula may not exactly match paper |
Ran turbo_kv_* with the QJL correction forcibly disabled (MSE-only) on Llama 3.2 3B:
| Config | PPL | Δ from full |
|---|---|---|
turbo_kv_4b full (MSE+QJL) |
16.03 | (baseline) |
turbo_kv_4b MSE-only |
16.03 | 0.00 |
turbo_kv_3b full (MSE+QJL) |
25.84 | (baseline) |
turbo_kv_3b MSE-only |
25.84 | 0.00 |
The QJL stage contributes literally nothing to the final scores. Disabling it produces byte-identical PPL.
This narrows the diagnosis dramatically:
- The QJL correction term is being computed as ~0 (or constant) regardless of input
- The MSE-only Lloyd-Max codebook stage is strictly worse than uniform per-block min-max at the same bit budget — Lloyd-Max-Gaussian centroids appear to clip outliers that uniform_4b's per-block range captures
- Real key vectors after RHT have heavier tails than the N(0,1) assumption — likely because the keys themselves have a few large components that don't fully redistribute even after a single-stage Hadamard rotation
Two structural fixes are needed:
- Outlier handling at Stage 1 (paper does this — 32 outlier channels at higher bit width)
- QJL correction debugging — verify the constant
√(π/2)/mis right for our Rademacher rows (the original paper uses Gaussian rows; constants differ)
-
Lloyd-Max scaling: After random rotation of a unit-norm vector, coordinates follow a
Beta(1/2, (d−1)/2)distribution scaled to[−1, 1], not exactlyN(0, 1/d). The discrepancy matters at smalld(head_dim 64–128). Need to either (a) recompute centroids for the Beta distribution, or (b) verify that the Gaussian approximation suffices ford ≥ 128. -
QJL correction formula: The paper's combined estimator is
⟨q, x̃_mse⟩ + ‖r‖₂ · ⟨q, Q_qjl⁻¹(Q_qjl(r))⟩. Our code usesdot1 + r_norm * qjl_dot * qjl_scalewhereqjl_scale = √(π/2) / sketch_dim. The constant factor and the fact that residual is computed after normalization may both be off. -
Per-channel outlier handling: The paper allocates extra bits to ~25% of channels identified as outliers. We do uniform per-channel allocation. This alone could account for a meaningful portion of the gap.
-
Block size: The paper operates on the full vector. We block at
TQ_BK = 128. Forhead_dim ≤ 128this is moot, but the per-block normalization may interact differently with rotation than per-vector normalization. -
Sketch dimension: We use
sketch_dim = head_dim. The paper may use a different ratio (typicallysketch_dim ≥ 2·dfor QJL).
For users who want maximum compression with minimum quality loss today, the recommended config is:
./build/quant model.gguf --chat -p "..." -k uniform_4b -v fp16 # 1.6x compression, +6.3% PPL on 3B
./build/quant model.gguf --chat -p "..." -k uniform_4b -v q4 # 6.9x compression, +6.3% PPL on 3B (V quality preserved)turbo_kv_* is not currently recommended for production use until the gap is closed.
- ☐ Recompute Lloyd-Max centroids assuming
Beta(1/2, (d−1)/2)ford ∈ {64, 128, 256} - ☐ Implement per-channel outlier extraction (32 outlier channels at higher bit width per the paper)
- ☐ Verify QJL correction constant against the original QJL paper (arXiv:2406.03482)
- ☐ Test with
sketch_dim = 2 · head_dim - ☐ Ablation: turn off QJL stage entirely; measure MSE-only PPL to isolate stage 1 vs stage 2
- ☐ Add a unit test that fails if
turbo_kv_4bPPL on Llama 3.2 3B exceeds 14.5 (currently 16.03) - ☐ Track in GitHub issue for community visibility
cmake --build build -j$(nproc)
for k in fp32 uniform_4b turbo_kv_4b turbo_kv_3b; do
echo "=== $k ==="
./build/quant models/Llama-3.2-3B-Instruct-Q8_0.gguf \
--ppl bench/data/ppl_1k.txt -j 8 -k $k -v fp16 2>&1 | tail -3
donequant.cpp's turbo_kv_4b is now the best 4-bit KV cache quantization in the project, beating both our previous production champion (uniform_4b) and llama.cpp's q4_0 KV at the same bit budget on Llama 3.2 3B perplexity.
It is inspired by but not identical to Google's TurboQuant. The literal paper algorithm (RHT + Lloyd-Max + 1-bit QJL residual + ‖r‖₂ scalar) was a straight port and produced the broken baseline numbers above. The shipped Variant F drops the QJL stage entirely (which contributed zero in our measurements) and reinvests the freed bytes in a finer codebook. This is structurally simpler than the paper but empirically better on our benchmark.
We don't claim to reproduce the paper's exact numbers — those are reported on Llama 3.1 8B with LongBench, and may also benefit from per-channel outlier handling we don't implement. We claim to ship a single-header C engine with KV compression that measurably beats the previous open-source baselines at the same bit budget, and credit Google's TurboQuant paper as the structural starting point.