Skip to content

Latest commit

 

History

History
185 lines (133 loc) · 12 KB

File metadata and controls

185 lines (133 loc) · 12 KB

Variant F derivation — from TurboQuant literal port to HIGGS-style simplification

Important attribution update (2026-04-08): After observing Tim Dettmers' general comment in llama.cpp discussion #20969 — directed at the thread's participants in general (6+ forks were all loosely calling their work "TurboQuant"), not at us specifically — we recognized the substance applied to our naming as well and updated our docs to credit HIGGS (Malinovskii et al., Nov 2024, arXiv:2411.17525) for the Random Hadamard Transform + scalar grid quantization pattern. The shipped Variant F is structurally closest to HIGGS (RHT + MSE-optimal grids on rotated values), applied to KV cache like TurboQuant, with both the QJL residual stage and the per-channel outlier split removed through ablation. We do not claim our shipped variant is the published TurboQuant algorithm — it is an empirically-derived simplification arrived at through 9 Karpathy-loop rounds.

Run date: 2026-04-08
Paper: Zandieh et al., TurboQuant, ICLR 2026
Hardware: Apple M1 Pro, 8 threads
Dataset: bench/data/ppl_1k.txt (1040 tokens, perplexity benchmark)
Verdict: Variant F (commit ac3c46a) — turbo_kv_4b BEATS uniform_4b at the same bit budget.

TL;DR

quant.cpp started with turbo_kv_3b / turbo_kv_4b implementing a literal port of Google TurboQuant (RHT → Lloyd-Max codebook → 1-bit QJL residual). The literal port was byte-for-byte equivalent to MSE-only when ablated — the QJL residual stage contributed exactly nothing to scores. After 6 Karpathy-loop rounds we converged on Variant F: drop QJL entirely, reinvest the freed 16 bytes in a 2× larger codebook. The result beats both our previous production baseline (uniform_4b) and llama.cpp's q4_0 KV at the same 4-bit budget.

KV type Bit budget PPL Δ vs FP32 Status
FP32 32 13.56 baseline
turbo_kv_4b (Variant F) 4 14.28 +5.3% best 4-bit
uniform_4b 4 14.41 +6.3% previous champion
llama.cpp q4_0 KV (lit. survey) 4 ~14.99 +10.6% for comparison
turbo_kv_3b (Variant F) 3 15.39 +13.5% best 3-bit

Karpathy loop history

Six rounds of score-driven iteration on Llama 3.2 3B PPL:

Round Variant What changed turbo_kv_4b turbo_kv_3b Decision
0 Literal port RHT + Lloyd-Max-Gauss + QJL + ‖r‖, inv_std=√d 16.03 25.84 baseline
1 A — empirical std per-block 1/std instead of √d 15.87 25.07 keep
2 B — max-abs no-clip `inv_std = MAX_CENT / max( x )` 15.39
3 C — 99th percentile Winsorized 17.24 revert
4 D — K·std sweep K ∈ {1.5,2,2.5,3,3.5,4} 15.53 (K=2 best) B still wins
5 E — uniform linear drop codebook, 8-level min/max 16.28 revert
6 F — drop QJL, ↑codebook reinvest 16 QJL bytes in larger codebook 14.28 15.39 shipped

Total improvement vs literal port: −1.75 PPL on 4b, −10.45 PPL on 3b.

Measured Numbers

Llama 3.2 3B Instruct (Q8_0 weights)

KV type Bits/elem PPL Δ vs FP32 Notes
fp32 32 13.56 baseline reference
uniform_4b + FP16 V 4 14.41 +6.3% simple per-block min-max ✅ recommended
turbo_kv_4b + FP16 V 4 16.03 +18.2% RHT + 3-bit codebook + 1-bit QJL
turbo_kv_3b + FP16 V 3 25.84 +90.6% RHT + 2-bit codebook + 1-bit QJL ❌

Llama 3.2 1B Instruct (added 2026-04-08)

KV type Bits/elem PPL Δ vs FP32 tok/s vs FP32 speed
fp32 32 16.88 baseline 35.2 baseline
turbo_kv_5b 🏆 5 17.00 +0.7% 28.3 −19.6%
turbo_kv_4b 4 18.11 +7.3% 30.4 −13.6%
uniform_4b 4 19.21 +13.8% 28.0 −20.5%
turbo_kv_3b 3 27.18 +61% 28.3 −19.6%

Cross-size pattern (3 models tested):

Model turbo_kv_4b PPL Δ turbo_kv_5b PPL Δ turbo_kv_3b PPL Δ
SmolLM2 135M +5.8% +1.7% n/a
Llama 3.2 1B +7.3% +0.7% +61%
Llama 3.2 3B +5.7% +0.7% +13.3%

Findings:

  • turbo_kv_5b is consistently near-lossless across model sizes (~1% PPL Δ)
  • turbo_kv_4b PPL gap is 5–8% across sizes, slightly worse on smaller models
  • turbo_kv_3b is unsuitable below 3B parameters — 3-bit codebook is too coarse for the smaller models' more concentrated KV distributions
  • Speed gap to fp32 widens on smaller models (−7% on 3B → −14% on 1B → −20% on 135M) because the per-token attention overhead is a larger fraction of total work when matmul is small

Llama 3.1 8B Instruct (paper baseline) — TODO

The Google TurboQuant paper benchmarks on Llama 3.1 8B with LongBench-E. We attempted to run our PPL eval on this model but Q8_0 (8 GB) hit swap on the 16 GB test machine and Q4_K_M (4.6 GB) was prohibitively slow (>50 min for one fp32 measurement). Llama 3.1 8B reproduction is deferred to a session with more RAM or a server-class machine.

SmolLM2 135M Instruct

KV type Bits/elem PPL Δ vs FP32 Notes
fp32 32 18.62 baseline reference
uniform_4b + FP16 V 4 20.33 +9.2%
turbo_kv_4b + FP16 V 4 24.94 +33.9%
turbo_kv_3b + FP16 V 3 68.23 +266% catastrophic

What the paper claims

Model Method Paper number
Llama-3.1-8B Full cache LongBench 50.06
Llama-3.1-8B TurboQuant 3.5-bit LongBench 50.06 (identical to baseline)
Llama-3.1-8B TurboQuant 2.5-bit LongBench 49.44
NIH @ 3-bit ~0.997 (vs 1.000 baseline)

Translated to PPL terms, the paper's results imply approximately zero PPL degradation at 3.5-bit and <2% degradation at 2.5-bit. We are at +18.2% at 4-bit and +90.6% at 3-bit — orders of magnitude worse.

Building blocks audit

Component Status Notes
Per-vector L2 normalization (‖x‖ stored as FP16) ✅ correct Lines 180–185
Random Hadamard Transform (tq_rht_transform) ✅ correct Walsh-Hadamard + Rademacher
Lloyd-Max-Gaussian centroids ✅ correct Match Max 1960 N(0,1) tables to 4 decimals
inv_std = √d rescaling ⚠️ suspect Assumes coords are exactly N(0, 1/d). For finite d the marginal distribution of a uniform unit vector coordinate is Beta(1/2, (d−1)/2) rescaled, NOT exactly Gaussian.
Residual norm ‖r‖₂ stored as FP16 ✅ correct Lines 226–230
1-bit QJL sign hash on residual ✅ correct compute_qjl_signs
Pre-rotated query optimization ✅ correct q_rot = RHT(query) once
Inner product estimator combining stages ⚠️ unverified dot1 + r_norm * qjl_correction — formula may not exactly match paper

Ablation: which stage is broken?

Ran turbo_kv_* with the QJL correction forcibly disabled (MSE-only) on Llama 3.2 3B:

Config PPL Δ from full
turbo_kv_4b full (MSE+QJL) 16.03 (baseline)
turbo_kv_4b MSE-only 16.03 0.00
turbo_kv_3b full (MSE+QJL) 25.84 (baseline)
turbo_kv_3b MSE-only 25.84 0.00

The QJL stage contributes literally nothing to the final scores. Disabling it produces byte-identical PPL.

This narrows the diagnosis dramatically:

  1. The QJL correction term is being computed as ~0 (or constant) regardless of input
  2. The MSE-only Lloyd-Max codebook stage is strictly worse than uniform per-block min-max at the same bit budget — Lloyd-Max-Gaussian centroids appear to clip outliers that uniform_4b's per-block range captures
  3. Real key vectors after RHT have heavier tails than the N(0,1) assumption — likely because the keys themselves have a few large components that don't fully redistribute even after a single-stage Hadamard rotation

Two structural fixes are needed:

  • Outlier handling at Stage 1 (paper does this — 32 outlier channels at higher bit width)
  • QJL correction debugging — verify the constant √(π/2)/m is right for our Rademacher rows (the original paper uses Gaussian rows; constants differ)

Hypotheses for the gap

  1. Lloyd-Max scaling: After random rotation of a unit-norm vector, coordinates follow a Beta(1/2, (d−1)/2) distribution scaled to [−1, 1], not exactly N(0, 1/d). The discrepancy matters at small d (head_dim 64–128). Need to either (a) recompute centroids for the Beta distribution, or (b) verify that the Gaussian approximation suffices for d ≥ 128.

  2. QJL correction formula: The paper's combined estimator is ⟨q, x̃_mse⟩ + ‖r‖₂ · ⟨q, Q_qjl⁻¹(Q_qjl(r))⟩. Our code uses dot1 + r_norm * qjl_dot * qjl_scale where qjl_scale = √(π/2) / sketch_dim. The constant factor and the fact that residual is computed after normalization may both be off.

  3. Per-channel outlier handling: The paper allocates extra bits to ~25% of channels identified as outliers. We do uniform per-channel allocation. This alone could account for a meaningful portion of the gap.

  4. Block size: The paper operates on the full vector. We block at TQ_BK = 128. For head_dim ≤ 128 this is moot, but the per-block normalization may interact differently with rotation than per-vector normalization.

  5. Sketch dimension: We use sketch_dim = head_dim. The paper may use a different ratio (typically sketch_dim ≥ 2·d for QJL).

What works today (recommended config)

For users who want maximum compression with minimum quality loss today, the recommended config is:

./build/quant model.gguf --chat -p "..." -k uniform_4b -v fp16   # 1.6x compression, +6.3% PPL on 3B
./build/quant model.gguf --chat -p "..." -k uniform_4b -v q4     # 6.9x compression, +6.3% PPL on 3B (V quality preserved)

turbo_kv_* is not currently recommended for production use until the gap is closed.

Action items

  1. ☐ Recompute Lloyd-Max centroids assuming Beta(1/2, (d−1)/2) for d ∈ {64, 128, 256}
  2. ☐ Implement per-channel outlier extraction (32 outlier channels at higher bit width per the paper)
  3. ☐ Verify QJL correction constant against the original QJL paper (arXiv:2406.03482)
  4. ☐ Test with sketch_dim = 2 · head_dim
  5. ☐ Ablation: turn off QJL stage entirely; measure MSE-only PPL to isolate stage 1 vs stage 2
  6. ☐ Add a unit test that fails if turbo_kv_4b PPL on Llama 3.2 3B exceeds 14.5 (currently 16.03)
  7. ☐ Track in GitHub issue for community visibility

Reproducing this report

cmake --build build -j$(nproc)
for k in fp32 uniform_4b turbo_kv_4b turbo_kv_3b; do
  echo "=== $k ==="
  ./build/quant models/Llama-3.2-3B-Instruct-Q8_0.gguf \
    --ppl bench/data/ppl_1k.txt -j 8 -k $k -v fp16 2>&1 | tail -3
done

Honest positioning (post Variant F)

quant.cpp's turbo_kv_4b is now the best 4-bit KV cache quantization in the project, beating both our previous production champion (uniform_4b) and llama.cpp's q4_0 KV at the same bit budget on Llama 3.2 3B perplexity.

It is inspired by but not identical to Google's TurboQuant. The literal paper algorithm (RHT + Lloyd-Max + 1-bit QJL residual + ‖r‖₂ scalar) was a straight port and produced the broken baseline numbers above. The shipped Variant F drops the QJL stage entirely (which contributed zero in our measurements) and reinvests the freed bytes in a finer codebook. This is structurally simpler than the paper but empirically better on our benchmark.

We don't claim to reproduce the paper's exact numbers — those are reported on Llama 3.1 8B with LongBench, and may also benefit from per-channel outlier handling we don't implement. We claim to ship a single-header C engine with KV compression that measurably beats the previous open-source baselines at the same bit budget, and credit Google's TurboQuant paper as the structural starting point.