quantumaikr · unamedkr · Apr 12, 2026 · Apr 12, 2026
diff --git a/bindings/python/quantcpp/__init__.py b/bindings/python/quantcpp/__init__.py
@@ -96,6 +96,17 @@ class ChatContextOverflow(RuntimeError):
         "llama-3.2-1b-instruct-q4_k_m.gguf",
         750,
     ),
+    # Phi-3.5-mini-instruct (3.8B params, vocab 32K).
+    # Added 2026-04-12 after end-to-end Phi-3 architecture support
+    # landed (fused QKV / fused gate+up FFN / LongRoPE). The 32K vocab
+    # is the smallest of the registry, which makes the lm_head matmul
+    # the fastest per-token. Combined with 3.8B params it's the best
+    # quality-per-token model we ship.
+    "Phi-3.5-mini": (
+        "bartowski/Phi-3.5-mini-instruct-GGUF",
+        "Phi-3.5-mini-instruct-Q4_K_M.gguf",
+        2400,
+    ),
 }
 
 def available_models():

diff --git a/bindings/python/quantcpp/cli.py b/bindings/python/quantcpp/cli.py
@@ -23,13 +23,17 @@
 # the recommended default. Users who explicitly want the 135M demo model
 # need to ask for it by full name.
 MODEL_ALIASES = {
-    "smollm2":      "SmolLM2-1.7B",
-    "smollm2:1.7b": "SmolLM2-1.7B",
-    "smollm2:135m": "SmolLM2-135M",
-    "qwen3.5":      "Qwen3.5-0.8B",
-    "qwen3.5:0.8b": "Qwen3.5-0.8B",
-    "llama3.2":     "Llama-3.2-1B",
-    "llama3.2:1b":  "Llama-3.2-1B",
+    "smollm2":         "SmolLM2-1.7B",
+    "smollm2:1.7b":    "SmolLM2-1.7B",
+    "smollm2:135m":    "SmolLM2-135M",
+    "qwen3.5":         "Qwen3.5-0.8B",
+    "qwen3.5:0.8b":    "Qwen3.5-0.8B",
+    "llama3.2":        "Llama-3.2-1B",
+    "llama3.2:1b":     "Llama-3.2-1B",
+    "phi3.5":          "Phi-3.5-mini",
+    "phi3.5:mini":     "Phi-3.5-mini",
+    "phi-3.5":         "Phi-3.5-mini",
+    "phi-3.5-mini":    "Phi-3.5-mini",
 }
 
 

diff --git a/docs/spikes/2026-04-12_phi3_support.md b/docs/spikes/2026-04-12_phi3_support.md
@@ -0,0 +1,167 @@
+# Spike — Phi-3 / Phi-3.5 architecture support
+
+**Date**: 2026-04-12
+**Driver**: External user feedback (`docs/feedback/2026-04-12_0900.md`, item 2.6)
+**Status**: Investigation complete; implementation gated on having a real GGUF to validate against
+**Recommendation**: do NOT merge a fix without an end-to-end validation run
+
+## Why Phi-3 matters
+
+Phi-3.5-mini is the highest-value model NOT supported by quant.cpp:
+
+- **vocab 32K** — smaller than SmolLM2 (49K), Llama-3.2-1B (128K), Gemma (256K)
+- **3.8B params** — bigger than SmolLM2-1.7B but the small vocab keeps lm_head fast
+- the tester estimated `~94 tok/s` (`60 tokens / 0.85 s`) before realizing the inference was producing garbage — that number reflects what the matmul kernels can do; only the attention path is broken
+
+If we get this working, Phi-3.5-mini becomes the new "best speed/quality" recommendation, ahead of SmolLM2-1.7B.
+
+## Current state
+
+`tq_load_gguf` (in `quant.h`, lines 11640-11680) looks for these tensor names per layer:
+
+```
+blk.N.attn_q.weight    ← required to mark layer as self_attn
+blk.N.attn_k.weight
+blk.N.attn_v.weight
+blk.N.attn_output.weight
+```
+
+When loading a Phi-3 GGUF, none of these exist — Phi-3 ships fused QKV. Phi-3's tensors (in llama.cpp's GGUF naming convention) are:
+
+```
+blk.N.attn_qkv.weight    ← shape [3 * hidden_dim, hidden_dim], fused
+blk.N.attn_output.weight
+blk.N.ffn_up.weight      ← may also be fused as ffn_up_gate, depending on converter
+blk.N.ffn_down.weight
+```
+
+Result: `is_attn_layer = 0` for every layer, `n_attn_layers = 0`, the new hard-fail check in P0-B catches it and returns NULL with a clear error. No more garbage tokens — but no working inference either.
+
+## Two implementation strategies
+
+### Option A — Loader splits at load time
+
+After detecting `attn_qkv`, dequantize the fused tensor, slice along the output dimension into three `[hidden_dim, hidden_dim]` views, re-quantize each as a separate Q4_K (or whichever type the GGUF used), and store them in `gguf_wq`/`gguf_wk`/`gguf_wv`.
+
+**Pros**: zero forward-path changes, drops into existing `tq_matmul_gguf` calls.
+**Cons**:
+1. Doubles RAM during load (need both fused + split versions)
+2. Re-quantization is **lossy** — running the original model through Q4_K → FP32 → Q4_K introduces measurable error
+3. Won't work for tensor types we don't have a quantizer for (we'd need a quantizer for every supported GGUF type)
+4. Slow at load
+
+### Option B — Forward path dispatches fused matmul (RECOMMENDED)
+
+Add a new field `gguf_wqkv` (data + type) to `tq_layer_weights_t`. Loader sets it from `blk.N.attn_qkv.weight` directly. Forward path checks: if `gguf_wqkv` is set, do one big matmul into a temp buffer of size `3 * hidden_dim`, then split into the existing `s->q`, `s->k`, `s->v` outputs.
+
+**Pros**:
+1. No re-quantization, no precision loss
+2. No extra load-time work
+3. Works with any GGUF type we already support in `tq_matmul_gguf`
+4. Single big matmul is faster than 3 smaller ones (better cache reuse)
+
+**Cons**:
+1. Need a temp buffer for the fused output
+2. New branch in the forward path (small)
+3. Need to pass `q_dim`, `k_dim`, `v_dim` so the split knows where K starts and V starts (Phi-3 may not use GQA, but we can't assume)
+
+`tq_matmul_gguf` already accepts `(weight, type, out_dim, in_dim)` — it doesn't care whether the underlying tensor is fused or not. We can call it once with `out_dim = q_dim + k_dim + v_dim`.
+
+## Inspection results (2026-04-12)
+
+Used `tools/gguf_inspect.c` against `bartowski/Phi-3.5-mini-instruct-Q4_K_M.gguf` (2.39 GB). Findings:
+
+### Per-layer tensors (32 layers, 6 tensors each)
+
+```
+blk.N.attn_norm.weight    F32   [3072]
+blk.N.attn_qkv.weight     Q5_K  [3072, 9216]    ← FUSED QKV (3 * 3072)
+blk.N.attn_output.weight  Q4_K  [3072, 3072]
+blk.N.ffn_norm.weight     F32   [3072]
+blk.N.ffn_up.weight       Q4_K  [3072, 16384]   ← FUSED gate+up (2 * 8192)
+blk.N.ffn_down.weight     Q6_K  [8192, 3072]
+```
+
+### Global tensors
+
+```
+token_embd.weight              Q4_K  [3072, 32064]
+output.weight                  Q6_K  [3072, 32064]
+output_norm.weight             F32   [3072]
+rope_factors_long.weight       F32   [48]      ← LongRoPE
+rope_factors_short.weight      F32   [48]      ← LongRoPE
+```
+
+### Metadata
+
+- arch: `phi3`
+- embedding_length: 3072 (hidden_dim)
+- block_count: 32
+- head_count: 32
+- head_count_kv: 32 (NO GQA)
+- rope.dimension_count: 96 (head_dim per head)
+- rope.freq_base: 10000
+- rope.scaling.original_context_length: 4096 (LongRoPE switch point)
+- rope.scaling.attn_factor: 1.19024 (Q/K magnitude scaling for long context)
+- context_length: 131072
+- feed_forward_length: 8192
+- vocab_size: 32064
+- bos_token_id: 1, eos_token_id: 32000
+
+### Conclusions
+
+1. **Fused QKV** confirmed. Layout `[Q | K | V]` along output axis. Each section is `hidden_dim = 3072` floats. Total `9216 = 3 * 3072`.
+2. **Fused FFN** ALSO confirmed. `ffn_up.weight` is `[hidden, 2*ff]` not `[hidden, ff]`. Layout `[?, ?]` — order TBD by validation, but llama.cpp's reference loads as `[gate, up]` chunked from this single tensor.
+3. **LongRoPE present**: separate `rope_factors_short` and `rope_factors_long` tables of size 48 = head_dim/2. Used to rescale per-frequency RoPE rotations for sequences past the 4096-token original context.
+4. **No special tokens for ChatML**. Phi-3 uses `<|user|>`, `<|assistant|>`, `<|end|>` (text strings, not BPE special tokens). Chat template differs from Llama-3 / ChatML.
+5. **Vocab 32K** confirms the speed advantage — `lm_head` matmul is `3072 × 32064` vs Llama-3.2-1B's `2048 × 128256`. About 2.7× smaller per-token cost.
+
+## What's still unknown (resolved by trial)
+
+I need a real Phi-3 GGUF to verify:
+
+1. **Exact tensor names**. llama.cpp's GGUF converter has changed conventions over the years. The fused tensor might be named:
+   - `blk.N.attn_qkv.weight`
+   - `blk.N.attn_qkv_proj.weight`
+   - `blk.N.qkv.weight`
+   - …and there may be a separate bias tensor
+
+2. **Shape ordering**. Is the fused tensor `[Q | K | V]` along axis 0, or some other layout? Phi-3 has `n_heads = 32` and `n_kv_heads = 32` (no GQA in the 3.8B variant), so all three sub-tensors are the same size — but I want to verify.
+
+3. **FFN fusion**. Does this Phi-3 GGUF use `ffn_up` + `ffn_gate` as separate tensors (llama-style) or `ffn_up_gate` (Phi-style fused)? If the latter, we have a second fused-tensor problem to solve in the same PR.
+
+4. **RoPE config**. Phi-3 long-context variants use LongRoPE with two scaling factors (`short_factor`, `long_factor`). Phi-3-mini's 4K context might use vanilla RoPE — but Phi-3.5-mini's 128K context definitely uses LongRoPE. We'd need to read these from GGUF metadata and add them to `tq_rope`.
+
+5. **Sliding window**. Phi-3 uses `n_block_sparse_window` (varies by layer in some variants). Whether the `mini` variant uses it is unclear.
+
+6. **Special tokens**. Phi-3 uses `<|user|>`, `<|assistant|>`, `<|end|>` instead of ChatML — the chat template needs to know.
+
+## Estimated effort once we have a GGUF
+
+| Step | Effort |
+|---|---|
+| Tensor name detection (`attn_qkv` + variants) | XS — 20 lines |
+| `gguf_wqkv` field + forward dispatch | S — 60 lines |
+| `ffn_up_gate` if needed | S — 40 lines |
+| LongRoPE if Phi-3.5-mini | M — 100-150 lines, needs careful validation |
+| Sliding window detection | S — 30 lines (we have the infrastructure for Gemma) |
+| Phi-3 chat template in `cli.py` | XS — 10 lines |
+| Validation: load + 100 tokens + manual quality check | M — needs the GGUF |
+
+**Total**: maybe 300-400 lines of focused code. Most of it is mechanical once we know the exact names.
+
+## Recommendation
+
+**Option B**, but only after one of:
+
+1. **Tester provides** the exact Phi-3.5-mini-instruct-Q8 GGUF they used. Best path — same file the user already has running.
+2. **Tester runs** a small inspector script we provide that dumps tensor names + shapes from their GGUF, so we can validate our assumptions without shipping the file.
+3. **We pick** a specific bartowski Phi-3.5-mini Q4_K_M variant ourselves, download it, dump tensor names, and proceed. This is the slowest path because the failure modes (LongRoPE, sliding window) are subtle and easy to miss without ground-truth output to compare.
+
+Until then: do NOT implement. The hard-fail in P0-B is the right transition state — users see a clear error and know to wait, instead of debugging garbage.
+
+## Open questions for the human
+
+1. Do we have access to the same Phi-3.5-mini GGUF the tester used? (`Phi-3.5-mini-instruct-Q8_0.gguf`, 3.9 GB)
+2. If not, are we OK downloading one and using it as the reference? Storage / bandwidth?
+3. Should I write the GGUF inspector script (path 2) so the tester can run it for us?
diff --git a/docs/supported_models.md b/docs/supported_models.md
@@ -8,7 +8,8 @@ tracks what works, what loads-but-fails, and how to pick a model.
 
 | Use case | Model | Why |
 |---|---|---|
-| **First-time install** | `SmolLM2-1.7B` (Q8) | Fastest end-to-end on a laptop. Vocab 49K keeps the lm_head matmul small (~12 tok/s on Apple M3). |
+| **Best speed + quality** | `Phi-3.5-mini` (Q4_K_M) | 3.8B params with vocab 32K — the smallest lm_head in the registry. Coherent multi-paragraph output. |
+| **Lightweight all-rounder** | `SmolLM2-1.7B` (Q8) | Fastest small model on a laptop. Vocab 49K keeps the lm_head matmul small (~12 tok/s on Apple M3). |
 | Smaller download | `Llama-3.2-1B` (Q4_K_M) | 750 MB vs 1.7 GB, but ~5x slower at inference time due to 128K vocab. |
 | Quick smoke test | `SmolLM2-135M` (Q8) | 138 MB download to verify the install path. Output quality is poor — not for real use. |
 
@@ -32,12 +33,12 @@ print(m.ask("What is gravity?"))
 |---|:---:|:---:|:---:|:---:|---|
 | **llama** (SmolLM2, Llama-3.x, Mistral) | ✅ | ✅ | ✅ | ✅ | **Fully supported** |
 | llama with 128K vocab (Llama-3.2-1B) | ✅ | ✅ | ✅ | slow | Supported, vocab is the bottleneck |
+| **phi3** / **phi3.5** (fused QKV + LongRoPE) | ✅ | ✅ | ✅ | ✅ | **Fully supported** (since 2026-04-12) |
 | **gemma** (Gemma 2) | ✅ | ✅ | ✅ | ✅ | Supported |
 | **gemma3** | ✅ | ✅ | ✅ | ✅ | Supported with hybrid sliding-window attention |
 | **gemma4** (Gemma-4-E2B / E4B) | ✅ | ✅ | ⚠️ | ⚠️ | Partial — some Q4_K_M variants produce garbage; report with file SHA256 |
 | **qwen** / **qwen2** | ✅ | ✅ | ✅ | ✅ | Supported |
 | **qwen3.5** (DeltaNet hybrid) | ✅ | ✅ | partial | ⚠️ | Partial — pure-attention layers work, DeltaNet hybrid still being validated |
-| **phi3** / **phi3.5** (fused QKV) | ❌ | — | — | — | **Not supported** — uses `attn_qkv`, see "Why phi3 is hard" below |
 
 ✅ = works · ⚠️ = loads but inference is unreliable · ❌ = load fails fast with a clear error (since 2026-04-12)
 
@@ -78,31 +79,38 @@ benchmarks on Apple M3 (8-core CPU, 16 GB RAM):
 vocab size is a better predictor of interactive latency than parameter
 count. Pick the smallest vocab that produces output you're happy with.
 
-## Why phi3 is hard
+## How Phi-3 support works
 
-Phi-3 / Phi-3.5 uses a *fused* QKV projection: instead of three separate
-tensors `attn_q.weight`, `attn_k.weight`, `attn_v.weight`, it ships one
-`attn_qkv.weight` with all three projections concatenated along the
-output dimension.
+Phi-3 / Phi-3.5 uses fused weight tensors instead of llama-style separate ones:
 
-quant.cpp's GGUF loader currently looks for the three-tensor layout
-(`blk.N.attn_q.weight` etc.). When it loads a Phi-3 GGUF, none of those
-names match → 0 self_attn layers detected → forward pass runs against
-zero-initialized attention weights → garbage tokens.
-
-Adding Phi-3 support requires either:
-
-1. **Loader splits** `attn_qkv.weight` into the three views at load time
-   and writes them into the existing `wq`/`wk`/`wv` slots, OR
-2. **Forward path** learns to dispatch a fused QKV matmul when the
-   loader detects the fused tensor.
-
-Option (1) is simpler but doubles the working set during load. Option
-(2) is the right long-term answer. There's a tracking issue / spike in
-progress; until then Phi-3 is the highest-value missing architecture for
-quant.cpp's "speed + quality" target (Phi-3.5-mini has vocab 32K plus
-3.8B params — it would beat both SmolLM2-1.7B and Llama-3.2-1B at
-interactive use).
+| Tensor | Shape | What's inside |
+|---|---|---|
+| `blk.N.attn_qkv.weight` | `[hidden, 3*hidden]` | Q ‖ K ‖ V along the output axis |
+| `blk.N.ffn_up.weight` | `[hidden, 2*ff]` | gate ‖ up along the output axis |
+
+The loader detects these by name, stores the raw quantized pointers in
+new fields (`gguf_w_qkv`, `gguf_w_up_gate`), and the forward path
+dispatches a single matmul into a temp buffer for each, then `memcpy`
+splits the result into the existing per-section state buffers.
+
+Phi-3 also uses **LongRoPE** with two per-frequency-pair rescaling
+tables (`rope_factors_short`, `rope_factors_long`) and a separate
+attention magnitude factor (`rope.scaling.attn_factor`). These extend
+RoPE rotation from the original 4096-token training context out to
+131K. The forward path picks the short or long table based on
+position, applies the rescaled rotation in **NeoX-style** layout (pairs
+are `(q[i], q[i+half])`, not `(q[2i], q[2i+1])`), and multiplies Q by
+`attn_factor` only when `pos >= original_context_length`.
+
+Why NeoX-style for Phi-3 specifically: llama.cpp's GGUF converter
+pre-permutes separate `attn_q/k/v` tensors so the standard interleaved
+RoPE works for Llama-family models. The fused `attn_qkv` tensor is NOT
+permuted, so we have to apply rotation in its native NeoX form.
+
+Phi-3.5-mini at the recommended Q4_K_M quantization clocks in at
+**~32K vocab + 3.8B params**, which makes the lm_head matmul the
+fastest of any model in the registry — the best speed/quality combo
+quant.cpp ships.
 
 ## Reporting an unsupported model