feat(phi3): end-to-end Phi-3 / Phi-3.5 architecture support by unamedkr · Pull Request #65 · quantumaikr/quant.cpp

unamedkr · 2026-04-12T02:56:59Z

Summary

Adds full inference support for Phi-3 family GGUFs, validated end-to-end against bartowski/Phi-3.5-mini-instruct-Q4_K_M.gguf. Phi-3.5-mini is now the recommended best speed + quality model in the registry — vocab 32K + 3.8B params makes the lm_head matmul the fastest of any model we ship.

Validation

Coherent multi-paragraph output on plain prompts AND chat template:

```
$ phi3_infer_test ~/.cache/quantcpp/Phi-3.5-mini-instruct-Q4_K_M.gguf
The capital of France is Paris.
The Eiffel Tower, located in the city center, stands as a symbolic
landmark for both locals and tourists alike. The tower's iconic
silhouette is visible from various points around the city, offering
a panoramic view of Parisian life and its vibrant culture.
The Seine River meanders through this historic metropolis...
```

Regression checks:

ctest --test-dir build → 35/35 passed
Llama-3.2-1B end-to-end → still coherent
SmolLM2-135M end-to-end → still coherent
Full build clean (no new warnings)

Why Phi-3 is hard

Phi-3 ships fused weight tensors instead of llama-style separate ones, plus a long-context RoPE variant:

Tensor	Shape	What's inside
`blk.N.attn_qkv.weight`	`[hidden, 3*hidden]`	Q ‖ K ‖ V along the output axis
`blk.N.ffn_up.weight`	`[hidden, 2*ff]`	gate ‖ up along the output axis
`rope_factors_short`	`[head_dim/2]`	LongRoPE per-pair rescaling
`rope_factors_long`	`[head_dim/2]`	LongRoPE per-pair rescaling

Before this PR: load reported `0 self_attn` and the forward pass ran against zero-initialized attention weights → garbage tokens.

Implementation

Loader (`tq_load_gguf`)

Detects `attn_qkv.weight` → `gguf_w_qkv` field, marks layer as attention
Detects `ffn_up.weight` with `shape[1] == 2 * intermediate_dim` AND no separate `ffn_gate.weight` → `gguf_w_up_gate`
Reads `phi3.rope.scaling.original_context_length` and `attn_factor`
Locates `rope_factors_short`/`rope_factors_long` global tensors
Hard-fail path from PR feat(feedback): Quick Wins from 2026-04-12 external user report #59 now correctly identifies Phi-3 as supported (32 attn layers detected)

Forward (`self_attn_forward` + `tq_forward` dispatcher)

New `if (layer->gguf_w_qkv)` branch: one matmul into a temp buffer of size `q_dim + 2*kv_dim`, memcpy splits into `s->q/s->k/s->v`
New `if (layer->gguf_w_up_gate)` FFN branch: one matmul of size `2*inter`, memcpy second half to `s->hb2`. Layout is `[gate | up]` (HuggingFace convention)
The dispatcher in `tq_forward`'s layer loop ALSO calls `self_attn_forward` when `gguf_w_qkv` is set — without this the new branches were unreachable
Same for FFN dispatcher: accept `gguf_w_up_gate` as a valid gate/up source

LongRoPE (NeoX-style)

New branch in full RoPE: `factor[i] = (pos < orig_ctx_len) ? short[i] : long[i]`, `freq[i] = 1 / (rope_base^(2i/head_dim) * factor[i])`
Uses NeoX-style pair layout `(q[i], q[i + half])` rather than the interleaved `(q[2i], q[2i+1])` that quant.cpp's existing `tq_rope` uses for Llama
Why NeoX: llama.cpp's GGUF converter pre-permutes separate Q/K weights so interleaved RoPE works for Llama-family models. The fused `attn_qkv` is NOT permuted, so we apply rotation in its native NeoX form.
`rope_attn_factor` multiplied into Q only when `pos >= orig_ctx_len`

State allocation

`max_dim` bumped to cover `q_dim + 2*kv_dim` (fused QKV temp buffer reuses `s->xb2`)
`s->hb` bumped to `2 * inter` when fused gate||up is detected

Tokenizer BOS

`tq_encode` adds `~~` to the BOS lookup chain alongside `` and `<|im_start|>`~~

~~`quant_generate` enables `add_bos=1` when vocab has `~~`. Phi-3 specifically degrades into garbage without it; Llama-3 behavior is unchanged because Llama-3 uses `<|begin_of_text|>`.~~~~

Debugging journey (TIL)

First implementation: garbage output. Loader correct, forward path wasn't being hit.

The dispatcher's `if (layer->gguf_wq)` check guarded the call to `self_attn_forward`. Phi-3 has `gguf_wq == NULL`. Fixed.

Still garbage. Forward branch firing but output is fragmented English.

Tried NeoX vs GPT-J RoPE — both garbage but different.

Verified Q5_K dequant + matmul produces bit-perfect output against an FP32 reference.

Verified tokenizer produces correct token IDs (`The capital of France is` → 5 correct tokens).

Tried `[gate | up]` vs `[up | gate]` for the fused FFN. `[gate | up]` was the answer. Output went from "Block...attacker...non...meteor..." to "Paris. The Eiffel Tower..." instantly.

The bug was that I had the FFN split order backwards. HuggingFace's `gate, up = gate_up_proj(x).chunk(2, dim=-1)` produces `[gate, up]` along the output axis, which is what the GGUF preserves. I had assumed `[up, gate]` based on a half-remembered llama.cpp comment.

New tools

`tools/gguf_inspect.c` — dumps tensor names, shapes, types, and metadata from any GGUF. Used to verify Phi-3.5's layout before writing loader code. General-purpose, kept for future architecture work.

`tools/phi3_infer_test.c` — minimal end-to-end inference test. Doubles as a smoke test for any future Phi-3 changes.

Registry / docs updates

`Phi-3.5-mini` added to `_MODEL_REGISTRY` with bartowski's Q4_K_M variant (~2.4 GB)

New CLI aliases: `phi3.5`, `phi3.5:mini`, `phi-3.5`, `phi-3.5-mini`

`docs/supported_models.md` updated: phi3 now ✅ Fully supported. Recommended-models table puts Phi-3.5-mini at the top.

`docs/spikes/2026-04-12_phi3_support.md` updated with the inspection findings + conclusions that drove implementation.

Test plan

Unit tests pass (35/35)

Phi-3.5-mini end-to-end inference produces coherent paragraphs

Phi-3 chat template works ("What is 2+2?" → "...equals to four")

Llama-3.2-1B regression — still coherent

SmolLM2-135M regression — still coherent

Manual: `quantcpp run phi-3.5-mini` after wheel build

Manual: long-context test (>4096 tokens) to exercise the LongRoPE `pos >= orig_ctx_len` branch

🤖 Generated with Claude Code

Adds full inference support for Phi-3 family GGUFs (validated against bartowski's `Phi-3.5-mini-instruct-Q4_K_M.gguf`). Output is coherent multi-paragraph English. Phi-3.5-mini becomes the new "best speed + quality" recommendation in the model registry. ## Why Phi-3 is hard Phi-3 ships fused weight tensors instead of llama-style separate ones, plus a long-context RoPE variant: blk.N.attn_qkv.weight shape [hidden, 3*hidden] Q ‖ K ‖ V blk.N.ffn_up.weight shape [hidden, 2*ff] gate ‖ up rope_factors_short [head_dim/2] LongRoPE rope_factors_long [head_dim/2] LongRoPE + rope.scaling.attn_factor long-context Q scaling Before this PR: load reported `0 self_attn` and the forward pass ran against zero-initialized attention weights → garbage tokens. ## What this PR adds ### Loader (`tq_load_gguf`) - Detects `blk.N.attn_qkv.weight` and stores its raw quantized pointer in a new field `gguf_w_qkv` (+ type). Marks the layer as attention. - Detects `blk.N.ffn_up.weight` with `shape[1] == 2 * intermediate_dim` AND no separate `ffn_gate.weight` → fused gate||up. Stores in `gguf_w_up_gate`. - Reads `phi3.rope.scaling.original_context_length` and `phi3.rope.scaling.attn_factor` via the existing arch-prefix macro. - Locates `rope_factors_short.weight` / `rope_factors_long.weight` as global tensors and stores raw F32 pointers in the model config. - The hard-fail path from the previous PR now correctly identifies Phi-3 as a *supported* architecture (n_attn_layers == 32, not 0). ### Forward (`self_attn_forward` + tq_forward dispatcher) - New `if (layer->gguf_w_qkv)` branch in self_attn_forward: one `tq_matmul_gguf` call into a temp buffer of size `q_dim + 2*kv_dim`, then memcpy splits into `s->q`, `s->k`, `s->v`. The K and V projection blocks below are skipped when fused. - New `if (layer->gguf_w_up_gate)` branch in the FFN section: one matmul of size `2*inter` into `s->hb`, memcpy second half to `s->hb2`. Layout is `[gate | up]` (HuggingFace convention). - The dispatcher (`tq_forward` layer loop) now also calls `self_attn_forward` when `layer->gguf_w_qkv != NULL` — without this the new branch was unreachable because the existing condition checks for `gguf_wq` separately. - Same for the FFN dispatcher: accept `gguf_w_up_gate` as a valid gate/up source. ### LongRoPE - New branch in the full RoPE path: when `rope_factors_short` or `rope_factors_long` is set, apply per-frequency-pair rescaling using `factor[i] = (pos < orig_ctx_len) ? short[i] : long[i]` and `freq[i] = 1 / (rope_base^(2i/head_dim) * factor[i])`. - Uses **NeoX-style** pair layout `(q[i], q[i + half])` rather than the interleaved `(q[2i], q[2i+1])` that quant.cpp's `tq_rope` uses for Llama. The reason: llama.cpp's GGUF converter pre-permutes separate Q/K weights so interleaved RoPE produces equivalent results, but the *fused* `attn_qkv` tensor is NOT permuted. - `rope_attn_factor` is multiplied into Q only when `pos >= orig_ctx_len` — no scaling at short context. ### State allocation (`tq_create_state_ex`) - Bumps `max_dim` to cover `q_dim + 2*kv_dim` (the fused QKV temp buffer reuses `s->xb2`) when `has_fused_qkv` is set. - Bumps the `s->hb` allocation to `2 * inter` (fused gate||up output) when `has_fused_up_gate` is set. `s->hb2` stays at `inter`. ### Tokenizer BOS - `tq_encode` adds Phi-3 / Llama's `<s>` to its BOS lookup chain alongside `<bos>` (Gemma) and `<|im_start|>` (Qwen). - `quant_generate` also enables `add_bos=1` when the vocab has `<s>` — Phi-3 specifically degrades into garbage without it. Existing Llama-3 behavior is unchanged because Llama-3 uses `<|begin_of_text|>` which the lookup chain also handles. ## Validation End-to-end inference test (`tools/phi3_infer_test.c`): ``` $ ./phi3_infer_test ~/.cache/quantcpp/Phi-3.5-mini-instruct-Q4_K_M.gguf The capital of France is Paris. The Eiffel Tower, located in the city center, stands as a symbolic landmark for both locals and tourists alike. The tower's iconic silhouette is visible from various points around the city, offering a panoramic view of Parisian life and its vibrant culture. The Seine River meanders through this historic metropolis... ``` Chat template: ``` $ ./phi3_infer_test ... "What is 2+2?" <|user|> What is 2+2?<|end|> <|assistant|> The sum of 2 + 2 equals to four. ``` Regression checks: - ctest --test-dir build → 35/35 passed - Llama-3.2-1B end-to-end → still coherent - SmolLM2-135M end-to-end → still coherent - Full build clean (no new warnings) ## Registry / docs - `Phi-3.5-mini` added to `_MODEL_REGISTRY` with the bartowski Q4_K_M variant (~2.4 GB). Listed as "best speed + quality" in `docs/supported_models.md`. - New aliases `phi3.5`, `phi3.5:mini`, `phi-3.5`, `phi-3.5-mini`. - Architecture matrix updated: phi3 now ✅ Fully supported. - Docs section "Why phi3 is hard" replaced with "How Phi-3 support works" — explains fused tensors + LongRoPE + NeoX rotation choice. - Spike doc `docs/spikes/2026-04-12_phi3_support.md` updated with inspection findings and the conclusions that drove implementation. ## New tools - `tools/gguf_inspect.c` — dump tensor names, shapes, types, and metadata from a GGUF file. Used to verify Phi-3.5's layout before writing loader code. General-purpose, kept for future architecture work. - `tools/phi3_infer_test.c` — minimal end-to-end inference test. Doubles as a smoke test for any future Phi-3 changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phi-3 architecture support landed in #65 and validated end-to-end as the best speed/quality combo we ship (vocab 32K + 3.8B params makes the lm_head matmul the fastest of any registered model). Promote it to the default everywhere. ## Code - `_MODEL_REGISTRY` reordered with Phi-3.5-mini first; comment block marks it as the default and explains the reasoning - `cmd_chat_default` (no-subcommand chat) now picks Phi-3.5-mini - Module docstring + `Model.from_pretrained` example use Phi-3.5-mini - CLI `--help` epilog: examples lead with `phi-3.5-mini` and the backwards-compat block mentions `smollm2` / `llama3.2:1b` as alternatives instead ## Docs - README.md: Quick Start renamed Phi-3.5-mini as the recommended default; CLI examples and Python `from_pretrained` example updated. Benchmark/perf sections still reference SmolLM2/Llama models because those are historical measurement data. - README.ko.md: same changes mirrored in Korean. - bindings/python/README.md (PyPI README): replaced "Basic question answering" with "Quick start (auto-download)" using `from_pretrained`. Added a multi-turn chat example using `m.chat()` + KV cache reuse, and an API reference entry for `Model.chat()` and `Model.from_pretrained()`. ## Verified - ctest --test-dir build → 35/35 passed - Full build clean (no new warnings) - Phi-3.5-mini end-to-end inference test still produces coherent multi-paragraph output ("Name three planets..." → Earth, Mars, Jupiter with descriptions) - `available_models()` returns Phi-3.5-mini in the list - `MODEL_ALIASES['phi-3.5-mini']` and friends resolve correctly - `cmd_chat_default` source confirms `args.model = "Phi-3.5-mini"` - `quantcpp --help` epilog reflects the new defaults Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phi-3 architecture support landed in #65 and validated end-to-end as the best speed/quality combo we ship (vocab 32K + 3.8B params makes the lm_head matmul the fastest of any registered model). Promote it to the default everywhere. ## Code - `_MODEL_REGISTRY` reordered with Phi-3.5-mini first; comment block marks it as the default and explains the reasoning - `cmd_chat_default` (no-subcommand chat) now picks Phi-3.5-mini - Module docstring + `Model.from_pretrained` example use Phi-3.5-mini - CLI `--help` epilog: examples lead with `phi-3.5-mini` and the backwards-compat block mentions `smollm2` / `llama3.2:1b` as alternatives instead ## Docs - README.md: Quick Start renamed Phi-3.5-mini as the recommended default; CLI examples and Python `from_pretrained` example updated. Benchmark/perf sections still reference SmolLM2/Llama models because those are historical measurement data. - README.ko.md: same changes mirrored in Korean. - bindings/python/README.md (PyPI README): replaced "Basic question answering" with "Quick start (auto-download)" using `from_pretrained`. Added a multi-turn chat example using `m.chat()` + KV cache reuse, and an API reference entry for `Model.chat()` and `Model.from_pretrained()`. ## Verified - ctest --test-dir build → 35/35 passed - Full build clean (no new warnings) - Phi-3.5-mini end-to-end inference test still produces coherent multi-paragraph output ("Name three planets..." → Earth, Mars, Jupiter with descriptions) - `available_models()` returns Phi-3.5-mini in the list - `MODEL_ALIASES['phi-3.5-mini']` and friends resolve correctly - `cmd_chat_default` source confirms `args.model = "Phi-3.5-mini"` - `quantcpp --help` epilog reflects the new defaults Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rces Ports the Phi-3/Phi-3.5 architecture support from quant.h (PR #65) to the split source files used by libturboquant and quant-server. Changes: - tq_model.c: fused attn_qkv detection, LongRoPE factor loading, fused gate||up FFN detection - tq_transformer.c: fused QKV matmul + split, NeoX-style LongRoPE rotation, fused gate||up FFN path, expanded state allocation - tq_generate.c: Phi-3 BOS token handling - tq_tokenizer.c: <s> BOS lookup - tq_server.c: Phi-3 chat template support - tq_engine.h: new fields for fused weights and LongRoPE config - cli.py: Phi-3.5 default model + alias updates quant-server now detects Phi-3.5 correctly: loaded 32 layers (32 self_attn) + LongRoPE Note: server crashes during inference (segfault in forward pass). The fused QKV → split memcpy or LongRoPE computation likely has a buffer size issue in the server path. Tracked in #67. 35/35 unit tests still pass. Fixes #67 (partial — loader works, inference needs debugging) Refs #69, #70 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Closes #67 (Phi-3 not propagated to libturboquant). Fixes #62 root cause for Phi-3 (Metal buffer overflow → garbage on 2nd request). Addresses #61 GQA — confirmed working on CPU-only server for Llama-3.2-1B. ## Root cause PR #65 added full Phi-3 support to `quant.h` (single-header) but the split source files used by `libturboquant` and `quant-server` were never updated. Models loaded correctly (32 self_attn layers) but the forward pass ran through code paths that didn't know about fused tensors → garbage output. Additionally, the Metal GPU backend intercepted Q4_K matmul calls via `tq_metal_matmul_gguf()` even for fused tensors. The Metal kernels assume standard separate-tensor layouts and pre-allocated buffer sizes. When called with fused FFN output dim (16384 vs expected 8192), the GPU produced corrupted output → garbage even after CPU code was fixed. ## Changes ### tq_engine.h (struct definitions) - Added `rope_orig_ctx_len`, `rope_attn_factor`, `rope_factors_short`, `rope_factors_long` to `tq_model_config_t` - Added `has_fused_qkv`, `has_fused_up_gate` flags - Added `gguf_w_qkv`, `gguf_w_up_gate` fields to `tq_layer_weights_t` ### tq_model.c (GGUF loader) - Phi-3 fused QKV detection: `blk.N.attn_qkv.weight` → `gguf_w_qkv` - Phi-3 fused FFN detection: `ffn_up` with 2× intermediate_dim → `gguf_w_up_gate` - LongRoPE config + factor loading from GGUF metadata/tensors - Hard-fail when `n_attn_layers == 0 && delta_n_heads == 0` - Metal GPU init SKIPPED for fused-tensor models (CPU is faster anyway) ### tq_transformer.c (forward pass) - Fused QKV branch: one matmul + memcpy split into s->q/s->k/s->v - K and V projection blocks skipped when fused - NeoX-style LongRoPE rotation with per-frequency rescaling - Fused FFN gate||up: one matmul + memcpy split (layout: [gate | up]) - Layer dispatcher: `layer->gguf_w_qkv` triggers self_attn_forward - FFN dispatcher: accepts `gguf_w_up_gate` as valid gate+up source - State buffer sizing: max_dim covers fused QKV, hb covers 2×inter ### tq_generate.c (BOS token) - add_bos=1 when vocab has `<s>` in first 8 entries (Phi-3 needs BOS) ### tq_tokenizer.c (BOS lookup chain) - Added `<s>` and `<|begin_of_text|>` to the BOS token lookup chain ### tq_metal_dispatch.m (Metal safety) - Added `tq_metal_disable()` function - `tq_metal_available()` returns 0 when disabled - Loader calls `tq_metal_disable()` for fused-tensor models ## Verified - ctest --test-dir build → 35/35 passed (Metal build) - quant-server + Phi-3.5-mini (Metal build, Metal disabled) → coherent output: "hat can help with math problems..." - quant-server + Phi-3.5-mini (CPU-only build) → coherent output: "capital city of France is Paris" - quant-server + Llama-3.2-1B GQA (CPU build) → "Answer: 4" - quant-server + SmolLM2-135M (Metal build, Metal active) → unchanged - quant.h single-header path → unchanged (no regression) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr merged commit 1e1ea2c into main Apr 12, 2026
2 of 3 checks passed

unamedkr deleted the feat/phi3-support branch April 12, 2026 02:57

unamedkr mentioned this pull request Apr 12, 2026

Phi-3.5-mini-instruct: 0 self_attn layers detected, inference produces garbage #56

Closed

unamedkr mentioned this pull request Apr 12, 2026

feat(default): promote Phi-3.5-mini to recommended default model #66

Merged

5 tasks

unamedkr mentioned this pull request Apr 12, 2026

Phi-3 support not propagated from quant.h to libturboquant (quant-server broken) #67

Closed

unamedkr mentioned this pull request Apr 12, 2026

Port Phi-3 architecture support to libturboquant + Qwen3.5 issues #71

Merged

5 tasks

unamedkr mentioned this pull request Apr 12, 2026

fix(libturboquant): sync Phi-3 + BOS + Metal safety to split sources #72

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(phi3): end-to-end Phi-3 / Phi-3.5 architecture support#65

feat(phi3): end-to-end Phi-3 / Phi-3.5 architecture support#65
unamedkr merged 1 commit intomainfrom
feat/phi3-support

unamedkr commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 12, 2026

Summary

Validation

Why Phi-3 is hard

Implementation

Loader (`tq_load_gguf`)

Forward (`self_attn_forward` + `tq_forward` dispatcher)

LongRoPE (NeoX-style)

State allocation

Tokenizer BOS

Debugging journey (TIL)

New tools

Registry / docs updates

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant