feat(phi3): end-to-end Phi-3 / Phi-3.5 architecture support#65
Merged
Conversation
Adds full inference support for Phi-3 family GGUFs (validated against bartowski's `Phi-3.5-mini-instruct-Q4_K_M.gguf`). Output is coherent multi-paragraph English. Phi-3.5-mini becomes the new "best speed + quality" recommendation in the model registry. ## Why Phi-3 is hard Phi-3 ships fused weight tensors instead of llama-style separate ones, plus a long-context RoPE variant: blk.N.attn_qkv.weight shape [hidden, 3*hidden] Q ‖ K ‖ V blk.N.ffn_up.weight shape [hidden, 2*ff] gate ‖ up rope_factors_short [head_dim/2] LongRoPE rope_factors_long [head_dim/2] LongRoPE + rope.scaling.attn_factor long-context Q scaling Before this PR: load reported `0 self_attn` and the forward pass ran against zero-initialized attention weights → garbage tokens. ## What this PR adds ### Loader (`tq_load_gguf`) - Detects `blk.N.attn_qkv.weight` and stores its raw quantized pointer in a new field `gguf_w_qkv` (+ type). Marks the layer as attention. - Detects `blk.N.ffn_up.weight` with `shape[1] == 2 * intermediate_dim` AND no separate `ffn_gate.weight` → fused gate||up. Stores in `gguf_w_up_gate`. - Reads `phi3.rope.scaling.original_context_length` and `phi3.rope.scaling.attn_factor` via the existing arch-prefix macro. - Locates `rope_factors_short.weight` / `rope_factors_long.weight` as global tensors and stores raw F32 pointers in the model config. - The hard-fail path from the previous PR now correctly identifies Phi-3 as a *supported* architecture (n_attn_layers == 32, not 0). ### Forward (`self_attn_forward` + tq_forward dispatcher) - New `if (layer->gguf_w_qkv)` branch in self_attn_forward: one `tq_matmul_gguf` call into a temp buffer of size `q_dim + 2*kv_dim`, then memcpy splits into `s->q`, `s->k`, `s->v`. The K and V projection blocks below are skipped when fused. - New `if (layer->gguf_w_up_gate)` branch in the FFN section: one matmul of size `2*inter` into `s->hb`, memcpy second half to `s->hb2`. Layout is `[gate | up]` (HuggingFace convention). - The dispatcher (`tq_forward` layer loop) now also calls `self_attn_forward` when `layer->gguf_w_qkv != NULL` — without this the new branch was unreachable because the existing condition checks for `gguf_wq` separately. - Same for the FFN dispatcher: accept `gguf_w_up_gate` as a valid gate/up source. ### LongRoPE - New branch in the full RoPE path: when `rope_factors_short` or `rope_factors_long` is set, apply per-frequency-pair rescaling using `factor[i] = (pos < orig_ctx_len) ? short[i] : long[i]` and `freq[i] = 1 / (rope_base^(2i/head_dim) * factor[i])`. - Uses **NeoX-style** pair layout `(q[i], q[i + half])` rather than the interleaved `(q[2i], q[2i+1])` that quant.cpp's `tq_rope` uses for Llama. The reason: llama.cpp's GGUF converter pre-permutes separate Q/K weights so interleaved RoPE produces equivalent results, but the *fused* `attn_qkv` tensor is NOT permuted. - `rope_attn_factor` is multiplied into Q only when `pos >= orig_ctx_len` — no scaling at short context. ### State allocation (`tq_create_state_ex`) - Bumps `max_dim` to cover `q_dim + 2*kv_dim` (the fused QKV temp buffer reuses `s->xb2`) when `has_fused_qkv` is set. - Bumps the `s->hb` allocation to `2 * inter` (fused gate||up output) when `has_fused_up_gate` is set. `s->hb2` stays at `inter`. ### Tokenizer BOS - `tq_encode` adds Phi-3 / Llama's `<s>` to its BOS lookup chain alongside `<bos>` (Gemma) and `<|im_start|>` (Qwen). - `quant_generate` also enables `add_bos=1` when the vocab has `<s>` — Phi-3 specifically degrades into garbage without it. Existing Llama-3 behavior is unchanged because Llama-3 uses `<|begin_of_text|>` which the lookup chain also handles. ## Validation End-to-end inference test (`tools/phi3_infer_test.c`): ``` $ ./phi3_infer_test ~/.cache/quantcpp/Phi-3.5-mini-instruct-Q4_K_M.gguf The capital of France is Paris. The Eiffel Tower, located in the city center, stands as a symbolic landmark for both locals and tourists alike. The tower's iconic silhouette is visible from various points around the city, offering a panoramic view of Parisian life and its vibrant culture. The Seine River meanders through this historic metropolis... ``` Chat template: ``` $ ./phi3_infer_test ... "What is 2+2?" <|user|> What is 2+2?<|end|> <|assistant|> The sum of 2 + 2 equals to four. ``` Regression checks: - ctest --test-dir build → 35/35 passed - Llama-3.2-1B end-to-end → still coherent - SmolLM2-135M end-to-end → still coherent - Full build clean (no new warnings) ## Registry / docs - `Phi-3.5-mini` added to `_MODEL_REGISTRY` with the bartowski Q4_K_M variant (~2.4 GB). Listed as "best speed + quality" in `docs/supported_models.md`. - New aliases `phi3.5`, `phi3.5:mini`, `phi-3.5`, `phi-3.5-mini`. - Architecture matrix updated: phi3 now ✅ Fully supported. - Docs section "Why phi3 is hard" replaced with "How Phi-3 support works" — explains fused tensors + LongRoPE + NeoX rotation choice. - Spike doc `docs/spikes/2026-04-12_phi3_support.md` updated with inspection findings and the conclusions that drove implementation. ## New tools - `tools/gguf_inspect.c` — dump tensor names, shapes, types, and metadata from a GGUF file. Used to verify Phi-3.5's layout before writing loader code. General-purpose, kept for future architecture work. - `tools/phi3_infer_test.c` — minimal end-to-end inference test. Doubles as a smoke test for any future Phi-3 changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
Phi-3 architecture support landed in #65 and validated end-to-end as the best speed/quality combo we ship (vocab 32K + 3.8B params makes the lm_head matmul the fastest of any registered model). Promote it to the default everywhere. ## Code - `_MODEL_REGISTRY` reordered with Phi-3.5-mini first; comment block marks it as the default and explains the reasoning - `cmd_chat_default` (no-subcommand chat) now picks Phi-3.5-mini - Module docstring + `Model.from_pretrained` example use Phi-3.5-mini - CLI `--help` epilog: examples lead with `phi-3.5-mini` and the backwards-compat block mentions `smollm2` / `llama3.2:1b` as alternatives instead ## Docs - README.md: Quick Start renamed Phi-3.5-mini as the recommended default; CLI examples and Python `from_pretrained` example updated. Benchmark/perf sections still reference SmolLM2/Llama models because those are historical measurement data. - README.ko.md: same changes mirrored in Korean. - bindings/python/README.md (PyPI README): replaced "Basic question answering" with "Quick start (auto-download)" using `from_pretrained`. Added a multi-turn chat example using `m.chat()` + KV cache reuse, and an API reference entry for `Model.chat()` and `Model.from_pretrained()`. ## Verified - ctest --test-dir build → 35/35 passed - Full build clean (no new warnings) - Phi-3.5-mini end-to-end inference test still produces coherent multi-paragraph output ("Name three planets..." → Earth, Mars, Jupiter with descriptions) - `available_models()` returns Phi-3.5-mini in the list - `MODEL_ALIASES['phi-3.5-mini']` and friends resolve correctly - `cmd_chat_default` source confirms `args.model = "Phi-3.5-mini"` - `quantcpp --help` epilog reflects the new defaults Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
Phi-3 architecture support landed in #65 and validated end-to-end as the best speed/quality combo we ship (vocab 32K + 3.8B params makes the lm_head matmul the fastest of any registered model). Promote it to the default everywhere. ## Code - `_MODEL_REGISTRY` reordered with Phi-3.5-mini first; comment block marks it as the default and explains the reasoning - `cmd_chat_default` (no-subcommand chat) now picks Phi-3.5-mini - Module docstring + `Model.from_pretrained` example use Phi-3.5-mini - CLI `--help` epilog: examples lead with `phi-3.5-mini` and the backwards-compat block mentions `smollm2` / `llama3.2:1b` as alternatives instead ## Docs - README.md: Quick Start renamed Phi-3.5-mini as the recommended default; CLI examples and Python `from_pretrained` example updated. Benchmark/perf sections still reference SmolLM2/Llama models because those are historical measurement data. - README.ko.md: same changes mirrored in Korean. - bindings/python/README.md (PyPI README): replaced "Basic question answering" with "Quick start (auto-download)" using `from_pretrained`. Added a multi-turn chat example using `m.chat()` + KV cache reuse, and an API reference entry for `Model.chat()` and `Model.from_pretrained()`. ## Verified - ctest --test-dir build → 35/35 passed - Full build clean (no new warnings) - Phi-3.5-mini end-to-end inference test still produces coherent multi-paragraph output ("Name three planets..." → Earth, Mars, Jupiter with descriptions) - `available_models()` returns Phi-3.5-mini in the list - `MODEL_ALIASES['phi-3.5-mini']` and friends resolve correctly - `cmd_chat_default` source confirms `args.model = "Phi-3.5-mini"` - `quantcpp --help` epilog reflects the new defaults Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
…rces Ports the Phi-3/Phi-3.5 architecture support from quant.h (PR #65) to the split source files used by libturboquant and quant-server. Changes: - tq_model.c: fused attn_qkv detection, LongRoPE factor loading, fused gate||up FFN detection - tq_transformer.c: fused QKV matmul + split, NeoX-style LongRoPE rotation, fused gate||up FFN path, expanded state allocation - tq_generate.c: Phi-3 BOS token handling - tq_tokenizer.c: <s> BOS lookup - tq_server.c: Phi-3 chat template support - tq_engine.h: new fields for fused weights and LongRoPE config - cli.py: Phi-3.5 default model + alias updates quant-server now detects Phi-3.5 correctly: loaded 32 layers (32 self_attn) + LongRoPE Note: server crashes during inference (segfault in forward pass). The fused QKV → split memcpy or LongRoPE computation likely has a buffer size issue in the server path. Tracked in #67. 35/35 unit tests still pass. Fixes #67 (partial — loader works, inference needs debugging) Refs #69, #70 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
Closes #67 (Phi-3 not propagated to libturboquant). Fixes #62 root cause for Phi-3 (Metal buffer overflow → garbage on 2nd request). Addresses #61 GQA — confirmed working on CPU-only server for Llama-3.2-1B. ## Root cause PR #65 added full Phi-3 support to `quant.h` (single-header) but the split source files used by `libturboquant` and `quant-server` were never updated. Models loaded correctly (32 self_attn layers) but the forward pass ran through code paths that didn't know about fused tensors → garbage output. Additionally, the Metal GPU backend intercepted Q4_K matmul calls via `tq_metal_matmul_gguf()` even for fused tensors. The Metal kernels assume standard separate-tensor layouts and pre-allocated buffer sizes. When called with fused FFN output dim (16384 vs expected 8192), the GPU produced corrupted output → garbage even after CPU code was fixed. ## Changes ### tq_engine.h (struct definitions) - Added `rope_orig_ctx_len`, `rope_attn_factor`, `rope_factors_short`, `rope_factors_long` to `tq_model_config_t` - Added `has_fused_qkv`, `has_fused_up_gate` flags - Added `gguf_w_qkv`, `gguf_w_up_gate` fields to `tq_layer_weights_t` ### tq_model.c (GGUF loader) - Phi-3 fused QKV detection: `blk.N.attn_qkv.weight` → `gguf_w_qkv` - Phi-3 fused FFN detection: `ffn_up` with 2× intermediate_dim → `gguf_w_up_gate` - LongRoPE config + factor loading from GGUF metadata/tensors - Hard-fail when `n_attn_layers == 0 && delta_n_heads == 0` - Metal GPU init SKIPPED for fused-tensor models (CPU is faster anyway) ### tq_transformer.c (forward pass) - Fused QKV branch: one matmul + memcpy split into s->q/s->k/s->v - K and V projection blocks skipped when fused - NeoX-style LongRoPE rotation with per-frequency rescaling - Fused FFN gate||up: one matmul + memcpy split (layout: [gate | up]) - Layer dispatcher: `layer->gguf_w_qkv` triggers self_attn_forward - FFN dispatcher: accepts `gguf_w_up_gate` as valid gate+up source - State buffer sizing: max_dim covers fused QKV, hb covers 2×inter ### tq_generate.c (BOS token) - add_bos=1 when vocab has `<s>` in first 8 entries (Phi-3 needs BOS) ### tq_tokenizer.c (BOS lookup chain) - Added `<s>` and `<|begin_of_text|>` to the BOS token lookup chain ### tq_metal_dispatch.m (Metal safety) - Added `tq_metal_disable()` function - `tq_metal_available()` returns 0 when disabled - Loader calls `tq_metal_disable()` for fused-tensor models ## Verified - ctest --test-dir build → 35/35 passed (Metal build) - quant-server + Phi-3.5-mini (Metal build, Metal disabled) → coherent output: "hat can help with math problems..." - quant-server + Phi-3.5-mini (CPU-only build) → coherent output: "capital city of France is Paris" - quant-server + Llama-3.2-1B GQA (CPU build) → "Answer: 4" - quant-server + SmolLM2-135M (Metal build, Metal active) → unchanged - quant.h single-header path → unchanged (no regression) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds full inference support for Phi-3 family GGUFs, validated end-to-end against
bartowski/Phi-3.5-mini-instruct-Q4_K_M.gguf. Phi-3.5-mini is now the recommended best speed + quality model in the registry — vocab 32K + 3.8B params makes the lm_head matmul the fastest of any model we ship.Validation
Coherent multi-paragraph output on plain prompts AND chat template:
```
$ phi3_infer_test ~/.cache/quantcpp/Phi-3.5-mini-instruct-Q4_K_M.gguf
The capital of France is Paris.
The Eiffel Tower, located in the city center, stands as a symbolic
landmark for both locals and tourists alike. The tower's iconic
silhouette is visible from various points around the city, offering
a panoramic view of Parisian life and its vibrant culture.
The Seine River meanders through this historic metropolis...
```
```
$ phi3_infer_test ... "What is 2+2?"
<|user|>
What is 2+2?<|end|>
<|assistant|>
The sum of 2 + 2 equals to four.
```
Regression checks:
Why Phi-3 is hard
Phi-3 ships fused weight tensors instead of llama-style separate ones, plus a long-context RoPE variant:
Before this PR: load reported `0 self_attn` and the forward pass ran against zero-initialized attention weights → garbage tokens.
Implementation
Loader (`tq_load_gguf`)
Forward (`self_attn_forward` + `tq_forward` dispatcher)
LongRoPE (NeoX-style)
State allocation
Tokenizer BOS
` to the BOS lookup chain alongside `` and `<|im_start|>``. Phi-3 specifically degrades into garbage without it; Llama-3 behavior is unchanged because Llama-3 uses `<|begin_of_text|>`.Debugging journey (TIL)
The bug was that I had the FFN split order backwards. HuggingFace's `gate, up = gate_up_proj(x).chunk(2, dim=-1)` produces `[gate, up]` along the output axis, which is what the GGUF preserves. I had assumed `[up, gate]` based on a half-remembered llama.cpp comment.
New tools
Registry / docs updates
Test plan
🤖 Generated with Claude Code