Skip to content

feat(phi3): end-to-end Phi-3 / Phi-3.5 architecture support#65

Merged
unamedkr merged 1 commit intomainfrom
feat/phi3-support
Apr 12, 2026
Merged

feat(phi3): end-to-end Phi-3 / Phi-3.5 architecture support#65
unamedkr merged 1 commit intomainfrom
feat/phi3-support

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

Summary

Adds full inference support for Phi-3 family GGUFs, validated end-to-end against bartowski/Phi-3.5-mini-instruct-Q4_K_M.gguf. Phi-3.5-mini is now the recommended best speed + quality model in the registry — vocab 32K + 3.8B params makes the lm_head matmul the fastest of any model we ship.

Validation

Coherent multi-paragraph output on plain prompts AND chat template:

```
$ phi3_infer_test ~/.cache/quantcpp/Phi-3.5-mini-instruct-Q4_K_M.gguf
The capital of France is Paris.
The Eiffel Tower, located in the city center, stands as a symbolic
landmark for both locals and tourists alike. The tower's iconic
silhouette is visible from various points around the city, offering
a panoramic view of Parisian life and its vibrant culture.
The Seine River meanders through this historic metropolis...
```

```
$ phi3_infer_test ... "What is 2+2?"
<|user|>
What is 2+2?<|end|>
<|assistant|>
The sum of 2 + 2 equals to four.
```

Regression checks:

  • ctest --test-dir build → 35/35 passed
  • Llama-3.2-1B end-to-end → still coherent
  • SmolLM2-135M end-to-end → still coherent
  • Full build clean (no new warnings)

Why Phi-3 is hard

Phi-3 ships fused weight tensors instead of llama-style separate ones, plus a long-context RoPE variant:

Tensor Shape What's inside
`blk.N.attn_qkv.weight` `[hidden, 3*hidden]` Q ‖ K ‖ V along the output axis
`blk.N.ffn_up.weight` `[hidden, 2*ff]` gate ‖ up along the output axis
`rope_factors_short` `[head_dim/2]` LongRoPE per-pair rescaling
`rope_factors_long` `[head_dim/2]` LongRoPE per-pair rescaling

Before this PR: load reported `0 self_attn` and the forward pass ran against zero-initialized attention weights → garbage tokens.

Implementation

Loader (`tq_load_gguf`)

  • Detects `attn_qkv.weight` → `gguf_w_qkv` field, marks layer as attention
  • Detects `ffn_up.weight` with `shape[1] == 2 * intermediate_dim` AND no separate `ffn_gate.weight` → `gguf_w_up_gate`
  • Reads `phi3.rope.scaling.original_context_length` and `attn_factor`
  • Locates `rope_factors_short`/`rope_factors_long` global tensors
  • Hard-fail path from PR feat(feedback): Quick Wins from 2026-04-12 external user report #59 now correctly identifies Phi-3 as supported (32 attn layers detected)

Forward (`self_attn_forward` + `tq_forward` dispatcher)

  • New `if (layer->gguf_w_qkv)` branch: one matmul into a temp buffer of size `q_dim + 2*kv_dim`, memcpy splits into `s->q/s->k/s->v`
  • New `if (layer->gguf_w_up_gate)` FFN branch: one matmul of size `2*inter`, memcpy second half to `s->hb2`. Layout is `[gate | up]` (HuggingFace convention)
  • The dispatcher in `tq_forward`'s layer loop ALSO calls `self_attn_forward` when `gguf_w_qkv` is set — without this the new branches were unreachable
  • Same for FFN dispatcher: accept `gguf_w_up_gate` as a valid gate/up source

LongRoPE (NeoX-style)

  • New branch in full RoPE: `factor[i] = (pos < orig_ctx_len) ? short[i] : long[i]`, `freq[i] = 1 / (rope_base^(2i/head_dim) * factor[i])`
  • Uses NeoX-style pair layout `(q[i], q[i + half])` rather than the interleaved `(q[2i], q[2i+1])` that quant.cpp's existing `tq_rope` uses for Llama
  • Why NeoX: llama.cpp's GGUF converter pre-permutes separate Q/K weights so interleaved RoPE works for Llama-family models. The fused `attn_qkv` is NOT permuted, so we apply rotation in its native NeoX form.
  • `rope_attn_factor` multiplied into Q only when `pos >= orig_ctx_len`

State allocation

  • `max_dim` bumped to cover `q_dim + 2*kv_dim` (fused QKV temp buffer reuses `s->xb2`)
  • `s->hb` bumped to `2 * inter` when fused gate||up is detected

Tokenizer BOS

  • `tq_encode` adds `` to the BOS lookup chain alongside `` and `<|im_start|>`
  • `quant_generate` enables `add_bos=1` when vocab has ``. Phi-3 specifically degrades into garbage without it; Llama-3 behavior is unchanged because Llama-3 uses `<|begin_of_text|>`.

Debugging journey (TIL)

  1. First implementation: garbage output. Loader correct, forward path wasn't being hit.
  2. The dispatcher's `if (layer->gguf_wq)` check guarded the call to `self_attn_forward`. Phi-3 has `gguf_wq == NULL`. Fixed.
  3. Still garbage. Forward branch firing but output is fragmented English.
  4. Tried NeoX vs GPT-J RoPE — both garbage but different.
  5. Verified Q5_K dequant + matmul produces bit-perfect output against an FP32 reference.
  6. Verified tokenizer produces correct token IDs (`The capital of France is` → 5 correct tokens).
  7. Tried `[gate | up]` vs `[up | gate]` for the fused FFN. `[gate | up]` was the answer. Output went from "Block...attacker...non...meteor..." to "Paris. The Eiffel Tower..." instantly.

The bug was that I had the FFN split order backwards. HuggingFace's `gate, up = gate_up_proj(x).chunk(2, dim=-1)` produces `[gate, up]` along the output axis, which is what the GGUF preserves. I had assumed `[up, gate]` based on a half-remembered llama.cpp comment.

New tools

  • `tools/gguf_inspect.c` — dumps tensor names, shapes, types, and metadata from any GGUF. Used to verify Phi-3.5's layout before writing loader code. General-purpose, kept for future architecture work.
  • `tools/phi3_infer_test.c` — minimal end-to-end inference test. Doubles as a smoke test for any future Phi-3 changes.

Registry / docs updates

  • `Phi-3.5-mini` added to `_MODEL_REGISTRY` with bartowski's Q4_K_M variant (~2.4 GB)
  • New CLI aliases: `phi3.5`, `phi3.5:mini`, `phi-3.5`, `phi-3.5-mini`
  • `docs/supported_models.md` updated: phi3 now ✅ Fully supported. Recommended-models table puts Phi-3.5-mini at the top.
  • `docs/spikes/2026-04-12_phi3_support.md` updated with the inspection findings + conclusions that drove implementation.

Test plan

  • Unit tests pass (35/35)
  • Phi-3.5-mini end-to-end inference produces coherent paragraphs
  • Phi-3 chat template works ("What is 2+2?" → "...equals to four")
  • Llama-3.2-1B regression — still coherent
  • SmolLM2-135M regression — still coherent
  • Manual: `quantcpp run phi-3.5-mini` after wheel build
  • Manual: long-context test (>4096 tokens) to exercise the LongRoPE `pos >= orig_ctx_len` branch

🤖 Generated with Claude Code

Adds full inference support for Phi-3 family GGUFs (validated against
bartowski's `Phi-3.5-mini-instruct-Q4_K_M.gguf`). Output is coherent
multi-paragraph English. Phi-3.5-mini becomes the new "best speed +
quality" recommendation in the model registry.

## Why Phi-3 is hard

Phi-3 ships fused weight tensors instead of llama-style separate ones,
plus a long-context RoPE variant:

  blk.N.attn_qkv.weight   shape [hidden, 3*hidden]   Q ‖ K ‖ V
  blk.N.ffn_up.weight     shape [hidden, 2*ff]       gate ‖ up
  rope_factors_short      [head_dim/2]               LongRoPE
  rope_factors_long       [head_dim/2]               LongRoPE
  + rope.scaling.attn_factor                          long-context Q scaling

Before this PR: load reported `0 self_attn` and the forward pass ran
against zero-initialized attention weights → garbage tokens.

## What this PR adds

### Loader (`tq_load_gguf`)

- Detects `blk.N.attn_qkv.weight` and stores its raw quantized pointer
  in a new field `gguf_w_qkv` (+ type). Marks the layer as attention.
- Detects `blk.N.ffn_up.weight` with `shape[1] == 2 * intermediate_dim`
  AND no separate `ffn_gate.weight` → fused gate||up. Stores in
  `gguf_w_up_gate`.
- Reads `phi3.rope.scaling.original_context_length` and
  `phi3.rope.scaling.attn_factor` via the existing arch-prefix macro.
- Locates `rope_factors_short.weight` / `rope_factors_long.weight` as
  global tensors and stores raw F32 pointers in the model config.
- The hard-fail path from the previous PR now correctly identifies
  Phi-3 as a *supported* architecture (n_attn_layers == 32, not 0).

### Forward (`self_attn_forward` + tq_forward dispatcher)

- New `if (layer->gguf_w_qkv)` branch in self_attn_forward: one
  `tq_matmul_gguf` call into a temp buffer of size `q_dim + 2*kv_dim`,
  then memcpy splits into `s->q`, `s->k`, `s->v`. The K and V projection
  blocks below are skipped when fused.
- New `if (layer->gguf_w_up_gate)` branch in the FFN section: one
  matmul of size `2*inter` into `s->hb`, memcpy second half to `s->hb2`.
  Layout is `[gate | up]` (HuggingFace convention).
- The dispatcher (`tq_forward` layer loop) now also calls
  `self_attn_forward` when `layer->gguf_w_qkv != NULL` — without this
  the new branch was unreachable because the existing condition checks
  for `gguf_wq` separately.
- Same for the FFN dispatcher: accept `gguf_w_up_gate` as a valid
  gate/up source.

### LongRoPE

- New branch in the full RoPE path: when `rope_factors_short` or
  `rope_factors_long` is set, apply per-frequency-pair rescaling using
  `factor[i] = (pos < orig_ctx_len) ? short[i] : long[i]` and
  `freq[i] = 1 / (rope_base^(2i/head_dim) * factor[i])`.
- Uses **NeoX-style** pair layout `(q[i], q[i + half])` rather than
  the interleaved `(q[2i], q[2i+1])` that quant.cpp's `tq_rope` uses
  for Llama. The reason: llama.cpp's GGUF converter pre-permutes
  separate Q/K weights so interleaved RoPE produces equivalent
  results, but the *fused* `attn_qkv` tensor is NOT permuted.
- `rope_attn_factor` is multiplied into Q only when
  `pos >= orig_ctx_len` — no scaling at short context.

### State allocation (`tq_create_state_ex`)

- Bumps `max_dim` to cover `q_dim + 2*kv_dim` (the fused QKV temp
  buffer reuses `s->xb2`) when `has_fused_qkv` is set.
- Bumps the `s->hb` allocation to `2 * inter` (fused gate||up output)
  when `has_fused_up_gate` is set. `s->hb2` stays at `inter`.

### Tokenizer BOS

- `tq_encode` adds Phi-3 / Llama's `<s>` to its BOS lookup chain
  alongside `<bos>` (Gemma) and `<|im_start|>` (Qwen).
- `quant_generate` also enables `add_bos=1` when the vocab has `<s>`
  — Phi-3 specifically degrades into garbage without it. Existing
  Llama-3 behavior is unchanged because Llama-3 uses `<|begin_of_text|>`
  which the lookup chain also handles.

## Validation

End-to-end inference test (`tools/phi3_infer_test.c`):

```
$ ./phi3_infer_test ~/.cache/quantcpp/Phi-3.5-mini-instruct-Q4_K_M.gguf
The capital of France is
 Paris.
The Eiffel Tower, located in the city center, stands as a symbolic
landmark for both locals and tourists alike. The tower's iconic
silhouette is visible from various points around the city, offering
a panoramic view of Parisian life and its vibrant culture.
The Seine River meanders through this historic metropolis...
```

Chat template:

```
$ ./phi3_infer_test ... "What is 2+2?"
<|user|>
What is 2+2?<|end|>
<|assistant|>
The sum of 2 + 2 equals to four.
```

Regression checks:
- ctest --test-dir build → 35/35 passed
- Llama-3.2-1B end-to-end → still coherent
- SmolLM2-135M end-to-end → still coherent
- Full build clean (no new warnings)

## Registry / docs

- `Phi-3.5-mini` added to `_MODEL_REGISTRY` with the bartowski Q4_K_M
  variant (~2.4 GB). Listed as "best speed + quality" in
  `docs/supported_models.md`.
- New aliases `phi3.5`, `phi3.5:mini`, `phi-3.5`, `phi-3.5-mini`.
- Architecture matrix updated: phi3 now ✅ Fully supported.
- Docs section "Why phi3 is hard" replaced with "How Phi-3 support
  works" — explains fused tensors + LongRoPE + NeoX rotation choice.
- Spike doc `docs/spikes/2026-04-12_phi3_support.md` updated with
  inspection findings and the conclusions that drove implementation.

## New tools

- `tools/gguf_inspect.c` — dump tensor names, shapes, types, and
  metadata from a GGUF file. Used to verify Phi-3.5's layout before
  writing loader code. General-purpose, kept for future architecture
  work.
- `tools/phi3_infer_test.c` — minimal end-to-end inference test.
  Doubles as a smoke test for any future Phi-3 changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr merged commit 1e1ea2c into main Apr 12, 2026
2 of 3 checks passed
@unamedkr unamedkr deleted the feat/phi3-support branch April 12, 2026 02:57
unamedkr added a commit that referenced this pull request Apr 12, 2026
Phi-3 architecture support landed in #65 and validated end-to-end as
the best speed/quality combo we ship (vocab 32K + 3.8B params makes
the lm_head matmul the fastest of any registered model). Promote it
to the default everywhere.

## Code

- `_MODEL_REGISTRY` reordered with Phi-3.5-mini first; comment block
  marks it as the default and explains the reasoning
- `cmd_chat_default` (no-subcommand chat) now picks Phi-3.5-mini
- Module docstring + `Model.from_pretrained` example use Phi-3.5-mini
- CLI `--help` epilog: examples lead with `phi-3.5-mini` and the
  backwards-compat block mentions `smollm2` / `llama3.2:1b` as
  alternatives instead

## Docs

- README.md: Quick Start renamed Phi-3.5-mini as the recommended
  default; CLI examples and Python `from_pretrained` example updated.
  Benchmark/perf sections still reference SmolLM2/Llama models because
  those are historical measurement data.
- README.ko.md: same changes mirrored in Korean.
- bindings/python/README.md (PyPI README): replaced "Basic question
  answering" with "Quick start (auto-download)" using `from_pretrained`.
  Added a multi-turn chat example using `m.chat()` + KV cache reuse,
  and an API reference entry for `Model.chat()` and
  `Model.from_pretrained()`.

## Verified

- ctest --test-dir build → 35/35 passed
- Full build clean (no new warnings)
- Phi-3.5-mini end-to-end inference test still produces coherent
  multi-paragraph output ("Name three planets..." → Earth, Mars,
  Jupiter with descriptions)
- `available_models()` returns Phi-3.5-mini in the list
- `MODEL_ALIASES['phi-3.5-mini']` and friends resolve correctly
- `cmd_chat_default` source confirms `args.model = "Phi-3.5-mini"`
- `quantcpp --help` epilog reflects the new defaults

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr added a commit that referenced this pull request Apr 12, 2026
Phi-3 architecture support landed in #65 and validated end-to-end as
the best speed/quality combo we ship (vocab 32K + 3.8B params makes
the lm_head matmul the fastest of any registered model). Promote it
to the default everywhere.

## Code

- `_MODEL_REGISTRY` reordered with Phi-3.5-mini first; comment block
  marks it as the default and explains the reasoning
- `cmd_chat_default` (no-subcommand chat) now picks Phi-3.5-mini
- Module docstring + `Model.from_pretrained` example use Phi-3.5-mini
- CLI `--help` epilog: examples lead with `phi-3.5-mini` and the
  backwards-compat block mentions `smollm2` / `llama3.2:1b` as
  alternatives instead

## Docs

- README.md: Quick Start renamed Phi-3.5-mini as the recommended
  default; CLI examples and Python `from_pretrained` example updated.
  Benchmark/perf sections still reference SmolLM2/Llama models because
  those are historical measurement data.
- README.ko.md: same changes mirrored in Korean.
- bindings/python/README.md (PyPI README): replaced "Basic question
  answering" with "Quick start (auto-download)" using `from_pretrained`.
  Added a multi-turn chat example using `m.chat()` + KV cache reuse,
  and an API reference entry for `Model.chat()` and
  `Model.from_pretrained()`.

## Verified

- ctest --test-dir build → 35/35 passed
- Full build clean (no new warnings)
- Phi-3.5-mini end-to-end inference test still produces coherent
  multi-paragraph output ("Name three planets..." → Earth, Mars,
  Jupiter with descriptions)
- `available_models()` returns Phi-3.5-mini in the list
- `MODEL_ALIASES['phi-3.5-mini']` and friends resolve correctly
- `cmd_chat_default` source confirms `args.model = "Phi-3.5-mini"`
- `quantcpp --help` epilog reflects the new defaults

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr added a commit that referenced this pull request Apr 12, 2026
…rces

Ports the Phi-3/Phi-3.5 architecture support from quant.h (PR #65)
to the split source files used by libturboquant and quant-server.

Changes:
- tq_model.c: fused attn_qkv detection, LongRoPE factor loading,
  fused gate||up FFN detection
- tq_transformer.c: fused QKV matmul + split, NeoX-style LongRoPE
  rotation, fused gate||up FFN path, expanded state allocation
- tq_generate.c: Phi-3 BOS token handling
- tq_tokenizer.c: <s> BOS lookup
- tq_server.c: Phi-3 chat template support
- tq_engine.h: new fields for fused weights and LongRoPE config
- cli.py: Phi-3.5 default model + alias updates

quant-server now detects Phi-3.5 correctly:
  loaded 32 layers (32 self_attn) + LongRoPE

Note: server crashes during inference (segfault in forward pass).
The fused QKV → split memcpy or LongRoPE computation likely has
a buffer size issue in the server path. Tracked in #67.

35/35 unit tests still pass.

Fixes #67 (partial — loader works, inference needs debugging)
Refs #69, #70

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr added a commit that referenced this pull request Apr 12, 2026
Closes #67 (Phi-3 not propagated to libturboquant).
Fixes #62 root cause for Phi-3 (Metal buffer overflow → garbage on 2nd request).
Addresses #61 GQA — confirmed working on CPU-only server for Llama-3.2-1B.

## Root cause

PR #65 added full Phi-3 support to `quant.h` (single-header) but the
split source files used by `libturboquant` and `quant-server` were
never updated. Models loaded correctly (32 self_attn layers) but the
forward pass ran through code paths that didn't know about fused
tensors → garbage output.

Additionally, the Metal GPU backend intercepted Q4_K matmul calls via
`tq_metal_matmul_gguf()` even for fused tensors. The Metal kernels
assume standard separate-tensor layouts and pre-allocated buffer sizes.
When called with fused FFN output dim (16384 vs expected 8192), the
GPU produced corrupted output → garbage even after CPU code was fixed.

## Changes

### tq_engine.h (struct definitions)
- Added `rope_orig_ctx_len`, `rope_attn_factor`, `rope_factors_short`,
  `rope_factors_long` to `tq_model_config_t`
- Added `has_fused_qkv`, `has_fused_up_gate` flags
- Added `gguf_w_qkv`, `gguf_w_up_gate` fields to `tq_layer_weights_t`

### tq_model.c (GGUF loader)
- Phi-3 fused QKV detection: `blk.N.attn_qkv.weight` → `gguf_w_qkv`
- Phi-3 fused FFN detection: `ffn_up` with 2× intermediate_dim → `gguf_w_up_gate`
- LongRoPE config + factor loading from GGUF metadata/tensors
- Hard-fail when `n_attn_layers == 0 && delta_n_heads == 0`
- Metal GPU init SKIPPED for fused-tensor models (CPU is faster anyway)

### tq_transformer.c (forward pass)
- Fused QKV branch: one matmul + memcpy split into s->q/s->k/s->v
- K and V projection blocks skipped when fused
- NeoX-style LongRoPE rotation with per-frequency rescaling
- Fused FFN gate||up: one matmul + memcpy split (layout: [gate | up])
- Layer dispatcher: `layer->gguf_w_qkv` triggers self_attn_forward
- FFN dispatcher: accepts `gguf_w_up_gate` as valid gate+up source
- State buffer sizing: max_dim covers fused QKV, hb covers 2×inter

### tq_generate.c (BOS token)
- add_bos=1 when vocab has `<s>` in first 8 entries (Phi-3 needs BOS)

### tq_tokenizer.c (BOS lookup chain)
- Added `<s>` and `<|begin_of_text|>` to the BOS token lookup chain

### tq_metal_dispatch.m (Metal safety)
- Added `tq_metal_disable()` function
- `tq_metal_available()` returns 0 when disabled
- Loader calls `tq_metal_disable()` for fused-tensor models

## Verified

- ctest --test-dir build → 35/35 passed (Metal build)
- quant-server + Phi-3.5-mini (Metal build, Metal disabled) → coherent
  output: "hat can help with math problems..."
- quant-server + Phi-3.5-mini (CPU-only build) → coherent output:
  "capital city of France is Paris"
- quant-server + Llama-3.2-1B GQA (CPU build) → "Answer: 4"
- quant-server + SmolLM2-135M (Metal build, Metal active) → unchanged
- quant.h single-header path → unchanged (no regression)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant