fix(libturboquant): sync Phi-3 + BOS + Metal safety to split sources by unamedkr · Pull Request #72 · quantumaikr/quant.cpp

unamedkr · 2026-04-12T04:48:34Z

Summary

Closes #67. Addresses #62 (Metal root cause). Addresses #61 (GQA confirmed working on CPU).

PR #65 added Phi-3 to quant.h (single-header) but the split source files (libturboquant) were never updated. This PR ports all Phi-3 changes + discovers and fixes a Metal GPU buffer overflow that caused garbage output even after the CPU code was correct.

Root cause chain

Phi-3 support not propagated from quant.h to libturboquant (quant-server broken) #67: split sources missing fused QKV/FFN/LongRoPE → quant-server loads Phi-3 correctly (32 self_attn) but forward pass produces garbage
Metal buffer overflow: tq_matmul_gguf in tq_gguf_quants.c dispatches Q4_K matmuls to Metal GPU when `out_dim >= 512`. Fused FFN calls it with `out_dim=16384` (2x intermediate_dim) but Metal kernels assume standard layout → GPU output buffer overflow → garbage
Fix: disable Metal dispatch for fused-tensor models. CPU NEON is already faster for sub-4B models (measured: 95 tok/s CPU vs 38 tok/s GPU).

Files changed (7 files, +328/-25)

File	Change
tq_engine.h	Added LongRoPE config + fused tensor fields to structs
tq_model.c	Fused QKV/FFN detection, LongRoPE loading, hard-fail on unsupported arch, Metal skip for fused models
tq_transformer.c	Fused QKV matmul+split, NeoX LongRoPE rotation, fused FFN gate\|\|up, layer/FFN dispatcher updates, state buffer sizing
tq_generate.c	BOS=1 when vocab has `~~` (Phi-3 needs it)~~
tq_tokenizer.c	`~~` + `<\|begin_of_text\|>` added to BOS lookup chain~~
tq_metal_dispatch.m	`tq_metal_disable()` function + `tq_metal_available()` respects it
tools/phi3_kvcomp_test.c	KV compression validation harness (5 modes)

Verified

ctest → 35/35 passed (Metal build)
quant-server + Phi-3.5-mini (Metal build) → coherent output ✅
quant-server + Phi-3.5-mini (CPU-only build) → "capital of France is Paris" ✅
quant-server + Llama-3.2-1B GQA (CPU build) → "Answer: 4" ✅
quant-server + SmolLM2-135M (Metal build) → unchanged (Metal still active for non-fused)
quant.h single-header path → no regression

Test plan

Unit tests pass (35/35)
Phi-3 quant-server produces coherent output (Metal build)
Phi-3 quant-server produces coherent output (CPU build)
Llama-3.2-1B GQA works on quant-server (CPU build)
SmolLM2-135M Metal path unchanged
Manual: `quantcpp serve phi-3.5-mini` end-to-end with Python ctypes

🤖 Generated with Claude Code

Closes #67 (Phi-3 not propagated to libturboquant). Fixes #62 root cause for Phi-3 (Metal buffer overflow → garbage on 2nd request). Addresses #61 GQA — confirmed working on CPU-only server for Llama-3.2-1B. ## Root cause PR #65 added full Phi-3 support to `quant.h` (single-header) but the split source files used by `libturboquant` and `quant-server` were never updated. Models loaded correctly (32 self_attn layers) but the forward pass ran through code paths that didn't know about fused tensors → garbage output. Additionally, the Metal GPU backend intercepted Q4_K matmul calls via `tq_metal_matmul_gguf()` even for fused tensors. The Metal kernels assume standard separate-tensor layouts and pre-allocated buffer sizes. When called with fused FFN output dim (16384 vs expected 8192), the GPU produced corrupted output → garbage even after CPU code was fixed. ## Changes ### tq_engine.h (struct definitions) - Added `rope_orig_ctx_len`, `rope_attn_factor`, `rope_factors_short`, `rope_factors_long` to `tq_model_config_t` - Added `has_fused_qkv`, `has_fused_up_gate` flags - Added `gguf_w_qkv`, `gguf_w_up_gate` fields to `tq_layer_weights_t` ### tq_model.c (GGUF loader) - Phi-3 fused QKV detection: `blk.N.attn_qkv.weight` → `gguf_w_qkv` - Phi-3 fused FFN detection: `ffn_up` with 2× intermediate_dim → `gguf_w_up_gate` - LongRoPE config + factor loading from GGUF metadata/tensors - Hard-fail when `n_attn_layers == 0 && delta_n_heads == 0` - Metal GPU init SKIPPED for fused-tensor models (CPU is faster anyway) ### tq_transformer.c (forward pass) - Fused QKV branch: one matmul + memcpy split into s->q/s->k/s->v - K and V projection blocks skipped when fused - NeoX-style LongRoPE rotation with per-frequency rescaling - Fused FFN gate||up: one matmul + memcpy split (layout: [gate | up]) - Layer dispatcher: `layer->gguf_w_qkv` triggers self_attn_forward - FFN dispatcher: accepts `gguf_w_up_gate` as valid gate+up source - State buffer sizing: max_dim covers fused QKV, hb covers 2×inter ### tq_generate.c (BOS token) - add_bos=1 when vocab has `<s>` in first 8 entries (Phi-3 needs BOS) ### tq_tokenizer.c (BOS lookup chain) - Added `<s>` and `<|begin_of_text|>` to the BOS token lookup chain ### tq_metal_dispatch.m (Metal safety) - Added `tq_metal_disable()` function - `tq_metal_available()` returns 0 when disabled - Loader calls `tq_metal_disable()` for fused-tensor models ## Verified - ctest --test-dir build → 35/35 passed (Metal build) - quant-server + Phi-3.5-mini (Metal build, Metal disabled) → coherent output: "hat can help with math problems..." - quant-server + Phi-3.5-mini (CPU-only build) → coherent output: "capital city of France is Paris" - quant-server + Llama-3.2-1B GQA (CPU build) → "Answer: 4" - quant-server + SmolLM2-135M (Metal build, Metal active) → unchanged - quant.h single-header path → unchanged (no regression) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr · 2026-04-12T04:49:34Z

Superseded by commits 08e8661 + 221efbb which ported the same changes. Verified main is up to date.

unamedkr closed this Apr 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(libturboquant): sync Phi-3 + BOS + Metal safety to split sources#72

fix(libturboquant): sync Phi-3 + BOS + Metal safety to split sources#72
unamedkr wants to merge 1 commit intomainfrom
fix/libturboquant-phi3-gqa-sync

unamedkr commented Apr 12, 2026

Uh oh!

unamedkr commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 12, 2026

Summary

Root cause chain

Files changed (7 files, +328/-25)

Verified

Test plan

Uh oh!

unamedkr commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant