Skip to content

fix(libturboquant): sync Phi-3 + BOS + Metal safety to split sources#72

Closed
unamedkr wants to merge 1 commit intomainfrom
fix/libturboquant-phi3-gqa-sync
Closed

fix(libturboquant): sync Phi-3 + BOS + Metal safety to split sources#72
unamedkr wants to merge 1 commit intomainfrom
fix/libturboquant-phi3-gqa-sync

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

Summary

Closes #67. Addresses #62 (Metal root cause). Addresses #61 (GQA confirmed working on CPU).

PR #65 added Phi-3 to quant.h (single-header) but the split source files (libturboquant) were never updated. This PR ports all Phi-3 changes + discovers and fixes a Metal GPU buffer overflow that caused garbage output even after the CPU code was correct.

Root cause chain

  1. Phi-3 support not propagated from quant.h to libturboquant (quant-server broken) #67: split sources missing fused QKV/FFN/LongRoPE → quant-server loads Phi-3 correctly (32 self_attn) but forward pass produces garbage
  2. Metal buffer overflow: tq_matmul_gguf in tq_gguf_quants.c dispatches Q4_K matmuls to Metal GPU when `out_dim >= 512`. Fused FFN calls it with `out_dim=16384` (2x intermediate_dim) but Metal kernels assume standard layout → GPU output buffer overflow → garbage
  3. Fix: disable Metal dispatch for fused-tensor models. CPU NEON is already faster for sub-4B models (measured: 95 tok/s CPU vs 38 tok/s GPU).

Files changed (7 files, +328/-25)

File Change
tq_engine.h Added LongRoPE config + fused tensor fields to structs
tq_model.c Fused QKV/FFN detection, LongRoPE loading, hard-fail on unsupported arch, Metal skip for fused models
tq_transformer.c Fused QKV matmul+split, NeoX LongRoPE rotation, fused FFN gate||up, layer/FFN dispatcher updates, state buffer sizing
tq_generate.c BOS=1 when vocab has `` (Phi-3 needs it)
tq_tokenizer.c `` + `<|begin_of_text|>` added to BOS lookup chain
tq_metal_dispatch.m `tq_metal_disable()` function + `tq_metal_available()` respects it
tools/phi3_kvcomp_test.c KV compression validation harness (5 modes)

Verified

  • ctest → 35/35 passed (Metal build)
  • quant-server + Phi-3.5-mini (Metal build) → coherent output
  • quant-server + Phi-3.5-mini (CPU-only build) → "capital of France is Paris"
  • quant-server + Llama-3.2-1B GQA (CPU build) → "Answer: 4"
  • quant-server + SmolLM2-135M (Metal build) → unchanged (Metal still active for non-fused)
  • quant.h single-header path → no regression

Test plan

  • Unit tests pass (35/35)
  • Phi-3 quant-server produces coherent output (Metal build)
  • Phi-3 quant-server produces coherent output (CPU build)
  • Llama-3.2-1B GQA works on quant-server (CPU build)
  • SmolLM2-135M Metal path unchanged
  • Manual: `quantcpp serve phi-3.5-mini` end-to-end with Python ctypes

🤖 Generated with Claude Code

Closes #67 (Phi-3 not propagated to libturboquant).
Fixes #62 root cause for Phi-3 (Metal buffer overflow → garbage on 2nd request).
Addresses #61 GQA — confirmed working on CPU-only server for Llama-3.2-1B.

## Root cause

PR #65 added full Phi-3 support to `quant.h` (single-header) but the
split source files used by `libturboquant` and `quant-server` were
never updated. Models loaded correctly (32 self_attn layers) but the
forward pass ran through code paths that didn't know about fused
tensors → garbage output.

Additionally, the Metal GPU backend intercepted Q4_K matmul calls via
`tq_metal_matmul_gguf()` even for fused tensors. The Metal kernels
assume standard separate-tensor layouts and pre-allocated buffer sizes.
When called with fused FFN output dim (16384 vs expected 8192), the
GPU produced corrupted output → garbage even after CPU code was fixed.

## Changes

### tq_engine.h (struct definitions)
- Added `rope_orig_ctx_len`, `rope_attn_factor`, `rope_factors_short`,
  `rope_factors_long` to `tq_model_config_t`
- Added `has_fused_qkv`, `has_fused_up_gate` flags
- Added `gguf_w_qkv`, `gguf_w_up_gate` fields to `tq_layer_weights_t`

### tq_model.c (GGUF loader)
- Phi-3 fused QKV detection: `blk.N.attn_qkv.weight` → `gguf_w_qkv`
- Phi-3 fused FFN detection: `ffn_up` with 2× intermediate_dim → `gguf_w_up_gate`
- LongRoPE config + factor loading from GGUF metadata/tensors
- Hard-fail when `n_attn_layers == 0 && delta_n_heads == 0`
- Metal GPU init SKIPPED for fused-tensor models (CPU is faster anyway)

### tq_transformer.c (forward pass)
- Fused QKV branch: one matmul + memcpy split into s->q/s->k/s->v
- K and V projection blocks skipped when fused
- NeoX-style LongRoPE rotation with per-frequency rescaling
- Fused FFN gate||up: one matmul + memcpy split (layout: [gate | up])
- Layer dispatcher: `layer->gguf_w_qkv` triggers self_attn_forward
- FFN dispatcher: accepts `gguf_w_up_gate` as valid gate+up source
- State buffer sizing: max_dim covers fused QKV, hb covers 2×inter

### tq_generate.c (BOS token)
- add_bos=1 when vocab has `<s>` in first 8 entries (Phi-3 needs BOS)

### tq_tokenizer.c (BOS lookup chain)
- Added `<s>` and `<|begin_of_text|>` to the BOS token lookup chain

### tq_metal_dispatch.m (Metal safety)
- Added `tq_metal_disable()` function
- `tq_metal_available()` returns 0 when disabled
- Loader calls `tq_metal_disable()` for fused-tensor models

## Verified

- ctest --test-dir build → 35/35 passed (Metal build)
- quant-server + Phi-3.5-mini (Metal build, Metal disabled) → coherent
  output: "hat can help with math problems..."
- quant-server + Phi-3.5-mini (CPU-only build) → coherent output:
  "capital city of France is Paris"
- quant-server + Llama-3.2-1B GQA (CPU build) → "Answer: 4"
- quant-server + SmolLM2-135M (Metal build, Metal active) → unchanged
- quant.h single-header path → unchanged (no regression)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr
Copy link
Copy Markdown
Collaborator Author

Superseded by commits 08e8661 + 221efbb which ported the same changes. Verified main is up to date.

@unamedkr unamedkr closed this Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phi-3 support not propagated from quant.h to libturboquant (quant-server broken)

1 participant