fix(libturboquant): sync Phi-3 + BOS + Metal safety to split sources#72
Closed
fix(libturboquant): sync Phi-3 + BOS + Metal safety to split sources#72
Conversation
Closes #67 (Phi-3 not propagated to libturboquant). Fixes #62 root cause for Phi-3 (Metal buffer overflow → garbage on 2nd request). Addresses #61 GQA — confirmed working on CPU-only server for Llama-3.2-1B. ## Root cause PR #65 added full Phi-3 support to `quant.h` (single-header) but the split source files used by `libturboquant` and `quant-server` were never updated. Models loaded correctly (32 self_attn layers) but the forward pass ran through code paths that didn't know about fused tensors → garbage output. Additionally, the Metal GPU backend intercepted Q4_K matmul calls via `tq_metal_matmul_gguf()` even for fused tensors. The Metal kernels assume standard separate-tensor layouts and pre-allocated buffer sizes. When called with fused FFN output dim (16384 vs expected 8192), the GPU produced corrupted output → garbage even after CPU code was fixed. ## Changes ### tq_engine.h (struct definitions) - Added `rope_orig_ctx_len`, `rope_attn_factor`, `rope_factors_short`, `rope_factors_long` to `tq_model_config_t` - Added `has_fused_qkv`, `has_fused_up_gate` flags - Added `gguf_w_qkv`, `gguf_w_up_gate` fields to `tq_layer_weights_t` ### tq_model.c (GGUF loader) - Phi-3 fused QKV detection: `blk.N.attn_qkv.weight` → `gguf_w_qkv` - Phi-3 fused FFN detection: `ffn_up` with 2× intermediate_dim → `gguf_w_up_gate` - LongRoPE config + factor loading from GGUF metadata/tensors - Hard-fail when `n_attn_layers == 0 && delta_n_heads == 0` - Metal GPU init SKIPPED for fused-tensor models (CPU is faster anyway) ### tq_transformer.c (forward pass) - Fused QKV branch: one matmul + memcpy split into s->q/s->k/s->v - K and V projection blocks skipped when fused - NeoX-style LongRoPE rotation with per-frequency rescaling - Fused FFN gate||up: one matmul + memcpy split (layout: [gate | up]) - Layer dispatcher: `layer->gguf_w_qkv` triggers self_attn_forward - FFN dispatcher: accepts `gguf_w_up_gate` as valid gate+up source - State buffer sizing: max_dim covers fused QKV, hb covers 2×inter ### tq_generate.c (BOS token) - add_bos=1 when vocab has `<s>` in first 8 entries (Phi-3 needs BOS) ### tq_tokenizer.c (BOS lookup chain) - Added `<s>` and `<|begin_of_text|>` to the BOS token lookup chain ### tq_metal_dispatch.m (Metal safety) - Added `tq_metal_disable()` function - `tq_metal_available()` returns 0 when disabled - Loader calls `tq_metal_disable()` for fused-tensor models ## Verified - ctest --test-dir build → 35/35 passed (Metal build) - quant-server + Phi-3.5-mini (Metal build, Metal disabled) → coherent output: "hat can help with math problems..." - quant-server + Phi-3.5-mini (CPU-only build) → coherent output: "capital city of France is Paris" - quant-server + Llama-3.2-1B GQA (CPU build) → "Answer: 4" - quant-server + SmolLM2-135M (Metal build, Metal active) → unchanged - quant.h single-header path → unchanged (no regression) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #67. Addresses #62 (Metal root cause). Addresses #61 (GQA confirmed working on CPU).
PR #65 added Phi-3 to
quant.h(single-header) but the split source files (libturboquant) were never updated. This PR ports all Phi-3 changes + discovers and fixes a Metal GPU buffer overflow that caused garbage output even after the CPU code was correct.Root cause chain
quant-serverloads Phi-3 correctly (32 self_attn) but forward pass produces garbagetq_matmul_ggufintq_gguf_quants.cdispatches Q4_K matmuls to Metal GPU when `out_dim >= 512`. Fused FFN calls it with `out_dim=16384` (2x intermediate_dim) but Metal kernels assume standard layout → GPU output buffer overflow → garbageFiles changed (7 files, +328/-25)
` (Phi-3 needs it)` + `<|begin_of_text|>` added to BOS lookup chainVerified
Test plan
🤖 Generated with Claude Code