Summary
Complete the Phi-3 architecture port from quant.h to src/engine/*.c so that quant-server (libturboquant) can serve Phi-3.5 with Metal GPU acceleration.
Current State
- quant.h: Phi-3.5 works perfectly (6.5 tok/s CPU NEON)
- libturboquant: Phi-3.5 crashes or produces garbage
- Workaround:
quant-server-unified (compiles quant.h directly, no GPU)
What Needs Porting
| Feature |
quant.h |
src/engine/ |
Status |
Fused attn_qkv detection |
✅ |
Partial |
loader works, forward crashes |
Fused ffn_up_gate detection |
✅ |
Partial |
loader works |
| LongRoPE NeoX rotation |
✅ |
Written |
untested |
| K/V projection skip for fused QKV |
✅ |
Missing |
root cause of crash |
| State allocation (xb2, hb sizing) |
✅ |
Written |
needs verification |
| BOS token handling |
✅ |
Written |
untested |
Previous Attempt
PR #71 attempted this but caused SmolLM2 regression — the K/V projection skip interacted badly with existing code paths. The port needs to be done one component at a time with regression tests after each step.
Recommended Approach
- Port K/V skip only → test SmolLM2 + Phi-3.5
- Port FFN fused path → test
- Port LongRoPE → test
- Metal dispatch → benchmark
Impact
Metal GPU acceleration for Phi-3.5 could improve speed from 6.5 to ~15-20 tok/s on Apple Silicon (3-4B models benefit from Metal, unlike 1B).
Priority: P3
This is blocked by #85 (Single Source of Truth) — once that's done, porting is automatic.
Summary
Complete the Phi-3 architecture port from
quant.htosrc/engine/*.cso thatquant-server(libturboquant) can serve Phi-3.5 with Metal GPU acceleration.Current State
quant-server-unified(compiles quant.h directly, no GPU)What Needs Porting
attn_qkvdetectionffn_up_gatedetectionPrevious Attempt
PR #71 attempted this but caused SmolLM2 regression — the K/V projection skip interacted badly with existing code paths. The port needs to be done one component at a time with regression tests after each step.
Recommended Approach
Impact
Metal GPU acceleration for Phi-3.5 could improve speed from 6.5 to ~15-20 tok/s on Apple Silicon (3-4B models benefit from Metal, unlike 1B).
Priority: P3
This is blocked by #85 (Single Source of Truth) — once that's done, porting is automatic.