Skip to content

Port Phi-3 fused QKV/FFN from quant.h to libturboquant (enables Metal GPU) #91

@unamedkr

Description

@unamedkr

Summary

Complete the Phi-3 architecture port from quant.h to src/engine/*.c so that quant-server (libturboquant) can serve Phi-3.5 with Metal GPU acceleration.

Current State

  • quant.h: Phi-3.5 works perfectly (6.5 tok/s CPU NEON)
  • libturboquant: Phi-3.5 crashes or produces garbage
  • Workaround: quant-server-unified (compiles quant.h directly, no GPU)

What Needs Porting

Feature quant.h src/engine/ Status
Fused attn_qkv detection Partial loader works, forward crashes
Fused ffn_up_gate detection Partial loader works
LongRoPE NeoX rotation Written untested
K/V projection skip for fused QKV Missing root cause of crash
State allocation (xb2, hb sizing) Written needs verification
BOS token handling Written untested

Previous Attempt

PR #71 attempted this but caused SmolLM2 regression — the K/V projection skip interacted badly with existing code paths. The port needs to be done one component at a time with regression tests after each step.

Recommended Approach

  1. Port K/V skip only → test SmolLM2 + Phi-3.5
  2. Port FFN fused path → test
  3. Port LongRoPE → test
  4. Metal dispatch → benchmark

Impact

Metal GPU acceleration for Phi-3.5 could improve speed from 6.5 to ~15-20 tok/s on Apple Silicon (3-4B models benefit from Metal, unlike 1B).

Priority: P3

This is blocked by #85 (Single Source of Truth) — once that's done, porting is automatic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions