Port Phi-3 architecture support to libturboquant + Qwen3.5 issues by unamedkr · Pull Request #71 · quantumaikr/quant.cpp

unamedkr · 2026-04-12T04:01:32Z

Summary

Ports Phi-3/Phi-3.5 fused QKV/FFN + LongRoPE from quant.h (feat(phi3): end-to-end Phi-3 / Phi-3.5 architecture support #65) to the split source files (src/engine/)
quant-server now correctly detects Phi-3.5: loaded 32 layers (32 self_attn) + LongRoPE
Documents Qwen3.5-4B serving issues discovered during testing

Changes

File	What
`src/engine/tq_model.c`	Fused `attn_qkv` detection, LongRoPE factor loading, fused `gate\|\|up` FFN
`src/engine/tq_transformer.c`	Fused QKV matmul + split, NeoX-style LongRoPE, fused FFN path, expanded state allocation
`src/engine/tq_generate.c`	Phi-3 BOS token handling
`src/engine/tq_tokenizer.c`	`<s>` BOS lookup chain
`src/server/tq_server.c`	Phi-3 chat template support
`include/turboquant/tq_engine.h`	New fields for fused weights and LongRoPE config
`bindings/python/quantcpp/cli.py`	Phi-3.5 default model + alias updates
`quant.h`	Minor improvements

Test plan

35/35 unit tests pass (ctest --test-dir build-metal)
Phi-3.5 loader: 32 self_attn + LongRoPE detected correctly
Phi-3.5 server inference: crashes during forward pass (segfault in fused QKV path — needs buffer size debugging)
Qwen3.5-4B CLI inference: coherent output ("I am Qwen3.5...")
Qwen3.5-4B server inference: empty tokens via ctypes (Qwen3.5-4B: quant_generate works but quant_ask produces empty/garbage output #69)

Known issues (tracked separately)

Phi-3 support not propagated from quant.h to libturboquant (quant-server broken) #67: Server crashes during Phi-3.5 inference (this PR is partial fix — loader works)
Qwen3.5-4B: quant_generate works but quant_ask produces empty/garbage output #69: Qwen3.5 quant_ask / ctypes produces empty output
Qwen3.5-4B DeltaNet layers: FP32 dequant bottleneck causes 0.7 tok/s #70: Qwen3.5 DeltaNet FP32 dequant bottleneck (0.7 tok/s)

🤖 Generated with Claude Code

…rces Ports the Phi-3/Phi-3.5 architecture support from quant.h (PR #65) to the split source files used by libturboquant and quant-server. Changes: - tq_model.c: fused attn_qkv detection, LongRoPE factor loading, fused gate||up FFN detection - tq_transformer.c: fused QKV matmul + split, NeoX-style LongRoPE rotation, fused gate||up FFN path, expanded state allocation - tq_generate.c: Phi-3 BOS token handling - tq_tokenizer.c: <s> BOS lookup - tq_server.c: Phi-3 chat template support - tq_engine.h: new fields for fused weights and LongRoPE config - cli.py: Phi-3.5 default model + alias updates quant-server now detects Phi-3.5 correctly: loaded 32 layers (32 self_attn) + LongRoPE Note: server crashes during inference (segfault in forward pass). The fused QKV → split memcpy or LongRoPE computation likely has a buffer size issue in the server path. Tracked in #67. 35/35 unit tests still pass. Fixes #67 (partial — loader works, inference needs debugging) Refs #69, #70 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr · 2026-04-12T04:27:18Z

Update: Reverted — Phi-3 port caused regression in SmolLM2

The Phi-3 port to libturboquant caused a regression: SmolLM2-1.7B (previously working) started crashing on the same server build. The fused QKV skip logic in self_attn_forward interacted badly with the existing K/V projection code paths.

What worked

Phi-3.5 loader: 32 self_attn detected correctly
Phi-3.5 server: no longer crashes (fused QKV skip fix)
Phi-3 chat template detection in server

What broke

SmolLM2-1.7B: crashed after the changes (regression)
Phi-3.5: output was garbage (FFN or RoPE issue in server path)

Root cause

The ported changes touched too many code paths simultaneously. The split source tq_transformer.c has subtle differences from quant.h (state management, buffer allocation, Metal dispatch) that make a direct port non-trivial.

Recommendation

This PR should be closed. The Phi-3 port needs to be done more carefully:

Port one component at a time (loader → RoPE → QKV → FFN)
Verify SmolLM2 regression after each step
Add a Phi-3.5 end-to-end test to CI

The Python quant.h-based server (phi35_server.py) remains a working workaround for Phi-3.5 serving.

Updated by ClawTeam

unamedkr merged commit 08e8661 into main Apr 12, 2026
2 of 3 checks passed

unamedkr mentioned this pull request Apr 12, 2026

Port Phi-3 fused QKV/FFN from quant.h to libturboquant (enables Metal GPU) #91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port Phi-3 architecture support to libturboquant + Qwen3.5 issues#71

Port Phi-3 architecture support to libturboquant + Qwen3.5 issues#71
unamedkr merged 1 commit intomainfrom
fix/qwen35-quant-ask

unamedkr commented Apr 12, 2026

Uh oh!

Uh oh!

unamedkr commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 12, 2026

Summary

Changes

Test plan

Known issues (tracked separately)

Uh oh!

Uh oh!

unamedkr commented Apr 12, 2026

Update: Reverted — Phi-3 port caused regression in SmolLM2

What worked

What broke

Root cause

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant