Skip to content

Port Phi-3 architecture support to libturboquant + Qwen3.5 issues#71

Merged
unamedkr merged 1 commit intomainfrom
fix/qwen35-quant-ask
Apr 12, 2026
Merged

Port Phi-3 architecture support to libturboquant + Qwen3.5 issues#71
unamedkr merged 1 commit intomainfrom
fix/qwen35-quant-ask

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

Summary

Changes

File What
src/engine/tq_model.c Fused attn_qkv detection, LongRoPE factor loading, fused gate||up FFN
src/engine/tq_transformer.c Fused QKV matmul + split, NeoX-style LongRoPE, fused FFN path, expanded state allocation
src/engine/tq_generate.c Phi-3 BOS token handling
src/engine/tq_tokenizer.c <s> BOS lookup chain
src/server/tq_server.c Phi-3 chat template support
include/turboquant/tq_engine.h New fields for fused weights and LongRoPE config
bindings/python/quantcpp/cli.py Phi-3.5 default model + alias updates
quant.h Minor improvements

Test plan

  • 35/35 unit tests pass (ctest --test-dir build-metal)
  • Phi-3.5 loader: 32 self_attn + LongRoPE detected correctly
  • Phi-3.5 server inference: crashes during forward pass (segfault in fused QKV path — needs buffer size debugging)
  • Qwen3.5-4B CLI inference: coherent output ("I am Qwen3.5...")
  • Qwen3.5-4B server inference: empty tokens via ctypes (Qwen3.5-4B: quant_generate works but quant_ask produces empty/garbage output #69)

Known issues (tracked separately)

🤖 Generated with Claude Code

…rces

Ports the Phi-3/Phi-3.5 architecture support from quant.h (PR #65)
to the split source files used by libturboquant and quant-server.

Changes:
- tq_model.c: fused attn_qkv detection, LongRoPE factor loading,
  fused gate||up FFN detection
- tq_transformer.c: fused QKV matmul + split, NeoX-style LongRoPE
  rotation, fused gate||up FFN path, expanded state allocation
- tq_generate.c: Phi-3 BOS token handling
- tq_tokenizer.c: <s> BOS lookup
- tq_server.c: Phi-3 chat template support
- tq_engine.h: new fields for fused weights and LongRoPE config
- cli.py: Phi-3.5 default model + alias updates

quant-server now detects Phi-3.5 correctly:
  loaded 32 layers (32 self_attn) + LongRoPE

Note: server crashes during inference (segfault in forward pass).
The fused QKV → split memcpy or LongRoPE computation likely has
a buffer size issue in the server path. Tracked in #67.

35/35 unit tests still pass.

Fixes #67 (partial — loader works, inference needs debugging)
Refs #69, #70

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr merged commit 08e8661 into main Apr 12, 2026
2 of 3 checks passed
@unamedkr
Copy link
Copy Markdown
Collaborator Author

Update: Reverted — Phi-3 port caused regression in SmolLM2

The Phi-3 port to libturboquant caused a regression: SmolLM2-1.7B (previously working) started crashing on the same server build. The fused QKV skip logic in self_attn_forward interacted badly with the existing K/V projection code paths.

What worked

  • Phi-3.5 loader: 32 self_attn detected correctly
  • Phi-3.5 server: no longer crashes (fused QKV skip fix)
  • Phi-3 chat template detection in server

What broke

  • SmolLM2-1.7B: crashed after the changes (regression)
  • Phi-3.5: output was garbage (FFN or RoPE issue in server path)

Root cause

The ported changes touched too many code paths simultaneously. The split source tq_transformer.c has subtle differences from quant.h (state management, buffer allocation, Metal dispatch) that make a direct port non-trivial.

Recommendation

This PR should be closed. The Phi-3 port needs to be done more carefully:

  1. Port one component at a time (loader → RoPE → QKV → FFN)
  2. Verify SmolLM2 regression after each step
  3. Add a Phi-3.5 end-to-end test to CI

The Python quant.h-based server (phi35_server.py) remains a working workaround for Phi-3.5 serving.


Updated by ClawTeam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant