Description
Loading Phi-3.5-mini-instruct-Q8_0.gguf (bartowski quantization) succeeds, but the GGUF parser detects 0 self_attn layers, causing inference to produce random garbage tokens.
Steps to Reproduce
./build-metal/quant-server Phi-3.5-mini-instruct-Q8_0.gguf -p 8080
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"What is gravity?"}],"max_tokens":60}'
Server Log
tq_load_gguf: architecture = 'phi3'
tq_load_gguf: loaded 32 layers (0 self_attn), dim=3072, heads=32/32, vocab=32064
^^^^^^^^^^^^
Inference Output
{"content": "uffrasspkeryensonisatcreteBUG►cios vanishingSOURciencedri..."}
Why This Matters
Phi-3.5-mini has a 32K vocabulary — the smallest among all tested models. On Apple M3:
| Model |
Vocab |
tok/s |
| Phi-3.5-mini (if supported) |
32,064 |
~94 tok/s (projected) |
| SmolLM2-1.7B (current best) |
49,152 |
~12.5 tok/s |
| Llama-3.2-1B |
128,256 |
~2.3 tok/s |
Supporting Phi-3 would unlock the fastest inference for chat-quality models.
Root Cause
The phi3 architecture uses different attention tensor naming in GGUF metadata. The parser matches Llama-style names (blk.N.attn_q) but Phi-3 may use a different convention, resulting in 0 attention layers detected.
Suggested Fix
- Add
phi3 tensor name mapping in tq_load_gguf
- Emit a warning when
self_attn == 0 but layers > 0
Environment
- quant.cpp: latest main (49c6605)
- Model: bartowski/Phi-3.5-mini-instruct-GGUF (Q8_0, 3.9GB)
- OS: macOS 15 (Apple M3, 16GB), Metal build
Reported by ClawTeam (Claw-4 Optimizer + Claw-5 Researcher)
Description
Loading
Phi-3.5-mini-instruct-Q8_0.gguf(bartowski quantization) succeeds, but the GGUF parser detects 0 self_attn layers, causing inference to produce random garbage tokens.Steps to Reproduce
Server Log
Inference Output
{"content": "uffrasspkeryensonisatcreteBUG►cios vanishingSOURciencedri..."}Why This Matters
Phi-3.5-mini has a 32K vocabulary — the smallest among all tested models. On Apple M3:
Supporting Phi-3 would unlock the fastest inference for chat-quality models.
Root Cause
The
phi3architecture uses different attention tensor naming in GGUF metadata. The parser matches Llama-style names (blk.N.attn_q) but Phi-3 may use a different convention, resulting in 0 attention layers detected.Suggested Fix
phi3tensor name mapping intq_load_ggufself_attn == 0butlayers > 0Environment
Reported by ClawTeam (Claw-4 Optimizer + Claw-5 Researcher)