Skip to content

Phi-3.5-mini-instruct: 0 self_attn layers detected, inference produces garbage #56

@unamedkr

Description

@unamedkr

Description

Loading Phi-3.5-mini-instruct-Q8_0.gguf (bartowski quantization) succeeds, but the GGUF parser detects 0 self_attn layers, causing inference to produce random garbage tokens.

Steps to Reproduce

./build-metal/quant-server Phi-3.5-mini-instruct-Q8_0.gguf -p 8080

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is gravity?"}],"max_tokens":60}'

Server Log

tq_load_gguf: architecture = 'phi3'
tq_load_gguf: loaded 32 layers (0 self_attn), dim=3072, heads=32/32, vocab=32064
                                  ^^^^^^^^^^^^

Inference Output

{"content": "uffrasspkeryensonisatcreteBUG►cios vanishingSOURciencedri..."}

Why This Matters

Phi-3.5-mini has a 32K vocabulary — the smallest among all tested models. On Apple M3:

Model Vocab tok/s
Phi-3.5-mini (if supported) 32,064 ~94 tok/s (projected)
SmolLM2-1.7B (current best) 49,152 ~12.5 tok/s
Llama-3.2-1B 128,256 ~2.3 tok/s

Supporting Phi-3 would unlock the fastest inference for chat-quality models.

Root Cause

The phi3 architecture uses different attention tensor naming in GGUF metadata. The parser matches Llama-style names (blk.N.attn_q) but Phi-3 may use a different convention, resulting in 0 attention layers detected.

Suggested Fix

  1. Add phi3 tensor name mapping in tq_load_gguf
  2. Emit a warning when self_attn == 0 but layers > 0

Environment

  • quant.cpp: latest main (49c6605)
  • Model: bartowski/Phi-3.5-mini-instruct-GGUF (Q8_0, 3.9GB)
  • OS: macOS 15 (Apple M3, 16GB), Metal build

Reported by ClawTeam (Claw-4 Optimizer + Claw-5 Researcher)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions