Phi-3.5-mini-instruct: 0 self_attn layers detected, inference produces garbage

## Description

Loading `Phi-3.5-mini-instruct-Q8_0.gguf` (bartowski quantization) succeeds, but the GGUF parser detects **0 self_attn layers**, causing inference to produce random garbage tokens.

## Steps to Reproduce

```bash
./build-metal/quant-server Phi-3.5-mini-instruct-Q8_0.gguf -p 8080

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is gravity?"}],"max_tokens":60}'
```

## Server Log

```
tq_load_gguf: architecture = 'phi3'
tq_load_gguf: loaded 32 layers (0 self_attn), dim=3072, heads=32/32, vocab=32064
                                  ^^^^^^^^^^^^
```

## Inference Output

```json
{"content": "uffrasspkeryensonisatcreteBUG►cios vanishingSOURciencedri..."}
```

## Why This Matters

Phi-3.5-mini has a **32K vocabulary** — the smallest among all tested models. On Apple M3:

| Model | Vocab | tok/s |
|-------|------:|------:|
| Phi-3.5-mini (if supported) | 32,064 | **~94 tok/s** (projected) |
| SmolLM2-1.7B (current best) | 49,152 | ~12.5 tok/s |
| Llama-3.2-1B | 128,256 | ~2.3 tok/s |

Supporting Phi-3 would unlock the fastest inference for chat-quality models.

## Root Cause

The `phi3` architecture uses different attention tensor naming in GGUF metadata. The parser matches Llama-style names (`blk.N.attn_q`) but Phi-3 may use a different convention, resulting in 0 attention layers detected.

## Suggested Fix

1. Add `phi3` tensor name mapping in `tq_load_gguf`
2. Emit a warning when `self_attn == 0` but `layers > 0`

## Environment

- quant.cpp: latest main (49c6605)
- Model: bartowski/Phi-3.5-mini-instruct-GGUF (Q8_0, 3.9GB)
- OS: macOS 15 (Apple M3, 16GB), Metal build

---
*Reported by ClawTeam (Claw-4 Optimizer + Claw-5 Researcher)*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phi-3.5-mini-instruct: 0 self_attn layers detected, inference produces garbage #56

Description

Steps to Reproduce

Server Log

Inference Output

Why This Matters

Root Cause

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Vocab	tok/s
Phi-3.5-mini (if supported)	32,064	~94 tok/s (projected)
SmolLM2-1.7B (current best)	49,152	~12.5 tok/s
Llama-3.2-1B	128,256	~2.3 tok/s

Phi-3.5-mini-instruct: 0 self_attn layers detected, inference produces garbage #56

Description

Description

Steps to Reproduce

Server Log

Inference Output

Why This Matters

Root Cause

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions