feat(feedback): Quick Wins from 2026-04-12 external user report by unamedkr · Pull Request #59 · quantumaikr/quant.cpp

unamedkr · 2026-04-12T02:00:28Z

Summary

Acts on the external user feedback in docs/feedback/2026-04-12_0900.md. Four scoped fixes targeting the highest-impact items. Phi-3 architecture support is intentionally deferred to a separate PR (it needs prototyping).

Bugs / improvements addressed

ID	Item	Severity	Notes
P0-A	SmolLM2-1.7B as recommended default	High	Tester measured 5x speedup vs Llama-3.2-1B on M3 (vocab 49K vs 128K)
P0-B	Hard-fail load on unsupported architecture	High	Phi-3 was loading silently, returning a model that produced garbage
P0-C	ChatML template marker filter	High	`<\|im_start\|>` etc. were leaking into chat output
P1-C	`docs/supported_models.md`	Medium	Architecture matrix + vocab-size guidance

P0-A — SmolLM2-1.7B default

Same llama arch family as the existing models, just bigger. The lm_head matmul (vocab × hidden_dim per token) is the bottleneck on CPU — fewer params don't help if the vocab is bigger. External tester benchmark on Apple M3:

Model	vocab	tok/s
SmolLM2-1.7B (Q8)	49K	~12.5
Llama-3.2-1B (Q4_K_M)	128K	~2.3

Added to _MODEL_REGISTRY in quantcpp/__init__.py
New CLI alias smollm2:1.7b. Bare smollm2 now points to 1.7B (was 135M); the 135M demo is now smollm2:135m.
cmd_chat_default switched
Module + class docstrings + CLI help epilog updated

P0-B — Hard-fail on unsupported architecture

Previously: loading a Phi-3 GGUF reported loaded N layers (0 self_attn) in the success log and returned a model that ran the forward pass against zero-initialized attention weights. The user got pages of garbage tokens with no clear error to debug.

Now: when tq_load_gguf finishes a model with zero standard self_attn layers AND no DeltaNet weights, it logs a clear ERROR naming the architecture and returns NULL.

tq_load_gguf: ERROR — model architecture 'phi3' is not supported.
  Detected 0 self_attn layers and no DeltaNet weights.
  This usually means the model uses fused QKV projection
  (e.g., Phi-3 \`attn_qkv\`) which quant.cpp does not yet handle.
  See docs/supported_models.md for the architecture support matrix.

tq_free_model is used for cleanup so we don't leak the partial state.

P0-C — ChatML marker filter

External tester reported <\|im_start\|>, <\|im_end\|>, etc. leaking into chat output. Root cause: BPE tokenizers fragment these markers across multiple tokens, so the existing per-token strstr check in the generation loop never matches.

Fix: a 32-byte lookahead filter inside chat_accum_callback. The filter buffers the most recent text, scans for known markers, and:

<\|im_start\|> at the very start of the response → strip the <\|im_start\|>assistant\n header (model is echoing the chat prompt)
any END marker (<\|im_end\|>, <\|eot_id\|>, <end_of_turn>, <\|endoftext\|>, <\|im_start\|> mid-response, <\|start_header_id\|>, <\|eom_id\|>) → emit clean prefix, set stop_requested, fast-path loop checks the flag and breaks

Cost: ~32 bytes of in-flight streaming buffer (small latency penalty before first token; steady state is unchanged).

quant.h and src/engine/tq_generate.c kept in lockstep (filter mirrored byte-for-byte).

Filter test results

A standalone harness drives the filter with simulated token streams:

[PASS] plain text
[PASS] header strip at start
[PASS] im_end mid-response stops
[PASS] multi-token im_end stops          ← the original complaint
[PASS] eot_id stops
[PASS] end_of_turn stops
[PASS] long stream
[PASS] im_start mid-response stops

8/8.

P1-C — Supported models docs

docs/supported_models.md covers:

Recommended models for each use case
Architecture compatibility matrix (llama / gemma / qwen / phi3)
Why vocab size dominates speed (with the M3 benchmarks)
Why phi3 is hard (the fused QKV story)
How to report an unsupported model

Verification

ctest --test-dir build → 35/35 passed
cmake --build build → all targets clean (no new warnings)
wasm/build.sh → 320K bundle rebuilt
Standalone chat_accum filter test → 8/8 passed
Python available_models() returns SmolLM2-1.7B
quantcpp --help epilog reflects new defaults

Deferred (separate PRs)

P1-A Phi-3 support (attn_qkv / gate_up_proj) — needs a prototyping spike before committing to a design
P1-B Pure-Python server fallback — so quantcpp serve works without a CMake build
P2-A Server request queueing / 429 — needs queueing-policy decisions
2.5/2.7 Gemma-4 / Qwen3.5 garbage — need the user's exact model file SHA256 to reproduce

Test plan

Unit tests pass
Filter harness passes
Manual: `quantcpp run smollm2` downloads SmolLM2-1.7B and chats cleanly
Manual: load a Phi-3.5-mini GGUF → confirms the new error message and NULL return
Manual: chat with a model that historically leaked `<|im_end|>` → confirms the marker is stripped from the stream

🤖 Generated with Claude Code

Four scoped fixes addressing the highest-impact items in docs/feedback/2026-04-12_0900.md. Each is independently useful and nothing experimental — Phi-3 architecture support is intentionally deferred to a separate PR. ## Changes ### P0-A — SmolLM2-1.7B as the recommended default External tester measured SmolLM2-1.7B at ~12.5 tok/s vs Llama-3.2-1B at ~2.3 tok/s on Apple M3. Same llama arch family, but vocab 49K vs 128K. The lm_head matmul (vocab × hidden_dim per token) is the bottleneck — fewer params don't help if the vocab is bigger. - Add SmolLM2-1.7B-Instruct (Q8) to `_MODEL_REGISTRY` - Add `smollm2:1.7b` and bare `smollm2` aliases (the bare alias now points at 1.7B; users wanting the demo model ask for `smollm2:135m`) - `cmd_chat_default` now uses SmolLM2-1.7B - Module + class docstrings + CLI help epilog all updated to reflect the new recommendation ### P0-B — Hard-fail load on unsupported architecture Previously: loading a Phi-3 GGUF reported `loaded N layers (0 self_attn)` in the success log and returned a model that produced page after page of garbage tokens. Phi-3 uses fused `attn_qkv` projection which the loader doesn't recognize. Now: when `tq_load_gguf` finishes a model with zero standard self_attn layers AND no DeltaNet weights, it logs a clear ERROR naming the architecture and returns NULL. Callers see the failure immediately instead of debugging garbage output. ``` tq_load_gguf: ERROR — model architecture 'phi3' is not supported. Detected 0 self_attn layers and no DeltaNet weights. ... ``` ### P0-C — ChatML template marker filter External tester reported `<|im_start|>`, `<|im_end|>`, `<line>assistant` etc. leaking into chat output. Root cause: BPE tokenizers fragment these markers across multiple tokens, so the existing per-token strstr check in the generation loop never matches. Fix: a 32-byte lookahead filter inside `chat_accum_callback`. The filter buffers the most recent text, scans for known markers, and: - `<|im_start|>` at the very start of the response → strip the `<|im_start|>assistant\n` header (model is echoing the chat prompt) - any END marker (`<|im_end|>`, `<|eot_id|>`, `<end_of_turn>`, `<|endoftext|>`, `<|im_start|>` mid-response, `<|start_header_id|>`, `<|eom_id|>`) → emit clean prefix, set `stop_requested`, fast-path loop checks the flag and breaks Streaming latency cost: ~CHAT_LOOKAHEAD bytes (32) of in-flight buffer. Verified by a standalone harness that drives the filter with simulated token streams (8 cases including BPE-split markers — all pass). ### P1-C — `docs/supported_models.md` New page documenting the architecture compatibility matrix, the vocab size → speed relationship, why Phi-3 is hard, and how to report a broken model. Linked from the feedback file. ## Verified - ctest --test-dir build → 35/35 passed - cmake --build build → all targets clean (no new warnings) - wasm/build.sh → 320K bundle rebuilt - Standalone chat_accum filter test → 8/8 passed - Python `from quantcpp import Model` + `available_models()` works - `quantcpp --help` epilog reflects new defaults quant.h and src/engine/tq_generate.c kept in lockstep (filter logic mirrored byte-for-byte). ## Deferred - Phi-3 (`attn_qkv` / `gate_up_proj`) loader support — separate PR with prototype + validation gate - Server fallback in pure Python (so `quantcpp serve` works without a CMake build) — separate PR - Server request queueing / 429 — separate PR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr merged commit a7795a5 into main Apr 12, 2026
3 checks passed

unamedkr deleted the feat/feedback-quick-wins-2026-04-12 branch April 12, 2026 02:04

This was referenced Apr 12, 2026

feat(phi3): end-to-end Phi-3 / Phi-3.5 architecture support #65

Merged

feat(default): promote Phi-3.5-mini to recommended default model #66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(feedback): Quick Wins from 2026-04-12 external user report#59

feat(feedback): Quick Wins from 2026-04-12 external user report#59
unamedkr merged 1 commit intomainfrom
feat/feedback-quick-wins-2026-04-12

unamedkr commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 12, 2026

Summary

Bugs / improvements addressed

P0-A — SmolLM2-1.7B default

P0-B — Hard-fail on unsupported architecture

P0-C — ChatML marker filter

Filter test results

P1-C — Supported models docs

Verification

Deferred (separate PRs)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant