feat(feedback): Quick Wins from 2026-04-12 external user report#59
Merged
feat(feedback): Quick Wins from 2026-04-12 external user report#59
Conversation
Four scoped fixes addressing the highest-impact items in docs/feedback/2026-04-12_0900.md. Each is independently useful and nothing experimental — Phi-3 architecture support is intentionally deferred to a separate PR. ## Changes ### P0-A — SmolLM2-1.7B as the recommended default External tester measured SmolLM2-1.7B at ~12.5 tok/s vs Llama-3.2-1B at ~2.3 tok/s on Apple M3. Same llama arch family, but vocab 49K vs 128K. The lm_head matmul (vocab × hidden_dim per token) is the bottleneck — fewer params don't help if the vocab is bigger. - Add SmolLM2-1.7B-Instruct (Q8) to `_MODEL_REGISTRY` - Add `smollm2:1.7b` and bare `smollm2` aliases (the bare alias now points at 1.7B; users wanting the demo model ask for `smollm2:135m`) - `cmd_chat_default` now uses SmolLM2-1.7B - Module + class docstrings + CLI help epilog all updated to reflect the new recommendation ### P0-B — Hard-fail load on unsupported architecture Previously: loading a Phi-3 GGUF reported `loaded N layers (0 self_attn)` in the success log and returned a model that produced page after page of garbage tokens. Phi-3 uses fused `attn_qkv` projection which the loader doesn't recognize. Now: when `tq_load_gguf` finishes a model with zero standard self_attn layers AND no DeltaNet weights, it logs a clear ERROR naming the architecture and returns NULL. Callers see the failure immediately instead of debugging garbage output. ``` tq_load_gguf: ERROR — model architecture 'phi3' is not supported. Detected 0 self_attn layers and no DeltaNet weights. ... ``` ### P0-C — ChatML template marker filter External tester reported `<|im_start|>`, `<|im_end|>`, `<line>assistant` etc. leaking into chat output. Root cause: BPE tokenizers fragment these markers across multiple tokens, so the existing per-token strstr check in the generation loop never matches. Fix: a 32-byte lookahead filter inside `chat_accum_callback`. The filter buffers the most recent text, scans for known markers, and: - `<|im_start|>` at the very start of the response → strip the `<|im_start|>assistant\n` header (model is echoing the chat prompt) - any END marker (`<|im_end|>`, `<|eot_id|>`, `<end_of_turn>`, `<|endoftext|>`, `<|im_start|>` mid-response, `<|start_header_id|>`, `<|eom_id|>`) → emit clean prefix, set `stop_requested`, fast-path loop checks the flag and breaks Streaming latency cost: ~CHAT_LOOKAHEAD bytes (32) of in-flight buffer. Verified by a standalone harness that drives the filter with simulated token streams (8 cases including BPE-split markers — all pass). ### P1-C — `docs/supported_models.md` New page documenting the architecture compatibility matrix, the vocab size → speed relationship, why Phi-3 is hard, and how to report a broken model. Linked from the feedback file. ## Verified - ctest --test-dir build → 35/35 passed - cmake --build build → all targets clean (no new warnings) - wasm/build.sh → 320K bundle rebuilt - Standalone chat_accum filter test → 8/8 passed - Python `from quantcpp import Model` + `available_models()` works - `quantcpp --help` epilog reflects new defaults quant.h and src/engine/tq_generate.c kept in lockstep (filter logic mirrored byte-for-byte). ## Deferred - Phi-3 (`attn_qkv` / `gate_up_proj`) loader support — separate PR with prototype + validation gate - Server fallback in pure Python (so `quantcpp serve` works without a CMake build) — separate PR - Server request queueing / 429 — separate PR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Acts on the external user feedback in
docs/feedback/2026-04-12_0900.md. Four scoped fixes targeting the highest-impact items. Phi-3 architecture support is intentionally deferred to a separate PR (it needs prototyping).Bugs / improvements addressed
<|im_start|>etc. were leaking into chat outputdocs/supported_models.mdP0-A — SmolLM2-1.7B default
Same llama arch family as the existing models, just bigger. The lm_head matmul (
vocab × hidden_dimper token) is the bottleneck on CPU — fewer params don't help if the vocab is bigger. External tester benchmark on Apple M3:_MODEL_REGISTRYinquantcpp/__init__.pysmollm2:1.7b. Baresmollm2now points to 1.7B (was 135M); the 135M demo is nowsmollm2:135m.cmd_chat_defaultswitchedP0-B — Hard-fail on unsupported architecture
Previously: loading a Phi-3 GGUF reported
loaded N layers (0 self_attn)in the success log and returned a model that ran the forward pass against zero-initialized attention weights. The user got pages of garbage tokens with no clear error to debug.Now: when
tq_load_gguffinishes a model with zero standard self_attn layers AND no DeltaNet weights, it logs a clear ERROR naming the architecture and returns NULL.tq_free_modelis used for cleanup so we don't leak the partial state.P0-C — ChatML marker filter
External tester reported
<\|im_start\|>,<\|im_end\|>, etc. leaking into chat output. Root cause: BPE tokenizers fragment these markers across multiple tokens, so the existing per-token strstr check in the generation loop never matches.Fix: a 32-byte lookahead filter inside
chat_accum_callback. The filter buffers the most recent text, scans for known markers, and:<\|im_start\|>at the very start of the response → strip the<\|im_start\|>assistant\nheader (model is echoing the chat prompt)<\|im_end\|>,<\|eot_id\|>,<end_of_turn>,<\|endoftext\|>,<\|im_start\|>mid-response,<\|start_header_id\|>,<\|eom_id\|>) → emit clean prefix, setstop_requested, fast-path loop checks the flag and breaksCost: ~32 bytes of in-flight streaming buffer (small latency penalty before first token; steady state is unchanged).
quant.handsrc/engine/tq_generate.ckept in lockstep (filter mirrored byte-for-byte).Filter test results
A standalone harness drives the filter with simulated token streams:
8/8.
P1-C — Supported models docs
docs/supported_models.mdcovers:Verification
ctest --test-dir build→ 35/35 passedcmake --build build→ all targets clean (no new warnings)wasm/build.sh→ 320K bundle rebuiltavailable_models()returns SmolLM2-1.7Bquantcpp --helpepilog reflects new defaultsDeferred (separate PRs)
attn_qkv/gate_up_proj) — needs a prototyping spike before committing to a designquantcpp serveworks without a CMake buildTest plan
🤖 Generated with Claude Code