Skip to content

feat(feedback): Quick Wins from 2026-04-12 external user report#59

Merged
unamedkr merged 1 commit intomainfrom
feat/feedback-quick-wins-2026-04-12
Apr 12, 2026
Merged

feat(feedback): Quick Wins from 2026-04-12 external user report#59
unamedkr merged 1 commit intomainfrom
feat/feedback-quick-wins-2026-04-12

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

Summary

Acts on the external user feedback in docs/feedback/2026-04-12_0900.md. Four scoped fixes targeting the highest-impact items. Phi-3 architecture support is intentionally deferred to a separate PR (it needs prototyping).

Bugs / improvements addressed

ID Item Severity Notes
P0-A SmolLM2-1.7B as recommended default High Tester measured 5x speedup vs Llama-3.2-1B on M3 (vocab 49K vs 128K)
P0-B Hard-fail load on unsupported architecture High Phi-3 was loading silently, returning a model that produced garbage
P0-C ChatML template marker filter High <|im_start|> etc. were leaking into chat output
P1-C docs/supported_models.md Medium Architecture matrix + vocab-size guidance

P0-A — SmolLM2-1.7B default

Same llama arch family as the existing models, just bigger. The lm_head matmul (vocab × hidden_dim per token) is the bottleneck on CPU — fewer params don't help if the vocab is bigger. External tester benchmark on Apple M3:

Model vocab tok/s
SmolLM2-1.7B (Q8) 49K ~12.5
Llama-3.2-1B (Q4_K_M) 128K ~2.3
  • Added to _MODEL_REGISTRY in quantcpp/__init__.py
  • New CLI alias smollm2:1.7b. Bare smollm2 now points to 1.7B (was 135M); the 135M demo is now smollm2:135m.
  • cmd_chat_default switched
  • Module + class docstrings + CLI help epilog updated

P0-B — Hard-fail on unsupported architecture

Previously: loading a Phi-3 GGUF reported loaded N layers (0 self_attn) in the success log and returned a model that ran the forward pass against zero-initialized attention weights. The user got pages of garbage tokens with no clear error to debug.

Now: when tq_load_gguf finishes a model with zero standard self_attn layers AND no DeltaNet weights, it logs a clear ERROR naming the architecture and returns NULL.

tq_load_gguf: ERROR — model architecture 'phi3' is not supported.
  Detected 0 self_attn layers and no DeltaNet weights.
  This usually means the model uses fused QKV projection
  (e.g., Phi-3 \`attn_qkv\`) which quant.cpp does not yet handle.
  See docs/supported_models.md for the architecture support matrix.

tq_free_model is used for cleanup so we don't leak the partial state.

P0-C — ChatML marker filter

External tester reported <\|im_start\|>, <\|im_end\|>, etc. leaking into chat output. Root cause: BPE tokenizers fragment these markers across multiple tokens, so the existing per-token strstr check in the generation loop never matches.

Fix: a 32-byte lookahead filter inside chat_accum_callback. The filter buffers the most recent text, scans for known markers, and:

  • <\|im_start\|> at the very start of the response → strip the <\|im_start\|>assistant\n header (model is echoing the chat prompt)
  • any END marker (<\|im_end\|>, <\|eot_id\|>, <end_of_turn>, <\|endoftext\|>, <\|im_start\|> mid-response, <\|start_header_id\|>, <\|eom_id\|>) → emit clean prefix, set stop_requested, fast-path loop checks the flag and breaks

Cost: ~32 bytes of in-flight streaming buffer (small latency penalty before first token; steady state is unchanged).

quant.h and src/engine/tq_generate.c kept in lockstep (filter mirrored byte-for-byte).

Filter test results

A standalone harness drives the filter with simulated token streams:

[PASS] plain text
[PASS] header strip at start
[PASS] im_end mid-response stops
[PASS] multi-token im_end stops          ← the original complaint
[PASS] eot_id stops
[PASS] end_of_turn stops
[PASS] long stream
[PASS] im_start mid-response stops

8/8.

P1-C — Supported models docs

docs/supported_models.md covers:

  • Recommended models for each use case
  • Architecture compatibility matrix (llama / gemma / qwen / phi3)
  • Why vocab size dominates speed (with the M3 benchmarks)
  • Why phi3 is hard (the fused QKV story)
  • How to report an unsupported model

Verification

  • ctest --test-dir build35/35 passed
  • cmake --build build → all targets clean (no new warnings)
  • wasm/build.sh → 320K bundle rebuilt
  • Standalone chat_accum filter test → 8/8 passed
  • Python available_models() returns SmolLM2-1.7B
  • quantcpp --help epilog reflects new defaults

Deferred (separate PRs)

  • P1-A Phi-3 support (attn_qkv / gate_up_proj) — needs a prototyping spike before committing to a design
  • P1-B Pure-Python server fallback — so quantcpp serve works without a CMake build
  • P2-A Server request queueing / 429 — needs queueing-policy decisions
  • 2.5/2.7 Gemma-4 / Qwen3.5 garbage — need the user's exact model file SHA256 to reproduce

Test plan

  • Unit tests pass
  • Filter harness passes
  • Manual: `quantcpp run smollm2` downloads SmolLM2-1.7B and chats cleanly
  • Manual: load a Phi-3.5-mini GGUF → confirms the new error message and NULL return
  • Manual: chat with a model that historically leaked `<|im_end|>` → confirms the marker is stripped from the stream

🤖 Generated with Claude Code

Four scoped fixes addressing the highest-impact items in
docs/feedback/2026-04-12_0900.md. Each is independently useful and
nothing experimental — Phi-3 architecture support is intentionally
deferred to a separate PR.

## Changes

### P0-A — SmolLM2-1.7B as the recommended default

External tester measured SmolLM2-1.7B at ~12.5 tok/s vs Llama-3.2-1B at
~2.3 tok/s on Apple M3. Same llama arch family, but vocab 49K vs 128K.
The lm_head matmul (vocab × hidden_dim per token) is the bottleneck —
fewer params don't help if the vocab is bigger.

- Add SmolLM2-1.7B-Instruct (Q8) to `_MODEL_REGISTRY`
- Add `smollm2:1.7b` and bare `smollm2` aliases (the bare alias now
  points at 1.7B; users wanting the demo model ask for `smollm2:135m`)
- `cmd_chat_default` now uses SmolLM2-1.7B
- Module + class docstrings + CLI help epilog all updated to reflect
  the new recommendation

### P0-B — Hard-fail load on unsupported architecture

Previously: loading a Phi-3 GGUF reported `loaded N layers (0 self_attn)`
in the success log and returned a model that produced page after page of
garbage tokens. Phi-3 uses fused `attn_qkv` projection which the loader
doesn't recognize.

Now: when `tq_load_gguf` finishes a model with zero standard self_attn
layers AND no DeltaNet weights, it logs a clear ERROR naming the
architecture and returns NULL. Callers see the failure immediately
instead of debugging garbage output.

```
tq_load_gguf: ERROR — model architecture 'phi3' is not supported.
  Detected 0 self_attn layers and no DeltaNet weights.
  ...
```

### P0-C — ChatML template marker filter

External tester reported `<|im_start|>`, `<|im_end|>`, `<line>assistant`
etc. leaking into chat output. Root cause: BPE tokenizers fragment these
markers across multiple tokens, so the existing per-token strstr check
in the generation loop never matches.

Fix: a 32-byte lookahead filter inside `chat_accum_callback`. The filter
buffers the most recent text, scans for known markers, and:

- `<|im_start|>` at the very start of the response → strip the
  `<|im_start|>assistant\n` header (model is echoing the chat prompt)
- any END marker (`<|im_end|>`, `<|eot_id|>`, `<end_of_turn>`,
  `<|endoftext|>`, `<|im_start|>` mid-response, `<|start_header_id|>`,
  `<|eom_id|>`) → emit clean prefix, set `stop_requested`, fast-path
  loop checks the flag and breaks

Streaming latency cost: ~CHAT_LOOKAHEAD bytes (32) of in-flight buffer.

Verified by a standalone harness that drives the filter with simulated
token streams (8 cases including BPE-split markers — all pass).

### P1-C — `docs/supported_models.md`

New page documenting the architecture compatibility matrix, the vocab
size → speed relationship, why Phi-3 is hard, and how to report a
broken model. Linked from the feedback file.

## Verified

- ctest --test-dir build → 35/35 passed
- cmake --build build → all targets clean (no new warnings)
- wasm/build.sh → 320K bundle rebuilt
- Standalone chat_accum filter test → 8/8 passed
- Python `from quantcpp import Model` + `available_models()` works
- `quantcpp --help` epilog reflects new defaults

quant.h and src/engine/tq_generate.c kept in lockstep (filter logic
mirrored byte-for-byte).

## Deferred

- Phi-3 (`attn_qkv` / `gate_up_proj`) loader support — separate PR with
  prototype + validation gate
- Server fallback in pure Python (so `quantcpp serve` works without a
  CMake build) — separate PR
- Server request queueing / 429 — separate PR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr merged commit a7795a5 into main Apr 12, 2026
3 checks passed
@unamedkr unamedkr deleted the feat/feedback-quick-wins-2026-04-12 branch April 12, 2026 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant