fix(chat-cache): second audit pass — 9 more hidden bugs eliminated by unamedkr · Pull Request #53 · quantumaikr/quant.cpp

unamedkr · 2026-04-12T01:34:17Z

Summary

Follow-up to #52. A fresh code-reading audit found another batch of hidden bugs in the chat KV cache path. None had visible symptoms in the happy path; all were latent failure modes that would surface under load, on long histories, or on memory pressure.

Audit method: code reading only — no behavior testing on the broken paths. Every finding was verified by reading the actual source before fixing. Per the principle that experimenting with buggy code is meaningless — the audit must catch what tests can't.

Bugs fixed (9)

Critical

B1 — tq_generate_continue's sliding-window truncation silently desynced cached_text from KV positions. cached_text claimed the FULL prompt was committed, but cached_tokens only held the truncated tail → next turn's text-prefix match mapped text bytes to the wrong KV positions → garbage output after long histories. Fix: continue now returns -2 on overflow instead of truncating; the caller resets and reports.
B2 — cached_text was updated even when generation returned an error (generated < 0), leaving the cache claiming committed bytes that weren't.
B3 — chat_accum_callback realloc failure silently dropped tokens AND skipped the user's stream callback → broken UX + corrupted cached_text. Fix: always pass tokens to the user callback first; mark accumulator tainted on realloc failure; skip cached_text update if tainted.
B4 — server's get_or_create_session didn't NULL-check tq_create_state_ex. An OOM made the next call dereference a NULL kv_state.
B5 — CLI cmd_run interactive loop ignored quant_chat return code, so context overflow produced an infinite stream of empty replies. Fix: catch ChatContextOverflow, drop oldest turn, retry.

High

B6 — server streaming path only branched on rc == -2; rc == -1 produced HTTP 200 with finish_reason: stop and no error info. Now sends an error delta + finish_reason: error.
B7 — server reused an existing session even when the request changed kv_type / value_quant_bits — old quantized blocks would be misinterpreted. Now detects the change and rebuilds state.
B8 — WASM wasm_load_model didn't reset g_generating. After a page-reload mid-stream, every subsequent generate call early-returned -1 forever.
B9 — rep_penalty was silently ignored in tq_generate_chat_text's FAST path (slow path applied it). Now mirrors the slow path including the in-loop reapplication after each forward pass.

Medium / hygiene

Removed dead update_cache: label (refactoring leftover).
Synced bindings/python/quant.h (sdist staging snapshot).

The most insidious bug — B1 walkthrough

The user-visible failure mode:

User has a long conversation (10+ turns).
Accumulated ChatML history + new user message exceeds max_seq_len.
tq_generate_continue's "prevent catastrophic failure" branch silently drops the oldest tokens, sets *n_cached_io = 0, re-prefills the truncated tail at positions [0..k).
Function returns successfully, so tq_generate_chat_text updates cached_text to the full prompt + generated text.
Next turn: new prompt matches cached_text byte-for-byte → fast path → assumes the first len(cached_text) bytes correspond to KV positions [0..n_cached) → prefills new tokens at the wrong positions.
Reality: KV positions [0..k] hold the tail of the original prompt, not the head. The text↔position mapping is now broken. Next response is garbage.

This bug never reproduces in short happy-path tests. It only surfaces after a long enough chat that overflow triggers. The fix removes the dangerous silent truncation entirely — overflow now propagates as -2 so callers can handle it explicitly (server returns 413, WASM auto-resets, Python raises ChatContextOverflow, CLI drops the oldest turn and retries).

Verification

ctest --test-dir build → 35/35 passed
cmake --build build → all targets clean
wasm/build.sh → quant.wasm rebuilt clean (320K)
Python imports + quantcpp --help work

quant.h and src/engine/tq_generate.c kept in lockstep (every chat-cache change applied to both, since they hold parallel implementations).

Test plan

Existing unit tests pass (35/35)
Library, server, WASM, Python all build clean
Python from quantcpp import ChatContextOverflow, Model works
Manual: long-conversation chat in WASM demo until overflow → confirms auto-reset message instead of garbage
Manual: server curl with two clients alternating kv_type on the same user field → confirms rebuild log message
Manual: quantcpp run llama3.2:1b, chat for 20+ turns until overflow → confirms graceful "dropped oldest turn" message instead of empty replies

🤖 Generated with Claude Code

Follow-up to PR #52. A fresh code-reading audit found another batch of hidden bugs in the chat KV cache path. None had visible symptoms in the happy path; all were latent failure modes that would surface under load, on long histories, or on memory pressure. Bugs fixed: CRITICAL - B1: tq_generate_continue's sliding-window truncation silently desynced cached_text. cached_text claimed the FULL prompt was committed, but cached_tokens only held the truncated tail — next turn's text-prefix match mapped text bytes to the wrong KV positions. Fix: continue now returns -2 on overflow instead of truncating. - B2: cached_text was updated even when generation returned an error, leaving the cache claiming committed bytes that weren't. - B3: chat_accum_callback realloc failure silently dropped tokens AND skipped the user's stream callback — broken UX + corrupted cached_text. Fix: always pass tokens to user_cb; mark accumulator tainted on realloc failure; skip cached_text update if tainted. - B4: server's get_or_create_session didn't NULL-check tq_create_state_ex. An OOM made the next call dereference a NULL kv_state. - B5: CLI cmd_run interactive loop ignored quant_chat return code, so context overflow produced an infinite stream of empty replies. Fix: catch ChatContextOverflow, drop oldest turn, retry. HIGH - B6: server streaming path only branched on rc == -2; rc == -1 produced HTTP 200 with finish_reason "stop" and no error info. Now sends an error delta + finish_reason "error". - B7: server reused an existing session even when the request changed kv_type / value_quant_bits — old quantized blocks would be misinterpreted. Now detects the change and rebuilds state. - B8: WASM wasm_load_model didn't reset g_generating. After a page-reload mid-stream, every subsequent generate call early-returned -1 forever. - B9: rep_penalty was silently ignored in tq_generate_chat_text's FAST path (slow path applied it). Now mirrors the slow path. - B10: Python Model.chat() ignored the C return value; -2 / -1 surfaced as empty token streams. Now raises ChatContextOverflow / RuntimeError. MEDIUM - Removed dead `update_cache:` label. - Synced bindings/python/quant.h (sdist staging snapshot). Verification: - ctest --test-dir build → 35/35 passed - cmake --build build → all targets clean - wasm/build.sh → quant.wasm rebuilt clean (320K) - Python imports + cli --help work quant.h and src/engine/tq_generate.c are kept in lockstep (every chat-cache change applied to both). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr merged commit 0623b16 into main Apr 12, 2026
3 checks passed

unamedkr deleted the fix/chat-cache-audit-2 branch April 12, 2026 01:36

unamedkr mentioned this pull request Apr 12, 2026

Non-deterministic output at temperature=0 — possible memory corruption #62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(chat-cache): second audit pass — 9 more hidden bugs eliminated#53

fix(chat-cache): second audit pass — 9 more hidden bugs eliminated#53
unamedkr merged 1 commit intomainfrom
fix/chat-cache-audit-2

unamedkr commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 12, 2026

Summary

Bugs fixed (9)

Critical

High

Medium / hygiene

The most insidious bug — B1 walkthrough

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant