fix(chat-cache): second audit pass — 9 more hidden bugs eliminated#53
Merged
fix(chat-cache): second audit pass — 9 more hidden bugs eliminated#53
Conversation
Follow-up to PR #52. A fresh code-reading audit found another batch of hidden bugs in the chat KV cache path. None had visible symptoms in the happy path; all were latent failure modes that would surface under load, on long histories, or on memory pressure. Bugs fixed: CRITICAL - B1: tq_generate_continue's sliding-window truncation silently desynced cached_text. cached_text claimed the FULL prompt was committed, but cached_tokens only held the truncated tail — next turn's text-prefix match mapped text bytes to the wrong KV positions. Fix: continue now returns -2 on overflow instead of truncating. - B2: cached_text was updated even when generation returned an error, leaving the cache claiming committed bytes that weren't. - B3: chat_accum_callback realloc failure silently dropped tokens AND skipped the user's stream callback — broken UX + corrupted cached_text. Fix: always pass tokens to user_cb; mark accumulator tainted on realloc failure; skip cached_text update if tainted. - B4: server's get_or_create_session didn't NULL-check tq_create_state_ex. An OOM made the next call dereference a NULL kv_state. - B5: CLI cmd_run interactive loop ignored quant_chat return code, so context overflow produced an infinite stream of empty replies. Fix: catch ChatContextOverflow, drop oldest turn, retry. HIGH - B6: server streaming path only branched on rc == -2; rc == -1 produced HTTP 200 with finish_reason "stop" and no error info. Now sends an error delta + finish_reason "error". - B7: server reused an existing session even when the request changed kv_type / value_quant_bits — old quantized blocks would be misinterpreted. Now detects the change and rebuilds state. - B8: WASM wasm_load_model didn't reset g_generating. After a page-reload mid-stream, every subsequent generate call early-returned -1 forever. - B9: rep_penalty was silently ignored in tq_generate_chat_text's FAST path (slow path applied it). Now mirrors the slow path. - B10: Python Model.chat() ignored the C return value; -2 / -1 surfaced as empty token streams. Now raises ChatContextOverflow / RuntimeError. MEDIUM - Removed dead `update_cache:` label. - Synced bindings/python/quant.h (sdist staging snapshot). Verification: - ctest --test-dir build → 35/35 passed - cmake --build build → all targets clean - wasm/build.sh → quant.wasm rebuilt clean (320K) - Python imports + cli --help work quant.h and src/engine/tq_generate.c are kept in lockstep (every chat-cache change applied to both). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #52. A fresh code-reading audit found another batch of hidden bugs in the chat KV cache path. None had visible symptoms in the happy path; all were latent failure modes that would surface under load, on long histories, or on memory pressure.
Audit method: code reading only — no behavior testing on the broken paths. Every finding was verified by reading the actual source before fixing. Per the principle that experimenting with buggy code is meaningless — the audit must catch what tests can't.
Bugs fixed (9)
Critical
tq_generate_continue's sliding-window truncation silently desyncedcached_textfrom KV positions. cached_text claimed the FULL prompt was committed, but cached_tokens only held the truncated tail → next turn's text-prefix match mapped text bytes to the wrong KV positions → garbage output after long histories. Fix: continue now returns-2on overflow instead of truncating; the caller resets and reports.cached_textwas updated even when generation returned an error (generated < 0), leaving the cache claiming committed bytes that weren't.chat_accum_callbackrealloc failure silently dropped tokens AND skipped the user's stream callback → broken UX + corrupted cached_text. Fix: always pass tokens to the user callback first; mark accumulatortaintedon realloc failure; skip cached_text update if tainted.get_or_create_sessiondidn't NULL-checktq_create_state_ex. An OOM made the next call dereference a NULL kv_state.cmd_runinteractive loop ignoredquant_chatreturn code, so context overflow produced an infinite stream of empty replies. Fix: catchChatContextOverflow, drop oldest turn, retry.High
rc == -2;rc == -1produced HTTP 200 withfinish_reason: stopand no error info. Now sends an error delta +finish_reason: error.kv_type/value_quant_bits— old quantized blocks would be misinterpreted. Now detects the change and rebuilds state.wasm_load_modeldidn't resetg_generating. After a page-reload mid-stream, every subsequent generate call early-returned -1 forever.rep_penaltywas silently ignored intq_generate_chat_text's FAST path (slow path applied it). Now mirrors the slow path including the in-loop reapplication after each forward pass.Medium / hygiene
update_cache:label (refactoring leftover).bindings/python/quant.h(sdist staging snapshot).The most insidious bug — B1 walkthrough
The user-visible failure mode:
max_seq_len.tq_generate_continue's "prevent catastrophic failure" branch silently drops the oldest tokens, sets*n_cached_io = 0, re-prefills the truncated tail at positions [0..k).tq_generate_chat_textupdatescached_textto the full prompt + generated text.cached_textbyte-for-byte → fast path → assumes the firstlen(cached_text)bytes correspond to KV positions [0..n_cached) → prefills new tokens at the wrong positions.This bug never reproduces in short happy-path tests. It only surfaces after a long enough chat that overflow triggers. The fix removes the dangerous silent truncation entirely — overflow now propagates as
-2so callers can handle it explicitly (server returns 413, WASM auto-resets, Python raisesChatContextOverflow, CLI drops the oldest turn and retries).Verification
ctest --test-dir build→ 35/35 passedcmake --build build→ all targets cleanwasm/build.sh→quant.wasmrebuilt clean (320K)quantcpp --helpworkquant.handsrc/engine/tq_generate.ckept in lockstep (every chat-cache change applied to both, since they hold parallel implementations).Test plan
from quantcpp import ChatContextOverflow, Modelworkscurlwith two clients alternatingkv_typeon the sameuserfield → confirms rebuild log messagequantcpp run llama3.2:1b, chat for 20+ turns until overflow → confirms graceful "dropped oldest turn" message instead of empty replies🤖 Generated with Claude Code