Skip to content

fix(chat-cache): second audit pass — 9 more hidden bugs eliminated#53

Merged
unamedkr merged 1 commit intomainfrom
fix/chat-cache-audit-2
Apr 12, 2026
Merged

fix(chat-cache): second audit pass — 9 more hidden bugs eliminated#53
unamedkr merged 1 commit intomainfrom
fix/chat-cache-audit-2

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #52. A fresh code-reading audit found another batch of hidden bugs in the chat KV cache path. None had visible symptoms in the happy path; all were latent failure modes that would surface under load, on long histories, or on memory pressure.

Audit method: code reading only — no behavior testing on the broken paths. Every finding was verified by reading the actual source before fixing. Per the principle that experimenting with buggy code is meaningless — the audit must catch what tests can't.

Bugs fixed (9)

Critical

  • B1tq_generate_continue's sliding-window truncation silently desynced cached_text from KV positions. cached_text claimed the FULL prompt was committed, but cached_tokens only held the truncated tail → next turn's text-prefix match mapped text bytes to the wrong KV positions → garbage output after long histories. Fix: continue now returns -2 on overflow instead of truncating; the caller resets and reports.
  • B2cached_text was updated even when generation returned an error (generated < 0), leaving the cache claiming committed bytes that weren't.
  • B3chat_accum_callback realloc failure silently dropped tokens AND skipped the user's stream callback → broken UX + corrupted cached_text. Fix: always pass tokens to the user callback first; mark accumulator tainted on realloc failure; skip cached_text update if tainted.
  • B4 — server's get_or_create_session didn't NULL-check tq_create_state_ex. An OOM made the next call dereference a NULL kv_state.
  • B5 — CLI cmd_run interactive loop ignored quant_chat return code, so context overflow produced an infinite stream of empty replies. Fix: catch ChatContextOverflow, drop oldest turn, retry.

High

  • B6 — server streaming path only branched on rc == -2; rc == -1 produced HTTP 200 with finish_reason: stop and no error info. Now sends an error delta + finish_reason: error.
  • B7 — server reused an existing session even when the request changed kv_type / value_quant_bits — old quantized blocks would be misinterpreted. Now detects the change and rebuilds state.
  • B8 — WASM wasm_load_model didn't reset g_generating. After a page-reload mid-stream, every subsequent generate call early-returned -1 forever.
  • B9rep_penalty was silently ignored in tq_generate_chat_text's FAST path (slow path applied it). Now mirrors the slow path including the in-loop reapplication after each forward pass.

Medium / hygiene

  • Removed dead update_cache: label (refactoring leftover).
  • Synced bindings/python/quant.h (sdist staging snapshot).

The most insidious bug — B1 walkthrough

The user-visible failure mode:

  1. User has a long conversation (10+ turns).
  2. Accumulated ChatML history + new user message exceeds max_seq_len.
  3. tq_generate_continue's "prevent catastrophic failure" branch silently drops the oldest tokens, sets *n_cached_io = 0, re-prefills the truncated tail at positions [0..k).
  4. Function returns successfully, so tq_generate_chat_text updates cached_text to the full prompt + generated text.
  5. Next turn: new prompt matches cached_text byte-for-byte → fast path → assumes the first len(cached_text) bytes correspond to KV positions [0..n_cached) → prefills new tokens at the wrong positions.
  6. Reality: KV positions [0..k] hold the tail of the original prompt, not the head. The text↔position mapping is now broken. Next response is garbage.

This bug never reproduces in short happy-path tests. It only surfaces after a long enough chat that overflow triggers. The fix removes the dangerous silent truncation entirely — overflow now propagates as -2 so callers can handle it explicitly (server returns 413, WASM auto-resets, Python raises ChatContextOverflow, CLI drops the oldest turn and retries).

Verification

  • ctest --test-dir build35/35 passed
  • cmake --build build → all targets clean
  • wasm/build.shquant.wasm rebuilt clean (320K)
  • Python imports + quantcpp --help work

quant.h and src/engine/tq_generate.c kept in lockstep (every chat-cache change applied to both, since they hold parallel implementations).

Test plan

  • Existing unit tests pass (35/35)
  • Library, server, WASM, Python all build clean
  • Python from quantcpp import ChatContextOverflow, Model works
  • Manual: long-conversation chat in WASM demo until overflow → confirms auto-reset message instead of garbage
  • Manual: server curl with two clients alternating kv_type on the same user field → confirms rebuild log message
  • Manual: quantcpp run llama3.2:1b, chat for 20+ turns until overflow → confirms graceful "dropped oldest turn" message instead of empty replies

🤖 Generated with Claude Code

Follow-up to PR #52. A fresh code-reading audit found another batch of
hidden bugs in the chat KV cache path. None had visible symptoms in the
happy path; all were latent failure modes that would surface under load,
on long histories, or on memory pressure.

Bugs fixed:

CRITICAL
- B1: tq_generate_continue's sliding-window truncation silently desynced
  cached_text. cached_text claimed the FULL prompt was committed, but
  cached_tokens only held the truncated tail — next turn's text-prefix
  match mapped text bytes to the wrong KV positions. Fix: continue now
  returns -2 on overflow instead of truncating.
- B2: cached_text was updated even when generation returned an error,
  leaving the cache claiming committed bytes that weren't.
- B3: chat_accum_callback realloc failure silently dropped tokens AND
  skipped the user's stream callback — broken UX + corrupted cached_text.
  Fix: always pass tokens to user_cb; mark accumulator tainted on
  realloc failure; skip cached_text update if tainted.
- B4: server's get_or_create_session didn't NULL-check tq_create_state_ex.
  An OOM made the next call dereference a NULL kv_state.
- B5: CLI cmd_run interactive loop ignored quant_chat return code, so
  context overflow produced an infinite stream of empty replies. Fix:
  catch ChatContextOverflow, drop oldest turn, retry.

HIGH
- B6: server streaming path only branched on rc == -2; rc == -1 produced
  HTTP 200 with finish_reason "stop" and no error info. Now sends an
  error delta + finish_reason "error".
- B7: server reused an existing session even when the request changed
  kv_type / value_quant_bits — old quantized blocks would be
  misinterpreted. Now detects the change and rebuilds state.
- B8: WASM wasm_load_model didn't reset g_generating. After a
  page-reload mid-stream, every subsequent generate call early-returned
  -1 forever.
- B9: rep_penalty was silently ignored in tq_generate_chat_text's FAST
  path (slow path applied it). Now mirrors the slow path.
- B10: Python Model.chat() ignored the C return value; -2 / -1 surfaced
  as empty token streams. Now raises ChatContextOverflow / RuntimeError.

MEDIUM
- Removed dead `update_cache:` label.
- Synced bindings/python/quant.h (sdist staging snapshot).

Verification:
- ctest --test-dir build → 35/35 passed
- cmake --build build → all targets clean
- wasm/build.sh → quant.wasm rebuilt clean (320K)
- Python imports + cli --help work

quant.h and src/engine/tq_generate.c are kept in lockstep (every
chat-cache change applied to both).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr merged commit 0623b16 into main Apr 12, 2026
3 checks passed
@unamedkr unamedkr deleted the fix/chat-cache-audit-2 branch April 12, 2026 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant