test: lock the prompt-encode wiring through the #9 seam — fake tokenizer, decidable clamp, prefill switch (#12) by kiki830621 · Pull Request #22 · PsychQuant/bestASR

kiki830621 · 2026-07-02T09:29:13Z

Refs #12

Summary

FakeTokenizer（非恆等 1000+byte 映射、未用 stubs fail-fast）注入 #9 spy pipeline，鎖 encode→clamp→makeDecodeOptions 實際接線：前導空格 in-band（1032）、trailing-224 clamp（異質資料、方向可判別 + != prefix 雙斷言）、usePrefillPrompt 開關、canary 去空洞化。production code 零改動。

Verification

6-AI verify master report 見 issue #12：MEDIUM（同質測資空洞方向）+ 2 LOW 當輪修復；四連突變以 exit code 判定全紅（丟空格/prefix/恆 nil/關 prefill）。161/161 綠。附帶抓到並 hotfix main 上的 #20 dist-sync 殘留（BestASRVersion 0.3.1→0.4.0，ac191fb）。

Checklist

Diagnose ✓（Simple / A_parallel_safe）
Implement + verify fixes（2 commits）
Verify ✓（post-fix 0 blocking）
Verify-gated: ready to merge → after merge, run /idd-close manually

🤖 Generated by /idd-all. Do NOT add a GitHub close trailer.

…okenizer The #9 spy injected tokenizer: nil, so the encode → clamp → makeDecodeOptions branch never ran under the seam — the DA proved deleting it (or dropping the leading space, or always passing nil) kept every test green, and the promptTokens == nil canary was vacuously true. A deterministic FakeTokenizer (UTF-8 bytes as ids, ~25 lines of inert stubbing) now rides the spy pipeline: the pipeline must receive exactly encode(" " + prompt) — leading space in-band — an overlong prompt must clamp to the TRAILING 224 tokens (nearest context wins), and the no-prompt canary runs WITH a tokenizer so nil now proves the gate is the absent prompt. Mutation probes: all three DA mutations (drop leading space, prefix-clamp, force nil) now go red. Production code untouched. Refs #12

…ntity fake, prefill switch locked The clamp test's homogeneous 'aaa…' data made suffix ≡ prefix, so its direction claim was vacuous (and exposed that the earlier mutation-probe evidence for that direction was contaminated — the grep-based red-count was measuring noise, not the assertion). Heterogeneous 150a+150b data now makes every 224-window distinct, with an explicit != prefix guard. The fake's encoding moves off identity UTF-8 (1000+byte) so a production path that hardcoded raw bytes without calling the injected tokenizer cannot match. The DA's mutation find is locked too: usePrefillPrompt — the switch that makes WhisperKit actually CONSUME promptTokens — is now asserted at the seam. Unused fake stubs fail fast instead of returning plausible zeros. All four mutations re-verified red by exit code (drop leading space / prefix clamp / force nil / prefill off). Refs #12

kiki830621 added 2 commits July 2, 2026 17:11

kiki830621 merged commit 210b8d1 into main Jul 2, 2026

kiki830621 deleted the idd/12-lock-prompt-encode-wiring branch July 2, 2026 20:03

This was referenced Jul 2, 2026

Lock the prompt-encode wiring in transcribeRaw — fake WhisperTokenizer through the #9 seam #12

Closed

feature: speaker identification — SPEAKER_N 對映 names[] 真名（兩階段之二，blocked by #25） #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: lock the prompt-encode wiring through the #9 seam — fake tokenizer, decidable clamp, prefill switch (#12)#22

test: lock the prompt-encode wiring through the #9 seam — fake tokenizer, decidable clamp, prefill switch (#12)#22
kiki830621 merged 2 commits into
mainfrom
idd/12-lock-prompt-encode-wiring

kiki830621 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kiki830621 commented Jul 2, 2026

Summary

Verification

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant