Decode-quality primitives: scaffolding-token mask and argmax-is-EOG stop signal#10
Merged
Merged
Conversation
…top signal Two zero-hot-loop-cost additions for autocomplete decode quality: - buildTokenMasks now probes each token's special-rendered piece and hard-masks single-token chat/instruct/FIM scaffolding (<|im_end|>, <start_of_turn>, [INST], FIM families) that the GGUF did not flag as a control token, plus an unflagged BOS. EOG tokens stay exempt so natural stops keep firing. Well-formed GGUFs already flag these as control, so the common count is 0; the rule is insurance against vocabularies that ship them unflagged. Exposed via getMaskedScaffoldingTokenCount for tests/diagnostics. - SampleResult gains argmax_is_eog: whether the raw distribution's single most-likely token at this position is an end-of-generation token. Stochastic sampling can draw past the point where the model wants to stop; this lets callers detect that stop intent on the exact step it appears. Computed in C++ while the logits row is hot (one O(vocab) pass per token, tens of microseconds); the seed token's verdict is captured at decodePrompt while its row is still resident. Field is appended, so existing Swift call sites that only read members keep compiling; SamplingConfig is untouched.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two additions for autocomplete decode quality, both zero cost on the hot sampling path and both source-compatible with the app's existing call sites (
SamplingConfiguntouched;SampleResultgains an appended field that callers only read).Scaffolding-token mask (load time).
buildTokenMasksnow probes each token's special-rendered piece and adds single-token chat/instruct/FIM scaffolding (<|im_end|>,<start_of_turn>,[INST], FIM marker families) to the existing -inf logit-bias table when the GGUF ships them without the control attribute, plus an unflagged BOS. EOG tokens stay exempt so natural stops keep firing. Well-formed GGUFs flag these as control already, so the expected count is 0 on the catalog models; this is insurance against vocabularies that do not, surfaced viagetMaskedScaffoldingTokenCountfor tests and diagnostics.argmax_is_eogonSampleResult. True when the raw distribution's single most-likely token at this position is an end-of-generation token. Stochastic sampling can draw past the point where the model wants to stop; this lets the caller detect the stop intent on the exact step it appears and finalize cleanly. Computed in C++ while the logits row is hot: one O(vocab) pass per sampled token (tens of microseconds; the row is unmutated becausellama_sampler_sampleworks on a copied candidate array). The seed token's verdict is captured atdecodePromptwhile its logits row is still resident, mirroringseed_logprob.Validation
Note:
testEndToEndWithModelhas a pre-existing, model-specific failure on Qwen3.5-0.8B-Base (partialtrimKVreturns false on that model's attention layout; KV count assertion follows). Verified identical on unmodifiedmain; unrelated to this change and the app already falls back to a fresh prompt build when trimKV reports failure.Risk / rollout
Mask additions only ever remove tokens that should never appear as autocomplete text; EOG exemption is covered by tests. The argmax flag is informational until a consumer opts in app-side.