fix(decode): ship the confidence floor OFF by default (#688 regression)#693
Merged
Conversation
#688 enabled a -1.5 mean-logprob confidence floor by default. It won an offline eval sweep, but that golden set under-represented real typing: production logs show the same floor withholding ~56% of completions, most of them perfectly usable, and under streaming it paints a partial then clears it (suggestions appear to flicker away). It also forces per-token logprob computation, adding latency. Set defaultConfidenceFloor to -.infinity so the gate (and its logprob cost) are off unless opted into via cotabbyConfidenceFloorOverride. The eval harness sets that key to measure with the gate on. Recalibrate against a representative real-usage distribution before re-enabling by default.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
#688 turned on a -1.5 mean-per-token-logprob confidence floor by default, which silently suppresses completions the model was "unsure" about. It won an offline eval sweep, but that golden set badly under-represented real free-form typing. In production logs the same floor withheld ~56% of completions the moment a build with it started running:
lowConfidenceMost suppressed completions were perfectly usable (passed and dropped completions sat on either side of -1.5 with no quality difference). Under streaming the gate also paints a partial and then clears it, which reads as suggestions flickering away. The floor additionally forces per-token logprob computation, adding latency.
This sets
defaultConfidenceFloorto-.infinity, so the gate and its logprob cost are off unless explicitly opted into via the existingcotabbyConfidenceFloorOverridedefault. The eval harness still sets that key to measure with the gate on. The floor should be recalibrated against a representative real-usage distribution before being re-enabled by default.Validation
Added
test_confidenceFloor_shippedOff_byDefaultto lock the off-by-default decision (asserts both the constant and the resolved value are-.infinity).Linked issues
Risk / rollout notes
defaultConfidenceFloor-1.5 → -.infinity. No schema/settings migration. ThelowConfidencesuppression path, the override key, and the eval harness wiring are all untouched and still function when the override is set.main: the confidence gate stops firing for everyone by default, so coverage goes back up (no more vanishing/flickering suggestions) and the per-token logprob cost is removed.Greptile Summary
Reverts the confidence gate from an aggressive on-by-default -1.5 floor (introduced in #688) back to disabled (
-.infinity) after production data showed it suppressing ~56% of completions with no measurable quality benefit.LlamaSuggestionEngine.swift:defaultConfidenceFloorchanged from-1.5to-.infinity; the doc comment is updated with the production evidence that motivated the revert, the eval context, and the opt-in path viacotabbyConfidenceFloorOverride.LlamaDecodeGateDefaultsTests.swift: Newtest_confidenceFloor_shippedOff_byDefaultregression-locks both the raw constant andresolvedConfidenceFloorto-.infinity, so any future accidental re-enable of the default will surface immediately.Confidence Score: 5/5
Safe to merge — one constant flipped to -.infinity, all other paths untouched.
The change is a single-constant revert backed by production log evidence and a dedicated regression-lock test. The suppression path, the override key, and the eval harness wiring are all unchanged. The new test correctly uses XCTAssertEqual with -.infinity, which compares exactly under IEEE 754. No logic paths were added or removed.
No files require special attention.
Important Files Changed
defaultConfidenceFloorfrom -1.5 to -.infinity (disabling the gate by default), with updated doc comment explaining the production regression. No logic changes beyond the constant.test_confidenceFloor_shippedOff_byDefaultto lock both the constant and the resolved value at -.infinity, preventing silent reversion.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[makeGenerationOptions] --> B[resolvedConfidenceFloor] B --> C{cotabbyConfidenceFloorOverride set in UserDefaults?} C -- No --> D["defaultConfidenceFloor\n(-.infinity ← this PR)\nwas: -1.5"] C -- Yes --> E[Use override value\ne.g. -0.8 for eval harness] D --> F{floor == -.infinity?} E --> F F -- Yes --> G[Gate OFF: no logprob computation, all completions passed through] F -- No --> H[Gate ON: per-token logprob computed, completions below floor suppressed]Reviews (1): Last reviewed commit: "fix(decode): ship the confidence floor O..." | Re-trigger Greptile