Skip to content

fix(decode): ship the confidence floor OFF by default (#688 regression)#693

Merged
FuJacob merged 1 commit into
mainfrom
fix/confidence-floor-off-by-default
Jun 12, 2026
Merged

fix(decode): ship the confidence floor OFF by default (#688 regression)#693
FuJacob merged 1 commit into
mainfrom
fix/confidence-floor-off-by-default

Conversation

@FuJacob

@FuJacob FuJacob commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

#688 turned on a -1.5 mean-per-token-logprob confidence floor by default, which silently suppresses completions the model was "unsure" about. It won an offline eval sweep, but that golden set badly under-represented real free-form typing. In production logs the same floor withheld ~56% of completions the moment a build with it started running:

window generations suppressed lowConfidence
06-11 (build without it active) 434 0%
06-12 (rebuilt off latest main) 97 56%

Most suppressed completions were perfectly usable (passed and dropped completions sat on either side of -1.5 with no quality difference). Under streaming the gate also paints a partial and then clears it, which reads as suggestions flickering away. The floor additionally forces per-token logprob computation, adding latency.

This sets defaultConfidenceFloor to -.infinity, so the gate and its logprob cost are off unless explicitly opted into via the existing cotabbyConfidenceFloorOverride default. The eval harness still sets that key to measure with the gate on. The floor should be recalibrated against a representative real-usage distribution before being re-enabled by default.

Validation

xcodebuild ... build -derivedDataPath build/DerivedData      # ** BUILD SUCCEEDED **
xcodebuild ... test -only-testing:CotabbyTests/LlamaDecodeGateDefaultsTests \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests \
  -only-testing:CotabbyTests/ModelAndPresentationValueTests \
  -only-testing:CotabbyTests/SuggestionQualityMetricsStoreTests \
  CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO          # ** TEST SUCCEEDED **
swiftlint lint --quiet                                       # exit 0

Added test_confidenceFloor_shippedOff_byDefault to lock the off-by-default decision (asserts both the constant and the resolved value are -.infinity).

Linked issues

Risk / rollout notes

  • Pure default change: defaultConfidenceFloor -1.5 → -.infinity. No schema/settings migration. The lowConfidence suppression path, the override key, and the eval harness wiring are all untouched and still function when the override is set.
  • Behavior change vs. current main: the confidence gate stops firing for everyone by default, so coverage goes back up (no more vanishing/flickering suggestions) and the per-token logprob cost is removed.

Greptile Summary

Reverts the confidence gate from an aggressive on-by-default -1.5 floor (introduced in #688) back to disabled (-.infinity) after production data showed it suppressing ~56% of completions with no measurable quality benefit.

  • LlamaSuggestionEngine.swift: defaultConfidenceFloor changed from -1.5 to -.infinity; the doc comment is updated with the production evidence that motivated the revert, the eval context, and the opt-in path via cotabbyConfidenceFloorOverride.
  • LlamaDecodeGateDefaultsTests.swift: New test_confidenceFloor_shippedOff_byDefault regression-locks both the raw constant and resolvedConfidenceFloor to -.infinity, so any future accidental re-enable of the default will surface immediately.

Confidence Score: 5/5

Safe to merge — one constant flipped to -.infinity, all other paths untouched.

The change is a single-constant revert backed by production log evidence and a dedicated regression-lock test. The suppression path, the override key, and the eval harness wiring are all unchanged. The new test correctly uses XCTAssertEqual with -.infinity, which compares exactly under IEEE 754. No logic paths were added or removed.

No files require special attention.

Important Files Changed

Filename Overview
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift Changed defaultConfidenceFloor from -1.5 to -.infinity (disabling the gate by default), with updated doc comment explaining the production regression. No logic changes beyond the constant.
CotabbyTests/LlamaDecodeGateDefaultsTests.swift Added test_confidenceFloor_shippedOff_byDefault to lock both the constant and the resolved value at -.infinity, preventing silent reversion.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[makeGenerationOptions] --> B[resolvedConfidenceFloor]
    B --> C{cotabbyConfidenceFloorOverride set in UserDefaults?}
    C -- No --> D["defaultConfidenceFloor\n(-.infinity  ← this PR)\nwas: -1.5"]
    C -- Yes --> E[Use override value\ne.g. -0.8 for eval harness]
    D --> F{floor == -.infinity?}
    E --> F
    F -- Yes --> G[Gate OFF: no logprob computation, all completions passed through]
    F -- No --> H[Gate ON: per-token logprob computed, completions below floor suppressed]
Loading

Reviews (1): Last reviewed commit: "fix(decode): ship the confidence floor O..." | Re-trigger Greptile

#688 enabled a -1.5 mean-logprob confidence floor by default. It won an offline
eval sweep, but that golden set under-represented real typing: production logs
show the same floor withholding ~56% of completions, most of them perfectly
usable, and under streaming it paints a partial then clears it (suggestions
appear to flicker away). It also forces per-token logprob computation, adding
latency.

Set defaultConfidenceFloor to -.infinity so the gate (and its logprob cost) are
off unless opted into via cotabbyConfidenceFloorOverride. The eval harness sets
that key to measure with the gate on. Recalibrate against a representative
real-usage distribution before re-enabling by default.
@FuJacob FuJacob merged commit f4d2db1 into main Jun 12, 2026
4 checks passed
@FuJacob FuJacob deleted the fix/confidence-floor-off-by-default branch June 12, 2026 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant