Decode gates, quality telemetry, adaptive debounce by FuJacob · Pull Request #688 · FuJacob/cotabby

FuJacob · 2026-06-12T02:45:28Z

Summary

Stacked on #686. The decode-side "showing nothing beats showing garbage" gates, the local counters that prove whether they help, and a latency-keyed debounce. Consumes the engine's new decode-quality primitives (cotabbyinference#10, on middleware main).

Confidence floor, shipped at -1.5. The runtime returns a typed output (text, mean per-token log-probability, withheld flag) so a confidence-suppressed completion is attributed as lowConfidence instead of "the model produced nothing". The -1.5 default came from a nine-point eval sweep (table below). Enabling the floor turns on the per-token logprob work the energy pass gated off; that cost is priced and bounded (p50 +13ms), and cotabbyConfidenceFloorOverride adjusts or disables (-inf) without a rebuild.
argmax-EOG early stop. The decode loop stops the moment the raw distribution's most-likely next token is end-of-generation (engine computes it while the logits row is hot; cotabbyinference#10). At temperature 0.1 sampling is near-greedy, so the eval shows no delta; this is tail insurance for the cases where the dist sampler draws past the model's intended stop. cotabbyArgmaxStopDisabled switches it off.
Quality telemetry (always on, counters only). Generations, shown, withheld-by-reason histogram (normalizer, confidence floor, seam guard), and accepted suggestions (once per suggestion, so word-by-word Tab walks do not inflate the rate), persisted across restarts and shown in the Performance pane with a reset control. Acceptance rate over the suppression histogram is the on-device ground truth for whether any of this stack actually helps.
Adaptive debounce: 15/25/55ms keyed to the last generation latency, configured value as fallback. Snappier on fast machines, calmer on slow ones.

Validation

xcodebuild build-for-testing ... CODE_SIGNING_ALLOWED=NO   # ** TEST BUILD SUCCEEDED **
xcodebuild test-without-building ...                       # FULL suite: 1128 tests, 1123 passed, 0 failed, 5 skipped (gated evals)
swiftlint lint --quiet <changed files>                     # exit 0
xcodegen generate                                          # pbxproj additions only

Confidence-floor sweep (eval, Gemma E2B Q6_K, 117 cases, fixed seed; floor read at runtime via the override key so the sweep is rebuild-free):

floor	qualityScore	precisionWhenShown	coverage	wrongShowRate
off / -4 / -3 / -2.5 / -2	0.734	0.780	0.866	0.188
-1.5 (shipped)	0.744	0.820	0.794	0.137
-1.0	0.624	0.849	0.474	0.068
-0.5	0.459	0.857	0.124	0.017
-0.3	0.407	1.000	0.031	0.000

Floors at or below -2 never fire on this model at temperature 0.1 (the model is confident even about garbage); -1.5 is the unique point where the composite rose while wrong-shows fell 27% relative and no must-show case was lost. A -0.05 sanity floor suppressed 117/117, proving the wiring end to end. Latency at -1.5: p50 187→200ms, p95 1036→1124ms (the logprob passes; engine-side cost is two O(vocab) scans per token).

Linked issues

Refs #486 (quality/context usage now visible in the Performance window), #546 (adaptive debounce stops thrashing slow machines).

Risk / rollout notes

LlamaRuntimeGenerating.generate now returns LlamaGenerationOutput instead of String (internal protocol; both conformers and the test fake updated).
The floor turns per-token logprob computation back on for the llama path. If field reports blame it, defaults write com.jacobfu.tabby cotabbyConfidenceFloorOverride -float -1e9 restores the old posture without a build; telemetry's lowConfidence count shows exactly how often the gate fires in real use.
Quality counters write one small UserDefaults blob per suggestion event; no content, no timestamps beyond a first-recorded date.
New SuggestionQualityMetricsStore carries nonisolated deinit {} for the known app-hosted-test double-free; without it the store's tests crash with SIGABRT in full-suite runs (reproduced, fixed, full suite green).

Greptile Summary

This PR adds four new features on top of #686: a confidence floor gate and argmax-EOG early-stop at the decode layer, always-on quality telemetry counters (generated / shown / withheld-by-reason / accepted), and a latency-keyed adaptive debounce (15/25/55 ms). It also adds surface-context conditioning for the base model and a host-font caret advance measurement that fixes the overlay slide accuracy.

Decode gates: LlamaRuntimeCore breaks the token loop before appending when argmax_is_eog; LlamaSuggestionEngine routes confidence-suppressed completions to a SuggestionNormalizationResult(.lowConfidence) so the coordinator attributes them correctly instead of reading as empty-model output. resolvedStopAtArgmaxEOG now accepts injectable UserDefaults, closing the testability gap noted in the prior review.
Quality telemetry: SuggestionQualityMetricsStore is a lean @MainActor ObservableObject persisting lifetime counters via a JSON UserDefaults blob; each coordinator gate records its own named suppression reason, and SuggestionEngineRouter.recordQualityOutcome is the single accounting point for engine-level outcomes.
Adaptive debounce + surface context: DebouncePolicy selects a debounce window from the last observed generation latency (fallback to configured value before first data); SurfaceContextComposer produces a sanitized writing-surface descriptor that suppresses itself for code editors and terminals.

Confidence Score: 5/5

Safe to merge; no new defects introduced on any path changed by this PR.

All four features are cleanly isolated, covered by targeted tests, and validated against a full 1128-test suite. The argmax-EOG token is correctly excluded from both the generated text and the sumLogprob accumulation before the confidence average is computed. The resolvedStopAtArgmaxEOG injection gap called out in the previous review is fixed. The suppression accounting is structurally sound for results that pass through only one accounting layer; concerns about results that simultaneously hit an engine-level and a lifecycle gate were covered in the prior review and are pre-existing rather than newly introduced here.

No files require special attention; all changes are internally consistent and well-tested.

Important Files Changed

Filename	Overview
Cotabby/Models/SuggestionQualityMetricsStore.swift	New always-on quality counter store (generated/shown/suppressed-by-reason/accepted), persisted via UserDefaults JSON; correct @mainactor isolation with the standard nonisolated-deinit workaround.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift	Adds confidence floor and argmax-EOG stop wiring; resolvedStopAtArgmaxEOG now accepts injectable UserDefaults (fixes the testability gap flagged in the previous review), and LlamaGenerationOutput carries suppression attribution correctly.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	Decode loop gains the argmax-EOG early stop (break before extractPiece so the discarded token is excluded from both generatedText and sumLogprob) and returns LlamaGenerationOutput with averaged logprob and confidence-suppressed flag.
Cotabby/App/Coordinators/SuggestionCoordinator+Prediction.swift	All coordinator early-exit gates now record their own suppression reasons; the empty-result gate guards against double-counting engine-attributed reasons; recordShown placed correctly at the single path where a generation becomes visible.
Cotabby/App/Coordinators/SuggestionCoordinator+Acceptance.swift	Acceptance path records accepted suggestion on first chunk; InsertedTextAdvance provides host-font caret measurement for overlay sliding; predictedCaretRect now takes fieldStyle for improved accuracy.
Cotabby/Services/Runtime/SuggestionEngineRouter.swift	Single recordQualityOutcome method handles generated + engine-attributed suppression accounting for all engine paths; qualityMetricsStore correctly injected.
Cotabby/Support/DebouncePolicy.swift	Simple latency-keyed debounce (15/25/55ms), nonisolated enum, clean guard on nil/zero latency; the configured value is explicitly the pre-data fallback per the documented design.
Cotabby/Support/SurfaceContextComposer.swift	New surface-context composer with clear omission-beats-noise invariants; sanitizedTitle suffix-stripping uses .anchored+.backwards correctly; domain extraction strips path/query/www; no prompt injection via controlChar+quote filtering.
Cotabby/Support/AppSurfaceClassifier.swift	Clean bundle-ID classifier; terminal check runs on the original case before lowercasing for prefix tables; integrated-terminal flag takes precedence per the comment.
Cotabby/Support/InsertedTextAdvance.swift	Host-font caret advance measurement; correct nil guards; falls back to system font with host point size when face name is unavailable, which is better than the fixed 14pt ghost-font fallback it replaces.
Cotabby/Services/Focus/SurfaceContextCache.swift	Session-scoped AX capture cache keyed on PID:elementID:focusSeq; @mainactor isolation with nonisolated-deinit workaround; caches negative results to avoid re-probing unresponsive hosts.
Cotabby/UI/Settings/Panes/PerformancePaneView.swift	New suggestion quality section with counters, acceptance rate, top-4 suppression reasons, and a Reset control; raw enum key display was flagged in the prior outside-diff comment.
Cotabby/Support/SuggestionOverlayStabilityGate.swift	isAwaitingPostInsertionSync short-circuit correctly placed after the text-change check but before geometry checks, preventing the pre-publish stale caret from moving the overlay.
CotabbyTests/SuggestionQualityMetricsStoreTests.swift	Good coverage: accumulation, acceptance rate, cross-instance persistence via isolated UserDefaults suite, and reset. Avoids process-global state.
CotabbyTests/LlamaDecodeGateDefaultsTests.swift	Covers the defaults-write escape hatches for both confidence floor and argmax-stop via isolated UserDefaults suite; resolvedStopAtArgmaxEOG is now testable because the previous injection gap was fixed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Engine Result] --> R[Router: recordGenerated]
    R --> RS{suppressionReason?}
    RS -- yes --> RSR[recordSuppressed engine reason]
    RS -- no --> C[Coordinator]
    RSR --> C
    C --> SD{Stale drop?}
    SD -- yes --> SDR[recordSuppressed discardedStaleContext]
    SDR --> END1([Drop])
    SD -- no --> ER{result.text empty?}
    ER -- yes, no engine reason --> ERR[recordSuppressed emptyUnattributed]
    ERR --> END2([Drop])
    ER -- yes, has engine reason --> END2
    ER -- no --> ST{Selected text?}
    ST -- yes --> STR[recordSuppressed discardedSelection]
    STR --> END3([Drop])
    ST -- no --> AE{Stale accept echo?}
    AE -- yes --> AER[recordSuppressed discardedAcceptEcho]
    AER --> END4([Drop])
    AE -- no --> SG{Seam guard?}
    SG -- yes --> SGR[recordSuppressed seamMisspelling or seamJunkPunctuationRun]
    SGR --> END5([Drop])
    SG -- no --> SH[recordShown]
    SH --> SHOW([Show overlay])
    SHOW --> ACC{First Tab accept?}
    ACC -- yes, consumedCount == 0 --> ACCR[recordAcceptedSuggestion]
    ACC -- no --> ACCR

Comments Outside Diff (1)

Cotabby/UI/Settings/Panes/PerformancePaneView.swift, line 902-908 (link)

Raw enum key names surfaced verbatim in the UI

topSuppressionReasons maps the raw suppressedByReason dictionary keys directly to the displayed string. Those keys are CompletionSuppressionReason raw values ("lowConfidence", "emptyGeneration", "seamMisspelling") plus coordinator-level strings ("seamJunkPunctuationRun"). A non-developer user reading the Performance pane would see "lowConfidence 12, seamJunkPunctuationRun 3" with no indication of what those mean. A small display-name map (or at least title-casing / spacing) would make the section usable outside of debugging contexts.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (4): Last reviewed commit: "ci: retrigger checks after force-push sy..." | Re-trigger Greptile}

greptile-apps · 2026-06-12T03:33:21Z

        }
    }

+    /// Marks the session's suggestion accepted in the quality counters, once per suggestion: only
+    /// the first chunk counts, so word-by-word walks of one suggestion add nothing further and the
+    /// acceptance rate stays suggestions-accepted over suggestions-shown.
+    private func recordSuggestionAcceptedIfFirstChunk(of session: ActiveSuggestionSession) {
+        guard session.consumedCharacterCount == 0 else { return }


Correction acceptances inflate acceptedSuggestions without a matching shown increment

recordSuggestionAcceptedIfFirstChunk only guards on consumedCharacterCount == 0; it fires for any session kind, including .correction(typoWord:). presentCorrection never calls recordShown(), so every accepted correction increments acceptedSuggestions without a corresponding shown entry. Since acceptanceRate = acceptedSuggestions / shown, a session where the user accepts even one correction while the shown count is in single digits produces a rate above 100%, making the metric actively misleading.

Add guard case .continuation = session.kind else { return } before the recordAcceptedSuggestion() call to restrict the counter to generated completions only.

greptile-apps · 2026-06-12T03:47:05Z

        guard liveContext.generation == result.generation else {

            latestRawModelOutput = SuggestionDebugLogger.debugPreview(result.rawText)
+            // Lifecycle discards are counted under their own reasons so `generated` always equals
+            // `shown` plus the suppression histogram; without this, every drop here silently


Double-counted suppressions when engine suppression meets coordinator discard

The stale-drop, selected-text, and stale-accept-echo gates unconditionally call recordSuppressed, but the router already called recordSuppressed(reason: result.suppressionReason) for the same result whenever the engine attributed a suppression (e.g., "lowConfidence" or a normalizer reason). Any result that was engine-suppressed AND then discarded by a coordinator gate (common during fast typing at ~187 ms p50 latency, where ~21% of results hit the confidence floor and stale-drop fires frequently) will increment suppressedTotal twice, breaking the invariant stated in the adjacent comment ("generated always equals shown plus the suppression histogram"). The empty-result gate already guards with if result.suppressionReason == nil — the three lifecycle-discard gates need the same guard.

FuJacob · 2026-06-12T03:55:21Z

Merging on local CI-equivalent validation: GitHub Actions did not register runs for this branch's pushes (repo Actions is otherwise healthy; sibling PRs triggered normally), so the full suite was run locally on the exact head tree: build-for-testing succeeded, 1542 tests with 0 failures, SwiftLint clean, xcodegen drift-free. Review fixes included: the quality ledger now balances at every lifecycle exit, and the argmax toggle gained an injectable-defaults seam with tests.

The runtime now returns a typed output (text, average logprob, withheld flag) instead of a bare string, so a confidence-suppressed completion is attributed as lowConfidence rather than reading as 'the model produced nothing'. The shipped floor of -1.5 mean per-token log-probability came from an eval sweep over nine values: floors at or below -2 never fire on this model at temperature 0.1, -1 and tighter buy precision at a brutal coverage cost, and -1.5 is the unique point where the composite quality score rose (0.734 to 0.744), wrong-shows fell 27% relative (0.188 to 0.137), and zero must-show cases were lost. Enabling the floor turns on per-token logprob computation (eval p50 187ms to 200ms); cotabbyConfidenceFloorOverride adjusts it without a rebuild, and -infinity restores the old posture entirely. The decode loop also stops the moment the raw distribution's most-likely next token is end-of-generation (computed by the engine while the logits row is hot). At temperature 0.1 sampling is near-greedy so the eval shows no delta; the stop exists for the sampling tail where the dist sampler draws past the model's intended stop. cotabbyArgmaxStopDisabled switches it off.

…ce pane readout Local lifetime counters answering 'is quality improving for real use': generations, suggestions shown, why withheld ones were withheld (reason histogram spanning the normalizer, the confidence floor, and the seam guard), and how many shown suggestions were accepted (counted once per suggestion, so word-by-word walks do not inflate the rate). The router counts generation outcomes because it is the single point every finished result passes through; the coordinator records the display-time and acceptance events only it can see. Counters carry no content, so unlike the per-request latency log there is no opt-in gate; the Performance pane shows the counts, acceptance rate, and top withhold reasons with a reset control, indexed for settings search. Acceptance rate over the suppression histogram is the on-device ground truth that decides whether future decode changes actually help.

…n latency A fixed debounce serves two masters badly: on fast hardware it adds avoidable delay before every suggestion, and on slow hardware it lets keystrokes pile doomed generations onto a model that cannot keep up (every cancel still costs decode setup and teardown). The debounce now keys off the most recent generation latency: 15ms when the model answers within 70ms, 25ms within 140ms, 55ms beyond that, with the configured value as the fallback until a first latency exists.

…max toggle The stale-drop, unattributed-empty, selected-text, and accept-echo exits returned without recording either shown or suppressed, so the generated counter silently outgrew the others; each now records a lifecycle discard reason (engine-attributed empties stay counted by the router alone). The argmax-EOG toggle now mirrors the confidence floor's injectable-defaults seam, with tests covering both escape hatches against an isolated suite.

…n) (#693)

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread Cotabby/Services/Runtime/SuggestionEngineRouter.swift

Comment thread Cotabby/Services/Runtime/LlamaSuggestionEngine.swift Outdated

FuJacob mentioned this pull request Jun 12, 2026

Responsive lifecycle: anchor reuse cache + speculative post-acceptance prefetch #689

Merged

FuJacob force-pushed the quality/context-relevance branch from b95e651 to d2ea9a1 Compare June 12, 2026 03:22

FuJacob force-pushed the quality/decode-gates-and-telemetry branch from c92d558 to a0b89ef Compare June 12, 2026 03:22

FuJacob force-pushed the quality/context-relevance branch 2 times, most recently from c351bdf to 5623a71 Compare June 12, 2026 03:24

FuJacob force-pushed the quality/decode-gates-and-telemetry branch from a0b89ef to 39d5f2d Compare June 12, 2026 03:27

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

FuJacob force-pushed the quality/decode-gates-and-telemetry branch from 39d5f2d to c1e408f Compare June 12, 2026 03:36

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

FuJacob closed this Jun 12, 2026

FuJacob reopened this Jun 12, 2026

FuJacob added 5 commits June 11, 2026 20:57

ci: retrigger checks after force-push synchronize was dropped

98ca097

FuJacob force-pushed the quality/decode-gates-and-telemetry branch from 8f36e25 to 98ca097 Compare June 12, 2026 03:57

FuJacob changed the base branch from quality/context-relevance to main June 12, 2026 04:59

FuJacob merged commit f2de07d into main Jun 12, 2026

FuJacob mentioned this pull request Jun 12, 2026

fix(decode): ship the confidence floor OFF by default (#688 regression) #693

Merged

FuJacob added a commit that referenced this pull request Jun 12, 2026

fix(decode): ship the confidence floor OFF by default (#688 regressio…

f4d2db1

…n) (#693)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decode gates, quality telemetry, adaptive debounce#688

Decode gates, quality telemetry, adaptive debounce#688
FuJacob merged 5 commits into
mainfrom
quality/decode-gates-and-telemetry

FuJacob commented Jun 12, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Jun 12, 2026

Uh oh!

greptile-apps Bot Jun 12, 2026

Uh oh!

FuJacob commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 12, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

FuJacob commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 12, 2026 •

edited by greptile-apps Bot

Loading