Skip to content

Decode gates, quality telemetry, adaptive debounce#688

Merged
FuJacob merged 5 commits into
mainfrom
quality/decode-gates-and-telemetry
Jun 12, 2026
Merged

Decode gates, quality telemetry, adaptive debounce#688
FuJacob merged 5 commits into
mainfrom
quality/decode-gates-and-telemetry

Conversation

@FuJacob

@FuJacob FuJacob commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

Stacked on #686. The decode-side "showing nothing beats showing garbage" gates, the local counters that prove whether they help, and a latency-keyed debounce. Consumes the engine's new decode-quality primitives (cotabbyinference#10, on middleware main).

  1. Confidence floor, shipped at -1.5. The runtime returns a typed output (text, mean per-token log-probability, withheld flag) so a confidence-suppressed completion is attributed as lowConfidence instead of "the model produced nothing". The -1.5 default came from a nine-point eval sweep (table below). Enabling the floor turns on the per-token logprob work the energy pass gated off; that cost is priced and bounded (p50 +13ms), and cotabbyConfidenceFloorOverride adjusts or disables (-inf) without a rebuild.
  2. argmax-EOG early stop. The decode loop stops the moment the raw distribution's most-likely next token is end-of-generation (engine computes it while the logits row is hot; cotabbyinference#10). At temperature 0.1 sampling is near-greedy, so the eval shows no delta; this is tail insurance for the cases where the dist sampler draws past the model's intended stop. cotabbyArgmaxStopDisabled switches it off.
  3. Quality telemetry (always on, counters only). Generations, shown, withheld-by-reason histogram (normalizer, confidence floor, seam guard), and accepted suggestions (once per suggestion, so word-by-word Tab walks do not inflate the rate), persisted across restarts and shown in the Performance pane with a reset control. Acceptance rate over the suppression histogram is the on-device ground truth for whether any of this stack actually helps.
  4. Adaptive debounce: 15/25/55ms keyed to the last generation latency, configured value as fallback. Snappier on fast machines, calmer on slow ones.

Validation

xcodebuild build-for-testing ... CODE_SIGNING_ALLOWED=NO   # ** TEST BUILD SUCCEEDED **
xcodebuild test-without-building ...                       # FULL suite: 1128 tests, 1123 passed, 0 failed, 5 skipped (gated evals)
swiftlint lint --quiet <changed files>                     # exit 0
xcodegen generate                                          # pbxproj additions only

Confidence-floor sweep (eval, Gemma E2B Q6_K, 117 cases, fixed seed; floor read at runtime via the override key so the sweep is rebuild-free):

floor qualityScore precisionWhenShown coverage wrongShowRate mustShow misses
off / -4 / -3 / -2.5 / -2 0.734 0.780 0.866 0.188 0
-1.5 (shipped) 0.744 0.820 0.794 0.137 0
-1.0 0.624 0.849 0.474 0.068 0
-0.5 0.459 0.857 0.124 0.017 0
-0.3 0.407 1.000 0.031 0.000 0

Floors at or below -2 never fire on this model at temperature 0.1 (the model is confident even about garbage); -1.5 is the unique point where the composite rose while wrong-shows fell 27% relative and no must-show case was lost. A -0.05 sanity floor suppressed 117/117, proving the wiring end to end. Latency at -1.5: p50 187→200ms, p95 1036→1124ms (the logprob passes; engine-side cost is two O(vocab) scans per token).

Linked issues

Refs #486 (quality/context usage now visible in the Performance window), #546 (adaptive debounce stops thrashing slow machines).

Risk / rollout notes

  • LlamaRuntimeGenerating.generate now returns LlamaGenerationOutput instead of String (internal protocol; both conformers and the test fake updated).
  • The floor turns per-token logprob computation back on for the llama path. If field reports blame it, defaults write com.jacobfu.tabby cotabbyConfidenceFloorOverride -float -1e9 restores the old posture without a build; telemetry's lowConfidence count shows exactly how often the gate fires in real use.
  • Quality counters write one small UserDefaults blob per suggestion event; no content, no timestamps beyond a first-recorded date.
  • New SuggestionQualityMetricsStore carries nonisolated deinit {} for the known app-hosted-test double-free; without it the store's tests crash with SIGABRT in full-suite runs (reproduced, fixed, full suite green).

Greptile Summary

This PR adds four new features on top of #686: a confidence floor gate and argmax-EOG early-stop at the decode layer, always-on quality telemetry counters (generated / shown / withheld-by-reason / accepted), and a latency-keyed adaptive debounce (15/25/55 ms). It also adds surface-context conditioning for the base model and a host-font caret advance measurement that fixes the overlay slide accuracy.

  • Decode gates: LlamaRuntimeCore breaks the token loop before appending when argmax_is_eog; LlamaSuggestionEngine routes confidence-suppressed completions to a SuggestionNormalizationResult(.lowConfidence) so the coordinator attributes them correctly instead of reading as empty-model output. resolvedStopAtArgmaxEOG now accepts injectable UserDefaults, closing the testability gap noted in the prior review.
  • Quality telemetry: SuggestionQualityMetricsStore is a lean @MainActor ObservableObject persisting lifetime counters via a JSON UserDefaults blob; each coordinator gate records its own named suppression reason, and SuggestionEngineRouter.recordQualityOutcome is the single accounting point for engine-level outcomes.
  • Adaptive debounce + surface context: DebouncePolicy selects a debounce window from the last observed generation latency (fallback to configured value before first data); SurfaceContextComposer produces a sanitized writing-surface descriptor that suppresses itself for code editors and terminals.

Confidence Score: 5/5

Safe to merge; no new defects introduced on any path changed by this PR.

All four features are cleanly isolated, covered by targeted tests, and validated against a full 1128-test suite. The argmax-EOG token is correctly excluded from both the generated text and the sumLogprob accumulation before the confidence average is computed. The resolvedStopAtArgmaxEOG injection gap called out in the previous review is fixed. The suppression accounting is structurally sound for results that pass through only one accounting layer; concerns about results that simultaneously hit an engine-level and a lifecycle gate were covered in the prior review and are pre-existing rather than newly introduced here.

No files require special attention; all changes are internally consistent and well-tested.

Important Files Changed

Filename Overview
Cotabby/Models/SuggestionQualityMetricsStore.swift New always-on quality counter store (generated/shown/suppressed-by-reason/accepted), persisted via UserDefaults JSON; correct @mainactor isolation with the standard nonisolated-deinit workaround.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift Adds confidence floor and argmax-EOG stop wiring; resolvedStopAtArgmaxEOG now accepts injectable UserDefaults (fixes the testability gap flagged in the previous review), and LlamaGenerationOutput carries suppression attribution correctly.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift Decode loop gains the argmax-EOG early stop (break before extractPiece so the discarded token is excluded from both generatedText and sumLogprob) and returns LlamaGenerationOutput with averaged logprob and confidence-suppressed flag.
Cotabby/App/Coordinators/SuggestionCoordinator+Prediction.swift All coordinator early-exit gates now record their own suppression reasons; the empty-result gate guards against double-counting engine-attributed reasons; recordShown placed correctly at the single path where a generation becomes visible.
Cotabby/App/Coordinators/SuggestionCoordinator+Acceptance.swift Acceptance path records accepted suggestion on first chunk; InsertedTextAdvance provides host-font caret measurement for overlay sliding; predictedCaretRect now takes fieldStyle for improved accuracy.
Cotabby/Services/Runtime/SuggestionEngineRouter.swift Single recordQualityOutcome method handles generated + engine-attributed suppression accounting for all engine paths; qualityMetricsStore correctly injected.
Cotabby/Support/DebouncePolicy.swift Simple latency-keyed debounce (15/25/55ms), nonisolated enum, clean guard on nil/zero latency; the configured value is explicitly the pre-data fallback per the documented design.
Cotabby/Support/SurfaceContextComposer.swift New surface-context composer with clear omission-beats-noise invariants; sanitizedTitle suffix-stripping uses .anchored+.backwards correctly; domain extraction strips path/query/www; no prompt injection via controlChar+quote filtering.
Cotabby/Support/AppSurfaceClassifier.swift Clean bundle-ID classifier; terminal check runs on the original case before lowercasing for prefix tables; integrated-terminal flag takes precedence per the comment.
Cotabby/Support/InsertedTextAdvance.swift Host-font caret advance measurement; correct nil guards; falls back to system font with host point size when face name is unavailable, which is better than the fixed 14pt ghost-font fallback it replaces.
Cotabby/Services/Focus/SurfaceContextCache.swift Session-scoped AX capture cache keyed on PID:elementID:focusSeq; @mainactor isolation with nonisolated-deinit workaround; caches negative results to avoid re-probing unresponsive hosts.
Cotabby/UI/Settings/Panes/PerformancePaneView.swift New suggestion quality section with counters, acceptance rate, top-4 suppression reasons, and a Reset control; raw enum key display was flagged in the prior outside-diff comment.
Cotabby/Support/SuggestionOverlayStabilityGate.swift isAwaitingPostInsertionSync short-circuit correctly placed after the text-change check but before geometry checks, preventing the pre-publish stale caret from moving the overlay.
CotabbyTests/SuggestionQualityMetricsStoreTests.swift Good coverage: accumulation, acceptance rate, cross-instance persistence via isolated UserDefaults suite, and reset. Avoids process-global state.
CotabbyTests/LlamaDecodeGateDefaultsTests.swift Covers the defaults-write escape hatches for both confidence floor and argmax-stop via isolated UserDefaults suite; resolvedStopAtArgmaxEOG is now testable because the previous injection gap was fixed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Engine Result] --> R[Router: recordGenerated]
    R --> RS{suppressionReason?}
    RS -- yes --> RSR[recordSuppressed engine reason]
    RS -- no --> C[Coordinator]
    RSR --> C
    C --> SD{Stale drop?}
    SD -- yes --> SDR[recordSuppressed discardedStaleContext]
    SDR --> END1([Drop])
    SD -- no --> ER{result.text empty?}
    ER -- yes, no engine reason --> ERR[recordSuppressed emptyUnattributed]
    ERR --> END2([Drop])
    ER -- yes, has engine reason --> END2
    ER -- no --> ST{Selected text?}
    ST -- yes --> STR[recordSuppressed discardedSelection]
    STR --> END3([Drop])
    ST -- no --> AE{Stale accept echo?}
    AE -- yes --> AER[recordSuppressed discardedAcceptEcho]
    AER --> END4([Drop])
    AE -- no --> SG{Seam guard?}
    SG -- yes --> SGR[recordSuppressed seamMisspelling or seamJunkPunctuationRun]
    SGR --> END5([Drop])
    SG -- no --> SH[recordShown]
    SH --> SHOW([Show overlay])
    SHOW --> ACC{First Tab accept?}
    ACC -- yes, consumedCount == 0 --> ACCR[recordAcceptedSuggestion]
    ACC -- no --> ACCR
Loading

Comments Outside Diff (1)

  1. Cotabby/UI/Settings/Panes/PerformancePaneView.swift, line 902-908 (link)

    P2 Raw enum key names surfaced verbatim in the UI

    topSuppressionReasons maps the raw suppressedByReason dictionary keys directly to the displayed string. Those keys are CompletionSuppressionReason raw values ("lowConfidence", "emptyGeneration", "seamMisspelling") plus coordinator-level strings ("seamJunkPunctuationRun"). A non-developer user reading the Performance pane would see "lowConfidence 12, seamJunkPunctuationRun 3" with no indication of what those mean. A small display-name map (or at least title-casing / spacing) would make the section usable outside of debugging contexts.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Codex Fix in Claude Code

Reviews (4): Last reviewed commit: "ci: retrigger checks after force-push sy..." | Re-trigger Greptile

Comment thread Cotabby/Services/Runtime/SuggestionEngineRouter.swift
Comment thread Cotabby/Services/Runtime/LlamaSuggestionEngine.swift Outdated
@FuJacob FuJacob force-pushed the quality/context-relevance branch from b95e651 to d2ea9a1 Compare June 12, 2026 03:22
@FuJacob FuJacob force-pushed the quality/decode-gates-and-telemetry branch from c92d558 to a0b89ef Compare June 12, 2026 03:22
@FuJacob FuJacob force-pushed the quality/context-relevance branch 2 times, most recently from c351bdf to 5623a71 Compare June 12, 2026 03:24
@FuJacob FuJacob force-pushed the quality/decode-gates-and-telemetry branch from a0b89ef to 39d5f2d Compare June 12, 2026 03:27
Comment on lines 563 to +570
}
}

/// Marks the session's suggestion accepted in the quality counters, once per suggestion: only
/// the first chunk counts, so word-by-word walks of one suggestion add nothing further and the
/// acceptance rate stays suggestions-accepted over suggestions-shown.
private func recordSuggestionAcceptedIfFirstChunk(of session: ActiveSuggestionSession) {
guard session.consumedCharacterCount == 0 else { return }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Correction acceptances inflate acceptedSuggestions without a matching shown increment

recordSuggestionAcceptedIfFirstChunk only guards on consumedCharacterCount == 0; it fires for any session kind, including .correction(typoWord:). presentCorrection never calls recordShown(), so every accepted correction increments acceptedSuggestions without a corresponding shown entry. Since acceptanceRate = acceptedSuggestions / shown, a session where the user accepts even one correction while the shown count is in single digits produces a rate above 100%, making the metric actively misleading.

Add guard case .continuation = session.kind else { return } before the recordAcceptedSuggestion() call to restrict the counter to generated completions only.

Fix in Codex Fix in Claude Code

@FuJacob FuJacob force-pushed the quality/decode-gates-and-telemetry branch from 39d5f2d to c1e408f Compare June 12, 2026 03:36
Comment on lines 503 to +507
guard liveContext.generation == result.generation else {

latestRawModelOutput = SuggestionDebugLogger.debugPreview(result.rawText)
// Lifecycle discards are counted under their own reasons so `generated` always equals
// `shown` plus the suppression histogram; without this, every drop here silently

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Double-counted suppressions when engine suppression meets coordinator discard

The stale-drop, selected-text, and stale-accept-echo gates unconditionally call recordSuppressed, but the router already called recordSuppressed(reason: result.suppressionReason) for the same result whenever the engine attributed a suppression (e.g., "lowConfidence" or a normalizer reason). Any result that was engine-suppressed AND then discarded by a coordinator gate (common during fast typing at ~187 ms p50 latency, where ~21% of results hit the confidence floor and stale-drop fires frequently) will increment suppressedTotal twice, breaking the invariant stated in the adjacent comment ("generated always equals shown plus the suppression histogram"). The empty-result gate already guards with if result.suppressionReason == nil — the three lifecycle-discard gates need the same guard.

Fix in Codex Fix in Claude Code

@FuJacob

FuJacob commented Jun 12, 2026

Copy link
Copy Markdown
Owner Author

Merging on local CI-equivalent validation: GitHub Actions did not register runs for this branch's pushes (repo Actions is otherwise healthy; sibling PRs triggered normally), so the full suite was run locally on the exact head tree: build-for-testing succeeded, 1542 tests with 0 failures, SwiftLint clean, xcodegen drift-free. Review fixes included: the quality ledger now balances at every lifecycle exit, and the argmax toggle gained an injectable-defaults seam with tests.

@FuJacob FuJacob closed this Jun 12, 2026
@FuJacob FuJacob reopened this Jun 12, 2026
FuJacob added 5 commits June 11, 2026 20:57
The runtime now returns a typed output (text, average logprob, withheld flag)
instead of a bare string, so a confidence-suppressed completion is attributed
as lowConfidence rather than reading as 'the model produced nothing'. The
shipped floor of -1.5 mean per-token log-probability came from an eval sweep
over nine values: floors at or below -2 never fire on this model at
temperature 0.1, -1 and tighter buy precision at a brutal coverage cost, and
-1.5 is the unique point where the composite quality score rose (0.734 to
0.744), wrong-shows fell 27% relative (0.188 to 0.137), and zero must-show
cases were lost. Enabling the floor turns on per-token logprob computation
(eval p50 187ms to 200ms); cotabbyConfidenceFloorOverride adjusts it without
a rebuild, and -infinity restores the old posture entirely.

The decode loop also stops the moment the raw distribution's most-likely next
token is end-of-generation (computed by the engine while the logits row is
hot). At temperature 0.1 sampling is near-greedy so the eval shows no delta;
the stop exists for the sampling tail where the dist sampler draws past the
model's intended stop. cotabbyArgmaxStopDisabled switches it off.
…ce pane readout

Local lifetime counters answering 'is quality improving for real use':
generations, suggestions shown, why withheld ones were withheld (reason
histogram spanning the normalizer, the confidence floor, and the seam guard),
and how many shown suggestions were accepted (counted once per suggestion, so
word-by-word walks do not inflate the rate). The router counts generation
outcomes because it is the single point every finished result passes through;
the coordinator records the display-time and acceptance events only it can
see. Counters carry no content, so unlike the per-request latency log there
is no opt-in gate; the Performance pane shows the counts, acceptance rate,
and top withhold reasons with a reset control, indexed for settings search.
Acceptance rate over the suppression histogram is the on-device ground truth
that decides whether future decode changes actually help.
…n latency

A fixed debounce serves two masters badly: on fast hardware it adds avoidable
delay before every suggestion, and on slow hardware it lets keystrokes pile
doomed generations onto a model that cannot keep up (every cancel still costs
decode setup and teardown). The debounce now keys off the most recent
generation latency: 15ms when the model answers within 70ms, 25ms within
140ms, 55ms beyond that, with the configured value as the fallback until a
first latency exists.
…max toggle

The stale-drop, unattributed-empty, selected-text, and accept-echo exits
returned without recording either shown or suppressed, so the generated
counter silently outgrew the others; each now records a lifecycle
discard reason (engine-attributed empties stay counted by the router
alone). The argmax-EOG toggle now mirrors the confidence floor's
injectable-defaults seam, with tests covering both escape hatches
against an isolated suite.
@FuJacob FuJacob force-pushed the quality/decode-gates-and-telemetry branch from 8f36e25 to 98ca097 Compare June 12, 2026 03:57
@FuJacob FuJacob changed the base branch from quality/context-relevance to main June 12, 2026 04:59
@FuJacob FuJacob merged commit f2de07d into main Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant