Decode gates, quality telemetry, adaptive debounce#688
Conversation
b95e651 to
d2ea9a1
Compare
c92d558 to
a0b89ef
Compare
c351bdf to
5623a71
Compare
a0b89ef to
39d5f2d
Compare
| } | ||
| } | ||
|
|
||
| /// Marks the session's suggestion accepted in the quality counters, once per suggestion: only | ||
| /// the first chunk counts, so word-by-word walks of one suggestion add nothing further and the | ||
| /// acceptance rate stays suggestions-accepted over suggestions-shown. | ||
| private func recordSuggestionAcceptedIfFirstChunk(of session: ActiveSuggestionSession) { | ||
| guard session.consumedCharacterCount == 0 else { return } |
There was a problem hiding this comment.
Correction acceptances inflate
acceptedSuggestions without a matching shown increment
recordSuggestionAcceptedIfFirstChunk only guards on consumedCharacterCount == 0; it fires for any session kind, including .correction(typoWord:). presentCorrection never calls recordShown(), so every accepted correction increments acceptedSuggestions without a corresponding shown entry. Since acceptanceRate = acceptedSuggestions / shown, a session where the user accepts even one correction while the shown count is in single digits produces a rate above 100%, making the metric actively misleading.
Add guard case .continuation = session.kind else { return } before the recordAcceptedSuggestion() call to restrict the counter to generated completions only.
39d5f2d to
c1e408f
Compare
| guard liveContext.generation == result.generation else { | ||
|
|
||
| latestRawModelOutput = SuggestionDebugLogger.debugPreview(result.rawText) | ||
| // Lifecycle discards are counted under their own reasons so `generated` always equals | ||
| // `shown` plus the suppression histogram; without this, every drop here silently |
There was a problem hiding this comment.
Double-counted suppressions when engine suppression meets coordinator discard
The stale-drop, selected-text, and stale-accept-echo gates unconditionally call recordSuppressed, but the router already called recordSuppressed(reason: result.suppressionReason) for the same result whenever the engine attributed a suppression (e.g., "lowConfidence" or a normalizer reason). Any result that was engine-suppressed AND then discarded by a coordinator gate (common during fast typing at ~187 ms p50 latency, where ~21% of results hit the confidence floor and stale-drop fires frequently) will increment suppressedTotal twice, breaking the invariant stated in the adjacent comment ("generated always equals shown plus the suppression histogram"). The empty-result gate already guards with if result.suppressionReason == nil — the three lifecycle-discard gates need the same guard.
|
Merging on local CI-equivalent validation: GitHub Actions did not register runs for this branch's pushes (repo Actions is otherwise healthy; sibling PRs triggered normally), so the full suite was run locally on the exact head tree: build-for-testing succeeded, 1542 tests with 0 failures, SwiftLint clean, xcodegen drift-free. Review fixes included: the quality ledger now balances at every lifecycle exit, and the argmax toggle gained an injectable-defaults seam with tests. |
The runtime now returns a typed output (text, average logprob, withheld flag) instead of a bare string, so a confidence-suppressed completion is attributed as lowConfidence rather than reading as 'the model produced nothing'. The shipped floor of -1.5 mean per-token log-probability came from an eval sweep over nine values: floors at or below -2 never fire on this model at temperature 0.1, -1 and tighter buy precision at a brutal coverage cost, and -1.5 is the unique point where the composite quality score rose (0.734 to 0.744), wrong-shows fell 27% relative (0.188 to 0.137), and zero must-show cases were lost. Enabling the floor turns on per-token logprob computation (eval p50 187ms to 200ms); cotabbyConfidenceFloorOverride adjusts it without a rebuild, and -infinity restores the old posture entirely. The decode loop also stops the moment the raw distribution's most-likely next token is end-of-generation (computed by the engine while the logits row is hot). At temperature 0.1 sampling is near-greedy so the eval shows no delta; the stop exists for the sampling tail where the dist sampler draws past the model's intended stop. cotabbyArgmaxStopDisabled switches it off.
…ce pane readout Local lifetime counters answering 'is quality improving for real use': generations, suggestions shown, why withheld ones were withheld (reason histogram spanning the normalizer, the confidence floor, and the seam guard), and how many shown suggestions were accepted (counted once per suggestion, so word-by-word walks do not inflate the rate). The router counts generation outcomes because it is the single point every finished result passes through; the coordinator records the display-time and acceptance events only it can see. Counters carry no content, so unlike the per-request latency log there is no opt-in gate; the Performance pane shows the counts, acceptance rate, and top withhold reasons with a reset control, indexed for settings search. Acceptance rate over the suppression histogram is the on-device ground truth that decides whether future decode changes actually help.
…n latency A fixed debounce serves two masters badly: on fast hardware it adds avoidable delay before every suggestion, and on slow hardware it lets keystrokes pile doomed generations onto a model that cannot keep up (every cancel still costs decode setup and teardown). The debounce now keys off the most recent generation latency: 15ms when the model answers within 70ms, 25ms within 140ms, 55ms beyond that, with the configured value as the fallback until a first latency exists.
…max toggle The stale-drop, unattributed-empty, selected-text, and accept-echo exits returned without recording either shown or suppressed, so the generated counter silently outgrew the others; each now records a lifecycle discard reason (engine-attributed empties stay counted by the router alone). The argmax-EOG toggle now mirrors the confidence floor's injectable-defaults seam, with tests covering both escape hatches against an isolated suite.
8f36e25 to
98ca097
Compare
Summary
Stacked on #686. The decode-side "showing nothing beats showing garbage" gates, the local counters that prove whether they help, and a latency-keyed debounce. Consumes the engine's new decode-quality primitives (cotabbyinference#10, on middleware
main).lowConfidenceinstead of "the model produced nothing". The -1.5 default came from a nine-point eval sweep (table below). Enabling the floor turns on the per-token logprob work the energy pass gated off; that cost is priced and bounded (p50 +13ms), andcotabbyConfidenceFloorOverrideadjusts or disables (-inf) without a rebuild.cotabbyArgmaxStopDisabledswitches it off.Validation
Confidence-floor sweep (eval, Gemma E2B Q6_K, 117 cases, fixed seed; floor read at runtime via the override key so the sweep is rebuild-free):
Floors at or below -2 never fire on this model at temperature 0.1 (the model is confident even about garbage); -1.5 is the unique point where the composite rose while wrong-shows fell 27% relative and no must-show case was lost. A -0.05 sanity floor suppressed 117/117, proving the wiring end to end. Latency at -1.5: p50 187→200ms, p95 1036→1124ms (the logprob passes; engine-side cost is two O(vocab) scans per token).
Linked issues
Refs #486 (quality/context usage now visible in the Performance window), #546 (adaptive debounce stops thrashing slow machines).
Risk / rollout notes
LlamaRuntimeGenerating.generatenow returnsLlamaGenerationOutputinstead ofString(internal protocol; both conformers and the test fake updated).defaults write com.jacobfu.tabby cotabbyConfidenceFloorOverride -float -1e9restores the old posture without a build; telemetry'slowConfidencecount shows exactly how often the gate fires in real use.SuggestionQualityMetricsStorecarriesnonisolated deinit {}for the known app-hosted-test double-free; without it the store's tests crash with SIGABRT in full-suite runs (reproduced, fixed, full suite green).Greptile Summary
This PR adds four new features on top of #686: a confidence floor gate and argmax-EOG early-stop at the decode layer, always-on quality telemetry counters (generated / shown / withheld-by-reason / accepted), and a latency-keyed adaptive debounce (15/25/55 ms). It also adds surface-context conditioning for the base model and a host-font caret advance measurement that fixes the overlay slide accuracy.
LlamaRuntimeCorebreaks the token loop before appending whenargmax_is_eog;LlamaSuggestionEngineroutes confidence-suppressed completions to aSuggestionNormalizationResult(.lowConfidence)so the coordinator attributes them correctly instead of reading as empty-model output.resolvedStopAtArgmaxEOGnow accepts injectableUserDefaults, closing the testability gap noted in the prior review.SuggestionQualityMetricsStoreis a lean@MainActorObservableObjectpersisting lifetime counters via a JSON UserDefaults blob; each coordinator gate records its own named suppression reason, andSuggestionEngineRouter.recordQualityOutcomeis the single accounting point for engine-level outcomes.DebouncePolicyselects a debounce window from the last observed generation latency (fallback to configured value before first data);SurfaceContextComposerproduces a sanitized writing-surface descriptor that suppresses itself for code editors and terminals.Confidence Score: 5/5
Safe to merge; no new defects introduced on any path changed by this PR.
All four features are cleanly isolated, covered by targeted tests, and validated against a full 1128-test suite. The argmax-EOG token is correctly excluded from both the generated text and the sumLogprob accumulation before the confidence average is computed. The resolvedStopAtArgmaxEOG injection gap called out in the previous review is fixed. The suppression accounting is structurally sound for results that pass through only one accounting layer; concerns about results that simultaneously hit an engine-level and a lifecycle gate were covered in the prior review and are pre-existing rather than newly introduced here.
No files require special attention; all changes are internally consistent and well-tested.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Engine Result] --> R[Router: recordGenerated] R --> RS{suppressionReason?} RS -- yes --> RSR[recordSuppressed engine reason] RS -- no --> C[Coordinator] RSR --> C C --> SD{Stale drop?} SD -- yes --> SDR[recordSuppressed discardedStaleContext] SDR --> END1([Drop]) SD -- no --> ER{result.text empty?} ER -- yes, no engine reason --> ERR[recordSuppressed emptyUnattributed] ERR --> END2([Drop]) ER -- yes, has engine reason --> END2 ER -- no --> ST{Selected text?} ST -- yes --> STR[recordSuppressed discardedSelection] STR --> END3([Drop]) ST -- no --> AE{Stale accept echo?} AE -- yes --> AER[recordSuppressed discardedAcceptEcho] AER --> END4([Drop]) AE -- no --> SG{Seam guard?} SG -- yes --> SGR[recordSuppressed seamMisspelling or seamJunkPunctuationRun] SGR --> END5([Drop]) SG -- no --> SH[recordShown] SH --> SHOW([Show overlay]) SHOW --> ACC{First Tab accept?} ACC -- yes, consumedCount == 0 --> ACCR[recordAcceptedSuggestion] ACC -- no --> ACCRComments Outside Diff (1)
Cotabby/UI/Settings/Panes/PerformancePaneView.swift, line 902-908 (link)topSuppressionReasonsmaps the rawsuppressedByReasondictionary keys directly to the displayed string. Those keys areCompletionSuppressionReasonraw values ("lowConfidence","emptyGeneration","seamMisspelling") plus coordinator-level strings ("seamJunkPunctuationRun"). A non-developer user reading the Performance pane would see"lowConfidence 12, seamJunkPunctuationRun 3"with no indication of what those mean. A small display-name map (or at least title-casing / spacing) would make the section usable outside of debugging contexts.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (4): Last reviewed commit: "ci: retrigger checks after force-push sy..." | Re-trigger Greptile