Context relevance: surface conditioning, token-true budgeting, wider prefix, OCR correction by FuJacob · Pull Request #686 · FuJacob/cotabby

FuJacob · 2026-06-12T02:04:24Z

Summary

Stacked on #683 (the eval harness measures everything here). Three changes to what the model sees:

Surface conditioning (the headline). The llama prompt previously carried zero situational context: on Mail, Slack, or Google Docs the base model had no idea what surface it was continuing, so completions read generic. The preface now states the surface class (An email being written in Mail.), the sanitized window title (the email subject / document name / channel is the highest-signal cue Accessibility offers), the web domain, and the field placeholder. Code editors and terminals are deliberately excluded (app metadata biases small base models toward code/numbers exactly where the text already makes that obvious), as are anonymous generic apps. The FM prompt states the same sanitized facts. Capture is one AX read per field session, cached and frozen for the session so the prompt bytes ahead of the prefix stay byte-stable and llama KV prefix reuse keeps absorbing them. Secure fields are never probed. New "Include App Context" toggle, default on, settings-search indexed; everything stays on device.
Token-true budgeting + FM-parity prefix window. The token-aware allocator existed with zero callers; it is now the shipped path (budget = 2048-token per-sequence KV capacity minus output ceiling and margin, per-section char caps retained as a second bound). With that in place, the llama prefix window rises from 1000 chars / 50 words to 2500 / 150. The old cap predates KV prefix reuse: the larger window's prefill is paid once per focused field, not per keystroke.
Vision language correction on for visual-context OCR (once per field, off the hot path; garbled recognitions die at the source instead of relying on post-hoc filters).

Bundle classification moves to a shared AppSurfaceClassifier so the FM tone hints and the new surface preface can never disagree (FM behavior unchanged, pinned by tests).

Validation

xcodebuild build-for-testing ... CODE_SIGNING_ALLOWED=NO   # ** TEST BUILD SUCCEEDED **, no new warnings
xcodebuild test-without-building ...                       # FULL suite: 1119 tests, 0 failures (5 gated evals skipped)
swiftlint lint --quiet <changed files>                     # exit 0
xcodegen generate                                          # pbxproj additions only

Eval (Gemma E2B Q6_K, fixed seed, same dataset as the #683 baseline):

114-case core: qualityScore 0.727 vs baseline 0.742, wrongShowRate 0.193 vs 0.184 — within single-sample noise. A per-case diff showed exactly 4 flips: 2 adverse single-draw flips, 1 improvement (a misspelled-splice now suppressed), and 1 case where the new output was strictly better but needed a reference-list addition ("grab lunch at " → "12:30?"). The dataset's prefixes are long and topical by construction, which is exactly where surface context adds least; its value case is short/ambiguous prefixes and on-topic vocabulary, which the new factory/renderer tests pin structurally.
New long-document cases (the wider window's purpose): 3/3 correctInsert with completions referencing content 1500+ characters before the caret (a chat reply that correctly recalls a checklist from the top of the scroll; an email that flags the right risk from two paragraphs up). Cold-start latency 344/382/734ms, inside the pre-existing p95 (1028ms). Overall p50 187ms vs 171ms baseline.

Linked issues

Refs #660 (Google Docs quality: browser surfaces now carry domain + document title context).

Risk / rollout notes

New persisted setting cotabbySurfaceContextEnabled (default on). Snapshot/data/store all migrated with write-back; fresh installs and upgrades both resolve to on.
Focus capture does 2-3 extra AX reads once per field session (window title, placeholder, URL for browsers), cached by the same pattern as the existing field-style read; the steady poll is untouched. The URL read previously ran per poll when per-site disable was on; it is now also cached per field session (navigation recreates the focused web element, so a stale URL cannot outlive its page).
Prompt-shape change invalidates KV prefix reuse once per field on first run after update, then amortizes as before.
Wider prefix raises first-suggestion prefill in long documents; measured at 344-734ms on the 2B model, and the token budget hard-bounds the prompt. If field reports say otherwise, the constants are in SuggestionConfiguration.standard.

Greptile Summary

This PR adds three coordinated improvements to what the local completion model sees: a surface-conditioning preface that tells the model which app, window title, domain, and field placeholder the user is writing in; a token-true prompt budget derived from the runtime's actual KV capacity; and a wider prefix window (1000 → 2500 chars) that amortises its prefill cost via KV prefix reuse.

Surface conditioning (AppSurfaceClassifier, SurfaceContextComposer, SurfaceContextCache) introduces a shared classifier so the llama preface and the FM tone hint always agree about the current app class, with code editors and terminals explicitly suppressed to avoid biasing base models. Metadata is captured once per field session, frozen for its lifetime to keep prompt bytes byte-stable, and never probed on secure fields.
Token-aware budgeting (SuggestionModels, SuggestionRequestFactory) wires up the previously zero-caller tokenBudget path by deriving the budget from LlamaRuntimeConfiguration.default.contextWindowTokens at compile time, preventing the budget from silently drifting if the context window constant changes.
OCR language correction (ScreenTextExtractor) enables usesLanguageCorrection = true for the once-per-field visual context capture, cutting garbled recognitions at the source rather than relying on downstream filters.

Confidence Score: 5/5

Safe to merge — all new behaviour is gated behind the defaulted-on isSurfaceContextEnabled toggle, AX reads are cached per field session, secure fields are never probed, and 1119 tests pass with zero failures.

The surface-conditioning, token-budget, and OCR-correction changes are well-isolated: surface metadata is frozen per field session and sanitized before reaching any prompt, the token budget is now derived from the runtime constant so it cannot drift silently, and the wider prefix window is bounded by the same budget. The previously flagged Unicode suffix-strip and fieldPlaceholder parity issues are both resolved. The only remaining note is a doc-comment naming mismatch on registrableDomain which does not affect runtime behaviour.

No files require special attention.

Important Files Changed

Filename	Overview
Cotabby/Support/AppSurfaceClassifier.swift	New single-source-of-truth classifier; well-structured with correct case-folding before prefix matching. Tests cover all surface classes and the integrated-terminal-beats-everything precedence rule.
Cotabby/Support/SurfaceContextComposer.swift	Sanitization pipeline is thorough and well-tested; Unicode suffix-strip bug fixed. Minor: `registrableDomain` name implies eTLD+1 but retains full subdomain intentionally.
Cotabby/Services/Focus/SurfaceContextCache.swift	Simple LRU-of-1 cache; @mainactor isolation and nonisolated deinit match FieldStyleCache pattern.
Cotabby/Services/Focus/FocusSnapshotResolver.swift	Secure fields correctly bypass cache; focusChangeSequence key invalidates on real focus changes.
Cotabby/Models/SuggestionModels.swift	Token budget now derived from LlamaRuntimeConfiguration.default.contextWindowTokens; cannot silently drift.
Cotabby/Support/SuggestionRequestFactory.swift	Surface context gated on isSurfaceContextEnabled; tokenBudget wired to llama path; FM path receives same surfaceContext.
Cotabby/Support/FoundationModelPromptRenderer.swift	Bundle-prefix tables removed in favour of AppSurfaceClassifier; fieldPlaceholder parity gap closed.
Cotabby/Support/SuggestionSettingsStore.swift	New cotabbySurfaceContextEnabled key defaults to true on miss; write-back added to saveAll.
Cotabby/Services/Visual/ScreenTextExtractor.swift	usesLanguageCorrection = true is safe for once-per-field path; improves OCR quality at source.

_{Reviews (3): Last reviewed commit: "ci: retrigger checks after force-push sy..." | Re-trigger Greptile}

OCR text conditions the prompt, and the downstream hygiene filters can only drop garbled lines, not repair them. Language correction cuts the garbling at the source; the capture is once per focused field, so the extra Vision work is off every hot path.

The base model previously received bare prefix text: on Mail, Slack, or Docs it had no idea what surface it was continuing, so completions read generic. The prompt preface now states the surface (An email being written in Mail.), the window title (the subject, document name, or channel is the highest-signal cue Accessibility offers), the web domain, and the field placeholder, all sanitized and length-capped. Code editors and terminals are deliberately excluded: app metadata biases small base models toward code and numbers exactly where the text already makes the language obvious. The Foundation Models prompt states the same sanitized facts. Capture is one Accessibility read per field session, cached and frozen for the session so the prompt bytes ahead of the prefix stay stable and llama KV prefix reuse keeps absorbing them; a retitling browser tab cannot thrash the cache. Secure fields are never probed. Classification moves to a shared AppSurfaceClassifier so both engines agree about what kind of app the user is in. New Include App Context toggle (default on, indexed for settings search); everything stays on device.

…window The token-aware section allocator existed but nothing called it; the shipped path budgeted 2400 characters flat, which misjudges code, CJK, and punctuation-heavy text. The factory now passes a budget derived from the runtime's per-sequence context window (2048) minus the output ceiling and a safety margin, with per-section character caps retained as a second bound. With token-true budgeting in place, the llama prefix window rises from 1000 characters / 50 words to Foundation Models parity (2500 / 150). The old cap predates KV prefix reuse: prefill for a larger window is now paid once per focused field rather than per keystroke, and the extra preceding sentences carry the topic and voice that multi-paragraph email and docs continuations need. New long-document eval cases show completions correctly referencing content 1500+ characters before the caret at 344-734ms cold start, well inside the existing p95.

@mainactor

…test teardown Stored-property @mainactor classes deallocated inside app-hosted tests double-free without an explicitly nonisolated deinit; FieldStyleCache carries the same workaround. Surfaced by the live resolver tests once this branch rebased onto them.

…udget, FM placeholder parity The lowercased hasSuffix paired with an original-string dropLast count could clip the wrong amount for characters that expand under case folding; the strip now uses an anchored backwards case-insensitive range. The 1934 token budget is now derived from LlamaRuntimeConfiguration.default so a context-window change cannot silently desynchronize it, with the output ceiling and safety margin as named constants. The FM prompt now states the field placeholder exactly like the llama preface, and the prefix-window comment states the real latency contract on trim-rejecting catalog models instead of assuming reuse.

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread Cotabby/Support/SurfaceContextComposer.swift

Comment thread Cotabby/Models/SuggestionModels.swift Outdated

Comment thread Cotabby/Support/FoundationModelPromptRenderer.swift

FuJacob mentioned this pull request Jun 12, 2026

Decode gates, quality telemetry, adaptive debounce #688

Merged

FuJacob force-pushed the quality/eval-and-output-hygiene branch from b0e6238 to f06caeb Compare June 12, 2026 03:16

FuJacob added 4 commits June 11, 2026 20:22

FuJacob force-pushed the quality/context-relevance branch 2 times, most recently from d2ea9a1 to c351bdf Compare June 12, 2026 03:23

FuJacob force-pushed the quality/context-relevance branch from c351bdf to 5623a71 Compare June 12, 2026 03:24

FuJacob changed the base branch from quality/eval-and-output-hygiene to main June 12, 2026 03:25

ci: retrigger checks after force-push synchronize was dropped

94a4439

FuJacob merged commit c24b1b1 into main Jun 12, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Context relevance: surface conditioning, token-true budgeting, wider prefix, OCR correction#686

Context relevance: surface conditioning, token-true budgeting, wider prefix, OCR correction#686
FuJacob merged 6 commits into
mainfrom
quality/context-relevance

FuJacob commented Jun 12, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 12, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 12, 2026 •

edited by greptile-apps Bot

Loading