Skip to content

Context relevance: surface conditioning, token-true budgeting, wider prefix, OCR correction#686

Merged
FuJacob merged 6 commits into
mainfrom
quality/context-relevance
Jun 12, 2026
Merged

Context relevance: surface conditioning, token-true budgeting, wider prefix, OCR correction#686
FuJacob merged 6 commits into
mainfrom
quality/context-relevance

Conversation

@FuJacob

@FuJacob FuJacob commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

Stacked on #683 (the eval harness measures everything here). Three changes to what the model sees:

  1. Surface conditioning (the headline). The llama prompt previously carried zero situational context: on Mail, Slack, or Google Docs the base model had no idea what surface it was continuing, so completions read generic. The preface now states the surface class (An email being written in Mail.), the sanitized window title (the email subject / document name / channel is the highest-signal cue Accessibility offers), the web domain, and the field placeholder. Code editors and terminals are deliberately excluded (app metadata biases small base models toward code/numbers exactly where the text already makes that obvious), as are anonymous generic apps. The FM prompt states the same sanitized facts. Capture is one AX read per field session, cached and frozen for the session so the prompt bytes ahead of the prefix stay byte-stable and llama KV prefix reuse keeps absorbing them. Secure fields are never probed. New "Include App Context" toggle, default on, settings-search indexed; everything stays on device.
  2. Token-true budgeting + FM-parity prefix window. The token-aware allocator existed with zero callers; it is now the shipped path (budget = 2048-token per-sequence KV capacity minus output ceiling and margin, per-section char caps retained as a second bound). With that in place, the llama prefix window rises from 1000 chars / 50 words to 2500 / 150. The old cap predates KV prefix reuse: the larger window's prefill is paid once per focused field, not per keystroke.
  3. Vision language correction on for visual-context OCR (once per field, off the hot path; garbled recognitions die at the source instead of relying on post-hoc filters).

Bundle classification moves to a shared AppSurfaceClassifier so the FM tone hints and the new surface preface can never disagree (FM behavior unchanged, pinned by tests).

Validation

xcodebuild build-for-testing ... CODE_SIGNING_ALLOWED=NO   # ** TEST BUILD SUCCEEDED **, no new warnings
xcodebuild test-without-building ...                       # FULL suite: 1119 tests, 0 failures (5 gated evals skipped)
swiftlint lint --quiet <changed files>                     # exit 0
xcodegen generate                                          # pbxproj additions only

Eval (Gemma E2B Q6_K, fixed seed, same dataset as the #683 baseline):

  • 114-case core: qualityScore 0.727 vs baseline 0.742, wrongShowRate 0.193 vs 0.184 — within single-sample noise. A per-case diff showed exactly 4 flips: 2 adverse single-draw flips, 1 improvement (a misspelled-splice now suppressed), and 1 case where the new output was strictly better but needed a reference-list addition ("grab lunch at " → "12:30?"). The dataset's prefixes are long and topical by construction, which is exactly where surface context adds least; its value case is short/ambiguous prefixes and on-topic vocabulary, which the new factory/renderer tests pin structurally.
  • New long-document cases (the wider window's purpose): 3/3 correctInsert with completions referencing content 1500+ characters before the caret (a chat reply that correctly recalls a checklist from the top of the scroll; an email that flags the right risk from two paragraphs up). Cold-start latency 344/382/734ms, inside the pre-existing p95 (1028ms). Overall p50 187ms vs 171ms baseline.

Linked issues

Refs #660 (Google Docs quality: browser surfaces now carry domain + document title context).

Risk / rollout notes

  • New persisted setting cotabbySurfaceContextEnabled (default on). Snapshot/data/store all migrated with write-back; fresh installs and upgrades both resolve to on.
  • Focus capture does 2-3 extra AX reads once per field session (window title, placeholder, URL for browsers), cached by the same pattern as the existing field-style read; the steady poll is untouched. The URL read previously ran per poll when per-site disable was on; it is now also cached per field session (navigation recreates the focused web element, so a stale URL cannot outlive its page).
  • Prompt-shape change invalidates KV prefix reuse once per field on first run after update, then amortizes as before.
  • Wider prefix raises first-suggestion prefill in long documents; measured at 344-734ms on the 2B model, and the token budget hard-bounds the prompt. If field reports say otherwise, the constants are in SuggestionConfiguration.standard.

Greptile Summary

This PR adds three coordinated improvements to what the local completion model sees: a surface-conditioning preface that tells the model which app, window title, domain, and field placeholder the user is writing in; a token-true prompt budget derived from the runtime's actual KV capacity; and a wider prefix window (1000 → 2500 chars) that amortises its prefill cost via KV prefix reuse.

  • Surface conditioning (AppSurfaceClassifier, SurfaceContextComposer, SurfaceContextCache) introduces a shared classifier so the llama preface and the FM tone hint always agree about the current app class, with code editors and terminals explicitly suppressed to avoid biasing base models. Metadata is captured once per field session, frozen for its lifetime to keep prompt bytes byte-stable, and never probed on secure fields.
  • Token-aware budgeting (SuggestionModels, SuggestionRequestFactory) wires up the previously zero-caller tokenBudget path by deriving the budget from LlamaRuntimeConfiguration.default.contextWindowTokens at compile time, preventing the budget from silently drifting if the context window constant changes.
  • OCR language correction (ScreenTextExtractor) enables usesLanguageCorrection = true for the once-per-field visual context capture, cutting garbled recognitions at the source rather than relying on downstream filters.

Confidence Score: 5/5

Safe to merge — all new behaviour is gated behind the defaulted-on isSurfaceContextEnabled toggle, AX reads are cached per field session, secure fields are never probed, and 1119 tests pass with zero failures.

The surface-conditioning, token-budget, and OCR-correction changes are well-isolated: surface metadata is frozen per field session and sanitized before reaching any prompt, the token budget is now derived from the runtime constant so it cannot drift silently, and the wider prefix window is bounded by the same budget. The previously flagged Unicode suffix-strip and fieldPlaceholder parity issues are both resolved. The only remaining note is a doc-comment naming mismatch on registrableDomain which does not affect runtime behaviour.

No files require special attention.

Important Files Changed

Filename Overview
Cotabby/Support/AppSurfaceClassifier.swift New single-source-of-truth classifier; well-structured with correct case-folding before prefix matching. Tests cover all surface classes and the integrated-terminal-beats-everything precedence rule.
Cotabby/Support/SurfaceContextComposer.swift Sanitization pipeline is thorough and well-tested; Unicode suffix-strip bug fixed. Minor: registrableDomain name implies eTLD+1 but retains full subdomain intentionally.
Cotabby/Services/Focus/SurfaceContextCache.swift Simple LRU-of-1 cache; @mainactor isolation and nonisolated deinit match FieldStyleCache pattern.
Cotabby/Services/Focus/FocusSnapshotResolver.swift Secure fields correctly bypass cache; focusChangeSequence key invalidates on real focus changes.
Cotabby/Models/SuggestionModels.swift Token budget now derived from LlamaRuntimeConfiguration.default.contextWindowTokens; cannot silently drift.
Cotabby/Support/SuggestionRequestFactory.swift Surface context gated on isSurfaceContextEnabled; tokenBudget wired to llama path; FM path receives same surfaceContext.
Cotabby/Support/FoundationModelPromptRenderer.swift Bundle-prefix tables removed in favour of AppSurfaceClassifier; fieldPlaceholder parity gap closed.
Cotabby/Support/SuggestionSettingsStore.swift New cotabbySurfaceContextEnabled key defaults to true on miss; write-back added to saveAll.
Cotabby/Services/Visual/ScreenTextExtractor.swift usesLanguageCorrection = true is safe for once-per-field path; improves OCR quality at source.

Fix All in Codex Fix All in Claude Code

Reviews (3): Last reviewed commit: "ci: retrigger checks after force-push sy..." | Re-trigger Greptile

Comment thread Cotabby/Support/SurfaceContextComposer.swift
Comment thread Cotabby/Models/SuggestionModels.swift Outdated
Comment thread Cotabby/Support/FoundationModelPromptRenderer.swift
FuJacob added 4 commits June 11, 2026 20:22
OCR text conditions the prompt, and the downstream hygiene filters can only
drop garbled lines, not repair them. Language correction cuts the garbling at
the source; the capture is once per focused field, so the extra Vision work
is off every hot path.
The base model previously received bare prefix text: on Mail, Slack, or Docs
it had no idea what surface it was continuing, so completions read generic.
The prompt preface now states the surface (An email being written in Mail.),
the window title (the subject, document name, or channel is the highest-signal
cue Accessibility offers), the web domain, and the field placeholder, all
sanitized and length-capped. Code editors and terminals are deliberately
excluded: app metadata biases small base models toward code and numbers
exactly where the text already makes the language obvious. The Foundation
Models prompt states the same sanitized facts.

Capture is one Accessibility read per field session, cached and frozen for the
session so the prompt bytes ahead of the prefix stay stable and llama KV
prefix reuse keeps absorbing them; a retitling browser tab cannot thrash the
cache. Secure fields are never probed. Classification moves to a shared
AppSurfaceClassifier so both engines agree about what kind of app the user is
in. New Include App Context toggle (default on, indexed for settings search);
everything stays on device.
…window

The token-aware section allocator existed but nothing called it; the shipped
path budgeted 2400 characters flat, which misjudges code, CJK, and
punctuation-heavy text. The factory now passes a budget derived from the
runtime's per-sequence context window (2048) minus the output ceiling and a
safety margin, with per-section character caps retained as a second bound.

With token-true budgeting in place, the llama prefix window rises from 1000
characters / 50 words to Foundation Models parity (2500 / 150). The old cap
predates KV prefix reuse: prefill for a larger window is now paid once per
focused field rather than per keystroke, and the extra preceding sentences
carry the topic and voice that multi-paragraph email and docs continuations
need. New long-document eval cases show completions correctly referencing
content 1500+ characters before the caret at 344-734ms cold start, well
inside the existing p95.
…test teardown

Stored-property @mainactor classes deallocated inside app-hosted tests
double-free without an explicitly nonisolated deinit; FieldStyleCache carries
the same workaround. Surfaced by the live resolver tests once this branch
rebased onto them.
@FuJacob FuJacob force-pushed the quality/context-relevance branch 2 times, most recently from d2ea9a1 to c351bdf Compare June 12, 2026 03:23
…udget, FM placeholder parity

The lowercased hasSuffix paired with an original-string dropLast count
could clip the wrong amount for characters that expand under case
folding; the strip now uses an anchored backwards case-insensitive
range. The 1934 token budget is now derived from
LlamaRuntimeConfiguration.default so a context-window change cannot
silently desynchronize it, with the output ceiling and safety margin as
named constants. The FM prompt now states the field placeholder exactly
like the llama preface, and the prefix-window comment states the real
latency contract on trim-rejecting catalog models instead of assuming
reuse.
@FuJacob FuJacob force-pushed the quality/context-relevance branch from c351bdf to 5623a71 Compare June 12, 2026 03:24
@FuJacob FuJacob changed the base branch from quality/eval-and-output-hygiene to main June 12, 2026 03:25
@FuJacob FuJacob merged commit c24b1b1 into main Jun 12, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant