test(cli): screenshot tests asserting each command loads the right skill#682
Open
sarahxsanders wants to merge 6 commits into
Open
test(cli): screenshot tests asserting each command loads the right skill#682sarahxsanders wants to merge 6 commits into
sarahxsanders wants to merge 6 commits into
Conversation
Adds scripts/cli-screenshots.mjs (pnpm screens:cli): for each command it runs
the real binary in a pseudo-terminal (via `script`, since Ink needs a TTY),
captures the raw ANSI output, and diffs it against a committed golden dump under
scripts/__screenshots__/. Catches regressions where a command falls through to
the default flow, renders the wrong intro screen, or doesn't render at all —
the class of bug we hit this cycle.
Goldens are raw ANSI ("jank screenshots"); normalize() strips volatile
cursor/spinner sequences before comparing so a live TUI doesn't flake.
Complements the component-level scripts/check-screens.tsx.
Includes a workflow_dispatch CI job. NOT yet wired to gate PRs: the goldens
need seeding once in a real terminal/CI (`pnpm build && pnpm screens:cli
--update`, commit scripts/__screenshots__/), then enable the pull_request/push
triggers. (Couldn't seed here — the sandbox has no pty.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🧙 Wizard CIRun the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands: Test all apps:
Test all apps in a directory:
Test an individual app:
Show more apps
Results will be posted here when complete. |
SIGTERM to `script` left the inner wizard running (script doesn't forward signals), and a stray wizard holds resources like the OAuth-callback port — so the first capture worked and every one after it came back empty. Spawn detached and SIGINT the process group (Ink then restores the terminal; the inner node dies), with a SIGKILL fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…byte-exact) Replaces the `script`-based capture, which painted blank frames because the pty had no window size when detached. node-pty gives a real pty with a fixed size (COLS×ROWS), forced colour (stable local↔CI), raw-byte capture (encoding: null), and a clean kill. Capture waits for the screen to settle, then snapshots; goldens are raw binary ANSI dumps compared byte-for-byte. Adds node-pty as a devDependency (not bundled into the shipped wizard) and to pnpm.onlyBuiltDependencies so its native build runs on install. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eenshot Run each command in a sized pty (node-pty), render the settled screen through a headless emulator (@xterm/headless), and assert it contains a source-grounded marker — the program/skill id the command wires into its intro screen. Catches command -> screen wiring regressions. - marker match, not byte-for-byte golden: survives spinners, async fetches, and project-dependent intros that make whole-frame matching flake - self-heal node-pty's spawn-helper +x bit (pnpm drops it on install) - CI: path-filtered PR gate + daily drift monitor, Slack on scheduled fail Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The integration (default) and slack flows render different screens depending on the directory, detection speed, and login state — cold CI catches `default` still detecting, and logged-out CI shows slack's connect intro rather than the login wait. A single static marker can't match all of them. Allow `marker` to be an array (pass if any appears); every accepted marker is still flow-specific, so a fall-through to the wrong flow fails. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gewenyu99
approved these changes
Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Catches command → intro-screen wiring regressions: a command falling through to the default flow, routing to the wrong skill, or rendering nothing. For each command in the surface, the harness (
scripts/cli-screenshots.mjs):node-pty),@xterm/headless) and snapshots the settled screen — an actual screenshot, colour + layout,Program ✔ …/Skill ✔ …).Run it with
pnpm screens:cli. Verified locally (macOS) and in a Linux container mirroring CI — 12/12 green.Decisions & gotchas worth your attention
Why a native dependency (
node-pty) instead of thescriptcommandInk asks the terminal "how many rows/columns do you have?" before it draws — no size, blank screen.
node-ptylets us set that size (100×40);scriptonly inherits it from an attached terminal, and there's none when spawned headlessly or in CI, so Ink paints nothing (this is exactly the empty-capture dead end we hit first).The deeper reason: allocating a pty is an OS syscall (
openpty/forkpty) — there's no pure-JS way to do it. Your only options are a native addon that exposes size control (node-pty) or shelling out to a native binary that doesn't (script). So even on Linux,scriptwouldn't have worked. The native dep buys the one thing Ink requires.We assert a string match, not a full-screenshot match
Tempting to diff the whole rendered frame against a committed golden — but several screens animate (spinners) or wait on an async fetch, and some intros are project-dependent (run in this repo,
revenue-analyticshits a "no Stripe SDK" preflight block instead of its normal intro). Whole-frame matching flakes on all of that.So we still read the rendered screenshot — we just assert the one line that proves the routing (the source-wired program/skill id), which is stable across spinners, ephemeral ports, and project state. Each marker carries a
srcfield pointing at the wizard source it comes from, so it stays a deliberate assertion, not an eyeballed string. There are no committed goldens — markers live in the harness; rendered frames are gitignored CI artifacts.Two markers are intentionally weaker, and the code says so:
revenue-analytics/upload-source-mapspreflight-block in the wizard's own repo, so they don't reach the skill-id intro — their marker proves the command routed to the right flow, not the exact skill id.slack addstarts OAuth immediately, so its settled screen is the auth wait — the marker proves it didn't fall through to the default flow.Ink workarounds
cols/rows) — see above; without it Ink renders blank.FORCE_COLOR=3+TERM=xterm-256colorpinned so colour output is identical local ↔ CI (otherwise the render differs by environment).SETTLE_MS), with a hard cap (MAX_CAPTURE_MS) for screens that never settle (live spinners). The emulator then collapses all the intermediate spinner/wait frames into the final screen.node-ptypackagingspawn-helper, sopty.spawndies withposix_spawnp failedon a fresh install. The harness re-adds+xitself (self-heals; no-op once fixed upstream). Needsnode-ptyinpnpm.onlyBuiltDependencies.CI (
cli-screenshots.yml)pull_request(path-filtered) — pre-merge gate, runs only when files that decide wiring change (bin.ts,src/commands/**,src/ui/tui/**,src/lib/programs/**, the harness).schedule(daily) — drift monitor: this is partly an integration test against context-mill's liveskill-menu.json, so an upstream change can break a command with no wizard PR. Slack-notifies on scheduled failure (reusesSLACK_WEBHOOK_WIZARD_CHANNEL); PR failures self-announce via the red ❌.workflow_dispatch— manual runs.🤖 Generated with Claude Code