[STG-1745] Add ui-test skill — adversarial UI testing with browse CLI by shrey150 · Pull Request #56 · browserbase/skills

shrey150 · 2026-03-26T06:52:08Z

Summary

Adds a ui-test skill that uses the browse CLI to run AI-powered adversarial UI tests in a real browser. The agent analyzes git diffs to test only what changed (or explores the full app), checking functional correctness, accessibility, responsive layout, and UX heuristics.

Key features:

Diff-driven testing — git diff → targeted tests for changed pages/components
Exploratory testing — agent navigates freely to find bugs without a predefined suite
Parallel execution — coordinator agent fans out to sub-agents (up to 5), each with a 20-step hard cap
Structured assertions — STEP_PASS|id|evidence / STEP_FAIL|id|expected → actual with screenshot evidence
Local + remote — localhost uses local browser, deployed sites use Browserbase
Deterministic checks — axe-core a11y, console errors, broken images, overflow detection, form labels
Adversarial patterns — XSS injection, empty submit, rapid click, keyboard-only nav, focus traps

Files

skills/ui-test/
├── SKILL.md                              # Skill definition
├── EXAMPLES.md                           # 9 worked examples with assertions
├── LICENSE.txt                           # MIT
├── README.md                             # Overview
├── rules/
│   └── ux-heuristics.md                 # 6 evaluation frameworks
└── references/
    ├── browser-recipes.md               # Deterministic check recipes
    ├── design-system.example.md         # Example design system template (copy to design-system.md)
    └── exploratory-testing.md           # Agent-driven QA guide

Test plan

Smoke tested diff-driven testing (component rendering, before/after snapshots)
Smoke tested adversarial patterns (empty submit, XSS, rapid click, keyboard-only)
Smoke tested axe-core accessibility audit
Smoke tested responsive screenshots + overflow/touch-target checks
Smoke tested console error capture
Smoke tested parallel execution with named Browserbase sessions
Found real bugs in test app (Escape not closing modals, undersized touch targets)

🤖 Generated with Claude Code

Note

Low Risk
Low risk since this PR primarily adds new markdown/HTML assets for a standalone skill and does not modify existing application/runtime code paths.

Overview
Introduces a new skills/ui-test package defining a ui-test skill for agent-driven UI QA using the browse CLI, including diff-driven, exploratory, and parallel testing workflows with structured STEP_PASS/STEP_FAIL assertions and screenshot evidence.

Adds extensive supporting materials: worked examples (EXAMPLES.md), deterministic check recipes (axe-core, console errors, broken images, responsive/touch targets), adversarial test patterns, UX/design-consistency heuristics, and a reusable references/report-template.html for generating a standalone HTML test report.

^{Written by Cursor Bugbot for commit 9ffe8e4. This will update automatically on new commits. Configure here.}

Builds on #52 with three key additions: 1. Local/remote mode selection — localhost uses local browser (no API key), deployed sites use Browserbase via cookie-sync for authenticated testing 2. Diff-driven testing — analyze git diff, generate targeted tests for what changed, execute with before/after snapshot comparison 3. Structured assertion protocol — STEP_PASS/STEP_FAIL markers with evidence, deterministic checks (axe-core, console errors, overflow detection), and adversarial testing patterns (XSS, empty submit, rapid click, keyboard-only) Smoke-tested against a local Next.js app: found real bugs (Escape not closing modals, undersized mobile touch targets) that confirmed the adversarial patterns work. Fixed browse eval recipes (no top-level await, console capture on-page not about:blank). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ssions Enables concurrent test execution by leveraging browse CLI's --session flag to spin up independent Browserbase browsers per test group, with fan-out via Agent tool and merged result reporting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Documents how to add Bash(browse:*) to project or user settings so users don't get prompted on every browse snapshot/click/eval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…t figure it out - Remove .ui-tests/suite.yml format and generation pipeline - Replace Workflow B (8-step codebase analysis) with lightweight exploratory testing - Simplify references/codebase-analysis.md to quick hints (framework detection, route finding) - Remove example YAML suite file - Update README to reflect no-artifacts philosophy - Drop Write tool from allowed-tools (no files to generate) The codegen/suite approach can ship as v2 later. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- XSS check: replace false-positive inline script count with input value check - Console capture: preserve original console.error in Examples 6 snippets - Form labels: use native i.labels API in browser-recipes.md (matches SKILL.md) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Also strengthens auto-select rule: localhost → browse env local, deployed URLs → browse env remote, applied consistently across all workflows including parallel sessions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove "remote only" restriction — named sessions work with local mode - Add BROWSE_SESSION=* permission pattern to avoid approval fatigue on parallel runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

If references/design-system.md exists, use it as ground truth. Otherwise, screenshot 2-3 existing pages to establish baseline patterns (spacing, radii, colors, typography) and compare the changed page against them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When a test step fails, the skill now instructs the agent to take a screenshot and save it to .context/ui-test-screenshots/<step-id>.png, referenced in the STEP_FAIL marker and final report so developers can see exactly what went wrong. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extracted from browserbase-brand-guidelines skill: colors, typography, border radii, spacing grid, component patterns, and visual principles. The ui-test skill checks changed pages against this when it exists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rewrite Budget & Limits section to use a coordinator/sub-agent model: main agent plans and delegates, sub-agents do the actual testing with a 20-step hard cap each. Wall clock target ~10 min for default runs.

Integrates two external frameworks into the testing skill: - Judgement (Emil Kowalski + Josh Puckett + UI Wiki): 9 reference files covering animations, forms, touch/a11y, typography, polish, component design, marketing, performance, and 152 UI wiki rules. Adds deterministic eval checks for touch targets, iOS zoom, transition:all, z-index abuse, and form labels. Adds screenshot-based critique methodology. - Luck (soleio): Assembly Theory meta-evaluation lens — 7 facets adapted to UI (solvency, gradient coupling, compatibility, niche construction, circulation, integration, path sensitivity) for "will this UI thrive?" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Strip out the Craft Quality Judgement section, Luck Lens meta-evaluation, and all references/judgement/ files. Keep the skill focused on functional testing, accessibility, and UX heuristics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename design-system.md → design-system.example.md with instructions for users to copy and fill in their own brand tokens. The skill reads design-system.md (user-created), not the example. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add browse wait timeout 3000 after axe-core script injection (SKILL.md, browser-recipes.md) - Fix form label check to include aria-label and aria-labelledby (SKILL.md) - Fix focus ring detection to check box-shadow too, not just outline (browser-recipes.md) - Fix window.__capturedErrors → window.__logs in Example 8 (EXAMPLES.md) - design-system.md already fixed in prior commit (renamed to .example.md) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add aria-labelledby to hasLabel check in browser-recipes.md - Add browse wait timeout 3000 after axe-core injection in Examples 4 and 7 - hasFocus box-shadow check was already fixed in prior commit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix allowed-tools delimiter: commas → spaces per spec - Extract adversarial patterns → references/adversarial-patterns.md - Extract parallel testing → references/parallel-testing.md - Extract design consistency → references/design-consistency.md - Replace inline deterministic checks with summary table + link - Move rules/ux-heuristics.md → references/ (eliminate non-standard dir) - Add conditional loading triggers for all reference files - Update README.md file tree Zero content lost — all extracted sections live in reference files with "when to load" guidance for progressive disclosure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Drop the "~30 seconds per browse command" claim and all derived wall clock estimates. Budget is now defined purely in steps/turns, which is the actual constraint that caps spend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a self-contained HTML report that reviewers can open in a browser. Screenshots are base64-embedded so the file works offline as a single artifact. Failed tests render open by default with inline screenshots; passed tests are collapsed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Light theme with warm off-white (#F9F6F4) background, Inter font via Google Fonts, brand orange-red (#F03603) for failures, brand green (#90C94D) for passes, brand blue (#4DA9E4) for suggestions. Inline Browserbase logo SVG in header and footer. Borders over shadows per brand design system. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Restructure test runs into 2-3 sweeps from different angles: 1. Deterministic sweep: fixed checklist (axe-core, console errors, etc.) — same results every run 2. Exploratory-functional: interactions, edge cases, adversarial inputs 3. Exploratory-visual: responsive layout, keyboard nav, design consistency Also adds mandatory page discovery step so every route gets covered. This addresses the inconsistency where repeated runs surface different bugs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This reverts commit 379e4af.

cursor · 2026-04-02T05:20:24Z

+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>UI Test Report — {{TITLE}}</title>
+<link rel="preconnect" href="https://fonts.googleapis.com">
+<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">


Report template uses CDN violating offline requirement

Medium Severity

The report template loads Google Fonts from fonts.googleapis.com via two <link> tags, but SKILL.md explicitly states "The report must work offline — no CDN links, no external assets." This directly contradicts the offline requirement. Reports generated from this template will make external network requests and won't render the intended font when opened offline. The CSS fallback stack (Inter, -apple-system, ...) helps, but the external request itself violates the stated contract and may cause slow/blocked rendering in air-gapped environments.

Additional Locations (1)

skills/ui-test/SKILL.md#L468-L469

…ning - Replace rigid budget tables (20 steps, 5 agents, quick/default/thorough modes) with guiding principles the agent uses to size effort itself - Add multi-angle planning flow: plan from 3 perspectives, deduplicate, execute once — produces broad coverage without re-running same checks - Keep 20-step safety valve as a runaway cap, not a target - Remove adjusting-the-budget modes entirely — one way of testing Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Make the three planning rounds mandatory and output-visible — the agent must write out all three rounds and the merged plan before it's allowed to launch any sub-agents. Prevents skipping straight to execution. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

cursor · 2026-04-02T18:50:52Z

+
+# Tab through elements one at a time
+browse press Tab
+browse eval "JSON.stringify({tag: document.activeElement?.tagName, text: document.activeElement?.textContent?.trim().slice(0,40), role: document.activeElement?.getAttribute('role'), ariaLabel: document.activeElement?.getAttribute('aria-label'), hasFocus: (() => { const s = window.getComputedStyle(document.activeElement); return s.outlineStyle !== 'none' || s.boxShadow !== 'none'; })()})"


Focus ring detection produces false positives from decorative shadows

Medium Severity

The hasFocus check uses s.boxShadow !== 'none' as an OR condition to detect visible focus rings. This returns true for any element with a decorative box-shadow (elevation, depth effects), which is extremely common in modern UIs. The agent is then told "hasFocus should be true for every element," so elements with decorative shadows will always pass the focus ring check even when no actual focus indicator exists. This masks real accessibility issues — the core purpose of this deterministic recipe.

Additional Locations (1)

skills/ui-test/references/browser-recipes.md#L113-L114

- Planning rounds happen in the main agent's own response, never delegated - Sub-agents receive a specific numbered test list and run only those tests - Sub-agents do not explore or plan — execute assigned tests and stop - Add explicit step budget requirement in every sub-agent prompt - Step heuristics (25/40/75) are starting points, not rules Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Sub-agents must return only STEP_PASS/STEP_FAIL markers as text. Only the coordinating main agent generates the final HTML report. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…etrics - Report header: Tests | Passed | Failed | Agents (not Total Steps) - Replace {{TOTAL_STEPS}} with {{TOTAL_TESTS}}, add {{AGENT_COUNT}} - Remove redundant {{TEST_COUNT}} placeholder Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Sub-agents report STEP_SKIP|<id>|budget reached for unfinished tests - Main agent accepts partial results as-is, never re-runs sub-agents - Add 25/40/75 step heuristics to sub-agent budget guidance - Skipped tests appear in final report so developer knows coverage gaps Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-04-02T19:38:08Z

+
+As a rough heuristic: ~25 steps for a few targeted checks, ~40 for a full page with functional + adversarial + a11y, ~75 for multiple pages or a broad category. **Adjust based on what the assigned tests actually require** — these are starting points, not rules.
+
+As a rough heuristic: ~25 steps for a few targeted checks, ~40 for a full page with functional + adversarial + a11y, ~75 for multiple pages or a broad category. **Adjust based on what the assigned tests actually require** — these are starting points, not rules.


Duplicated paragraph in step budget guidance

Low Severity

The "As a rough heuristic: ~25 steps for a few targeted checks…" paragraph is duplicated verbatim on consecutive lines. This appears to be a copy-paste artifact. Since the SKILL.md is read by the agent at runtime to determine behavior, redundant instructions add noise and could confuse the agent's interpretation of the budget guidance.

cursor · 2026-04-02T19:38:08Z

+|-------------|-------|
+| `{{TITLE}}` | Report title (e.g., "UI Test: PR #1234 — OAuth Settings") |
+| `{{META}}` | One-line context: date, app URL, user, branch |
+| `{{TOTAL_TESTS}}` | Total STEP_PASS + STEP_FAIL count |


HTML report definition excludes skipped tests from totals

Medium Severity

{{TOTAL_TESTS}} is defined as "Total STEP_PASS + STEP_FAIL count," explicitly excluding STEP_SKIP. But the text report format on line 81 counts total as pass + fail + skipped (20 = 14 + 4 + 2) and computes pass rate accordingly (70%). The HTML template also has no placeholder or stat card for skipped tests. This means the HTML report will show different totals and an inflated pass rate compared to the text report for the same test run.

Additional Locations (2)

skills/ui-test/SKILL.md#L80-L81

skills/ui-test/references/report-template.html#L105-L111

The report h1 now uses {{TITLE_HTML}} which supports wrapping PR references in <a> tags. Link styled in brand red with underline on hover. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phase 3 previously only scanned ports. Now it also instructs the agent to: - Check out the correct branch when testing a PR - Install dependencies after branch switch - Verify the server actually renders content (not just HTTP 200 with a build error) Learned from a real failure: tested against a dev server running the wrong branch with a broken build, wasting the entire step budget on login attempts before discovering the app wasn't rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Mar 26, 2026

View reviewed changes

shubh24 and others added 4 commits March 26, 2026 12:41

Add permission setup docs to avoid approval fatigue on browse commands

3a51067

Documents how to add Bash(browse:*) to project or user settings so users don't get prompted on every browse snapshot/click/eval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor bot reviewed Mar 26, 2026

View reviewed changes

Comment thread skills/ui-test/SKILL.md Outdated

Comment thread skills/ui-test/references/browser-recipes.md

Comment thread skills/ui-test/EXAMPLES.md Outdated

shubh24 and others added 6 commits March 26, 2026 13:32

Simplify env rule: just say localhost → local, don't prescribe remote

4b04ef0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix parallel testing docs: works locally too, fix permission patterns

053edcb

- Remove "remote only" restriction — named sessions work with local mode - Add BROWSE_SESSION=* permission pattern to avoid approval fatigue on parallel runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor bot reviewed Mar 27, 2026

View reviewed changes

Comment thread skills/ui-test/references/design-system.example.md

Restructure budget system for sub-agent architecture

dc1eedd

Rewrite Budget & Limits section to use a coordinator/sub-agent model: main agent plans and delegates, sub-agents do the actual testing with a 20-step hard cap each. Wall clock target ~10 min for default runs.

shrey150 force-pushed the shrey/ui-test-skill branch from e563ad6 to dc1eedd Compare March 30, 2026 19:50

shubh24 and others added 2 commits March 30, 2026 14:12

shubh24 force-pushed the shrey/ui-test-skill branch from 0cfbc08 to d94fa1b Compare March 30, 2026 21:13

shubh24 changed the title ~~[STG-NEW] Add ui-test skill for adversarial UI testing~~ Add ui-test skill — adversarial UI testing with browse CLI Mar 30, 2026

cursor bot reviewed Mar 30, 2026

View reviewed changes

Comment thread skills/ui-test/references/browser-recipes.md

shubh24 and others added 2 commits March 30, 2026 14:20

cursor bot reviewed Mar 30, 2026

View reviewed changes

Comment thread skills/ui-test/references/browser-recipes.md Outdated

Comment thread skills/ui-test/EXAMPLES.md

shubh24 and others added 3 commits March 30, 2026 14:44

Remove codebase-analysis.md — not needed without test generation flow

65afb85

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shrey150 changed the title ~~Add ui-test skill — adversarial UI testing with browse CLI~~ [STG-1745] Add ui-test skill — adversarial UI testing with browse CLI Apr 1, 2026

shrey150 commented Apr 1, 2026

View reviewed changes

Comment thread skills/ui-test/SKILL.md Outdated

shrey150 commented Apr 1, 2026

View reviewed changes

Comment thread skills/ui-test/SKILL.md Outdated

shrey150 and others added 6 commits April 1, 2026 13:12

fix: make header logo link to browserbase.com with "Powered by" label

cadd998

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "Add multi-sweep testing for more consistent bug discovery"

3b3fc81

This reverts commit 379e4af.

This was referenced Apr 2, 2026

[STG-1755] docs: sync browse env local skill guidance #64

Closed

[STG-1755] docs(browser): sync browse env local guidance with current CLI #65

Merged

docs(ui-test): align local env guidance with browse CLI

e0ae3e8

cursor bot reviewed Apr 2, 2026

View reviewed changes

shubh24 and others added 2 commits April 2, 2026 11:31

cursor bot reviewed Apr 2, 2026

View reviewed changes

shubh24 and others added 4 commits April 2, 2026 12:03

Prevent sub-agents from generating HTML reports

6eb52ca

Sub-agents must return only STEP_PASS/STEP_FAIL markers as text. Only the coordinating main agent generates the final HTML report. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Apr 2, 2026

View reviewed changes

shubh24 approved these changes Apr 2, 2026

View reviewed changes

shrey150 and others added 2 commits April 2, 2026 12:54

feat: add clickable PR link in HTML report title

cd679f3

The report h1 now uses {{TITLE_HTML}} which supports wrapping PR references in <a> tags. Link styled in brand red with underline on hover. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

shrey150 merged commit cb4c772 into main Apr 2, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STG-1745] Add ui-test skill — adversarial UI testing with browse CLI#56

[STG-1745] Add ui-test skill — adversarial UI testing with browse CLI#56
shrey150 merged 34 commits intomainfrom
shrey/ui-test-skill

shrey150 commented Mar 26, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Apr 2, 2026

Uh oh!

cursor bot Apr 2, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 2, 2026

Uh oh!

cursor bot Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		As a rough heuristic: ~25 steps for a few targeted checks, ~40 for a full page with functional + adversarial + a11y, ~75 for multiple pages or a broad category. Adjust based on what the assigned tests actually require — these are starting points, not rules.

		As a rough heuristic: ~25 steps for a few targeted checks, ~40 for a full page with functional + adversarial + a11y, ~75 for multiple pages or a broad category. Adjust based on what the assigned tests actually require — these are starting points, not rules.

Conversation

shrey150 commented Mar 26, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Apr 2, 2026

Choose a reason for hiding this comment

Report template uses CDN violating offline requirement

Uh oh!

cursor bot Apr 2, 2026

Choose a reason for hiding this comment

Focus ring detection produces false positives from decorative shadows

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 2, 2026

Choose a reason for hiding this comment

Duplicated paragraph in step budget guidance

Uh oh!

cursor bot Apr 2, 2026

Choose a reason for hiding this comment

HTML report definition excludes skipped tests from totals

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shrey150 commented Mar 26, 2026 •

edited by cursor bot

Loading