Skip to content

Add browsability skill — score how usable a site is for a browser agent#122

Open
shubh24 wants to merge 1 commit into
mainfrom
shubh24/browsability-skill
Open

Add browsability skill — score how usable a site is for a browser agent#122
shubh24 wants to merge 1 commit into
mainfrom
shubh24/browsability-skill

Conversation

@shubh24
Copy link
Copy Markdown
Contributor

@shubh24 shubh24 commented May 30, 2026

What

Adds the browsability skill: an operational rubric + tooling for scoring how well an AI browser agent can drive a website's UI.

It's the sibling of agent-experience (which audits docs/SDK onboarding DX). This one is about operability — perceiving and driving the live DOM — not discoverability (no SEO/AEO/llms.txt).

The opinion

Browsability = how little help an agent needs to succeed, and how much harder the site is for an agent than for a human. Step-count is scored as the delta over a human baseline, so a genuinely long workflow isn't penalized — only the agent-specific tax (e.g. a custom dropdown that costs two steps where a native <select> costs one).

Score (0–100)

Axis Pts What
A · Access Resistance 30 lowest assistance rung (stealth / proxy / captcha ladder) a task needs to complete
B1 · Reachability 25 % of controls that survive the accessibility-tree prune
B3 · Structural traps 15 cross-origin iframes, shadow DOM, DOM depth/size
C · Agent tax 20 agent steps over the human baseline
D · Recoverability 10 self-heal / site errors / blocking overlays / step ceiling

Agent-native affordance (an API/deep-link path) is noted as a ceiling badge, not scored — this rubric measures UI operability.

Contents

  • SKILL.md — the workflow (probe → agent ladder → composite score → report).
  • references/rubric.md — the full code-grounded rubric, the assistance ladder, the delta framing, and a remediation table.
  • scripts/friction.tsdeterministic Drivability probe (B1 + B3) from a single page load via the browse CLI; no model needed.
  • scripts/score.ts — composite 0–100 scorer combining the probe with agent-run results.

Status

  • The Drivability slice (B1 + B3, 40 pts) runs today, deterministically.
  • The agent ladder (A/C/D, 60 pts) is scaffolded; it needs a model-driven reference agent (use the browser skill as the driver) to climb the rungs and record tasks.json. Without it, the scorer reports a Drivability-only score and marks A/C/D pending.

Notes

  • Grounded in what the open-source Stagehand framework treats as hard; uses only public Browserbase session settings.
  • solveCaptchas defaults to on, so an honest rung-0 baseline explicitly disables it (documented in the skill).
  • Verified: drivability probe + scorer run end-to-end on real sites.

🤖 Generated with Claude Code


Note

Low Risk
Documentation and local scoring scripts only; no changes to auth, production services, or existing skill runtime behavior.

Overview
Adds a new browsability skill that scores how well an AI browser agent can operate a site’s UI (not SEO/docs DX), and lists it in the root README skills table.

The skill documents a 0–100 rubric (Access Resistance ladder, reachability, structural traps, agent tax vs human baseline, recoverability) in SKILL.md and references/rubric.md, with a workflow: deterministic Drivability probe → optional agent ladder in tasks.json → composite report.

New Bun scripts drive the flow: friction.ts loads a URL via the browse CLI, evaluates DOM signals, and writes friction.json; score.ts merges that with optional task runs into browsability.json (full score or Drivability-only when ladder data is missing). Output goes under browsability-out/ (gitignored); MIT LICENSE is included.

Reviewed by Cursor Bugbot for commit 9c30e80. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds the `browsability` skill — an operational rubric for how well an AI
*browser* agent can drive a website's UI (the sibling of agent-experience,
which covers docs/SDK onboarding DX).

Scores 0–100 across:
  A  Access Resistance — lowest assistance rung (stealth/proxy/captcha ladder)
                          a task needs to complete
  B1 Reachability      — % of controls that survive the accessibility-tree prune
  B3 Structural traps  — cross-origin iframes, shadow DOM, DOM depth/size
  C  Agent tax         — agent steps OVER the human baseline (delta, not absolute)
  D  Recoverability    — self-heal / site errors / blocking overlays / step ceiling

The Drivability slice (B1+B3) runs deterministically from one page load via
scripts/friction.ts (no model). The full score adds an agent run across the
assistance ladder via scripts/score.ts. Rubric grounded in what the open-source
Stagehand framework treats as hard; uses only public Browserbase session settings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9c30e80. Configure here.

type Run = { rung: number; success: boolean; steps?: number; model?: string; note?: string };
type Task = { name: string; type?: string; humanBaselineSteps?: number; runs?: Run[] };

let a = 0, c = taxProxy, d = 10, tasks: Task[] = [], haveRuns = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recoverability defaults to full marks when unassessed

Medium Severity

When no tasks.json is provided (drivability-only mode), d is initialized to 10 — giving full recoverability marks for an axis that was never assessed. Meanwhile a (Access Resistance) correctly defaults to 0 when unknown. This asymmetry inflates the reported score by 10 points. The SKILL.md documentation explicitly states the drivability-only score is "B1 + B3, 40 max" with A/C/D pending, but the code can produce up to 70. d needs to default to 0 to match the pessimistic stance used for a and the documented behavior.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9c30e80. Configure here.


const total = Math.round(a + b1 + b3 + c + d);
const grade = (t: number) => t >= 90 ? "A" : t >= 80 ? "B+" : t >= 70 ? "B" : t >= 60 ? "C+" : t >= 50 ? "C" : t >= 35 ? "D" : "F";
const rungName = ["L0 vanilla", "L1 default-assist", "L2 proxy+fingerprint", "L3 advanced-stealth", "L4 verified"];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused rungName variable in scorer

Low Severity

rungName is defined but never referenced anywhere in score.ts or the rest of the codebase. It's dead code — likely intended for the report output but never wired in.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9c30e80. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant