Skip to content

feat(autobrowse): deterministic Playwright export + iterative co-evolution with the explorer#108

Open
aq17 wants to merge 2 commits into
mainfrom
aq/autobrowse-iterative-playwright-loop
Open

feat(autobrowse): deterministic Playwright export + iterative co-evolution with the explorer#108
aq17 wants to merge 2 commits into
mainfrom
aq/autobrowse-iterative-playwright-loop

Conversation

@aq17
Copy link
Copy Markdown
Contributor

@aq17 aq17 commented May 14, 2026

Headline

autobrowse can now emit a runnable, deterministic Playwright script from any passing trace, and iterate the explorer + emitter together until both halves converge on the same workflow.

Before this PR, autobrowse produced traces + strategy.md — durable artifacts, but the only way to re-run the task was to pay LLM inference per step. There was no path from a graduated task to a no-LLM-loop runnable script. This PR adds that path.

What's new

1. End-to-end Playwright export pipeline (entirely new)

The full mining → resolve → codegen → verify pipeline, none of which existed in autobrowse before:

  • scripts/export.mjs — CLI: --task --target playwright --workspace --run --no-verify
  • scripts/lib/pick-run.mjs — newest-passing-run selection from traces/<task>/run-NNN/
  • scripts/lib/parse-task.mjstask.md Output block → Zod schema for the emitted script
  • scripts/lib/command-mapping.mjsbrowse trace → target-agnostic op stream
  • scripts/lib/selector-resolver.mjs — snapshot + session-scoped ARIA ref → ranked Playwright locator candidates (getByRole(name) → getByLabel → getByPlaceholder → getByText)
  • scripts/lib/codegen-playwright.mjs — ops → runnable TypeScript with helper functions baked in (see streamline screenshot process, add pnpm claude to start #3 below)
  • scripts/lib/verify.mjsnpm install + npx tsx + JSON output parse
  • scripts/lib/distill-failure.mjs — Claude Haiku summary of Playwright failures into strategy.md

The emitted script connects to a fresh Browserbase session bound to BROWSERBASE_CONTEXT_ID (when set), so persistent-context auth survives between explorer training and Playwright replay.

2. scripts/loop.mjs — iterative co-evolution

Until now autobrowse converged on "the LLM can finish the task," then export would have been a one-shot translation at the end. Those are different objective functions: what unblocks the LLM agent doesn't always unblock a deterministic replay.

The loop bridges them:

For each iteration (max --max-iterations):
  1. evaluate.mjs                   → trace.json + summary.md
  2. If trace passed:
       export.mjs --target playwright --no-verify  → emits script
       npx tsx <task>.ts                            → deterministic replay
       If replay passed → record pass
       Else → distill failure into strategy.md
  3. Next iteration's evaluate reads the updated strategy.md and adapts
  4. Graduate when Playwright passes in 2 of the last 3 iterations

strategy.md becomes a shared intelligence layer between the LLM explorer (next iteration) and the codegen. Three sections (documented in SKILL.md):

  • Navigation Heuristics — LLM-facing prose
  • Codegen Hints — per-task overrides for the emitter
  • Recent Playwright Failures — auto-appended by the distiller

3. Codegen defaults that absorb the common state-portal pitfalls

Demoing the export pipeline end-to-end on bizfile.sos.ca.gov surfaced ~7 distinct classes of mismatch between what unblocks the LLM agent and what unblocks a deterministic replay. Each is now baked in as an auto-emitted helper or behavior — so the next task we point this at starts from a much smaller residual.

Helper / Behavior Replaces / fixes
forceCheck page.locator('input[type=checkbox]').fill('true') (Playwright rejects) and overlay-intercepted .check()
forceClickRadio Radio clicks blocked by styled-label overlays — applied automatically when selector matches [type=radio] OR when resolved snapshot node role is radio
selectWithFallback .selectOption() with a JS-enable + React-native-setter fallback for transiently-disabled <select>
reactFill Inputs where keystroke handlers (autosuggest, autocomplete) drop chars — uses HTMLInputElement.prototype.value setter + synthetic input/change events
clickButtonByText Wizard "Next Step" buttons across SPA page transitions — avoids getByRole race
clickLinkWithFallback SPA link clicks intercepted by tour/onboarding overlays — reads resolved .href property and prefers page.goto for absolute hrefs
.first() default for ambiguous click_sel button[type=button] matching 3 elements (Help / Save Draft / Next Step) → strict-mode violation
exact: true for form-input getByRole "Limited Liability Company Name" matching "Confirm Limited Liability Company Name"
Snapshot role \"select\" → ARIA \"combobox\" Resolver was emitting getByRole(\"select\", ...) which is invalid in Playwright
select_ref op routing browse select [0-2005] CA resolves the ref via snapshot instead of leaking as invalid CSS

4. scripts/evaluate.mjs — additive patches

  • Reads BROWSERBASE_CONTEXT_ID env var; if set with --env remote, pre-creates one BB session bound to that context, transparently injects --connect <session-id> into every browse command from the agent, and releases the session at exit. Lets persistent-context auth flow through every iteration without per-run login flailing.
  • --max-turns N CLI flag (previously hard-coded to 30). loop.mjs plumbs this through.

5. SKILL.md

New "Export to deterministic Playwright" and "Iterative Playwright loop" sections covering when to use loop.mjs vs evaluate.mjs, the sectioned strategy.md format, the codegen helper defaults, and pre-authed sessions via persistent context.

Validation (May 13–14, bizfile.sos.ca.gov LLC formation)

Phase 1 (May 13, customer_demos PR #33): ran the export pipeline by hand against run-004. The emitted script needed 15 hand-edits + an extract patch before it would replay cleanly. Those hand-edits became the source list for the codegen defaults above.

Phase 2 (May 14, this PR): ran the full loop.mjs from scratch.

Run Stage Result
Loop iter 1 evaluate ❌ max_turns at Step 7 (eval-flakiness on Confirm name field cost ~15 turns)
Loop iter 1 Playwright (skipped — no passing trace)
Loop iter 2 evaluate ✅ reached Review (Step 9 of 11, run-008)
Loop iter 2 Playwright export 88 ops, 18 cached, 25 ref_resolved, 8 ref_failed, LLM extract generated
Loop iter 2 Playwright replay ❌ failed on the issues since fixed below
Loop iter 2 distill-failure ✅ wrote LLM-summarized addendum to strategy.md
Post-loop regen (after this PR's codegen fixes) Wizard navigation ✅ all 9 steps, zero hand-edits
Post-loop regen LLM-extract block ❌ still brittle (tracked as follow-up #1)

Net result: the wizard-navigation half went from 15 hand-edits → 0. The LLM-extract block is the remaining gap.

Known limitations / follow-ups

  1. LLM-generated extract block remains brittle. The Haiku-generated result-shaping code at the end of every emitted script uses structural locators (page.locator('text=\"X\"').evaluate(...)) that often match multiple elements. The wizard navigation succeeds end-to-end, then the extract throws and success: false is returned. Right fix: harden the extract prompt to insist on per-field try/catch + prefer getByLabel({..., exact: true}). ~30 LOC follow-up.

  2. No feedback when evaluate itself maxes out. The loop currently only distills Playwright failures into strategy.md. When evaluate hits max_turns, there's no addendum and the next iteration repeats whatever caused the flailing. Right fix: a second distillation pathway that reads evaluate's decision log when status is max_turns, identifies the longest-spent step, and writes a Codegen Hint.

  3. strategy.md's "Codegen Hints" section is human-readable only. The codegen doesn't yet parse it for per-task overrides at export time. The new helpers are baked in as defaults that fire on selector/role heuristics. Right fix: structured Codegen Hints DSL the emitter consumes.

  4. Validated on n=1 task. All evidence so far comes from bizfile. State-portal patterns we haven't exercised: date pickers, file uploads, multi-tab flows, iframed forms, captchas mid-flow, Symantec VIP / SAML auth, steppers without a "Next" button. Each may surface a new codegen default. Recommend running this against 1–2 more diverse portals (CA EDD + a DMV-style stepper) before any "generalizes to all 50 agencies" claim.

Try it

cd <your-workspace>
export BROWSERBASE_CONTEXT_ID=<id-of-an-authed-context>
node ~/Desktop/skills/skills/autobrowse/scripts/loop.mjs \\\\
  --task <task-name> \\\\
  --env remote \\\\
  --max-iterations 5 \\\\
  --max-turns-per-iter 100

The loop graduates when Playwright passes in 2 of the last 3 iterations and writes a report to <workspace>/reports/loop-<task>-<timestamp>.md. Sister PR with the bizfile demo workspace + the emitted-then-hand-fixed script: browserbase/customer_demos#33.

🤖 Generated with Claude Code

…c verify converge together

Until now the explorer (evaluate.mjs) and the Playwright emitter (export.mjs)
were two disconnected stages: explorer converged on "the LLM can finish the
task," then export was a one-shot translation. The two objective functions
diverge — what unblocks the LLM agent doesn't always unblock a deterministic
replay. Demoing this against bizfile.sos.ca.gov surfaced 7+ classes of
mismatch (styled-label overlays, autocomplete keystroke interception,
transiently-disabled selects) that each cost a hand-fix in the emitted
script.

This PR unifies the loop:

  Each iteration of `scripts/loop.mjs`:
    1. evaluate.mjs  → produces trace.json + summary.md
    2. If trace passed, export.mjs --no-verify → emits Playwright script
    3. npx tsx <task>.ts → actual deterministic replay
    4. On Playwright fail, distill-failure.mjs summarizes the error via
       Claude Haiku into strategy.md's "Recent Playwright Failures" section
    5. Next iteration's evaluate reads the updated strategy.md and adapts
  Convergence: Playwright passes 2 of last 3 iterations → graduate.

`strategy.md` is the shared intelligence layer between the LLM explorer and
the codegen. Three sections (documented in SKILL.md):
  - Navigation Heuristics  (LLM-facing)
  - Codegen Hints         (emitter-facing, per-task overrides)
  - Recent Playwright Failures  (auto-appended by distill-failure)

Also lifts the lessons from the bizfile demo into codegen defaults so future
tasks don't repeat the same hand-fixes:

  - forceCheck       : .check({ force: true }) for checkbox fill_sel ops
  - forceClickRadio  : .first().click({ force: true }) for radio click ops
                       (detected by selector pattern OR resolved node role)
  - selectWithFallback: .selectOption() with a JS-enable + native-setter
                       fallback when the <select> is transiently disabled
  - reactFill        : helper for inputs where simulated keystrokes get
                       intercepted by autosuggest/autocomplete handlers
  - clickButtonByText: eval-find-by-text in page context, avoids the
                       cross-step getByRole race on SPA wizards

Plus: select_dropdown ops with ref-shaped selectors (e.g. `[0-2005]`) now
route through the snapshot resolver instead of leaking as invalid CSS.

Files in this PR:
  scripts/loop.mjs                  NEW — top-level orchestrator
  scripts/export.mjs                NEW — trace → Playwright codegen
  scripts/lib/pick-run.mjs          NEW — newest-passing-run selector
  scripts/lib/parse-task.mjs        NEW — task.md → Zod schema
  scripts/lib/command-mapping.mjs   NEW — browse trace → target-agnostic ops
  scripts/lib/selector-resolver.mjs NEW — snapshot+ref → Playwright locators
  scripts/lib/codegen-playwright.mjs NEW — ops → TS with helpers baked in
  scripts/lib/verify.mjs            NEW — npm install + tsx run + JSON parse
  scripts/lib/distill-failure.mjs   NEW — Playwright stderr → strategy.md addendum
  scripts/evaluate.mjs              MODIFIED — BROWSERBASE_CONTEXT_ID
                                    passthrough + --max-turns flag
  SKILL.md                          MODIFIED — documents export, loop,
                                    sectioned strategy.md, and the
                                    helper defaults baked into codegen

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
const result = spawnSync("node", args, {
stdio: ["ignore", "inherit", "inherit"],
env: process.env,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runExport stdout inheritance pollutes loop's JSON output

Medium Severity

runExport uses stdio: ["ignore", "inherit", "inherit"], which inherits stdout from the child export.mjs process. Since export.mjs with --no-verify writes a JSON report to stdout via console.log, each iteration's export output leaks onto loop.mjs's stdout. Then loop.mjs writes its own final structured JSON to stdout at the end. The combined stdout contains multiple JSON objects, breaking any consumer expecting a single parseable JSON result. Compare with runEvaluate, which correctly uses "pipe" for stdout to capture it.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8626af8. Configure here.

}, text);
await page.waitForLoadState("load");
await page.waitForTimeout(waitAfterMs);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reactFill and clickButtonByText are never emitted by codegen

Low Severity

The reactFill and clickButtonByText helper functions are baked into every generated Playwright script via wrapScript, but emitOp never generates calls to either of them. Only forceCheck, forceClickRadio, and selectWithFallback are actually dispatched. These two functions are dead code in every emitted script, adding ~30 lines of unused TypeScript to each generated artifact.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8626af8. Configure here.

…ed by loop validation

Loop validation today on bizfile (run-008 mined as the passing trace) reduced
the post-codegen hand-edits from yesterday's 15 down to 4 + 1 LLM-extract patch.
Each of the 4 navigation-level issues is now baked in as a codegen default, so
the next task we point loop.mjs at should start from a much smaller residual.

Fixes landed:

  1. clickLinkWithFallback helper (codegen-playwright.mjs)
     - For click_ref ops where the resolved node role is "link", emit
       clickLinkWithFallback(page, <locator>) instead of plain .click().
     - Helper reads the resolved .href property (not getAttribute, which
       returns relative URLs). If the link exposes an absolute http(s) href,
       prefer page.goto over .click — bypasses SPA tour overlays and
       onClick preventDefault gates that block deterministic replay.
     - Waits for networkidle after navigation (load fires too early on SPAs).

  2. .first() default for ambiguous click_sel selectors
     - Added isUniqueSelector() classifier: #id, [id=...], [data-testid=...].
     - For unique selectors, emit .click() as before. For ambiguous ones
       (e.g. `button[type=button]`), emit .first().click() to avoid
       Playwright strict-mode violations.

  3. exact: true for form-input getByRole emissions (selector-resolver.mjs)
     - Added EXACT_NAME_ROLES set: textbox, searchbox, combobox, spinbutton,
       listbox. nodeToLocators emits { name, exact: true } for these.
     - Prevents "Limited Liability Company Name" from matching
       "Confirm Limited Liability Company Name" (real bug from yesterday).

  4. snapshot role "select" → ARIA role "combobox" (selector-resolver.mjs)
     - Added SNAPSHOT_TO_ARIA_ROLE map and normalize at top of nodeToLocators.
     - Browse-snapshot reports <select> with role "select" but Playwright's
       ARIA role is "combobox". Without this mapping, the emitter produced
       getByRole("select", ...) which is invalid.
     - Also boost getByLabel above getByRole for select-likes (combobox/listbox)
       since label-based locators tend to be more reliable for form selects.

Validation:

  Re-exported bizfile-ca-llc from run-008 with these defaults. The emitted
  script navigates ALL 9 wizard steps without hand-edits (vs. yesterday's
  hand-fixed playwright-baseline/ which required 7 categories of patches).
  Only failure is in the LLM-generated extract block at the end (brittle
  structural locators in result-shaping) — separate concern, tracked as a
  follow-up. The architectural goal (loop + codegen produces a navigating
  Playwright script) is met.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aq17
Copy link
Copy Markdown
Contributor Author

aq17 commented May 14, 2026

Update (May 14): validated end-to-end on bizfile from scratch + landed 4 codegen fixes for the residual hand-edits.

  • Loop converges through evaluate to a passing trace in 2 iterations (run-008 graduated)
  • Post-this-PR, the auto-emitted Playwright script navigates all 9 wizard steps with zero hand-edits
  • Only remaining failure is in the LLM-generated extract block at the end (brittle structural locators) — tracked as follow-up default to sonnet #1 in the updated PR description

Commit c918d2d adds:

  • clickLinkWithFallback helper for SPA links that don't navigate via .click() (the bizfile dashboard tour-overlay case)
  • .first() default for ambiguous click_sel selectors (the button[type=button] strict-mode case)
  • exact: true on form-input getByRole emissions (the "Limited Liability Company Name" matching "Confirm..." case)
  • Snapshot role "select" → ARIA role "combobox" mapping in the resolver (the getByRole("select") invalid case)

Ready for review. Two named follow-ups in the PR description for when we pick this up after Friday's walkthrough.

@aq17 aq17 changed the title feat(autobrowse): iterative Playwright loop + emitter co-evolved with explorer feat(autobrowse): deterministic Playwright export + iterative co-evolution with the explorer May 14, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

There are 5 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c918d2d. Configure here.

let parsed = null;
try {
const lastBrace = stdout.lastIndexOf("{");
if (lastBrace >= 0) parsed = JSON.parse(stdout.slice(lastBrace));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON parsing fails for nested pretty-printed output

High Severity

stdout.lastIndexOf("{") finds the last opening brace in the output, but the generated Playwright script emits JSON.stringify(result, null, 2) (pretty-printed). For any Output schema with nested objects, lastIndexOf locates an inner { rather than the outermost one, causing JSON.parse to throw. Since parsed stays null, passed is always false even when the script succeeds — the loop can never graduate for tasks with non-flat output schemas.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c918d2d. Configure here.

import { spawnSync } from "node:child_process";
import { fileURLToPath } from "node:url";
import { distillFailure, appendToStrategy } from "./lib/distill-failure.mjs";
import { pickRun } from "./lib/pick-run.mjs";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused pickRun import in loop orchestrator

Low Severity

pickRun is imported from ./lib/pick-run.mjs but never called anywhere in loop.mjs. Only extractFinalJson and readSummary (imported on the next line from the same module) are actually used.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c918d2d. Configure here.

// comments. Good enough to catch LLM truncation, not a parser.
function checkBalance(code) {
let depth = { "{": 0, "[": 0, "(": 0 };
const open = { "{": "}", "[": "]", "(": ")" };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused open variable in bracket balance checker

Low Severity

The open mapping in checkBalance is declared but never referenced. Only the depth object is used for tracking bracket counts.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c918d2d. Configure here.

@aq17 aq17 requested review from rcbrowder, shubh24 and ziruihao May 14, 2026 23:46
@shubh24
Copy link
Copy Markdown
Contributor

shubh24 commented May 27, 2026

🏗️ Architecture feedback — toward goal-driven codegen

Nice work on this PR — the pipeline is clean and the bizfile validation is solid. But I want to flag a longer-term architecture concern before we harden around the current design.

The core tension

The current pipeline is a trace compiler: mine the LLM's trace → resolve ARIA refs to Playwright locators → emit a deterministic script. This works, but it means the trace is the source of truth for the generated script — and that's where things get complicated.

The LLM's trace includes a lot of incidental decisions. It clicked a button because it saw it first. It used a CSS selector because the snapshot was long. It took an extra step because it got confused. The export pipeline faithfully converts all of this into Playwright — the noise alongside the signal. A human writing Playwright wouldn't replay the journey; they'd look at the goal (fill step 3 of this form) and write the simplest path to it.

This has three downstream consequences:

  1. The codegen helpers are a hand-curated catalog of workarounds. `forceCheck`, `selectWithFallback`, `reactFill`, `clickLinkWithFallback` — each one exists because you ran bizfile, hit a specific failure class, and wrote a helper. The PR body acknowledges this: "each [new site] may surface a new codegen default." That pattern won't scale to 50 state agencies.

  2. The convergence criterion is mechanical, not intelligent. "Playwright passes in 2 of the last 3 iterations" doesn't understand why something passed. A test that passed because a longer timeout absorbed a race condition isn't the same as one that passed because the selectors are robust. An agent could make this judgment; a counter can't.

  3. The feedback loop is indirect. When Playwright fails, the failure gets distilled into strategy.md, the LLM explorer adapts on the next iteration, and the trace gets re-exported. But the codegen doesn't read strategy.md's "Codegen Hints" section (acknowledged in the PR). So the explorer is learning, but the compiler isn't — you're optimizing the input to the compiler rather than the compiler itself.

Proposed architecture: hybrid skeleton + agent codegen

Split the problem into what machines do well (structure) and what agents do well (judgment):

Phase 1 — Mechanical skeleton extraction (keep most of what you have)

Mine the trace into a workflow skeleton, not a Playwright script:

  • Page-level navigation sequence (goto URL A → fill form → click Next → goto URL B → ...)
  • Per-page: which fields need filling, which buttons need clicking, what values to use
  • Don't resolve selectors. Just record the intent: "fill the Company Name field with 'Acme Corp'"

The command-mapping and trace-walking code from this PR is great infrastructure for this. The change is: stop at the intent layer, don't go all the way to Playwright locators.

Phase 2 — Agent writes Playwright from the skeleton + a live session

Give Claude the skeleton + strategy.md + a live Browserbase session. Claude writes Playwright for each step:

  • It can see the live page, pick its own selectors using its judgment
  • It runs each step interactively, sees what works
  • When something fails, it fixes it in-place — no roundtrip through strategy.md
  • It decides when to use `force: true`, when to use `evaluate()` for React inputs, when to add waits — from the DOM context, not from baked-in helpers

This eliminates the ARIA ref resolution pipeline, the selector ranking heuristics, and the hand-curated helper catalog. Claude is the codegen. The domain knowledge ("state portals use React controlled forms with styled label overlays") lives in strategy.md as prose, and the agent decides when to apply it — rather than encoding it as named functions.

Phase 3 — Agent-driven verification

Instead of "2 of 3 passes," give Claude the script + the last N run results and ask: "Is this production-ready? What's still flaky?" The agent can identify that a timeout was 4900ms on a 5000ms limit (near-miss, not a real pass), or that a selector matched by coincidence. Graduation becomes a judgment call, not a counter.

What this means for this PR

Ship it as-is — the pipeline works, bizfile validates it, and the infrastructure (command-mapping, trace-walking, selectors.cache.json, distill-failure) is valuable regardless of architecture. But I'd treat the current export pipeline as a stepping stone, not the final architecture:

  • The trace → ops → skeleton extraction is durable infrastructure. Keep investing here.
  • The ops → Playwright codegen (selector resolution, helper functions, emitOp) is the part that should eventually be replaced by agent-driven codegen from the skeleton.
  • The loop + distillation machinery is good, but the convergence check should move toward agent judgment.

The end state: the trace trains strategy.md (the existing autobrowse loop), and the agent writes Playwright from the task description + strategy — not from the trace. The trace is training data, not source code. That's the architecture that scales to 50 agencies without a new helper per portal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants