feat(autobrowse): deterministic Playwright export + iterative co-evolution with the explorer#108
feat(autobrowse): deterministic Playwright export + iterative co-evolution with the explorer#108aq17 wants to merge 2 commits into
Conversation
…c verify converge together
Until now the explorer (evaluate.mjs) and the Playwright emitter (export.mjs)
were two disconnected stages: explorer converged on "the LLM can finish the
task," then export was a one-shot translation. The two objective functions
diverge — what unblocks the LLM agent doesn't always unblock a deterministic
replay. Demoing this against bizfile.sos.ca.gov surfaced 7+ classes of
mismatch (styled-label overlays, autocomplete keystroke interception,
transiently-disabled selects) that each cost a hand-fix in the emitted
script.
This PR unifies the loop:
Each iteration of `scripts/loop.mjs`:
1. evaluate.mjs → produces trace.json + summary.md
2. If trace passed, export.mjs --no-verify → emits Playwright script
3. npx tsx <task>.ts → actual deterministic replay
4. On Playwright fail, distill-failure.mjs summarizes the error via
Claude Haiku into strategy.md's "Recent Playwright Failures" section
5. Next iteration's evaluate reads the updated strategy.md and adapts
Convergence: Playwright passes 2 of last 3 iterations → graduate.
`strategy.md` is the shared intelligence layer between the LLM explorer and
the codegen. Three sections (documented in SKILL.md):
- Navigation Heuristics (LLM-facing)
- Codegen Hints (emitter-facing, per-task overrides)
- Recent Playwright Failures (auto-appended by distill-failure)
Also lifts the lessons from the bizfile demo into codegen defaults so future
tasks don't repeat the same hand-fixes:
- forceCheck : .check({ force: true }) for checkbox fill_sel ops
- forceClickRadio : .first().click({ force: true }) for radio click ops
(detected by selector pattern OR resolved node role)
- selectWithFallback: .selectOption() with a JS-enable + native-setter
fallback when the <select> is transiently disabled
- reactFill : helper for inputs where simulated keystrokes get
intercepted by autosuggest/autocomplete handlers
- clickButtonByText: eval-find-by-text in page context, avoids the
cross-step getByRole race on SPA wizards
Plus: select_dropdown ops with ref-shaped selectors (e.g. `[0-2005]`) now
route through the snapshot resolver instead of leaking as invalid CSS.
Files in this PR:
scripts/loop.mjs NEW — top-level orchestrator
scripts/export.mjs NEW — trace → Playwright codegen
scripts/lib/pick-run.mjs NEW — newest-passing-run selector
scripts/lib/parse-task.mjs NEW — task.md → Zod schema
scripts/lib/command-mapping.mjs NEW — browse trace → target-agnostic ops
scripts/lib/selector-resolver.mjs NEW — snapshot+ref → Playwright locators
scripts/lib/codegen-playwright.mjs NEW — ops → TS with helpers baked in
scripts/lib/verify.mjs NEW — npm install + tsx run + JSON parse
scripts/lib/distill-failure.mjs NEW — Playwright stderr → strategy.md addendum
scripts/evaluate.mjs MODIFIED — BROWSERBASE_CONTEXT_ID
passthrough + --max-turns flag
SKILL.md MODIFIED — documents export, loop,
sectioned strategy.md, and the
helper defaults baked into codegen
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| const result = spawnSync("node", args, { | ||
| stdio: ["ignore", "inherit", "inherit"], | ||
| env: process.env, | ||
| }); |
There was a problem hiding this comment.
runExport stdout inheritance pollutes loop's JSON output
Medium Severity
runExport uses stdio: ["ignore", "inherit", "inherit"], which inherits stdout from the child export.mjs process. Since export.mjs with --no-verify writes a JSON report to stdout via console.log, each iteration's export output leaks onto loop.mjs's stdout. Then loop.mjs writes its own final structured JSON to stdout at the end. The combined stdout contains multiple JSON objects, breaking any consumer expecting a single parseable JSON result. Compare with runEvaluate, which correctly uses "pipe" for stdout to capture it.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8626af8. Configure here.
| }, text); | ||
| await page.waitForLoadState("load"); | ||
| await page.waitForTimeout(waitAfterMs); | ||
| } |
There was a problem hiding this comment.
reactFill and clickButtonByText are never emitted by codegen
Low Severity
The reactFill and clickButtonByText helper functions are baked into every generated Playwright script via wrapScript, but emitOp never generates calls to either of them. Only forceCheck, forceClickRadio, and selectWithFallback are actually dispatched. These two functions are dead code in every emitted script, adding ~30 lines of unused TypeScript to each generated artifact.
Reviewed by Cursor Bugbot for commit 8626af8. Configure here.
…ed by loop validation
Loop validation today on bizfile (run-008 mined as the passing trace) reduced
the post-codegen hand-edits from yesterday's 15 down to 4 + 1 LLM-extract patch.
Each of the 4 navigation-level issues is now baked in as a codegen default, so
the next task we point loop.mjs at should start from a much smaller residual.
Fixes landed:
1. clickLinkWithFallback helper (codegen-playwright.mjs)
- For click_ref ops where the resolved node role is "link", emit
clickLinkWithFallback(page, <locator>) instead of plain .click().
- Helper reads the resolved .href property (not getAttribute, which
returns relative URLs). If the link exposes an absolute http(s) href,
prefer page.goto over .click — bypasses SPA tour overlays and
onClick preventDefault gates that block deterministic replay.
- Waits for networkidle after navigation (load fires too early on SPAs).
2. .first() default for ambiguous click_sel selectors
- Added isUniqueSelector() classifier: #id, [id=...], [data-testid=...].
- For unique selectors, emit .click() as before. For ambiguous ones
(e.g. `button[type=button]`), emit .first().click() to avoid
Playwright strict-mode violations.
3. exact: true for form-input getByRole emissions (selector-resolver.mjs)
- Added EXACT_NAME_ROLES set: textbox, searchbox, combobox, spinbutton,
listbox. nodeToLocators emits { name, exact: true } for these.
- Prevents "Limited Liability Company Name" from matching
"Confirm Limited Liability Company Name" (real bug from yesterday).
4. snapshot role "select" → ARIA role "combobox" (selector-resolver.mjs)
- Added SNAPSHOT_TO_ARIA_ROLE map and normalize at top of nodeToLocators.
- Browse-snapshot reports <select> with role "select" but Playwright's
ARIA role is "combobox". Without this mapping, the emitter produced
getByRole("select", ...) which is invalid.
- Also boost getByLabel above getByRole for select-likes (combobox/listbox)
since label-based locators tend to be more reliable for form selects.
Validation:
Re-exported bizfile-ca-llc from run-008 with these defaults. The emitted
script navigates ALL 9 wizard steps without hand-edits (vs. yesterday's
hand-fixed playwright-baseline/ which required 7 categories of patches).
Only failure is in the LLM-generated extract block at the end (brittle
structural locators in result-shaping) — separate concern, tracked as a
follow-up. The architectural goal (loop + codegen produces a navigating
Playwright script) is met.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Update (May 14): validated end-to-end on bizfile from scratch + landed 4 codegen fixes for the residual hand-edits.
Commit
Ready for review. Two named follow-ups in the PR description for when we pick this up after Friday's walkthrough. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
There are 5 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c918d2d. Configure here.
| let parsed = null; | ||
| try { | ||
| const lastBrace = stdout.lastIndexOf("{"); | ||
| if (lastBrace >= 0) parsed = JSON.parse(stdout.slice(lastBrace)); |
There was a problem hiding this comment.
JSON parsing fails for nested pretty-printed output
High Severity
stdout.lastIndexOf("{") finds the last opening brace in the output, but the generated Playwright script emits JSON.stringify(result, null, 2) (pretty-printed). For any Output schema with nested objects, lastIndexOf locates an inner { rather than the outermost one, causing JSON.parse to throw. Since parsed stays null, passed is always false even when the script succeeds — the loop can never graduate for tasks with non-flat output schemas.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c918d2d. Configure here.
| import { spawnSync } from "node:child_process"; | ||
| import { fileURLToPath } from "node:url"; | ||
| import { distillFailure, appendToStrategy } from "./lib/distill-failure.mjs"; | ||
| import { pickRun } from "./lib/pick-run.mjs"; |
There was a problem hiding this comment.
Unused pickRun import in loop orchestrator
Low Severity
pickRun is imported from ./lib/pick-run.mjs but never called anywhere in loop.mjs. Only extractFinalJson and readSummary (imported on the next line from the same module) are actually used.
Reviewed by Cursor Bugbot for commit c918d2d. Configure here.
| // comments. Good enough to catch LLM truncation, not a parser. | ||
| function checkBalance(code) { | ||
| let depth = { "{": 0, "[": 0, "(": 0 }; | ||
| const open = { "{": "}", "[": "]", "(": ")" }; |
There was a problem hiding this comment.
Unused open variable in bracket balance checker
Low Severity
The open mapping in checkBalance is declared but never referenced. Only the depth object is used for tracking bracket counts.
Reviewed by Cursor Bugbot for commit c918d2d. Configure here.
🏗️ Architecture feedback — toward goal-driven codegenNice work on this PR — the pipeline is clean and the bizfile validation is solid. But I want to flag a longer-term architecture concern before we harden around the current design. The core tensionThe current pipeline is a trace compiler: mine the LLM's trace → resolve ARIA refs to Playwright locators → emit a deterministic script. This works, but it means the trace is the source of truth for the generated script — and that's where things get complicated. The LLM's trace includes a lot of incidental decisions. It clicked a button because it saw it first. It used a CSS selector because the snapshot was long. It took an extra step because it got confused. The export pipeline faithfully converts all of this into Playwright — the noise alongside the signal. A human writing Playwright wouldn't replay the journey; they'd look at the goal (fill step 3 of this form) and write the simplest path to it. This has three downstream consequences:
Proposed architecture: hybrid skeleton + agent codegenSplit the problem into what machines do well (structure) and what agents do well (judgment): Phase 1 — Mechanical skeleton extraction (keep most of what you have) Mine the trace into a workflow skeleton, not a Playwright script:
The command-mapping and trace-walking code from this PR is great infrastructure for this. The change is: stop at the intent layer, don't go all the way to Playwright locators. Phase 2 — Agent writes Playwright from the skeleton + a live session Give Claude the skeleton + strategy.md + a live Browserbase session. Claude writes Playwright for each step:
This eliminates the ARIA ref resolution pipeline, the selector ranking heuristics, and the hand-curated helper catalog. Claude is the codegen. The domain knowledge ("state portals use React controlled forms with styled label overlays") lives in strategy.md as prose, and the agent decides when to apply it — rather than encoding it as named functions. Phase 3 — Agent-driven verification Instead of "2 of 3 passes," give Claude the script + the last N run results and ask: "Is this production-ready? What's still flaky?" The agent can identify that a timeout was 4900ms on a 5000ms limit (near-miss, not a real pass), or that a selector matched by coincidence. Graduation becomes a judgment call, not a counter. What this means for this PRShip it as-is — the pipeline works, bizfile validates it, and the infrastructure (command-mapping, trace-walking, selectors.cache.json, distill-failure) is valuable regardless of architecture. But I'd treat the current export pipeline as a stepping stone, not the final architecture:
The end state: the trace trains strategy.md (the existing autobrowse loop), and the agent writes Playwright from the task description + strategy — not from the trace. The trace is training data, not source code. That's the architecture that scales to 50 agencies without a new helper per portal. |


Headline
autobrowse can now emit a runnable, deterministic Playwright script from any passing trace, and iterate the explorer + emitter together until both halves converge on the same workflow.
Before this PR, autobrowse produced traces +
strategy.md— durable artifacts, but the only way to re-run the task was to pay LLM inference per step. There was no path from a graduated task to a no-LLM-loop runnable script. This PR adds that path.What's new
1. End-to-end Playwright export pipeline (entirely new)
The full mining → resolve → codegen → verify pipeline, none of which existed in autobrowse before:
scripts/export.mjs— CLI:--task --target playwright --workspace --run --no-verifyscripts/lib/pick-run.mjs— newest-passing-run selection fromtraces/<task>/run-NNN/scripts/lib/parse-task.mjs—task.mdOutput block → Zod schema for the emitted scriptscripts/lib/command-mapping.mjs—browsetrace → target-agnostic op streamscripts/lib/selector-resolver.mjs— snapshot + session-scoped ARIA ref → ranked Playwright locator candidates (getByRole(name) → getByLabel → getByPlaceholder → getByText)scripts/lib/codegen-playwright.mjs— ops → runnable TypeScript with helper functions baked in (see streamline screenshot process, add pnpm claude to start #3 below)scripts/lib/verify.mjs—npm install+npx tsx+ JSON output parsescripts/lib/distill-failure.mjs— Claude Haiku summary of Playwright failures intostrategy.mdThe emitted script connects to a fresh Browserbase session bound to
BROWSERBASE_CONTEXT_ID(when set), so persistent-context auth survives between explorer training and Playwright replay.2.
scripts/loop.mjs— iterative co-evolutionUntil now autobrowse converged on "the LLM can finish the task," then export would have been a one-shot translation at the end. Those are different objective functions: what unblocks the LLM agent doesn't always unblock a deterministic replay.
The loop bridges them:
strategy.mdbecomes a shared intelligence layer between the LLM explorer (next iteration) and the codegen. Three sections (documented in SKILL.md):3. Codegen defaults that absorb the common state-portal pitfalls
Demoing the export pipeline end-to-end on bizfile.sos.ca.gov surfaced ~7 distinct classes of mismatch between what unblocks the LLM agent and what unblocks a deterministic replay. Each is now baked in as an auto-emitted helper or behavior — so the next task we point this at starts from a much smaller residual.
forceCheckpage.locator('input[type=checkbox]').fill('true')(Playwright rejects) and overlay-intercepted.check()forceClickRadio[type=radio]OR when resolved snapshot node role isradioselectWithFallback.selectOption()with a JS-enable + React-native-setter fallback for transiently-disabled<select>reactFillHTMLInputElement.prototype.valuesetter + syntheticinput/changeeventsclickButtonByTextgetByRoleraceclickLinkWithFallback.hrefproperty and preferspage.gotofor absolute hrefs.first()default for ambiguousclick_selbutton[type=button]matching 3 elements (Help / Save Draft / Next Step) → strict-mode violationexact: truefor form-inputgetByRole\"select\"→ ARIA\"combobox\"getByRole(\"select\", ...)which is invalid in Playwrightselect_refop routingbrowse select [0-2005] CAresolves the ref via snapshot instead of leaking as invalid CSS4.
scripts/evaluate.mjs— additive patchesBROWSERBASE_CONTEXT_IDenv var; if set with--env remote, pre-creates one BB session bound to that context, transparently injects--connect <session-id>into every browse command from the agent, and releases the session at exit. Lets persistent-context auth flow through every iteration without per-run login flailing.--max-turns NCLI flag (previously hard-coded to 30).loop.mjsplumbs this through.5.
SKILL.mdNew "Export to deterministic Playwright" and "Iterative Playwright loop" sections covering when to use
loop.mjsvsevaluate.mjs, the sectionedstrategy.mdformat, the codegen helper defaults, and pre-authed sessions via persistent context.Validation (May 13–14, bizfile.sos.ca.gov LLC formation)
Phase 1 (May 13, customer_demos PR #33): ran the export pipeline by hand against
run-004. The emitted script needed 15 hand-edits + an extract patch before it would replay cleanly. Those hand-edits became the source list for the codegen defaults above.Phase 2 (May 14, this PR): ran the full
loop.mjsfrom scratch.Net result: the wizard-navigation half went from 15 hand-edits → 0. The LLM-extract block is the remaining gap.
Known limitations / follow-ups
LLM-generated extract block remains brittle. The Haiku-generated result-shaping code at the end of every emitted script uses structural locators (
page.locator('text=\"X\"').evaluate(...)) that often match multiple elements. The wizard navigation succeeds end-to-end, then the extract throws andsuccess: falseis returned. Right fix: harden the extract prompt to insist on per-field try/catch + prefergetByLabel({..., exact: true}). ~30 LOC follow-up.No feedback when
evaluateitself maxes out. The loop currently only distills Playwright failures intostrategy.md. When evaluate hits max_turns, there's no addendum and the next iteration repeats whatever caused the flailing. Right fix: a second distillation pathway that reads evaluate's decision log when status ismax_turns, identifies the longest-spent step, and writes a Codegen Hint.strategy.md's "Codegen Hints" section is human-readable only. The codegen doesn't yet parse it for per-task overrides at export time. The new helpers are baked in as defaults that fire on selector/role heuristics. Right fix: structured Codegen Hints DSL the emitter consumes.Validated on n=1 task. All evidence so far comes from bizfile. State-portal patterns we haven't exercised: date pickers, file uploads, multi-tab flows, iframed forms, captchas mid-flow, Symantec VIP / SAML auth, steppers without a "Next" button. Each may surface a new codegen default. Recommend running this against 1–2 more diverse portals (CA EDD + a DMV-style stepper) before any "generalizes to all 50 agencies" claim.
Try it
The loop graduates when Playwright passes in 2 of the last 3 iterations and writes a report to
<workspace>/reports/loop-<task>-<timestamp>.md. Sister PR with the bizfile demo workspace + the emitted-then-hand-fixed script: browserbase/customer_demos#33.🤖 Generated with Claude Code