Add opt-in playwright_execute tool to the CUA agent and CLI by dprevoznik · Pull Request #33 · kernel/cua

dprevoznik · 2026-06-19T21:52:01Z

Summary

Adds an opt-in playwright_execute tool so the model can run Playwright/TypeScript directly against the live browser session — for steps that are awkward as raw pointer/keyboard actions (precise DOM reads, form fills, data extraction, waiting on selectors). It sits alongside the existing computer-use tools rather than replacing them.

Execution runs server-side in the browser VM via the Kernel SDK (client.browsers.playwright.execute), which exposes page/context/browser and lets the code return a JSON-serializable value. No CDP wiring or local Playwright is needed.

It is modeled directly on the existing computer_use_extra navigation tool:

@onkernel/cua-ai — playwright_execute tool name, { code, timeout_sec? } schema, and createCuaPlaywrightToolDefinition().
@onkernel/cua-agent — InternalComputerTranslator.executePlaywright(), a playwright executor in tools.ts, and the playwright option threaded through CuaAgent/CuaAgentHarness. The tool name is added to keepToolNames() so provider payload hooks don't strip it.
@onkernel/cua-cli — --playwright flag and a TUI tool-call preview.

Behavior, per the decisions on this:

Opt-in: off by default; enable with --playwright (CLI) or playwright: true (library).
Result shape: returns result (when present), plus stdout/stderr only when non-empty, and error on success: false. A reported failure comes back as tool content (not thrown) so the model can adapt; only a thrown SDK error surfaces as a tool error. Library consumers can also read the structured result/stdout/stderr/error off PlaywrightDetails without re-parsing tool content text.
Timeout: timeout_sec follows the documented server contract (default 60s, max 300s); values are clamped client-side so the model can't violate the cap.
Screenshot: appends a fresh screenshot after execution so the screenshot loop stays coherent.

Naming note

The model-facing wire name is playwright_execute (snake_case, consistent with computer_use_extra / computer_batch), the CLI flag is --playwright, and the option is playwright.

Model support

The tool is advertised as a generic function tool, so any provider that supports function calling alongside its native computer-use API can call it. The playwright_execute name is added to keepToolNames() so provider payload hooks that filter unknown tools (tzafon/yutori) won't strip it. Verified e2e against:

Anthropic (claude-opus-4-7)
Tzafon (tzafon.northstar-cua-fast-1.6)
Yutori (n1.5-latest)

OpenAI (gpt-5.5) and Google (gemini-3-flash-preview) are unit-tested but not yet e2e-verified against a live browser.

Docs

packages/agent/README.md, packages/ai/README.md, and packages/cli/README.md updated alongside the code.

Test plan

npm run typecheck (workspace) passes
@onkernel/cua-agent suite passes, incl. 3 new tests (tool synthesized when enabled; execution formats result/stdout + appends screenshot; failure surfaces as content without throwing)
@onkernel/cua-ai (88) and @onkernel/cua-cli (37) suites pass
Manual smoke against a live Kernel browser (cua --playwright) on three providers:
- Anthropic (claude-opus-4-7) — happy path returned result: {"h1":"Example Domain","title":"Example Domain"} in one turn; details carried the structured result object.
- Tzafon (tzafon.northstar-cua-fast-1.6) — same one-turn happy path. Confirms keepToolNames() correctly preserves the tool through tzafon's payload hook.
- Yutori (n1.5-latest) — recovered from a TypeError (page.querySelector is not a function) and a ReferenceError (document not defined) by reading the failure-as-content stderr/error blocks, then arrived at the correct page.evaluate(...) pattern. Confirms the failure-as-content design closes the iteration loop.
Failure path verified during the Yutori smoke: success: false with the Playwright stderr/error came back as tool content (not thrown), screenshot still appended, model read it and adapted.

🤖 Generated with Claude Code

Note

Medium Risk
Introduces server-side arbitrary Playwright execution against live browser sessions when enabled; mitigated by opt-in default, timeout caps, and soft failure handling for model recovery.

Overview
Adds an opt-in playwright_execute tool so models can run Playwright/TypeScript against the live Kernel browser session (DOM reads, selectors, form fills) alongside existing computer-use tools.

@onkernel/cua-ai defines the tool (CuaPlaywrightSchema, createCuaPlaywrightToolDefinition(), CUA_PLAYWRIGHT_TOOL_NAME).

@onkernel/cua-agent wires it through a new playwright?: boolean on CuaAgent / CuaAgentHarness, InternalComputerTranslator.executePlaywright() (Kernel browsers.playwright.execute, optional timeout_sec clamped to 300s), and executor logic that returns structured PlaywrightDetails plus model-facing text for result / stdout / stderr / error. Reported Playwright failures are tool content (not thrown); only SDK errors throw. playwright_execute is included in keepToolNames() so Yutori/Tzafon payload hooks do not strip it. Unlike navigation/batch tools, this path does not auto-append a screenshot.

@onkernel/cua-cli adds --playwright, passes it into the harness, and shows truncated code in the TUI tool-call preview. README updates and unit tests cover synthesis, success/failure shapes, and no image on success.

^{Reviewed by Cursor Bugbot for commit f855cf1. Bugbot is set up for automated code reviews on this repo. Configure here.}

Exposes a tool that runs Playwright/TypeScript directly against the browser session (via the Kernel SDK browsers.playwright.execute) for steps that are awkward as raw pointer/keyboard actions. Modeled on the existing computer_use_extra navigation tool: defined in cua-ai, executed through the translator, gated by a `playwright` option, and added to keepToolNames so providers retain it in the payload. Enable with the `--playwright` CLI flag. Returns result/stdout/stderr and appends a fresh screenshot so the screenshot loop stays coherent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Drop misleading "Defaults to 60" from timeout_sec description; the actual default lives in the Kernel SDK, not here. - Expose result/stdout/stderr/error on PlaywrightDetails so library consumers can branch on the structured execution result without re-parsing tool content text. - Guard formatPlaywrightResult against non-JSON-serializable returns (e.g. BigInt, circular refs) so a successful Playwright run never becomes a tool-level error. - Sync package-lock.json to match the cua-cli 0.1.1 bump in a7cdc07. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Locals don't persist across calls but the browser session does. Without this, a model could write code in call N assuming variables from call N-1 are still in scope. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Earlier review feedback dropped "Defaults to 60" out of a worry that the default lived in the SDK and could drift. The kernel.sh docs put both the default (60s) and the cap (300s) on the server, so the description is the authoritative place to surface them — the model can't choose a sensible timeout without that anchor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Schema description tells the model "max 300" but nothing enforced it. A model that ignored the bound would have hit a confusing SDK-level failure depending on server behavior; this clamp keeps the client honest to the documented contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- packages/agent: list playwright option alongside computerUseExtra and add a paragraph explaining the tool's behavior and tested-models scope. - packages/ai: list the new tool-definition factory, schema, constants, and CuaPlaywrightInput type in the API surface index. - packages/cli: document --playwright with a short explainer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

formatPlaywrightResult's JSON.stringify try/catch guarded against non-serializable values, but execution.result came from the SDK after a JSON round trip through the wire — anything that survived that is already JSON-safe, so the catch arm is unreachable. The executePlaywright timeout chain checked typeof === "number" (dead, the parameter is TS-typed number | undefined) and Number.isFinite (redundant — timeoutSec > 0 already rejects NaN, and Math.min handles Infinity). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Empirical results show CUA-specialized providers (Tzafon, Yutori) do emit playwright_execute calls — earlier docs were overly cautious. Yutori in particular demonstrates the failure-as-content design well: it iterated through two wrong-API attempts (page.querySelector, bare document) before reading the stderr/error blocks and landing on page.evaluate(), which throwing would have prevented. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

firetiger-agent · 2026-06-20T19:29:55Z

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

PRs in the kernel, infra, hypeman, and hypeship repos. kernel is a ~mono repo with many logical services underneath, ensure to focus on the implicated service for the PR

Reason: PR is in the kernel repo but affects the CUA (computer-use agent) service; unclear if this qualifies as a monitored service within the kernel mono repo—please confirm or add the kernel:cua label to opt in.

To monitor this PR anyway, reply with @firetiger monitor this.

Matches executeBatchTool's shape: the trailing translator.screenshot() lives inside the same try/catch as the underlying work, so any failure in the pipeline produces a single wrapped tool error rather than diverging based on which step failed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes using high effort and found 4 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 565fe01. Configure here.}

- executePlaywright: timeout_sec values below 1s previously truncated to 0 and were forwarded to the SDK, which differs from omitting the field. Floor the truncated value at 1s; anything sub-second falls back to "use server default". - Document PlaywrightDetails fields so library consumers know what each one means without reading the executor source. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

rgarcia

reviewed end to end — solid work, and a faithful mirror of the computer_use_extra pattern. the failure-as-content design and the keepToolNames() wiring so tzafon/yutori don't strip the tool are the right calls. one change i'd want before merge, one nit, and two follow-ups.

change request

packages/agent/src/tools.ts:222 — drop the auto-appended screenshot from playwright_execute. the native action / computer_batch tools leave screenshots to the model (batch only appends one as a fallback when nothing was read); playwright instead copied navigation's always-append. but playwright is the one tool that's frequently a pure read (return await page.title()), where forcing a screenshot every call is wasted image tokens + latency. better to let the model pull a screenshot on a follow-up turn. the content.length === 0 → statusText fallback at tools.ts:220 already keeps content non-empty for side-effect-only calls, so the function itself needs nothing else. things to keep in sync:

packages/agent/README.md:135 — "a fresh screenshot is appended after every call…" becomes false; reword toward "request a screenshot on a follow-up turn to see the page."
packages/ai/src/providers/common.ts:346 — already says "capture page state with a follow-up screenshot action," which stays correct; worth tightening to make explicit that none is returned automatically.
packages/agent/test/tool-exhaustiveness.test.ts:116,140 — both assert content.at(-1) is an image; that becomes the trailing text block. the captureScreenshot mocks at :102,127 go unused for these two. while in here, add a side-effect-only case (no result/stdout/stderr, success → statusText), since that's now the primary content shape.

nit

packages/agent/src/tools.ts:209 vs :227 — stdout/stderr guards diverge. model-facing content gates on execution.stdout?.trim() and stores the trimEnd()-ed value, while details.stdout gates on plain truthiness and stores the raw value. so a whitespace-only stdout lands in details but not in the model content. probably intentional (details = faithful capture), just flagging the inconsistency against the PlaywrightDetails doc ("present only when the daemon captured output"). same for stderr.

follow-ups (separate PRs, not blocking)

same screenshot removal for computer_use_extra (tools.ts:175, executeNavigationTool) — it also unconditionally appends.
a /playwright [on|off] interactive slash command to toggle the tool mid-session, mirroring /model's setModel → state.tools refresh (agent.ts:306-310). the live this.options.playwright reads in tools()/keepToolNames() make this cheap; needs a setPlaywright mutator on runtime + agent + harness, a slash-commands.ts entry/parse case, and an applyPlaywrightCommand handler in main.ts.

…/stderr details playwright_execute is frequently a pure read where forcing a screenshot wastes image tokens and latency. Let the model request one on a follow-up turn. The existing content.length === 0 → statusText fallback keeps content non-empty for side-effect-only calls. Also tighten the PlaywrightDetails TSDoc for stdout/stderr to reflect that details captures raw daemon output (potentially whitespace-only), while the model-facing content blocks only surface trimmed non-empty output. - packages/agent/src/tools.ts: drop screenshot append in executePlaywrightTool; update PlaywrightDetails TSDoc for stdout/stderr. - packages/agent/README.md and packages/ai/src/providers/common.ts: reword to make explicit no screenshot is returned automatically. - packages/agent/test/tool-exhaustiveness.test.ts: flip the trailing-image assertions to assert no image is appended; drop the unused captureScreenshot mocks; add a side-effect-only case that hits the statusText fallback.

dprevoznik · 2026-06-23T20:40:27Z

@rgarcia requested change + nit addressed f855cf1 and reverified all tests still run. @masnwilliams lmk if you have any thoughts before I merge. If they are optimizations that can be added later, happy to address as part of Raf's suggested follow up items too.

rgarcia

re-reviewed at f855cf1 — the screenshot-removal change request is fully addressed, and cleanly.

tools.ts — auto-screenshot dropped from playwright_execute; the content.length === 0 → statusText fallback now carries side-effect-only calls, so content is never empty. ✅
packages/agent/README.md / packages/ai/src/providers/common.ts — both reworded to "no screenshot is returned automatically; request one on a follow-up turn." accurate now. ✅
tests — happy-path renamed and the two exec tests now assert no image block, the dead captureScreenshot mocks are gone, and you added the side-effect-only case (asserts content is exactly the statusText). ✅
stdout/stderr nit — resolved by aligning the PlaywrightDetails doc to the actual behavior ("raw daemon output … may be whitespace-only") rather than changing the guards. good call keeping details as the faithful raw capture and content as the trimmed view.

verified locally: @onkernel/cua-agent suite green (25 passed, incl. the new side-effect-only test) and tsc -b typecheck clean.

the two follow-ups — same screenshot removal for computer_use_extra (navigation), and a /playwright [on|off] mid-session toggle — remain tracked as separate PRs, not blocking. nothing else outstanding from my end.

rgarcia

approving — screenshot-removal change request fully addressed at f855cf1, stdout/stderr doc nit resolved, and i verified the agent suite (25 passing) + tsc -b typecheck locally. the two follow-ups (navigation screenshot removal, /playwright mid-session toggle) are tracked as separate non-blocking PRs.

dprevoznik and others added 8 commits June 19, 2026 21:51

dprevoznik marked this pull request as ready for review June 20, 2026 19:29

cursor Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread packages/agent/src/tools.ts Outdated

Comment thread packages/agent/src/tools.ts

cursor Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread packages/cli/src/tui/message-list.ts

Comment thread packages/agent/src/translator/translator.ts

Comment thread packages/agent/src/tools.ts

Comment thread packages/agent/src/tools.ts

dprevoznik requested review from masnwilliams and rgarcia June 20, 2026 20:49

rgarcia reviewed Jun 23, 2026

View reviewed changes

rgarcia approved these changes Jun 23, 2026

View reviewed changes

dprevoznik merged commit 6c4740c into main Jun 23, 2026
6 checks passed

This was referenced Jun 23, 2026

Drop auto-appended screenshot from computer_use_extra #35

Draft

Add /playwright [on|off] slash command to toggle the tool mid-session #36

Draft

rgarcia mentioned this pull request Jun 23, 2026

Release: bump cua-ai 0.3.1, cua-agent 0.3.4, cua-cli 0.1.3 (playwright_execute) #37

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add opt-in playwright_execute tool to the CUA agent and CLI#33

Add opt-in playwright_execute tool to the CUA agent and CLI#33
dprevoznik merged 11 commits into
mainfrom
hypeship/cua-playwright-execute-tool

dprevoznik commented Jun 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

firetiger-agent Bot commented Jun 20, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rgarcia left a comment

Uh oh!

dprevoznik commented Jun 23, 2026 •

edited

Loading

Uh oh!

rgarcia left a comment

Uh oh!

rgarcia left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dprevoznik commented Jun 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Naming note

Model support

Docs

Test plan

Uh oh!

firetiger-agent Bot commented Jun 20, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rgarcia left a comment

Choose a reason for hiding this comment

change request

nit

follow-ups (separate PRs, not blocking)

Uh oh!

dprevoznik commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgarcia left a comment

Choose a reason for hiding this comment

Uh oh!

rgarcia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dprevoznik commented Jun 19, 2026 •

edited by cursor Bot

Loading

dprevoznik commented Jun 23, 2026 •

edited

Loading