From 28258149b8509f14c53e7287d2928acba5a159e6 Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Sun, 21 Jun 2026 14:33:50 +0000 Subject: [PATCH 1/4] add cua skill plugin New `cua` plugin documenting both the `cua` CLI (`@onkernel/cua-cli`) and the `@onkernel/cua-agent` TS library (`CuaAgent` / `CuaAgentHarness`). Single skill covers one-shot subcommands, named sessions, transcripts, model selection across providers, library quick start, live-view handoff for manual login, and the Playwright escape hatch for deterministic actions against the underlying Kernel browser. Co-Authored-By: Claude Opus 4.7 --- README.md | 12 + plugins/cua/.claude-plugin/plugin.json | 11 + plugins/cua/skills/cua/SKILL.md | 390 +++++++++++++++++++++++++ 3 files changed, 413 insertions(+) create mode 100644 plugins/cua/.claude-plugin/plugin.json create mode 100644 plugins/cua/skills/cua/SKILL.md diff --git a/README.md b/README.md index 31232f1..9860269 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,9 @@ Official AI agent skills from the Kernel for installing useful skills for our CL # Install the video generation skill /plugin install generate-video + +# Install the cua skill (CLI + library for computer-use on Kernel) +/plugin install cua ``` ### Cursor @@ -42,6 +45,7 @@ git clone https://github.com/kernel/skills.git cp -r skills/plugins/kernel-cli ~/.claude/skills/ cp -r skills/plugins/kernel-sdks ~/.claude/skills/ cp -r skills/plugins/generate-video ~/.claude/skills/ +cp -r skills/plugins/cua ~/.claude/skills/ ``` ## Prerequisites @@ -83,6 +87,14 @@ SDK skills for building browser automation with TypeScript and Python. | **typescript-sdk** | Build automation with Kernel's Typescript SDK | | **python-sdk** | Build automation with kernel's Python SDK | +### cua + +Computer-use loop for Kernel cloud browsers — CLI for shell-driven automation and the `@onkernel/cua-agent` TS library for embedding in your own agents. + +| Skill | Description | +|-------|-------------| +| **cua** | Drive Kernel cua via the `cua` CLI (one-shot subcommands, named sessions, TUI) or the `@onkernel/cua-agent` library (`CuaAgent` / `CuaAgentHarness`); covers model selection, profile persistence, transcripts, live-view handoff, and Playwright escape hatches | + ### generate-video Render smooth, deterministic MP4s from web scenes. No Kernel account required — just Chromium, Node, and ffmpeg. diff --git a/plugins/cua/.claude-plugin/plugin.json b/plugins/cua/.claude-plugin/plugin.json new file mode 100644 index 0000000..273978e --- /dev/null +++ b/plugins/cua/.claude-plugin/plugin.json @@ -0,0 +1,11 @@ +{ + "name": "cua", + "version": "1.0.0", + "description": "Drive Kernel cua: the `cua` CLI for shell-driven computer-use automation, and the @onkernel/cua-agent TS library for building your own computer-use agents on Kernel browsers", + "author": { + "name": "Kernel", + "url": "www.kernel.sh" + }, + "repository": "https://github.com/kernel/skills", + "license": "MIT" +} diff --git a/plugins/cua/skills/cua/SKILL.md b/plugins/cua/skills/cua/SKILL.md new file mode 100644 index 0000000..3b3d4cb --- /dev/null +++ b/plugins/cua/skills/cua/SKILL.md @@ -0,0 +1,390 @@ +--- +name: cua +description: Drive Kernel cua — the `cua` CLI for shell automation, or the @onkernel/cua-agent TypeScript library for building your own computer-use agents. Use when opening URLs, clicking/typing/observing in a real cloud browser via cua, chaining multi-step browser tasks across shell calls, or wiring up `CuaAgent` / `CuaAgentHarness` against a Kernel browser. Covers model selection (gpt-5.5, claude-opus-4-7, gemini-3-flash-preview, n1.5-latest), named sessions, profile persistence, transcripts, live-view handoff, and Playwright escape hatches. +--- + +# cua + +`cua` is a computer-use loop for Kernel cloud browsers. There are two surfaces, both backed by the same execution layer: + +- **`cua` CLI** (`@onkernel/cua-cli`) — single binary that drives a real Chrome session running in Kernel. Each subcommand returns a one-line result on stdout and a deterministic exit code, so shell agents can chain calls. +- **`@onkernel/cua-agent` library** — `CuaAgent` / `CuaAgentHarness` TypeScript classes that run the same prompt → screenshot → tool-call loop against a Kernel browser, callable from your own code. + +Both translate per-provider computer-use tool calls (OpenAI's `computer`, Anthropic's `computer_20251124`, Gemini's normalized-coordinate functions, Yutori Navigator's browser actions) into Kernel SDK `browsers.computer.*` calls and feed a fresh screenshot back to the model on every turn. + +## When to use this skill + +- **Use the CLI** when you need shell-callable computer-use steps (`cua open`, `cua click`, `cua do …`) or an interactive TUI. Best for ad-hoc agent tasks, shell pipelines, and one-shot prompts. +- **Use the library** when you need to embed cua inside a larger TS app, run a custom session repo, add your own pi tools alongside computer use, or react to per-event streams programmatically. +- **Reach for `kernel-agent-browser` instead** when you need deterministic browser scripting (semantic selectors, `find role`, `wait --text`, snapshots/refs). cua drives by screenshots; agent-browser drives by accessibility tree. +- **Reach for `kernel-typescript-sdk` instead** for raw Playwright/CDP control over a Kernel browser without an LLM in the loop. + +## Prerequisites + +- A Kernel account and API key (`KERNEL_API_KEY`). See the [`kernel-cli`](https://www.kernel.sh/docs) skill for install + auth. +- At least one model-provider API key, matched to the model you pick (table in "Model selection" below). +- Node 20+ for both the CLI install and the library. + +## Install + +### CLI + +```bash +# Global install — gives you the `cua` binary on $PATH +npm i -g @onkernel/cua-cli + +# Or zero-install one-shot +npx -y -p @onkernel/cua-cli cua --help +``` + +### Library + +```bash +npm i @onkernel/cua-agent @onkernel/cua-ai @onkernel/sdk +``` + +## Environment variables + +| Env | Used for | +| --- | --- | +| `KERNEL_API_KEY` | Kernel API key (always required) | +| `OPENAI_API_KEY` | OpenAI models (`-m openai:…`) | +| `ANTHROPIC_API_KEY` | Anthropic models (`-m anthropic:…`); `ANTHROPIC_OAUTH_TOKEN` also works | +| `GOOGLE_API_KEY` / `GEMINI_API_KEY` | Google / Gemini models (`-m google:…`) | +| `YUTORI_API_KEY` | Yutori Navigator (`-m yutori:…`) | +| `TZAFON_API_KEY` | Tzafon (`-m tzafon:…`) | +| `KERNEL_BASE_URL` | Override Kernel base URL | +| `XDG_DATA_HOME` | CLI sessions/transcripts dir (defaults to `~/.local/share`) | +| `CUA_IMAGE_PROTOCOL` | Force inline image protocol (`kitty` / `iterm2` / `none` / `auto`) | + +The library auto-loads these via `getCuaEnvApiKey` if you don't pass explicit auth callbacks. + +## CLI: one-shot subcommands + +Each call provisions a fresh Kernel browser by default, runs the action, prints a one-line result, and tears the browser down. Chain via `-s ` (next section) to keep state. + +| Subcommand | What it does | Stdout | Exit code | +| --- | --- | --- | --- | +| `cua open ` | Navigate to a URL. | `ok` | 0 ok, 2 error | +| `cua click ""` | Find element matching natural-language description and click it. | `ok clicked (x, y)` or `not_found ` | 0 ok, 1 not_found, 2 error | +| `cua type "" ""` | Focus a field by description and type. | `ok typed` or `not_found ` | 0 ok, 1 not_found, 2 error | +| `cua press [...]` | Send a key combo (`cua press ctrl l`, `cua press Return`). | `ok pressed` | 0 ok, 2 error | +| `cua url` | Print the current URL. | the URL | 0 ok, 2 error | +| `cua observe [""]` | Describe the page; optionally answer a question. | the description | 0 ok, 2 error | +| `cua screenshot --out ` | Save a PNG. `--out -` writes bytes to stdout. | the path or `(stdout)` | 0 ok, 2 error | +| `cua do ""` | Open-ended; agent plans and acts. Bound by `--max-steps` (default 3). | the assistant's final text | 0 ok, 2 error | + +Useful flags: + +- `-m ` — pick the LLM (default `openai:gpt-5.5`). `cua models` to list. +- `--max-steps ` — bound the loop on `cua do`. +- `--profile ` — load a Kernel browser profile for persisted cookies / storage. Existing ids or names are reused; a non-id name is created if missing. Pass `--profile-no-save-changes` for read-only. +- `-v` — verbose progress on stderr (provisioning, tool calls, transcript path). + +`click` and `type` match **semantically**, not by selector — use natural-language descriptions of what's visible on screen. + +## CLI: named sessions + +Without `-s`, each subcommand provisions a brand-new browser. To keep state (cookies, URL, scroll position) across calls, allocate a named session first: + +```bash +cua --profile github session start login # provisions a Kernel browser, prints `name=login` +cua -s login open https://github.com/login +cua -s login type "email field" "$EMAIL" +cua -s login type "password field" "$PASSWORD" +cua -s login click "Sign in" +cua -s login url # prints post-login URL +cua session stop login # tears down the Kernel browser +``` + +Inspect: + +```bash +cua session list # NAME / KERNEL_ID / AGE / LIVE_URL +cua session show login # full JSON metadata +``` + +Pass `--profile` when starting the named session; later `cua -s …` calls attach to the same browser, so they don't need the profile flag. + +**Liveness**: Kernel browsers time out from inactivity. If you see `error session "" is no longer alive on Kernel …`, run `cua session stop && cua --profile session start ` to re-provision with the same persisted profile. + +Named-session metadata lives in `$XDG_DATA_HOME/cua/named-sessions/.json`. + +## CLI: free-form mode + +```bash +cua --print "open hn and tell me the top story" # one-shot, streams text +cua --print -o jsonl "..." # one-shot, streams JSONL events +cua "..." # interactive TUI (real terminal) +``` + +`--print` exits when the agent finishes; the TUI runs until Ctrl+C. Add `--jsonl-include-deltas` for token deltas, `--jsonl-include-images` for base64 screenshots in `tool_result` events. + +## Library: quick start with `CuaAgentHarness` + +The harness is the recommended entry point. It owns the session, persists every turn, handles steering / follow-up, and can swap providers mid-conversation. + +```ts +import Kernel from "@onkernel/sdk"; +import { + CuaAgentHarness, + InMemorySessionRepo, + NodeExecutionEnv, +} from "@onkernel/cua-agent"; +import type { AssistantMessage } from "@onkernel/cua-ai"; + +const client = new Kernel({ apiKey: process.env.KERNEL_API_KEY! }); +const browser = await client.browsers.create({ stealth: true }); + +const repo = new InMemorySessionRepo(); +const session = await repo.create({ id: "research" }); + +const harness = new CuaAgentHarness({ + browser, + client, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + model: "openai:gpt-5.5", + session, +}); + +const textOf = (m: AssistantMessage) => + m.content.flatMap((b) => (b.type === "text" ? [b.text] : [])).join("").trim(); + +const first = await harness.prompt("Open example.com and describe what you see."); +console.log(textOf(first)); + +// Swap providers mid-session — CUA tools and the default prompt refresh. +await harness.setModel("anthropic:claude-opus-4-7"); +const second = await harness.prompt("Open the most relevant link from what you found."); +console.log(textOf(second)); + +await client.browsers.deleteByID(browser.session_id); +``` + +While a turn is running: `steer()` injects course corrections, `followUp()` queues the next instruction, `subscribe()` streams underlying agent events, and `compact()` collapses long transcripts. + +### When to use `CuaAgent` instead + +Reach for `CuaAgent` (extends pi `Agent`) when you want raw control — direct `state.messages` access, custom streaming, explicit prompt/continue/queue, no session repo. The shape is the same except you assign `agent.state.model = …` instead of calling `setModel()`. + +```ts +import { CuaAgent } from "@onkernel/cua-agent"; + +const agent = new CuaAgent({ + browser, + client, + initialState: { + model: "openai:gpt-5.5", + systemPrompt: "You are a careful browser automation agent.", + }, +}); + +agent.subscribe((event) => { /* … */ }); +await agent.prompt("Open news.ycombinator.com and summarize the top story."); +``` + +### CLI vs library vs raw `CuaAgent` + +| You want to … | Use | +| --- | --- | +| Drive cua from shell scripts | CLI | +| Open-ended TUI session | CLI (`cua` no args) | +| Embed cua inside a TS app with session-backed turns | `CuaAgentHarness` | +| Add your own pi tools alongside computer use | `CuaAgentHarness` (`extraTools`) or `CuaAgent` | +| Raw pi `Agent` semantics: own message state, lifecycle events | `CuaAgent` | + +## Model selection + +Run `cua models` (or `listCuaModels()` from `@onkernel/cua-ai`) for the current catalog. As of writing, the four supported providers and their built-in computer-use vocabularies: + +| Model ref | Provider | Notes | +| --- | --- | --- | +| `openai:gpt-5.5` | OpenAI | Built-in `computer` tool; default in CLI. | +| `anthropic:claude-opus-4-7` | Anthropic | Built-in `computer_20251124` tool. Supports `--thinking` levels. | +| `google:gemini-3-flash-preview` | Google | Predefined computer-use functions with 0–1000 normalized coords. | +| `yutori:n1.5-latest` | Yutori | OpenAI-compatible chat with browser action tool calls. | + +Switching models mid-turn: + +- CLI: re-run with `-m `, or attach a `-s` named session with a different `-m` per call. +- Library (harness): `await harness.setModel("anthropic:claude-opus-4-7")` — CUA tools and the default system prompt refresh. +- Library (agent): assign `agent.state.model = "anthropic:claude-opus-4-7"`. + +Not every provider's native vocabulary includes navigation. Pass `computerUseExtra: true` to add the provider-neutral `computer_use_extra` tool (`goto`, `back`, `forward`, `url`) when you need it on a model that lacks built-in navigation. + +## Browser config + +Provision the underlying Kernel browser to match the task before handing it to cua: + +```ts +const browser = await client.browsers.create({ + stealth: true, // bypass most fingerprinting; default off + headless: false, // headful => live view URL; smaller image when headless + timeout: 1800, // seconds before the Kernel browser auto-times-out + profile: { name: "github", save_changes: true }, // load + save persisted state + // proxy: { ... }, // optional outbound proxy +}); +``` + +The CLI exposes equivalents via `--profile`, `--profile-no-save-changes`, and the underlying Kernel CLI flags (the cua CLI itself doesn't surface a `--stealth` flag yet — when stealth matters, use the library or pre-create the browser via `kernel browsers create` and reuse the session). + +## Adding your own tools + +```ts +import { CuaAgentHarness } from "@onkernel/cua-agent"; +import { tool } from "@earendil-works/pi-agent-core"; + +const lookupOrder = tool({ + name: "lookup_order", + description: "Look up an order by id in our DB.", + schema: { /* … */ }, + handler: async ({ orderId }) => { + return await db.orders.findOne(orderId); + }, +}); + +const harness = new CuaAgentHarness({ + browser, client, + model: "openai:gpt-5.5", + session, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + extraTools: [lookupOrder], + computerUseExtra: true, +}); +``` + +Use `createCuaComputerTools()` directly if you want to compose the tool list yourself (e.g. wrap computer-use tools in a permission gate): + +```ts +import { resolveCuaRuntimeSpec } from "@onkernel/cua-ai"; +import { createCuaComputerTools } from "@onkernel/cua-agent"; + +const runtime = resolveCuaRuntimeSpec("openai:gpt-5.5"); +const tools = [ + ...createCuaComputerTools({ browser, client, toolExecutors: runtime.toolExecutors }), + lookupOrder, +]; +``` + +## Live view URL and manual login fallback + +cua's `--profile` (CLI) and `profile` (library) handle most login persistence, but stealth doesn't always beat bot detection. When automation gets stuck on a login, hand off to a human via the live view URL. + +### CLI + +```bash +cua --profile mysite session start login +cua session show login | jq -r .live_url # share this URL with the user +# user logs in manually in their browser via the live view +cua -s login url # confirm the post-login URL +cua session stop login # profile state saves on teardown +``` + +### Library + +Every Kernel browser response carries the live view URL on creation: + +```ts +const browser = await client.browsers.create({ stealth: true, headless: false }); +console.log("live view:", browser.browser_live_view_url); +// share that URL, wait for the user to finish manual login, then prompt the agent +``` + +If you only have a session id, fetch it: + +```bash +kernel browsers view +``` + +## Cross-origin iframes / Playwright escape hatch + +cua drives by clicking pixels, so cross-origin iframes (payment forms, embedded vendor widgets) work in the screenshot flow without special handling — the model just clicks them. When you need a deterministic Playwright action against the underlying browser (e.g. to fill a card form via a fixed selector), break out to Kernel's exec endpoint with the session id: + +```bash +# CLI: find the session id +cua session show login | jq -r .kernel_session_id + +# Run a Playwright snippet against the same browser +kernel browsers exec --code " + const frame = page.frameLocator('#payment-iframe'); + await frame.locator('#card-number').fill('4111111111111111'); + await frame.locator('#submit').click(); +" +``` + +From the library, you already have `browser.session_id` and the Kernel client, so call into the SDK directly. + +## Debugging + +- **CLI verbose**: `cua -v --print "…"` writes provisioning info, tool calls, and the transcript path to stderr. +- **Live event stream**: `cua --print -o jsonl "…"` emits one event per line (`tool_call`, `tool_result`, `assistant_text_done`, etc.). Add `--jsonl-include-images` to inline screenshots in `tool_result`. +- **Persisted transcript**: every `--print`, TUI, and `-s ` invocation appends to `$XDG_DATA_HOME/cua/sessions//.jsonl`. Exact path: + ```bash + cua -v --print "..." # stderr includes: [cua] session= + cua session show login | jq -r .transcript_path + ``` + Roles: `user`, `assistant`, `toolResult`. There's also a custom `cua-browser` entry written once per session with `kernel_session_id` / `live_url` / `profile_id`. +- **Library event subscription**: + ```ts + harness.subscribe((event) => { + // event.type === "tool_call" | "tool_result" | "assistant_text_done" | ... + }); + ``` +- **Screenshots**: `cua screenshot --out shot.png` (CLI) or inspect the `image` blocks in `toolResult` transcript entries. +- **Page URL**: `cua url` to confirm post-action navigation. `agent.state.messages` (library) holds the full message history. + +A couple of `jq` starters against a transcript path: + +```bash +# Every tool call the agent made, in order +jq -c 'select(.role == "assistant") | .content[]? + | select(.type == "tool_use") | {name, input}' "$TRANSCRIPT" + +# Final assistant text (the answer) +jq -r 'select(.role == "assistant") | .content[]? + | select(.type == "text") | .text' "$TRANSCRIPT" | tail -1 +``` + +## Gotchas + +- **Element descriptions are semantic, not selectors.** `cua click "Sign in button"` looks at the screenshot — describe what the user sees, not a CSS selector. +- **Viewport defaults to 1920x1080.** Resize via `client.browsers.create({ ... })` flags if you need something else. +- **Keyboard navigation > mouse-wheel scroll.** `cua press Page_Down` / `Home` / arrow keys is more reliable than scroll wheel via the LLM. +- **Multi-step state requires `-s` (CLI) or a session-backed harness (library).** A second one-shot subcommand can't see what the first one did. +- **Profile saves on close, not continuously.** Tear down cleanly (`cua session stop`, `client.browsers.deleteByID`) or you'll lose recent state. +- **Provider tool vocab gaps.** If a model can click and type but can't navigate, set `computerUseExtra: true` (library) or pick a different model. +- **`--max-steps` defaults to 3 on `cua do`.** Bump it for non-trivial tasks. + +## Quick reference + +```bash +# CLI quickstart — one-shot, fresh browser +cua --print "open hn and tell me the top story" + +# CLI — named session for multi-step +cua --profile mysite session start work +cua -s work open https://example.com +cua -s work click "Log in" +cua -s work type "email field" "$EMAIL" +cua -s work click "Submit" +cua -s work url +cua session stop work + +# CLI — list models, switch model per call +cua models +cua --print -m anthropic:claude-opus-4-7 "..." + +# Get the live view URL +cua session show work | jq -r .live_url +kernel browsers view # alternative + +# Library — minimal harness +import { CuaAgentHarness, InMemorySessionRepo, NodeExecutionEnv } from "@onkernel/cua-agent"; +const session = await new InMemorySessionRepo().create({ id: "main" }); +const harness = new CuaAgentHarness({ + browser, client, session, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + model: "openai:gpt-5.5", +}); +const result = await harness.prompt("Open example.com and click the first link."); +``` From bb8b3b3da0132e4e5dd97e8f83cf44ed4f442f57 Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Sun, 21 Jun 2026 14:58:48 +0000 Subject: [PATCH 2/4] cua skill: fix self-review findings - Browser config: CLI hardcodes stealth-on; library is the opt-out path. - Adding your own tools: drop unverified `tool()` helper, point at pi-agent-core's AgentTool shape instead. - Cross-origin section: tighten library escape-hatch sentence. - Quick reference: split the trailing TS example into its own ts fence. - Named-session relaunch tip: clarify "same profile as before". Co-Authored-By: Claude Opus 4.7 --- plugins/cua/skills/cua/SKILL.md | 34 ++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/plugins/cua/skills/cua/SKILL.md b/plugins/cua/skills/cua/SKILL.md index 3b3d4cb..f67481e 100644 --- a/plugins/cua/skills/cua/SKILL.md +++ b/plugins/cua/skills/cua/SKILL.md @@ -106,7 +106,7 @@ cua session show login # full JSON metadata Pass `--profile` when starting the named session; later `cua -s …` calls attach to the same browser, so they don't need the profile flag. -**Liveness**: Kernel browsers time out from inactivity. If you see `error session "" is no longer alive on Kernel …`, run `cua session stop && cua --profile session start ` to re-provision with the same persisted profile. +**Liveness**: Kernel browsers time out from inactivity. If you see `error session "" is no longer alive on Kernel …`, run `cua session stop && cua --profile session start ` to re-provision with the same persisted profile. Named-session metadata lives in `$XDG_DATA_HOME/cua/named-sessions/.json`. @@ -214,34 +214,31 @@ Not every provider's native vocabulary includes navigation. Pass `computerUseExt ## Browser config -Provision the underlying Kernel browser to match the task before handing it to cua: +The CLI always provisions stealth-on browsers and exposes profile persistence via `--profile` / `--profile-no-save-changes`. For any other browser knob — non-stealth, custom viewport, proxy, custom timeout — use the library and provision the browser yourself: ```ts const browser = await client.browsers.create({ - stealth: true, // bypass most fingerprinting; default off - headless: false, // headful => live view URL; smaller image when headless + stealth: true, // CLI hardcodes this on; flip to false only via the library + headless: false, // headful => live view URL; headless => no live view, smaller image timeout: 1800, // seconds before the Kernel browser auto-times-out profile: { name: "github", save_changes: true }, // load + save persisted state // proxy: { ... }, // optional outbound proxy }); ``` -The CLI exposes equivalents via `--profile`, `--profile-no-save-changes`, and the underlying Kernel CLI flags (the cua CLI itself doesn't surface a `--stealth` flag yet — when stealth matters, use the library or pre-create the browser via `kernel browsers create` and reuse the session). +If you need a custom-provisioned browser from the CLI, pre-create it with `kernel browsers create` and attach via `cua session …` — see the kernel-cli skill for the create flag reference. ## Adding your own tools +Pass any pi `AgentTool` (see [`@earendil-works/pi-agent-core`](https://www.npmjs.com/package/@earendil-works/pi-agent-core) for the tool shape) via `extraTools`. The CUA defaults stay installed; your tools run alongside them. + ```ts +import type { AgentTool } from "@onkernel/cua-agent"; import { CuaAgentHarness } from "@onkernel/cua-agent"; -import { tool } from "@earendil-works/pi-agent-core"; - -const lookupOrder = tool({ - name: "lookup_order", - description: "Look up an order by id in our DB.", - schema: { /* … */ }, - handler: async ({ orderId }) => { - return await db.orders.findOne(orderId); - }, -}); + +const lookupOrder: AgentTool = { + // shape per pi-agent-core docs: name, description, schema, run, ... +}; const harness = new CuaAgentHarness({ browser, client, @@ -312,7 +309,7 @@ kernel browsers exec --code " " ``` -From the library, you already have `browser.session_id` and the Kernel client, so call into the SDK directly. +From the library, you already have `browser.session_id` and the Kernel client — call the same exec endpoint via the SDK. ## Debugging @@ -377,9 +374,12 @@ cua --print -m anthropic:claude-opus-4-7 "..." # Get the live view URL cua session show work | jq -r .live_url kernel browsers view # alternative +``` -# Library — minimal harness +```ts +// Library — minimal harness import { CuaAgentHarness, InMemorySessionRepo, NodeExecutionEnv } from "@onkernel/cua-agent"; + const session = await new InMemorySessionRepo().create({ id: "main" }); const harness = new CuaAgentHarness({ browser, client, session, From b4f25bc0355376618d7ff324fa13c35c7db9ed9b Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Sun, 21 Jun 2026 15:48:39 +0000 Subject: [PATCH 3/4] split cua plugin into cua-cli + cua-agent skills Skills are loaded into coding-agent context, and the two audiences are distinct: agents driving the cua binary from shell vs humans asking Claude to write TS apps on @onkernel/cua-agent. Mirrors the repo's existing CLI-vs-SDK split (kernel-agent-browser vs kernel-typescript-sdk). - cua-cli: shell-callable subcommands, named sessions, profile persistence, live-view handoff, Playwright escape hatch, debugging. - cua-agent: CuaAgent / CuaAgentHarness quick start, browser provisioning, extraTools, setModel switching, SDK escape hatch, subscribe-based debugging. Plugin manifest description and the kernel/skills README list both. Co-Authored-By: Claude Opus 4.7 --- README.md | 5 +- plugins/cua/.claude-plugin/plugin.json | 2 +- plugins/cua/skills/cua-agent/SKILL.md | 293 +++++++++++++++++++ plugins/cua/skills/cua-cli/SKILL.md | 217 ++++++++++++++ plugins/cua/skills/cua/SKILL.md | 390 ------------------------- 5 files changed, 514 insertions(+), 393 deletions(-) create mode 100644 plugins/cua/skills/cua-agent/SKILL.md create mode 100644 plugins/cua/skills/cua-cli/SKILL.md delete mode 100644 plugins/cua/skills/cua/SKILL.md diff --git a/README.md b/README.md index 9860269..d2c203d 100644 --- a/README.md +++ b/README.md @@ -89,11 +89,12 @@ SDK skills for building browser automation with TypeScript and Python. ### cua -Computer-use loop for Kernel cloud browsers — CLI for shell-driven automation and the `@onkernel/cua-agent` TS library for embedding in your own agents. +Computer-use loop for Kernel cloud browsers — CLI for shell-driven automation and the `@onkernel/cua-agent` TS library for embedding in your own agents. One plugin, two skills (load whichever matches the task). | Skill | Description | |-------|-------------| -| **cua** | Drive Kernel cua via the `cua` CLI (one-shot subcommands, named sessions, TUI) or the `@onkernel/cua-agent` library (`CuaAgent` / `CuaAgentHarness`); covers model selection, profile persistence, transcripts, live-view handoff, and Playwright escape hatches | +| **cua-cli** | Drive a Kernel browser from shell via the `cua` binary: one-shot subcommands, named sessions, TUI, profile persistence, transcripts, live-view handoff | +| **cua-agent** | Build TypeScript apps that embed Kernel cua's loop with `CuaAgent` / `CuaAgentHarness`: provider switching, custom tools, session repos, event-stream debugging | ### generate-video diff --git a/plugins/cua/.claude-plugin/plugin.json b/plugins/cua/.claude-plugin/plugin.json index 273978e..436bab5 100644 --- a/plugins/cua/.claude-plugin/plugin.json +++ b/plugins/cua/.claude-plugin/plugin.json @@ -1,7 +1,7 @@ { "name": "cua", "version": "1.0.0", - "description": "Drive Kernel cua: the `cua` CLI for shell-driven computer-use automation, and the @onkernel/cua-agent TS library for building your own computer-use agents on Kernel browsers", + "description": "Kernel cua skills: `cua-cli` for shell-driven computer-use automation via the `cua` binary, and `cua-agent` for building TypeScript apps on the @onkernel/cua-agent library (CuaAgent / CuaAgentHarness)", "author": { "name": "Kernel", "url": "www.kernel.sh" diff --git a/plugins/cua/skills/cua-agent/SKILL.md b/plugins/cua/skills/cua-agent/SKILL.md new file mode 100644 index 0000000..322f908 --- /dev/null +++ b/plugins/cua/skills/cua-agent/SKILL.md @@ -0,0 +1,293 @@ +--- +name: cua-agent +description: Build TypeScript apps that embed Kernel's computer-use loop with `@onkernel/cua-agent` — `CuaAgent` and `CuaAgentHarness` classes drive a Kernel cloud browser via prompt → screenshot → tool-call loops across OpenAI, Anthropic, Google, and Yutori provider tools. Use when writing TS code that needs computer-use against a Kernel browser, swapping providers mid-session, adding your own pi tools alongside computer use, or hooking into the agent event stream. For shell-callable cua, see `cua-cli`. +--- + +# cua-agent + +`@onkernel/cua-agent` ships two TS classes for running a computer-use loop against a Kernel cloud browser: + +- **`CuaAgentHarness`** — recommended entry point. Session-backed turns, `setModel` mid-conversation, steering / follow-up, `subscribe()` event stream. Extends pi-agent-core's `AgentHarness`. +- **`CuaAgent`** — lower-level. Direct `state.messages` access, custom streaming, explicit prompt/continue/queue. Extends pi-agent-core's `Agent`. + +Both translate per-provider computer-use tool calls (OpenAI's `computer`, Anthropic's `computer_20251124`, Gemini's normalized-coordinate functions, Yutori Navigator's browser actions) into Kernel SDK `browsers.computer.*` calls and feed a fresh screenshot back to the model on every turn. + +## When to use this skill + +- **Use this skill** when writing TS code that embeds cua inside a larger app, needs a custom session repo, runs its own pi tools alongside computer use, or reacts to per-event streams programmatically. +- **Reach for [`cua-cli`](../cua-cli/SKILL.md)** when shell-callable computer-use is enough (`cua open`, `cua click`, `cua do`). +- **Reach for `kernel-typescript-sdk`** for raw Playwright / CDP control over a Kernel browser without an LLM in the loop. + +## Prerequisites + +- A Kernel account and `KERNEL_API_KEY`. +- At least one model-provider API key, matched to the model you pick (table below). +- Node 20+, TypeScript app or `tsx` runner. + +## Install + +```bash +npm i @onkernel/cua-agent @onkernel/cua-ai @onkernel/sdk +``` + +The three packages divide responsibility: + +- `@onkernel/cua-agent` — `CuaAgent` / `CuaAgentHarness` execution loop. +- `@onkernel/cua-ai` — model catalog (`getCuaModel` / `listCuaModels`), canonical CUA tool schemas, per-provider adapters. +- `@onkernel/sdk` — Kernel SDK client used to provision the browser. + +Both classes re-export the full pi-agent-core surface from `@onkernel/cua-agent`, including `NodeExecutionEnv` (via the `/node` subpath under the hood) and `InMemorySessionRepo`. Import them from `@onkernel/cua-agent` directly. + +## Environment variables + +If you don't pass explicit auth callbacks, both classes resolve provider keys via `@onkernel/cua-ai`'s `getCuaEnvApiKey`: + +| Env | Used for | +| --- | --- | +| `KERNEL_API_KEY` | Kernel API key (always required) | +| `OPENAI_API_KEY` | `openai:…` models | +| `ANTHROPIC_API_KEY` or `ANTHROPIC_OAUTH_TOKEN` | `anthropic:…` models | +| `GOOGLE_API_KEY` or `GEMINI_API_KEY` | `google:…` models | +| `YUTORI_API_KEY` | `yutori:…` models | +| `TZAFON_API_KEY` | `tzafon:…` models | + +## Quick start — `CuaAgentHarness` + +```ts +import Kernel from "@onkernel/sdk"; +import { + CuaAgentHarness, + InMemorySessionRepo, + NodeExecutionEnv, +} from "@onkernel/cua-agent"; +import type { AssistantMessage } from "@onkernel/cua-ai"; + +const client = new Kernel({ apiKey: process.env.KERNEL_API_KEY! }); +const browser = await client.browsers.create({ stealth: true }); + +const repo = new InMemorySessionRepo(); +const session = await repo.create({ id: "research" }); + +const harness = new CuaAgentHarness({ + browser, + client, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + model: "openai:gpt-5.5", + session, +}); + +const textOf = (m: AssistantMessage) => + m.content.flatMap((b) => (b.type === "text" ? [b.text] : [])).join("").trim(); + +const first = await harness.prompt("Open example.com and describe what you see."); +console.log(textOf(first)); + +// Swap providers mid-session — CUA tools and the default prompt refresh. +await harness.setModel("anthropic:claude-opus-4-7"); +const second = await harness.prompt("Open the most relevant link from what you found."); +console.log(textOf(second)); + +await client.browsers.deleteByID(browser.session_id); +``` + +While a turn is running: `steer()` injects course corrections, `followUp()` queues the next instruction, `subscribe()` streams underlying agent events, and `compact()` collapses long transcripts. See [`@earendil-works/pi-agent-core`](https://www.npmjs.com/package/@earendil-works/pi-agent-core) for the full harness lifecycle. + +## `CuaAgent` for raw pi `Agent` semantics + +Reach for `CuaAgent` when you want direct control — `state.messages` access, custom streaming, explicit prompt/continue/queue, no session repo. Same constructor shape except you assign `agent.state.model = …` instead of calling `setModel()`. + +```ts +import { CuaAgent } from "@onkernel/cua-agent"; + +const agent = new CuaAgent({ + browser, + client, + initialState: { + model: "openai:gpt-5.5", + systemPrompt: "You are a careful browser automation agent.", + }, +}); + +agent.subscribe((event) => { /* … */ }); +await agent.prompt("Open news.ycombinator.com and summarize the top story."); +``` + +### Harness vs Agent + +| You want to … | Use | +| --- | --- | +| Session-backed turns persisted to a repo | `CuaAgentHarness` | +| Steering, follow-up queue, compaction, branching | `CuaAgentHarness` | +| `await setModel()` mid-conversation | `CuaAgentHarness` | +| Direct `state.messages` access, no session machinery | `CuaAgent` | +| Custom streaming + explicit `prompt`/`continue`/`queue` control | `CuaAgent` | + +## Model selection and switching + +Run `listCuaModels()` from `@onkernel/cua-ai` for the current catalog. Pass either a CUA model ref (e.g. `"openai:gpt-5.5"`) or a concrete pi `Model` — both shape-widen the same options field. + +| Model ref | Provider | Notes | +| --- | --- | --- | +| `openai:gpt-5.5` | OpenAI | Built-in `computer` tool | +| `anthropic:claude-opus-4-7` | Anthropic | Built-in `computer_20251124` tool | +| `google:gemini-3-flash-preview` | Google | Predefined CU functions, 0–1000 normalized coords | +| `yutori:n1.5-latest` | Yutori | OpenAI-compatible chat with browser action tool calls | + +Switching: + +```ts +// Harness — async, updates via pi snapshot machinery +await harness.setModel("anthropic:claude-opus-4-7"); + +// Agent — direct assignment +agent.state.model = "anthropic:claude-opus-4-7"; +``` + +In both cases CUA-owned tools and the default system prompt refresh for the next provider request. + +Not every provider's native vocabulary includes navigation (`goto`, `back`, `forward`, `url`). Pass `computerUseExtra: true` to add the provider-neutral `computer_use_extra` tool when the model can click/type but can't navigate. + +## Browser provisioning + +You own the Kernel browser lifecycle — provision before constructing the agent, tear down after: + +```ts +const browser = await client.browsers.create({ + stealth: true, // bypass most fingerprinting; default off + headless: false, // headful => live view URL; headless => no live view, smaller image + timeout: 1800, // seconds before Kernel auto-times-out the browser + profile: { name: "github", save_changes: true }, + // proxy: { ... }, +}); + +try { + // ... use browser with harness/agent ... +} finally { + await client.browsers.deleteByID(browser.session_id); +} +``` + +The `browser.browser_live_view_url` field on the create response is the URL to share when you need a human to take over (manual login on a stealth-blocked site, captcha, etc.). + +## Adding your own tools + +Pass any pi `AgentTool` (see [`@earendil-works/pi-agent-core`](https://www.npmjs.com/package/@earendil-works/pi-agent-core) for the tool shape) via `extraTools`. The CUA defaults stay installed; your tools run alongside them. + +```ts +import type { AgentTool } from "@onkernel/cua-agent"; +import { CuaAgentHarness } from "@onkernel/cua-agent"; + +const lookupOrder: AgentTool = { + // shape per pi-agent-core docs: name, description, schema, run, ... +}; + +const harness = new CuaAgentHarness({ + browser, client, + model: "openai:gpt-5.5", + session, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + extraTools: [lookupOrder], + computerUseExtra: true, +}); +``` + +If you want to compose the tool list yourself (e.g. wrap computer-use tools in a permission gate), reach for `createCuaComputerTools()`: + +```ts +import { resolveCuaRuntimeSpec } from "@onkernel/cua-ai"; +import { createCuaComputerTools } from "@onkernel/cua-agent"; + +const runtime = resolveCuaRuntimeSpec("openai:gpt-5.5"); +const tools = [ + ...createCuaComputerTools({ browser, client, toolExecutors: runtime.toolExecutors }), + lookupOrder, +]; +``` + +## Manual login handoff via live view URL + +Every Kernel browser response carries the live view URL on creation. When stealth doesn't beat bot detection, share that URL and wait for the human: + +```ts +const browser = await client.browsers.create({ + stealth: true, + headless: false, + profile: { name: "mysite", save_changes: true }, +}); +console.log("share with user:", browser.browser_live_view_url); + +// wait for user signal — e.g. a button, stdin, an HTTP callback — +// THEN start prompting the agent against the logged-in browser +await harness.prompt("Now click 'Settings' and read me the current value of X."); +``` + +Profile saves on browser teardown, so future runs with the same profile name skip the manual login. + +## Cross-origin iframes / Playwright escape hatch + +cua drives by clicking pixels, so cross-origin iframes work in the screenshot flow without special handling. When you need a deterministic Playwright action against the underlying browser (e.g. fill a card form via a fixed selector), drop to the Kernel SDK's exec endpoint with the session id you already have: + +```ts +await client.browsers.exec(browser.session_id, { + code: ` + const frame = page.frameLocator('#payment-iframe'); + await frame.locator('#card-number').fill('4111111111111111'); + await frame.locator('#submit').click(); + `, +}); +``` + +## Debugging + +- **`subscribe()`** — the harness and agent both stream pi-agent-core events. Use it to log tool calls, screenshot sizes, tokens: + ```ts + harness.subscribe((event) => { + if (event.type === "tool_call") console.log("tool:", event.toolName); + if (event.type === "assistant_text_done") console.log("text:", event.text); + }); + ``` +- **`agent.state.messages`** — full message history including image blocks (for `CuaAgent`). Inspect after a turn finishes. +- **Live view URL** — `browser.browser_live_view_url` lets you watch the agent work in real time, even headful. +- **Custom session repo** — implement pi-agent-core's `SessionRepo` interface to persist transcripts wherever you want (JSONL on disk, S3, a DB). + +## Gotchas + +- **You own the browser lifecycle.** Always tear down with `client.browsers.deleteByID(browser.session_id)` in a `finally` block — Kernel timeouts will reclaim eventually but profile state saves on close, not continuously. +- **`setModel` is async.** It propagates through pi's snapshot machinery — `await` it before the next `prompt()`. +- **Provider tool vocab gaps.** If a model can click and type but can't navigate, set `computerUseExtra: true` to add provider-neutral `goto` / `back` / `forward` / `url`. +- **`InMemorySessionRepo` is in-process only.** Reach for a persistent `SessionRepo` implementation if you need transcripts to survive restarts. +- **`extraTools` runs alongside CUA tools, not in place of them.** To replace the defaults, build the tool list with `createCuaComputerTools()` yourself. +- **Stealth, headless, viewport, proxy** are all `browsers.create` flags — set them when provisioning, not on the harness. + +## Quick reference + +```ts +import Kernel from "@onkernel/sdk"; +import { + CuaAgentHarness, + InMemorySessionRepo, + NodeExecutionEnv, +} from "@onkernel/cua-agent"; + +const client = new Kernel({ apiKey: process.env.KERNEL_API_KEY! }); +const browser = await client.browsers.create({ stealth: true }); + +const session = await new InMemorySessionRepo().create({ id: "main" }); + +const harness = new CuaAgentHarness({ + browser, client, session, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + model: "openai:gpt-5.5", + computerUseExtra: true, +}); + +harness.subscribe((event) => { /* ... */ }); + +try { + const first = await harness.prompt("Open example.com and click the first link."); + await harness.setModel("anthropic:claude-opus-4-7"); + const second = await harness.prompt("Now extract the page title."); +} finally { + await client.browsers.deleteByID(browser.session_id); +} +``` diff --git a/plugins/cua/skills/cua-cli/SKILL.md b/plugins/cua/skills/cua-cli/SKILL.md new file mode 100644 index 0000000..527b826 --- /dev/null +++ b/plugins/cua/skills/cua-cli/SKILL.md @@ -0,0 +1,217 @@ +--- +name: cua-cli +description: Drive a Kernel cloud browser from the shell using the `cua` CLI. Use this skill when you need to open URLs, click elements, type into fields, take screenshots, or chain multi-step browser tasks across shell calls. Supports named sessions for stateful workflows, profile persistence for logins, transcript-based debugging, and live-view handoff when stealth fails. For building your own TS agent on top of cua, see `cua-agent`. +--- + +# cua-cli + +`cua` is a single-binary CLI that drives a real Chrome session running in Kernel. It's designed for agentic use: each subcommand returns a one-line result on stdout and a deterministic exit code, so you can chain calls together and parse the output. An LLM picks targets semantically from screenshots — there are no CSS selectors. + +## When to use this skill + +- **Use this skill** when you need shell-callable computer-use steps (`cua open`, `cua click`, `cua do …`), an interactive TUI, or want to chain browser actions in a shell pipeline. +- **Reach for [`cua-agent`](../cua-agent/SKILL.md)** when you're writing a TypeScript app that needs to embed cua's prompt → screenshot → tool-call loop programmatically. +- **Reach for `kernel-agent-browser`** when you need deterministic browser scripting (semantic selectors, `find role`, `wait --text`, accessibility-tree snapshots). +- **Reach for `kernel-cli`** for raw Kernel browser management (`kernel browsers create`, `kernel browsers exec`, profile / proxy CRUD). + +## Prerequisites + +- A Kernel account and `KERNEL_API_KEY`. See `kernel-cli` for install + auth. +- At least one model-provider API key, matched to the model you pick (table below). +- Node 20+ for the npm install. + +## Install + +```bash +# Global install — puts `cua` on $PATH +npm i -g @onkernel/cua-cli + +# Or zero-install one-shot +npx -y -p @onkernel/cua-cli cua --help +``` + +## Environment variables + +| Env | Used for | +| --- | --- | +| `KERNEL_API_KEY` | Kernel API key (always required) | +| `OPENAI_API_KEY` | OpenAI models (`-m openai:…`) | +| `ANTHROPIC_API_KEY` | Anthropic models (`-m anthropic:…`) | +| `GOOGLE_API_KEY` / `GEMINI_API_KEY` | Google / Gemini models (`-m google:…`) | +| `YUTORI_API_KEY` | Yutori Navigator (`-m yutori:…`) | +| `TZAFON_API_KEY` | Tzafon (`-m tzafon:…`) | +| `KERNEL_BASE_URL` | Override Kernel base URL | +| `XDG_DATA_HOME` | Sessions / transcripts dir (defaults to `~/.local/share`) | +| `CUA_IMAGE_PROTOCOL` | Force inline image protocol (`kitty` / `iterm2` / `none` / `auto`) | + +## One-shot subcommands + +Each call provisions a fresh Kernel browser by default, runs the action, prints a one-line result, and tears the browser down. Chain via `-s ` (next section) to keep state. + +| Subcommand | What it does | Stdout | Exit code | +| --- | --- | --- | --- | +| `cua open ` | Navigate to a URL. | `ok` | 0 ok, 2 error | +| `cua click ""` | Find element matching natural-language description and click it. | `ok clicked (x, y)` or `not_found ` | 0 ok, 1 not_found, 2 error | +| `cua type "" ""` | Focus a field by description and type. | `ok typed` or `not_found ` | 0 ok, 1 not_found, 2 error | +| `cua press [...]` | Send a key combo (`cua press ctrl l`, `cua press Return`). | `ok pressed` | 0 ok, 2 error | +| `cua url` | Print the current URL. | the URL | 0 ok, 2 error | +| `cua observe [""]` | Describe the page; optionally answer a question. | the description | 0 ok, 2 error | +| `cua screenshot --out ` | Save a PNG. `--out -` writes bytes to stdout. | the path or `(stdout)` | 0 ok, 2 error | +| `cua do ""` | Open-ended; agent plans and acts. Bound by `--max-steps` (default 3). | the assistant's final text | 0 ok, 2 error | + +Useful flags: + +- `-m ` — pick the LLM (default `openai:gpt-5.5`). `cua models` to list. +- `--max-steps ` — bound the loop on `cua do`. +- `--profile ` — load a Kernel browser profile for persisted cookies / storage. Existing ids or names are reused; a non-id name is created if missing. Pass `--profile-no-save-changes` for read-only. +- `-v` — verbose progress on stderr (provisioning, tool calls, transcript path). + +`click` and `type` match **semantically**, not by selector — use natural-language descriptions of what's visible on screen. + +The cua CLI always provisions **stealth-on** browsers. If you need non-stealth or a custom viewport / proxy, pre-create the browser via `kernel browsers create` and attach the cua session to it. + +## Named sessions + +Without `-s`, each subcommand provisions a brand-new browser. To keep state across calls, allocate a named session first: + +```bash +cua --profile github session start login # provisions a Kernel browser, prints `name=login` +cua -s login open https://github.com/login +cua -s login type "email field" "$EMAIL" +cua -s login type "password field" "$PASSWORD" +cua -s login click "Sign in" +cua -s login url # prints post-login URL +cua session stop login # tears down the Kernel browser +``` + +Inspect: + +```bash +cua session list # NAME / KERNEL_ID / AGE / LIVE_URL +cua session show login # full JSON metadata +``` + +Pass `--profile` when starting the named session; later `cua -s …` calls attach to the same browser, so they don't need the profile flag again. + +**Liveness**: Kernel browsers time out from inactivity. If you see `error session "" is no longer alive on Kernel …`, run `cua session stop && cua --profile session start ` to re-provision with the same persisted profile. + +Named-session metadata lives in `$XDG_DATA_HOME/cua/named-sessions/.json`. + +## Free-form mode + +```bash +cua --print "open hn and tell me the top story" # one-shot, streams text +cua --print -o jsonl "..." # one-shot, streams JSONL events +cua "..." # interactive TUI (real terminal) +``` + +`--print` exits when the agent finishes; the TUI runs until Ctrl+C. Add `--jsonl-include-deltas` for token deltas, `--jsonl-include-images` for base64 screenshots in `tool_result` events. + +## Model selection + +Run `cua models` for the current catalog. Pick with `-m ` (default `openai:gpt-5.5`). Switch per call or per named session. + +| Model ref | Provider | +| --- | --- | +| `openai:gpt-5.5` | OpenAI (default) | +| `anthropic:claude-opus-4-7` | Anthropic (supports `--thinking off\|minimal\|low\|medium\|high\|xhigh`) | +| `google:gemini-3-flash-preview` | Google / Gemini | +| `yutori:n1.5-latest` | Yutori Navigator | + +Not every provider's native vocabulary includes navigation. If a model can click and type but can't navigate (`goto`, `back`, `forward`, `url`), pick a different model. + +## Live view URL and manual login fallback + +Stealth-on doesn't always beat bot detection. When automation gets stuck on a login, hand off to a human via the live view URL. + +```bash +cua --profile mysite session start login +cua session show login | jq -r .live_url # share this URL with the user +# user logs in manually in the live view +cua -s login url # confirm post-login URL +cua session stop login # profile state saves on teardown +``` + +If you only have a session id (e.g. from `cua session list`), the `kernel` CLI also surfaces it: + +```bash +kernel browsers view +``` + +## Cross-origin iframes / Playwright escape hatch + +cua drives by clicking pixels, so cross-origin iframes (payment forms, embedded vendor widgets) work in the screenshot flow without special handling — the model just clicks them. When you need a deterministic Playwright action against the underlying browser (e.g. fill a card form via a fixed selector), break out to Kernel's exec endpoint with the session id: + +```bash +# Find the session id +cua session show login | jq -r .kernel_session_id + +# Run a Playwright snippet against the same browser +kernel browsers exec --code " + const frame = page.frameLocator('#payment-iframe'); + await frame.locator('#card-number').fill('4111111111111111'); + await frame.locator('#submit').click(); +" +``` + +## Debugging + +- **Verbose stderr**: `cua -v --print "…"` writes provisioning info, tool calls, and the transcript path to stderr. +- **Live event stream**: `cua --print -o jsonl "…"` emits one event per line (`tool_call`, `tool_result`, `assistant_text_done`, etc.). Add `--jsonl-include-images` to inline screenshots in `tool_result`. +- **Persisted transcript**: every `--print`, TUI, and `-s ` invocation appends to `$XDG_DATA_HOME/cua/sessions//.jsonl`. Find the exact path: + ```bash + cua -v --print "..." # stderr includes: [cua] session= + cua session show login | jq -r .transcript_path + ``` + Roles: `user`, `assistant`, `toolResult`. There's also a custom `cua-browser` entry written once per session with `kernel_session_id` / `live_url` / `profile_id`. +- **Screenshots**: `cua screenshot --out shot.png` or inspect `image` blocks in `toolResult` transcript entries. +- **Page URL**: `cua url` to confirm post-action navigation. + +A few `jq` starters against a transcript path: + +```bash +# Every tool call the agent made, in order +jq -c 'select(.role == "assistant") | .content[]? + | select(.type == "tool_use") | {name, input}' "$TRANSCRIPT" + +# Final assistant text (the answer) +jq -r 'select(.role == "assistant") | .content[]? + | select(.type == "text") | .text' "$TRANSCRIPT" | tail -1 +``` + +## Gotchas + +- **Element descriptions are semantic, not selectors.** `cua click "Sign in button"` looks at the screenshot — describe what the user sees, not a CSS selector. +- **Viewport defaults to 1920x1080.** Pre-create the browser with `kernel browsers create` if you need something else. +- **Keyboard navigation > mouse-wheel scroll.** `cua press Page_Down` / `Home` / arrow keys is more reliable than scroll wheel via the LLM. +- **Multi-step state requires `-s `.** A second one-shot subcommand can't see what the first one did. +- **Profile saves on close, not continuously.** Tear down cleanly with `cua session stop` or you'll lose recent state. +- **`--max-steps` defaults to 3 on `cua do`.** Bump it for non-trivial tasks. + +## Quick reference + +```bash +# One-shot, fresh browser +cua --print "open hn and tell me the top story" + +# Named session for multi-step +cua --profile mysite session start work +cua -s work open https://example.com +cua -s work click "Log in" +cua -s work type "email field" "$EMAIL" +cua -s work click "Submit" +cua -s work url +cua session stop work + +# List models, switch model per call +cua models +cua --print -m anthropic:claude-opus-4-7 "..." + +# Get the live view URL +cua session show work | jq -r .live_url +kernel browsers view # alternative + +# Drop to Playwright for deterministic actions +cua session show work | jq -r .kernel_session_id +kernel browsers exec --code "..." +``` diff --git a/plugins/cua/skills/cua/SKILL.md b/plugins/cua/skills/cua/SKILL.md deleted file mode 100644 index f67481e..0000000 --- a/plugins/cua/skills/cua/SKILL.md +++ /dev/null @@ -1,390 +0,0 @@ ---- -name: cua -description: Drive Kernel cua — the `cua` CLI for shell automation, or the @onkernel/cua-agent TypeScript library for building your own computer-use agents. Use when opening URLs, clicking/typing/observing in a real cloud browser via cua, chaining multi-step browser tasks across shell calls, or wiring up `CuaAgent` / `CuaAgentHarness` against a Kernel browser. Covers model selection (gpt-5.5, claude-opus-4-7, gemini-3-flash-preview, n1.5-latest), named sessions, profile persistence, transcripts, live-view handoff, and Playwright escape hatches. ---- - -# cua - -`cua` is a computer-use loop for Kernel cloud browsers. There are two surfaces, both backed by the same execution layer: - -- **`cua` CLI** (`@onkernel/cua-cli`) — single binary that drives a real Chrome session running in Kernel. Each subcommand returns a one-line result on stdout and a deterministic exit code, so shell agents can chain calls. -- **`@onkernel/cua-agent` library** — `CuaAgent` / `CuaAgentHarness` TypeScript classes that run the same prompt → screenshot → tool-call loop against a Kernel browser, callable from your own code. - -Both translate per-provider computer-use tool calls (OpenAI's `computer`, Anthropic's `computer_20251124`, Gemini's normalized-coordinate functions, Yutori Navigator's browser actions) into Kernel SDK `browsers.computer.*` calls and feed a fresh screenshot back to the model on every turn. - -## When to use this skill - -- **Use the CLI** when you need shell-callable computer-use steps (`cua open`, `cua click`, `cua do …`) or an interactive TUI. Best for ad-hoc agent tasks, shell pipelines, and one-shot prompts. -- **Use the library** when you need to embed cua inside a larger TS app, run a custom session repo, add your own pi tools alongside computer use, or react to per-event streams programmatically. -- **Reach for `kernel-agent-browser` instead** when you need deterministic browser scripting (semantic selectors, `find role`, `wait --text`, snapshots/refs). cua drives by screenshots; agent-browser drives by accessibility tree. -- **Reach for `kernel-typescript-sdk` instead** for raw Playwright/CDP control over a Kernel browser without an LLM in the loop. - -## Prerequisites - -- A Kernel account and API key (`KERNEL_API_KEY`). See the [`kernel-cli`](https://www.kernel.sh/docs) skill for install + auth. -- At least one model-provider API key, matched to the model you pick (table in "Model selection" below). -- Node 20+ for both the CLI install and the library. - -## Install - -### CLI - -```bash -# Global install — gives you the `cua` binary on $PATH -npm i -g @onkernel/cua-cli - -# Or zero-install one-shot -npx -y -p @onkernel/cua-cli cua --help -``` - -### Library - -```bash -npm i @onkernel/cua-agent @onkernel/cua-ai @onkernel/sdk -``` - -## Environment variables - -| Env | Used for | -| --- | --- | -| `KERNEL_API_KEY` | Kernel API key (always required) | -| `OPENAI_API_KEY` | OpenAI models (`-m openai:…`) | -| `ANTHROPIC_API_KEY` | Anthropic models (`-m anthropic:…`); `ANTHROPIC_OAUTH_TOKEN` also works | -| `GOOGLE_API_KEY` / `GEMINI_API_KEY` | Google / Gemini models (`-m google:…`) | -| `YUTORI_API_KEY` | Yutori Navigator (`-m yutori:…`) | -| `TZAFON_API_KEY` | Tzafon (`-m tzafon:…`) | -| `KERNEL_BASE_URL` | Override Kernel base URL | -| `XDG_DATA_HOME` | CLI sessions/transcripts dir (defaults to `~/.local/share`) | -| `CUA_IMAGE_PROTOCOL` | Force inline image protocol (`kitty` / `iterm2` / `none` / `auto`) | - -The library auto-loads these via `getCuaEnvApiKey` if you don't pass explicit auth callbacks. - -## CLI: one-shot subcommands - -Each call provisions a fresh Kernel browser by default, runs the action, prints a one-line result, and tears the browser down. Chain via `-s ` (next section) to keep state. - -| Subcommand | What it does | Stdout | Exit code | -| --- | --- | --- | --- | -| `cua open ` | Navigate to a URL. | `ok` | 0 ok, 2 error | -| `cua click ""` | Find element matching natural-language description and click it. | `ok clicked (x, y)` or `not_found ` | 0 ok, 1 not_found, 2 error | -| `cua type "" ""` | Focus a field by description and type. | `ok typed` or `not_found ` | 0 ok, 1 not_found, 2 error | -| `cua press [...]` | Send a key combo (`cua press ctrl l`, `cua press Return`). | `ok pressed` | 0 ok, 2 error | -| `cua url` | Print the current URL. | the URL | 0 ok, 2 error | -| `cua observe [""]` | Describe the page; optionally answer a question. | the description | 0 ok, 2 error | -| `cua screenshot --out ` | Save a PNG. `--out -` writes bytes to stdout. | the path or `(stdout)` | 0 ok, 2 error | -| `cua do ""` | Open-ended; agent plans and acts. Bound by `--max-steps` (default 3). | the assistant's final text | 0 ok, 2 error | - -Useful flags: - -- `-m ` — pick the LLM (default `openai:gpt-5.5`). `cua models` to list. -- `--max-steps ` — bound the loop on `cua do`. -- `--profile ` — load a Kernel browser profile for persisted cookies / storage. Existing ids or names are reused; a non-id name is created if missing. Pass `--profile-no-save-changes` for read-only. -- `-v` — verbose progress on stderr (provisioning, tool calls, transcript path). - -`click` and `type` match **semantically**, not by selector — use natural-language descriptions of what's visible on screen. - -## CLI: named sessions - -Without `-s`, each subcommand provisions a brand-new browser. To keep state (cookies, URL, scroll position) across calls, allocate a named session first: - -```bash -cua --profile github session start login # provisions a Kernel browser, prints `name=login` -cua -s login open https://github.com/login -cua -s login type "email field" "$EMAIL" -cua -s login type "password field" "$PASSWORD" -cua -s login click "Sign in" -cua -s login url # prints post-login URL -cua session stop login # tears down the Kernel browser -``` - -Inspect: - -```bash -cua session list # NAME / KERNEL_ID / AGE / LIVE_URL -cua session show login # full JSON metadata -``` - -Pass `--profile` when starting the named session; later `cua -s …` calls attach to the same browser, so they don't need the profile flag. - -**Liveness**: Kernel browsers time out from inactivity. If you see `error session "" is no longer alive on Kernel …`, run `cua session stop && cua --profile session start ` to re-provision with the same persisted profile. - -Named-session metadata lives in `$XDG_DATA_HOME/cua/named-sessions/.json`. - -## CLI: free-form mode - -```bash -cua --print "open hn and tell me the top story" # one-shot, streams text -cua --print -o jsonl "..." # one-shot, streams JSONL events -cua "..." # interactive TUI (real terminal) -``` - -`--print` exits when the agent finishes; the TUI runs until Ctrl+C. Add `--jsonl-include-deltas` for token deltas, `--jsonl-include-images` for base64 screenshots in `tool_result` events. - -## Library: quick start with `CuaAgentHarness` - -The harness is the recommended entry point. It owns the session, persists every turn, handles steering / follow-up, and can swap providers mid-conversation. - -```ts -import Kernel from "@onkernel/sdk"; -import { - CuaAgentHarness, - InMemorySessionRepo, - NodeExecutionEnv, -} from "@onkernel/cua-agent"; -import type { AssistantMessage } from "@onkernel/cua-ai"; - -const client = new Kernel({ apiKey: process.env.KERNEL_API_KEY! }); -const browser = await client.browsers.create({ stealth: true }); - -const repo = new InMemorySessionRepo(); -const session = await repo.create({ id: "research" }); - -const harness = new CuaAgentHarness({ - browser, - client, - env: new NodeExecutionEnv({ cwd: process.cwd() }), - model: "openai:gpt-5.5", - session, -}); - -const textOf = (m: AssistantMessage) => - m.content.flatMap((b) => (b.type === "text" ? [b.text] : [])).join("").trim(); - -const first = await harness.prompt("Open example.com and describe what you see."); -console.log(textOf(first)); - -// Swap providers mid-session — CUA tools and the default prompt refresh. -await harness.setModel("anthropic:claude-opus-4-7"); -const second = await harness.prompt("Open the most relevant link from what you found."); -console.log(textOf(second)); - -await client.browsers.deleteByID(browser.session_id); -``` - -While a turn is running: `steer()` injects course corrections, `followUp()` queues the next instruction, `subscribe()` streams underlying agent events, and `compact()` collapses long transcripts. - -### When to use `CuaAgent` instead - -Reach for `CuaAgent` (extends pi `Agent`) when you want raw control — direct `state.messages` access, custom streaming, explicit prompt/continue/queue, no session repo. The shape is the same except you assign `agent.state.model = …` instead of calling `setModel()`. - -```ts -import { CuaAgent } from "@onkernel/cua-agent"; - -const agent = new CuaAgent({ - browser, - client, - initialState: { - model: "openai:gpt-5.5", - systemPrompt: "You are a careful browser automation agent.", - }, -}); - -agent.subscribe((event) => { /* … */ }); -await agent.prompt("Open news.ycombinator.com and summarize the top story."); -``` - -### CLI vs library vs raw `CuaAgent` - -| You want to … | Use | -| --- | --- | -| Drive cua from shell scripts | CLI | -| Open-ended TUI session | CLI (`cua` no args) | -| Embed cua inside a TS app with session-backed turns | `CuaAgentHarness` | -| Add your own pi tools alongside computer use | `CuaAgentHarness` (`extraTools`) or `CuaAgent` | -| Raw pi `Agent` semantics: own message state, lifecycle events | `CuaAgent` | - -## Model selection - -Run `cua models` (or `listCuaModels()` from `@onkernel/cua-ai`) for the current catalog. As of writing, the four supported providers and their built-in computer-use vocabularies: - -| Model ref | Provider | Notes | -| --- | --- | --- | -| `openai:gpt-5.5` | OpenAI | Built-in `computer` tool; default in CLI. | -| `anthropic:claude-opus-4-7` | Anthropic | Built-in `computer_20251124` tool. Supports `--thinking` levels. | -| `google:gemini-3-flash-preview` | Google | Predefined computer-use functions with 0–1000 normalized coords. | -| `yutori:n1.5-latest` | Yutori | OpenAI-compatible chat with browser action tool calls. | - -Switching models mid-turn: - -- CLI: re-run with `-m `, or attach a `-s` named session with a different `-m` per call. -- Library (harness): `await harness.setModel("anthropic:claude-opus-4-7")` — CUA tools and the default system prompt refresh. -- Library (agent): assign `agent.state.model = "anthropic:claude-opus-4-7"`. - -Not every provider's native vocabulary includes navigation. Pass `computerUseExtra: true` to add the provider-neutral `computer_use_extra` tool (`goto`, `back`, `forward`, `url`) when you need it on a model that lacks built-in navigation. - -## Browser config - -The CLI always provisions stealth-on browsers and exposes profile persistence via `--profile` / `--profile-no-save-changes`. For any other browser knob — non-stealth, custom viewport, proxy, custom timeout — use the library and provision the browser yourself: - -```ts -const browser = await client.browsers.create({ - stealth: true, // CLI hardcodes this on; flip to false only via the library - headless: false, // headful => live view URL; headless => no live view, smaller image - timeout: 1800, // seconds before the Kernel browser auto-times-out - profile: { name: "github", save_changes: true }, // load + save persisted state - // proxy: { ... }, // optional outbound proxy -}); -``` - -If you need a custom-provisioned browser from the CLI, pre-create it with `kernel browsers create` and attach via `cua session …` — see the kernel-cli skill for the create flag reference. - -## Adding your own tools - -Pass any pi `AgentTool` (see [`@earendil-works/pi-agent-core`](https://www.npmjs.com/package/@earendil-works/pi-agent-core) for the tool shape) via `extraTools`. The CUA defaults stay installed; your tools run alongside them. - -```ts -import type { AgentTool } from "@onkernel/cua-agent"; -import { CuaAgentHarness } from "@onkernel/cua-agent"; - -const lookupOrder: AgentTool = { - // shape per pi-agent-core docs: name, description, schema, run, ... -}; - -const harness = new CuaAgentHarness({ - browser, client, - model: "openai:gpt-5.5", - session, - env: new NodeExecutionEnv({ cwd: process.cwd() }), - extraTools: [lookupOrder], - computerUseExtra: true, -}); -``` - -Use `createCuaComputerTools()` directly if you want to compose the tool list yourself (e.g. wrap computer-use tools in a permission gate): - -```ts -import { resolveCuaRuntimeSpec } from "@onkernel/cua-ai"; -import { createCuaComputerTools } from "@onkernel/cua-agent"; - -const runtime = resolveCuaRuntimeSpec("openai:gpt-5.5"); -const tools = [ - ...createCuaComputerTools({ browser, client, toolExecutors: runtime.toolExecutors }), - lookupOrder, -]; -``` - -## Live view URL and manual login fallback - -cua's `--profile` (CLI) and `profile` (library) handle most login persistence, but stealth doesn't always beat bot detection. When automation gets stuck on a login, hand off to a human via the live view URL. - -### CLI - -```bash -cua --profile mysite session start login -cua session show login | jq -r .live_url # share this URL with the user -# user logs in manually in their browser via the live view -cua -s login url # confirm the post-login URL -cua session stop login # profile state saves on teardown -``` - -### Library - -Every Kernel browser response carries the live view URL on creation: - -```ts -const browser = await client.browsers.create({ stealth: true, headless: false }); -console.log("live view:", browser.browser_live_view_url); -// share that URL, wait for the user to finish manual login, then prompt the agent -``` - -If you only have a session id, fetch it: - -```bash -kernel browsers view -``` - -## Cross-origin iframes / Playwright escape hatch - -cua drives by clicking pixels, so cross-origin iframes (payment forms, embedded vendor widgets) work in the screenshot flow without special handling — the model just clicks them. When you need a deterministic Playwright action against the underlying browser (e.g. to fill a card form via a fixed selector), break out to Kernel's exec endpoint with the session id: - -```bash -# CLI: find the session id -cua session show login | jq -r .kernel_session_id - -# Run a Playwright snippet against the same browser -kernel browsers exec --code " - const frame = page.frameLocator('#payment-iframe'); - await frame.locator('#card-number').fill('4111111111111111'); - await frame.locator('#submit').click(); -" -``` - -From the library, you already have `browser.session_id` and the Kernel client — call the same exec endpoint via the SDK. - -## Debugging - -- **CLI verbose**: `cua -v --print "…"` writes provisioning info, tool calls, and the transcript path to stderr. -- **Live event stream**: `cua --print -o jsonl "…"` emits one event per line (`tool_call`, `tool_result`, `assistant_text_done`, etc.). Add `--jsonl-include-images` to inline screenshots in `tool_result`. -- **Persisted transcript**: every `--print`, TUI, and `-s ` invocation appends to `$XDG_DATA_HOME/cua/sessions//.jsonl`. Exact path: - ```bash - cua -v --print "..." # stderr includes: [cua] session= - cua session show login | jq -r .transcript_path - ``` - Roles: `user`, `assistant`, `toolResult`. There's also a custom `cua-browser` entry written once per session with `kernel_session_id` / `live_url` / `profile_id`. -- **Library event subscription**: - ```ts - harness.subscribe((event) => { - // event.type === "tool_call" | "tool_result" | "assistant_text_done" | ... - }); - ``` -- **Screenshots**: `cua screenshot --out shot.png` (CLI) or inspect the `image` blocks in `toolResult` transcript entries. -- **Page URL**: `cua url` to confirm post-action navigation. `agent.state.messages` (library) holds the full message history. - -A couple of `jq` starters against a transcript path: - -```bash -# Every tool call the agent made, in order -jq -c 'select(.role == "assistant") | .content[]? - | select(.type == "tool_use") | {name, input}' "$TRANSCRIPT" - -# Final assistant text (the answer) -jq -r 'select(.role == "assistant") | .content[]? - | select(.type == "text") | .text' "$TRANSCRIPT" | tail -1 -``` - -## Gotchas - -- **Element descriptions are semantic, not selectors.** `cua click "Sign in button"` looks at the screenshot — describe what the user sees, not a CSS selector. -- **Viewport defaults to 1920x1080.** Resize via `client.browsers.create({ ... })` flags if you need something else. -- **Keyboard navigation > mouse-wheel scroll.** `cua press Page_Down` / `Home` / arrow keys is more reliable than scroll wheel via the LLM. -- **Multi-step state requires `-s` (CLI) or a session-backed harness (library).** A second one-shot subcommand can't see what the first one did. -- **Profile saves on close, not continuously.** Tear down cleanly (`cua session stop`, `client.browsers.deleteByID`) or you'll lose recent state. -- **Provider tool vocab gaps.** If a model can click and type but can't navigate, set `computerUseExtra: true` (library) or pick a different model. -- **`--max-steps` defaults to 3 on `cua do`.** Bump it for non-trivial tasks. - -## Quick reference - -```bash -# CLI quickstart — one-shot, fresh browser -cua --print "open hn and tell me the top story" - -# CLI — named session for multi-step -cua --profile mysite session start work -cua -s work open https://example.com -cua -s work click "Log in" -cua -s work type "email field" "$EMAIL" -cua -s work click "Submit" -cua -s work url -cua session stop work - -# CLI — list models, switch model per call -cua models -cua --print -m anthropic:claude-opus-4-7 "..." - -# Get the live view URL -cua session show work | jq -r .live_url -kernel browsers view # alternative -``` - -```ts -// Library — minimal harness -import { CuaAgentHarness, InMemorySessionRepo, NodeExecutionEnv } from "@onkernel/cua-agent"; - -const session = await new InMemorySessionRepo().create({ id: "main" }); -const harness = new CuaAgentHarness({ - browser, client, session, - env: new NodeExecutionEnv({ cwd: process.cwd() }), - model: "openai:gpt-5.5", -}); -const result = await harness.prompt("Open example.com and click the first link."); -``` From 83093fe63b2897ffec14c8a37c17a2e024c86758 Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Sun, 21 Jun 2026 15:57:28 +0000 Subject: [PATCH 4/4] move cua-cli into kernel-cli, cua-agent into kernel-sdks Drop the standalone `cua` plugin. The repo organizes skills by audience (shell-driving vs SDK-authoring), not by product, so cua-cli sits alongside kernel-agent-browser in kernel-cli, and cua-agent sits alongside kernel-typescript-sdk in kernel-sdks. Users who already install those plugins get the cua skills for free. - plugins/kernel-cli/skills/cua-cli/SKILL.md (moved) - plugins/kernel-sdks/skills/cua-agent/SKILL.md (moved) - plugins/cua/ deleted - README install snippets and skill tables updated - cross-skill links in both SKILL.md files updated to reference the new plugin locations Co-Authored-By: Claude Opus 4.7 --- README.md | 13 +------------ plugins/cua/.claude-plugin/plugin.json | 11 ----------- plugins/{cua => kernel-cli}/skills/cua-cli/SKILL.md | 2 +- .../{cua => kernel-sdks}/skills/cua-agent/SKILL.md | 2 +- 4 files changed, 3 insertions(+), 25 deletions(-) delete mode 100644 plugins/cua/.claude-plugin/plugin.json rename plugins/{cua => kernel-cli}/skills/cua-cli/SKILL.md (98%) rename plugins/{cua => kernel-sdks}/skills/cua-agent/SKILL.md (98%) diff --git a/README.md b/README.md index d2c203d..3057be6 100644 --- a/README.md +++ b/README.md @@ -18,9 +18,6 @@ Official AI agent skills from the Kernel for installing useful skills for our CL # Install the video generation skill /plugin install generate-video - -# Install the cua skill (CLI + library for computer-use on Kernel) -/plugin install cua ``` ### Cursor @@ -45,7 +42,6 @@ git clone https://github.com/kernel/skills.git cp -r skills/plugins/kernel-cli ~/.claude/skills/ cp -r skills/plugins/kernel-sdks ~/.claude/skills/ cp -r skills/plugins/generate-video ~/.claude/skills/ -cp -r skills/plugins/cua ~/.claude/skills/ ``` ## Prerequisites @@ -76,6 +72,7 @@ Command-line interface skills for using Kernel CLI commands. | **kernel-cli** | Complete guide to Kernel CLI - cloud browser platform with automation, deployment, and management | | **kernel-agent-browser** | Best practices for `agent-browser -p kernel` automation, bot detection handling, iframes, login persistence | | **kernel-auth** | Setup and manage Kernel authentication connections for any website with safety checks and reauthentication support | +| **cua-cli** | Drive a Kernel browser from shell via the `cua` binary: one-shot subcommands, named sessions, TUI, profile persistence, transcripts, live-view handoff | | **profile-website-bot-detection** | Profile a website for bot detection vendors using stealth vs non-stealth Kernel browsers; compare effectiveness and identify vendor products | ### kernel-sdks @@ -86,14 +83,6 @@ SDK skills for building browser automation with TypeScript and Python. |-------|-------------| | **typescript-sdk** | Build automation with Kernel's Typescript SDK | | **python-sdk** | Build automation with kernel's Python SDK | - -### cua - -Computer-use loop for Kernel cloud browsers — CLI for shell-driven automation and the `@onkernel/cua-agent` TS library for embedding in your own agents. One plugin, two skills (load whichever matches the task). - -| Skill | Description | -|-------|-------------| -| **cua-cli** | Drive a Kernel browser from shell via the `cua` binary: one-shot subcommands, named sessions, TUI, profile persistence, transcripts, live-view handoff | | **cua-agent** | Build TypeScript apps that embed Kernel cua's loop with `CuaAgent` / `CuaAgentHarness`: provider switching, custom tools, session repos, event-stream debugging | ### generate-video diff --git a/plugins/cua/.claude-plugin/plugin.json b/plugins/cua/.claude-plugin/plugin.json deleted file mode 100644 index 436bab5..0000000 --- a/plugins/cua/.claude-plugin/plugin.json +++ /dev/null @@ -1,11 +0,0 @@ -{ - "name": "cua", - "version": "1.0.0", - "description": "Kernel cua skills: `cua-cli` for shell-driven computer-use automation via the `cua` binary, and `cua-agent` for building TypeScript apps on the @onkernel/cua-agent library (CuaAgent / CuaAgentHarness)", - "author": { - "name": "Kernel", - "url": "www.kernel.sh" - }, - "repository": "https://github.com/kernel/skills", - "license": "MIT" -} diff --git a/plugins/cua/skills/cua-cli/SKILL.md b/plugins/kernel-cli/skills/cua-cli/SKILL.md similarity index 98% rename from plugins/cua/skills/cua-cli/SKILL.md rename to plugins/kernel-cli/skills/cua-cli/SKILL.md index 527b826..29a0e6a 100644 --- a/plugins/cua/skills/cua-cli/SKILL.md +++ b/plugins/kernel-cli/skills/cua-cli/SKILL.md @@ -10,7 +10,7 @@ description: Drive a Kernel cloud browser from the shell using the `cua` CLI. Us ## When to use this skill - **Use this skill** when you need shell-callable computer-use steps (`cua open`, `cua click`, `cua do …`), an interactive TUI, or want to chain browser actions in a shell pipeline. -- **Reach for [`cua-agent`](../cua-agent/SKILL.md)** when you're writing a TypeScript app that needs to embed cua's prompt → screenshot → tool-call loop programmatically. +- **Reach for the `cua-agent` skill** (in the `kernel-sdks` plugin) when you're writing a TypeScript app that needs to embed cua's prompt → screenshot → tool-call loop programmatically. - **Reach for `kernel-agent-browser`** when you need deterministic browser scripting (semantic selectors, `find role`, `wait --text`, accessibility-tree snapshots). - **Reach for `kernel-cli`** for raw Kernel browser management (`kernel browsers create`, `kernel browsers exec`, profile / proxy CRUD). diff --git a/plugins/cua/skills/cua-agent/SKILL.md b/plugins/kernel-sdks/skills/cua-agent/SKILL.md similarity index 98% rename from plugins/cua/skills/cua-agent/SKILL.md rename to plugins/kernel-sdks/skills/cua-agent/SKILL.md index 322f908..46696a3 100644 --- a/plugins/cua/skills/cua-agent/SKILL.md +++ b/plugins/kernel-sdks/skills/cua-agent/SKILL.md @@ -15,7 +15,7 @@ Both translate per-provider computer-use tool calls (OpenAI's `computer`, Anthro ## When to use this skill - **Use this skill** when writing TS code that embeds cua inside a larger app, needs a custom session repo, runs its own pi tools alongside computer use, or reacts to per-event streams programmatically. -- **Reach for [`cua-cli`](../cua-cli/SKILL.md)** when shell-callable computer-use is enough (`cua open`, `cua click`, `cua do`). +- **Reach for the `cua-cli` skill** (in the `kernel-cli` plugin) when shell-callable computer-use is enough (`cua open`, `cua click`, `cua do`). - **Reach for `kernel-typescript-sdk`** for raw Playwright / CDP control over a Kernel browser without an LLM in the loop. ## Prerequisites