Add packages/bench: single-task CUA model runner by jarugupj · Pull Request #39 · kernel/cua

jarugupj · 2026-06-25T15:09:14Z

Summary

Adds a private @onkernel/cua-bench workspace — the first slice of a benchmark that runs CUA models on Kernel browsers and reports accuracy / cost / speed.
runTask(modelRef, task) provisions a fresh Kernel browser, builds a CuaAgentHarness on the given model, runs the prompt, and tears the browser down. It captures wall-clock, turn count, and token totals (summed across every model call via harness events).
spike.ts is a runnable one-shot: one HN task on anthropic:claude-opus-4-6.

Scope / what's deliberately left for follow-up

success returns null — the accuracy judge (adopting an existing benchmark's scorer, e.g. Online-Mind2Web's) is the next step.
costUsd is populated only when the provider reports a cost; the token×price conversion comes with the model price table.
One task, one model, hardcoded — the task-set loader and multi-model fan-out are next.

How to run locally

npm install
export KERNEL_API_KEY=...
export ANTHROPIC_API_KEY=...
npm run spike --workspace @onkernel/cua-bench

Prints a TaskResult JSON with timing + token totals.

Test plan

tsc -b typechecks clean
spike resolves all imports and fails fast without KERNEL_API_KEY
real run against claude-opus-4-6 produces wall-clock + token numbers

Introduce a private @onkernel/cua-bench workspace that runs one task on one model against a fresh Kernel browser via CuaAgentHarness, capturing wall-clock, turn count, and token totals. Accuracy scoring and cost conversion are left unscored for follow-up work. Includes a spike entrypoint for a manual run.

Pin @onkernel/cua-agent and @onkernel/cua-ai to "*" so the private bench package keeps resolving the workspace siblings across version bumps, and regenerate package-lock.json against the current tree.

jarugupj added 4 commits June 25, 2026 15:08

Regenerate package-lock.json so npm ci resolves the bench workspace

33cb859

Merge remote-tracking branch 'origin/main' into phani/cua-bench-runner

6f92ae0

Make bench workspace deps version-agnostic and resync lock

98bf4e4

Pin @onkernel/cua-agent and @onkernel/cua-ai to "*" so the private bench package keeps resolving the workspace siblings across version bumps, and regenerate package-lock.json against the current tree.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add packages/bench: single-task CUA model runner#39

Add packages/bench: single-task CUA model runner#39
jarugupj wants to merge 4 commits into
mainfrom
phani/cua-bench-runner

jarugupj commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jarugupj commented Jun 25, 2026

Summary

Scope / what's deliberately left for follow-up

How to run locally

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant