Skip to content

Add packages/bench: single-task CUA model runner#39

Draft
jarugupj wants to merge 4 commits into
mainfrom
phani/cua-bench-runner
Draft

Add packages/bench: single-task CUA model runner#39
jarugupj wants to merge 4 commits into
mainfrom
phani/cua-bench-runner

Conversation

@jarugupj

Copy link
Copy Markdown

Summary

  • Adds a private @onkernel/cua-bench workspace — the first slice of a benchmark that runs CUA models on Kernel browsers and reports accuracy / cost / speed.
  • runTask(modelRef, task) provisions a fresh Kernel browser, builds a CuaAgentHarness on the given model, runs the prompt, and tears the browser down. It captures wall-clock, turn count, and token totals (summed across every model call via harness events).
  • spike.ts is a runnable one-shot: one HN task on anthropic:claude-opus-4-6.

Scope / what's deliberately left for follow-up

  • success returns null — the accuracy judge (adopting an existing benchmark's scorer, e.g. Online-Mind2Web's) is the next step.
  • costUsd is populated only when the provider reports a cost; the token×price conversion comes with the model price table.
  • One task, one model, hardcoded — the task-set loader and multi-model fan-out are next.

How to run locally

npm install
export KERNEL_API_KEY=...
export ANTHROPIC_API_KEY=...
npm run spike --workspace @onkernel/cua-bench

Prints a TaskResult JSON with timing + token totals.

Test plan

  • tsc -b typechecks clean
  • spike resolves all imports and fails fast without KERNEL_API_KEY
  • real run against claude-opus-4-6 produces wall-clock + token numbers

jarugupj added 4 commits June 25, 2026 15:08
Introduce a private @onkernel/cua-bench workspace that runs one task on one
model against a fresh Kernel browser via CuaAgentHarness, capturing wall-clock,
turn count, and token totals. Accuracy scoring and cost conversion are left
unscored for follow-up work. Includes a spike entrypoint for a manual run.
Pin @onkernel/cua-agent and @onkernel/cua-ai to "*" so the private bench
package keeps resolving the workspace siblings across version bumps, and
regenerate package-lock.json against the current tree.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant