evals(cua): add a deterministic CUA agent regression task by yawbtng · Pull Request #2292 · browserbase/stagehand

yawbtng · 2026-06-30T17:06:08Z

why

mode: "cua" is only exercised today inside the heavyweight WebVoyager / OnlineMind2Web suites — too slow for quick CI signal and too non-deterministic to attribute a failure to a specific provider. So provider-specific function-response regressions (#2046, fixed by #2159; #2035) slip through, because they only break outside the PNG / happy-path that the act-tier tests cover.

This adds the small, deterministic regression check proposed in #2188: a fixture-backed agent task that answers "does the CUA function-response → browser-action loop still work end to end?" without LLM-judge noise or page drift.

what changed

One new bench task: packages/evals/tasks/bench/agent/cua_amazon_checkout.ts.

Fixture: the pinned static mirror browserbase.github.io/stagehand-eval-sites/sites/amazon/ — already used by act/amazon_add_to_cart, so the page is known-stable.
Flow: agent.execute(...) to add the product to the cart and proceed to checkout.
Pass criterion: page.url() === .../amazon/sign-in.html — the exact deterministic check the existing act task uses. No rubric, no screenshot/coordinate comparison.
Pattern: mirrors the existing deterministic-URL agent task agent/sign_in.ts; adds no new framework surface and no dependencies.

Mode-agnostic — point it at a CUA model to exercise the CUA path:

evals run agent/cua_amazon_checkout --agent-mode cua \
  --model google/gemini-2.5-computer-use-preview-10-2025

attribution context

Per review discussion on #2188 (keeping failures narrowly attributable), the result records the model/agent-mode path that ran and whether the agent ever left the start page — i.e. whether a failure happened before or after the first browser action. Finer-grained path attribution (function-response vs browser-execution) is preserved in the per-step trajectory logged from agent.execute.

testing

pnpm --filter @browserbasehq/stagehand-evals run lint (prettier + eslint + typecheck) passes, and the task is picked up by directory auto-discovery. I haven't run it against a live CUA model (needs provider keys + budget) — happy to share a run if there's a preferred model. Open to a different fixture, instruction, or pass-criterion shape.

Closes #2188.

Summary by cubic

Adds a small, deterministic CUA agent regression task to verify the end-to-end function-response → browser-action loop on a pinned Amazon mirror. Provides fast, stable CI signal and isolates provider/plumbing regressions (addresses #2188).

New Features
- Adds agent/cua_amazon_checkout in packages/evals/tasks/bench/agent/cua_amazon_checkout.ts.
- Uses the static fixture at https://browserbase.github.io/stagehand-eval-sites/sites/amazon/.
- Passes only when page.url() matches the known sign-in URL; no rubric or screenshots.
- Captures modelName, agentMode, isCUA, and whether the agent left the start page, plus logs and debug/session URLs.
- Mirrors the existing deterministic-URL agent pattern; no new dependencies.

^{Written for commit 18c0f03. Summary will update on new commits.}

Adds agent/cua_amazon_checkout, a fixture-backed agent eval that passes only when the agent reaches an exact known URL on the pinned stagehand-eval-sites Amazon mirror. Unlike the rubric-graded agent benchmarks, the deterministic URL criterion makes a failure attributable to a real provider/plumbing regression rather than page drift or LLM-judge noise, and exercises the full computer-use loop (function-response decoding -> browser action) end to end — the path that broke in browserbase#2046 and browserbase#2035. Mirrors the existing agent/sign_in deterministic-URL pattern and reuses the act/amazon_add_to_cart fixture and expected sign-in URL. Records the model/agent-mode path and whether the agent left the start page so failures are easy to attribute. Closes a coverage gap tracked in browserbase#2188.

changeset-bot · 2026-06-30T17:06:13Z

⚠️ No Changeset found

Latest commit: 18c0f03

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

github-actions · 2026-06-30T17:06:19Z

This PR is from an external contributor and must be approved by a stagehand team member with write access before CI can run.
Approving the latest commit mirrors it into an internal PR owned by the approver.
If new commits are pushed later, the internal PR stays open but is marked stale until someone approves the latest external commit and refreshes it.

cubic-dev-ai

No issues found across 1 file

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

Architecture diagram

sequenceDiagram
    participant Runner as Eval Runner
    participant Task as cua_amazon_checkout
    participant Agent as CUA Agent Model
    participant Browser as Fixture Page
    participant Logger

    Runner->>Task: NEW: execute task (agent/cua_amazon_checkout)
    Note over Task: Pinned fixture URL, expected sign-in URL

    Task->>Browser: NEW: page.goto(startUrl)
    Browser-->>Task: page loaded

    Task->>Agent: NEW: agent.execute(instruction)
    Note over Agent: CUA function-response → browser-action loop<br/>(up to AGENT_EVAL_MAX_STEPS)

    alt Agent performs actions
        Agent->>Browser: Perform browser actions (click, type, etc.)
        Browser-->>Agent: New page state
    end

    Agent-->>Task: agentResult (trajectory logs)

    Task->>Logger: NEW: logger.log(agentResult)
    Logger-->>Task: logged

    Task->>Browser: NEW: page.url()
    Browser-->>Task: currentUrl

    alt Success: currentUrl === expectedUrl
        Task->>Task: _success = true
    else Failure: URL mismatch or error
        Task->>Task: _success = false
    end

    Note over Task: Returns attribution context:<br/>modelName, agentMode, isCUA,<br/>leftStartPage, debugUrl, sessionUrl

    Task-->>Runner: NEW: result object

_{Re-trigger cubic}

github-actions Bot added external-contributor Tracks PRs mirrored from external contributor forks. external-contributor:awaiting-approval Waiting for a stagehand team member to approve the latest external commit. labels Jun 30, 2026

yawbtng mentioned this pull request Jun 30, 2026

evals(cua): add a deterministic CUA agent regression task #2188

Open

cubic-dev-ai Bot reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

evals(cua): add a deterministic CUA agent regression task#2292

evals(cua): add a deterministic CUA agent regression task#2292
yawbtng wants to merge 1 commit into
browserbase:mainfrom
yawbtng:evals-cua-amazon-checkout

yawbtng commented Jun 30, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

changeset-bot Bot commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yawbtng commented Jun 30, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

why

what changed

attribution context

testing

Summary by cubic

Uh oh!

changeset-bot Bot commented Jun 30, 2026

⚠️ No Changeset found

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yawbtng commented Jun 30, 2026 •

edited by cubic-dev-ai Bot

Loading