Skip to content

evals(cua): add a deterministic CUA agent regression task#2292

Open
yawbtng wants to merge 1 commit into
browserbase:mainfrom
yawbtng:evals-cua-amazon-checkout
Open

evals(cua): add a deterministic CUA agent regression task#2292
yawbtng wants to merge 1 commit into
browserbase:mainfrom
yawbtng:evals-cua-amazon-checkout

Conversation

@yawbtng

@yawbtng yawbtng commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

why

mode: "cua" is only exercised today inside the heavyweight WebVoyager / OnlineMind2Web suites — too slow for quick CI signal and too non-deterministic to attribute a failure to a specific provider. So provider-specific function-response regressions (#2046, fixed by #2159; #2035) slip through, because they only break outside the PNG / happy-path that the act-tier tests cover.

This adds the small, deterministic regression check proposed in #2188: a fixture-backed agent task that answers "does the CUA function-response → browser-action loop still work end to end?" without LLM-judge noise or page drift.

what changed

One new bench task: packages/evals/tasks/bench/agent/cua_amazon_checkout.ts.

  • Fixture: the pinned static mirror browserbase.github.io/stagehand-eval-sites/sites/amazon/ — already used by act/amazon_add_to_cart, so the page is known-stable.
  • Flow: agent.execute(...) to add the product to the cart and proceed to checkout.
  • Pass criterion: page.url() === .../amazon/sign-in.html — the exact deterministic check the existing act task uses. No rubric, no screenshot/coordinate comparison.
  • Pattern: mirrors the existing deterministic-URL agent task agent/sign_in.ts; adds no new framework surface and no dependencies.

Mode-agnostic — point it at a CUA model to exercise the CUA path:

evals run agent/cua_amazon_checkout --agent-mode cua \
  --model google/gemini-2.5-computer-use-preview-10-2025

attribution context

Per review discussion on #2188 (keeping failures narrowly attributable), the result records the model/agent-mode path that ran and whether the agent ever left the start page — i.e. whether a failure happened before or after the first browser action. Finer-grained path attribution (function-response vs browser-execution) is preserved in the per-step trajectory logged from agent.execute.

testing

pnpm --filter @browserbasehq/stagehand-evals run lint (prettier + eslint + typecheck) passes, and the task is picked up by directory auto-discovery. I haven't run it against a live CUA model (needs provider keys + budget) — happy to share a run if there's a preferred model. Open to a different fixture, instruction, or pass-criterion shape.

Closes #2188.


Summary by cubic

Adds a small, deterministic CUA agent regression task to verify the end-to-end function-response → browser-action loop on a pinned Amazon mirror. Provides fast, stable CI signal and isolates provider/plumbing regressions (addresses #2188).

  • New Features
    • Adds agent/cua_amazon_checkout in packages/evals/tasks/bench/agent/cua_amazon_checkout.ts.
    • Uses the static fixture at https://browserbase.github.io/stagehand-eval-sites/sites/amazon/.
    • Passes only when page.url() matches the known sign-in URL; no rubric or screenshots.
    • Captures modelName, agentMode, isCUA, and whether the agent left the start page, plus logs and debug/session URLs.
    • Mirrors the existing deterministic-URL agent pattern; no new dependencies.

Written for commit 18c0f03. Summary will update on new commits.

Review in cubic

Adds agent/cua_amazon_checkout, a fixture-backed agent eval that passes
only when the agent reaches an exact known URL on the pinned
stagehand-eval-sites Amazon mirror. Unlike the rubric-graded agent
benchmarks, the deterministic URL criterion makes a failure attributable
to a real provider/plumbing regression rather than page drift or
LLM-judge noise, and exercises the full computer-use loop
(function-response decoding -> browser action) end to end — the path
that broke in browserbase#2046 and browserbase#2035.

Mirrors the existing agent/sign_in deterministic-URL pattern and reuses
the act/amazon_add_to_cart fixture and expected sign-in URL. Records the
model/agent-mode path and whether the agent left the start page so
failures are easy to attribute. Closes a coverage gap tracked in browserbase#2188.
@changeset-bot

changeset-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 18c0f03

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions

Copy link
Copy Markdown
Contributor

This PR is from an external contributor and must be approved by a stagehand team member with write access before CI can run.
Approving the latest commit mirrors it into an internal PR owned by the approver.
If new commits are pushed later, the internal PR stays open but is marked stale until someone approves the latest external commit and refreshes it.

@github-actions github-actions Bot added external-contributor Tracks PRs mirrored from external contributor forks. external-contributor:awaiting-approval Waiting for a stagehand team member to approve the latest external commit. labels Jun 30, 2026

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Runner as Eval Runner
    participant Task as cua_amazon_checkout
    participant Agent as CUA Agent Model
    participant Browser as Fixture Page
    participant Logger

    Runner->>Task: NEW: execute task (agent/cua_amazon_checkout)
    Note over Task: Pinned fixture URL, expected sign-in URL

    Task->>Browser: NEW: page.goto(startUrl)
    Browser-->>Task: page loaded

    Task->>Agent: NEW: agent.execute(instruction)
    Note over Agent: CUA function-response → browser-action loop<br/>(up to AGENT_EVAL_MAX_STEPS)

    alt Agent performs actions
        Agent->>Browser: Perform browser actions (click, type, etc.)
        Browser-->>Agent: New page state
    end

    Agent-->>Task: agentResult (trajectory logs)

    Task->>Logger: NEW: logger.log(agentResult)
    Logger-->>Task: logged

    Task->>Browser: NEW: page.url()
    Browser-->>Task: currentUrl

    alt Success: currentUrl === expectedUrl
        Task->>Task: _success = true
    else Failure: URL mismatch or error
        Task->>Task: _success = false
    end

    Note over Task: Returns attribution context:<br/>modelName, agentMode, isCUA,<br/>leftStartPage, debugUrl, sessionUrl

    Task-->>Runner: NEW: result object
Loading

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contributor:awaiting-approval Waiting for a stagehand team member to approve the latest external commit. external-contributor Tracks PRs mirrored from external contributor forks.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

evals(cua): add a deterministic CUA agent regression task

1 participant