evals(cua): add a deterministic CUA agent regression task#2292
Conversation
Adds agent/cua_amazon_checkout, a fixture-backed agent eval that passes only when the agent reaches an exact known URL on the pinned stagehand-eval-sites Amazon mirror. Unlike the rubric-graded agent benchmarks, the deterministic URL criterion makes a failure attributable to a real provider/plumbing regression rather than page drift or LLM-judge noise, and exercises the full computer-use loop (function-response decoding -> browser action) end to end — the path that broke in browserbase#2046 and browserbase#2035. Mirrors the existing agent/sign_in deterministic-URL pattern and reuses the act/amazon_add_to_cart fixture and expected sign-in URL. Records the model/agent-mode path and whether the agent left the start page so failures are easy to attribute. Closes a coverage gap tracked in browserbase#2188.
|
|
This PR is from an external contributor and must be approved by a stagehand team member with write access before CI can run. |
There was a problem hiding this comment.
No issues found across 1 file
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Runner as Eval Runner
participant Task as cua_amazon_checkout
participant Agent as CUA Agent Model
participant Browser as Fixture Page
participant Logger
Runner->>Task: NEW: execute task (agent/cua_amazon_checkout)
Note over Task: Pinned fixture URL, expected sign-in URL
Task->>Browser: NEW: page.goto(startUrl)
Browser-->>Task: page loaded
Task->>Agent: NEW: agent.execute(instruction)
Note over Agent: CUA function-response → browser-action loop<br/>(up to AGENT_EVAL_MAX_STEPS)
alt Agent performs actions
Agent->>Browser: Perform browser actions (click, type, etc.)
Browser-->>Agent: New page state
end
Agent-->>Task: agentResult (trajectory logs)
Task->>Logger: NEW: logger.log(agentResult)
Logger-->>Task: logged
Task->>Browser: NEW: page.url()
Browser-->>Task: currentUrl
alt Success: currentUrl === expectedUrl
Task->>Task: _success = true
else Failure: URL mismatch or error
Task->>Task: _success = false
end
Note over Task: Returns attribution context:<br/>modelName, agentMode, isCUA,<br/>leftStartPage, debugUrl, sessionUrl
Task-->>Runner: NEW: result object
why
mode: "cua"is only exercised today inside the heavyweight WebVoyager / OnlineMind2Web suites — too slow for quick CI signal and too non-deterministic to attribute a failure to a specific provider. So provider-specific function-response regressions (#2046, fixed by #2159; #2035) slip through, because they only break outside the PNG / happy-path that theact-tier tests cover.This adds the small, deterministic regression check proposed in #2188: a fixture-backed agent task that answers "does the CUA function-response → browser-action loop still work end to end?" without LLM-judge noise or page drift.
what changed
One new bench task:
packages/evals/tasks/bench/agent/cua_amazon_checkout.ts.browserbase.github.io/stagehand-eval-sites/sites/amazon/— already used byact/amazon_add_to_cart, so the page is known-stable.agent.execute(...)to add the product to the cart and proceed to checkout.page.url() === .../amazon/sign-in.html— the exact deterministic check the existingacttask uses. No rubric, no screenshot/coordinate comparison.agent/sign_in.ts; adds no new framework surface and no dependencies.Mode-agnostic — point it at a CUA model to exercise the CUA path:
attribution context
Per review discussion on #2188 (keeping failures narrowly attributable), the result records the model/agent-mode path that ran and whether the agent ever left the start page — i.e. whether a failure happened before or after the first browser action. Finer-grained path attribution (function-response vs browser-execution) is preserved in the per-step trajectory logged from
agent.execute.testing
pnpm --filter @browserbasehq/stagehand-evals run lint(prettier + eslint + typecheck) passes, and the task is picked up by directory auto-discovery. I haven't run it against a live CUA model (needs provider keys + budget) — happy to share a run if there's a preferred model. Open to a different fixture, instruction, or pass-criterion shape.Closes #2188.
Summary by cubic
Adds a small, deterministic CUA agent regression task to verify the end-to-end function-response → browser-action loop on a pinned Amazon mirror. Provides fast, stable CI signal and isolates provider/plumbing regressions (addresses #2188).
agent/cua_amazon_checkoutinpackages/evals/tasks/bench/agent/cua_amazon_checkout.ts.https://browserbase.github.io/stagehand-eval-sites/sites/amazon/.page.url()matches the known sign-in URL; no rubric or screenshots.modelName,agentMode,isCUA, and whether the agent left the start page, plus logs and debug/session URLs.Written for commit 18c0f03. Summary will update on new commits.