feat(evals): wire WebTailBench through verifier by miguelg719 · Pull Request #2135 · browserbase/stagehand

miguelg719 · 2026-05-15T20:58:49Z

Why

The eval harness needs one reusable path for running an agent, recording its trajectory, and scoring it with the verifier. This PR applies that path to WebTailBench first as the narrowest dataset migration.

What Changed

Added runWithVerifier for rubric resolution, trajectory recording, verifier scoring, and result persistence.
Migrated the WebTailBench task path to verifier-backed scoring.
Added validated success-mode conversion from verifier verdicts; invalid or missing values fall back to outcome.
Added focused tests for success-mode validation.
Added a focused WebTailBench verifier smoke script.
Kept verifier usage explicit with backend: "verifier".
Removed upstream verifier references from comments.

Tests

pnpm --filter @browserbasehq/stagehand run typecheck
pnpm --filter @browserbasehq/stagehand-evals run typecheck
pnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.ts
git diff --check

changeset-bot · 2026-05-15T20:59:06Z

⚠️ No Changeset found

Latest commit: 47dc1d5

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

2 issues found across 4 files

Confidence score: 3/5

There is moderate merge risk because both findings are medium severity (6/10) with solid confidence, and each can affect correctness of eval outcomes or debugging behavior.
In packages/evals/framework/verifierAdapter.ts, a rejection from recorder.finish() inside the catch path can mask the original agent failure, making root-cause diagnosis harder and potentially changing observed failure behavior.
Pay close attention to packages/evals/framework/verifierAdapter.ts and packages/evals/tasks/bench/agent/webtailbench.ts - preserve original thrown errors during persistence failure handling, and validate EVAL_SUCCESS_MODE so invalid values cannot produce undefined success mapping.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/framework/verifierAdapter.ts">

<violation number="1" location="packages/evals/framework/verifierAdapter.ts:110">
P2: If `recorder.finish()` rejects inside the catch block, the original agent error is lost. Wrap the persistence call in its own try/catch so the original error is always rethrown.</violation>
</file>

<file name="packages/evals/tasks/bench/agent/webtailbench.ts">

<violation number="1" location="packages/evals/tasks/bench/agent/webtailbench.ts:77">
P2: Unvalidated env var cast can silently corrupt eval results. If `EVAL_SUCCESS_MODE` is set to a typo (e.g., `"outcomes"`), `verdictToSuccess` falls through its switch without matching, returns `undefined`, and all evals report `_success: false`. Validate or constrain the value before passing it.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant Bench as Bench Task<br/>(webtailbench.ts)
    participant Adapter as runWithVerifier<br/>(verifierAdapter.ts)
    participant Cache as RubricCache<br/>(rubricCache.ts)
    participant Recorder as TrajectoryRecorder
    participant Evaluator as V3Evaluator<br/>(verifier backend)
    participant Agent as Agent<br/>(agent.execute())

    Note over Bench,Agent: WebTailBench scoring pipeline (Wave 1 MVP)

    Bench->>Adapter: runWithVerifier({v3, agent, taskSpec, dataset:"webtailbench"})

    Note over Adapter,Cache: Resolve Rubric

    Adapter->>Adapter: Check precomputedRubric on taskSpec
    alt precomputed rubric provided (e.g., upstream WebTailBench)
        Adapter->>Adapter: Use precomputed rubric directly
    else VERIFIER_DISABLE_RUBRIC_CACHE=1
        Adapter->>Evaluator: generateRubric(taskSpec)
        Evaluator-->>Adapter: Generated rubric
    else cache available
        Adapter->>Cache: getOrGenerate(taskSpec, evaluator)
        Cache-->>Adapter: Cached or generated rubric
    end
    Adapter->>Adapter: Build hydratedTaskSpec with resolved rubric

    Note over Adapter,Agent: Record Trajectory

    Adapter->>Recorder: start()
    Recorder->>Recorder: Subscribe to bus events
    Adapter->>Agent: execute({instruction, maxSteps})
    Agent-->>Adapter: AgentResult
    alt Agent throws error
        Agent-->>Adapter: Error
        Adapter->>Recorder: finish({status:"error"})
        Recorder-->>Adapter: Partial trajectory
        Adapter->>Adapter: Attach trajectoryDir to error
        Adapter-->>Bench: Re-throw error
    end
    Adapter->>Recorder: finish({status:"complete", finalAnswer, usage})
    Recorder-->>Adapter: Recorded Trajectory

    Note over Adapter,Evaluator: Verify & Persist

    Adapter->>Evaluator: verify(trajectory, hydratedTaskSpec)
    Evaluator-->>Adapter: Verdict (outcome + process scores)
    Adapter->>Recorder: persistVerdict(verdict)
    Adapter-->>Bench: {trajectory, verdict, rubric, trajectoryDir}

    Note over Bench: Convert Verdict to Success Flag

    Bench->>Bench: verdictToSuccess(verdict, EVAL_SUCCESS_MODE)
    Note right of Bench: mode: outcome | process | both<br/>default = outcome
    Bench-->>Bench: Return eval result with _success flag

    Note over Bench: Cleanup on Error Path

    alt Error in Adapter
        Bench-->>Bench: Catch error with trajectoryDir
        Bench-->>Bench: Return {_success: false, trajectoryDir}
    end

    Note over Bench,Evaluator: Persistence (env-gated)
    opt VERIFIER_PERSIST_TRAJECTORIES set
        Recorder->>Recorder: Write trajectory JSON to disk
        Recorder->>Recorder: Write verdict JSON to disk
    end
    opt rubric caching active
        Cache->>Cache: Cache rubric to .rubric-cache/webtailbench/<task-id>.json
    end

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic}

cubic-dev-ai · 2026-05-15T21:03:24Z

+    });
+  } catch (e) {
+    recorderStatus = "error";
+    const trajectory = await recorder.finish({ status: recorderStatus });


P2: If recorder.finish() rejects inside the catch block, the original agent error is lost. Wrap the persistence call in its own try/catch so the original error is always rethrown.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/framework/verifierAdapter.ts, line 110: <comment>If `recorder.finish()` rejects inside the catch block, the original agent error is lost. Wrap the persistence call in its own try/catch so the original error is always rethrown.</comment> <file context> @@ -0,0 +1,160 @@ + }); + } catch (e) { + recorderStatus = "error"; + const trajectory = await recorder.finish({ status: recorderStatus }); + // Re-throw after persisting so the bench task can decide how to report. + const wrapped = e instanceof Error ? e : new Error(String(e)); </file context>

cubic-dev-ai · 2026-05-15T21:03:24Z

-          }),
-        );
-      }
+      const successMode =


P2: Unvalidated env var cast can silently corrupt eval results. If EVAL_SUCCESS_MODE is set to a typo (e.g., "outcomes"), verdictToSuccess falls through its switch without matching, returns undefined, and all evals report _success: false. Validate or constrain the value before passing it.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tasks/bench/agent/webtailbench.ts, line 77: <comment>Unvalidated env var cast can silently corrupt eval results. If `EVAL_SUCCESS_MODE` is set to a typo (e.g., `"outcomes"`), `verdictToSuccess` falls through its switch without matching, returns `undefined`, and all evals report `_success: false`. Validate or constrain the value before passing it.</comment> <file context> @@ -41,70 +56,62 @@ export default defineBenchTask( - }), - ); - } + const successMode = + (process.env.EVAL_SUCCESS_MODE as "outcome" | "process" | "both") || + "outcome"; </file context>

cubic-dev-ai Bot reviewed May 15, 2026

View reviewed changes

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from d4c4260 to 4923ce6 Compare May 15, 2026 21:23

miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch 2 times, most recently from 105b13c to f7df0cf Compare May 15, 2026 21:45

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch 2 times, most recently from dcc5bfc to d736522 Compare May 15, 2026 22:33

miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch from f7df0cf to a476ada Compare May 15, 2026 22:33

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from d736522 to cd1f8f4 Compare May 15, 2026 23:27

miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch 2 times, most recently from 8ec81fd to cce3a9b Compare May 16, 2026 04:40

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from cd1f8f4 to 95ada04 Compare May 16, 2026 04:40

miguelg719 added 5 commits May 15, 2026 22:49

feat(evals): wire WebTailBench through verifier

adca143

fix(evals): normalize verifier rubric inputs

986624e

fix(evals): validate verifier success mode

92afba0

docs(evals): remove rollout comments from verifier adapter

7652e32

fix(evals): align verifier adapter result API

47dc1d5

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from 95ada04 to 4f141e7 Compare May 16, 2026 05:50

miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch from cce3a9b to 47dc1d5 Compare May 16, 2026 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): wire WebTailBench through verifier#2135

feat(evals): wire WebTailBench through verifier#2135
miguelg719 wants to merge 5 commits into
miguelgonzalez/verifier-06-offline-clifrom
miguelgonzalez/verifier-07-evals-adapter

miguelg719 commented May 15, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented May 15, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 15, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miguelg719 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What Changed

Tests

Uh oh!

changeset-bot Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelg719 commented May 15, 2026 •

edited

Loading

changeset-bot Bot commented May 15, 2026 •

edited

Loading

cubic-dev-ai Bot May 15, 2026 •

edited

Loading

cubic-dev-ai Bot May 15, 2026 •

edited

Loading