feat(evals): add verifier benchmark instrumentation by miguelg719 · Pull Request #2138 · browserbase/stagehand

miguelg719 · 2026-05-15T20:58:55Z

Why

The verifier should not become the default until it is benchmarked against legacy and Approach A/B are compared on the same saved trajectories. This PR adds the instrumentation and scripts needed for that evidence gate.

What Changed

Added Braintrust spans around rubric resolution, agent execution, and verifier judgment.
Added cross-verification scripts for rescoring saved trajectories with Approach A and Approach B.
Added benchmark matrix documentation.
Kept the default evaluator backend as legacy until benchmark evidence supports a flip.

Tests

pnpm -w exec prettier --check packages/core/lib/v3/index.ts packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/core/tests/unit/verifier-trajectory.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agent
pnpm --filter @browserbasehq/stagehand run typecheck
pnpm --filter @browserbasehq/stagehand run build:esm
pnpm --filter @browserbasehq/stagehand run test:core -- packages/core/dist/esm/tests/unit/verifier-failure-step-parser.test.js packages/core/dist/esm/tests/unit/verifier-trajectory.test.js
pnpm --filter @browserbasehq/stagehand-evals run typecheck
pnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.ts

changeset-bot · 2026-05-15T20:58:59Z

⚠️ No Changeset found

Latest commit: 81ee924

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

3 issues found across 5 files

Confidence score: 3/5

There is some merge risk because scripts/cross-verify-parallel.sh has concrete shell-safety issues (null-delimiting for xargs, and quoting redirect targets) that can fail on valid paths containing spaces, quotes, or glob characters.
The most severe issue is the xargs input handling in scripts/cross-verify-parallel.sh (6/10, confidence 7/10): without printf '%s\0' + xargs -0, path parsing can break and cause incorrect task execution.
scripts/cross-verify.sh also has a quoting concern around $TRAJECTORY_GLOB; unquoted expansion can mis-handle special characters, so this is user-facing in environments with non-trivial file names.
Pay close attention to scripts/cross-verify-parallel.sh and scripts/cross-verify.sh - path quoting and argument delimiting need hardening to avoid shell parsing errors.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="scripts/cross-verify-parallel.sh">

<violation number="1" location="scripts/cross-verify-parallel.sh:40">
P2: Unquoted redirect target will cause "ambiguous redirect" if `$task` contains spaces or glob characters. Quote the path to make the redirect robust.</violation>

<violation number="2" location="scripts/cross-verify-parallel.sh:58">
P2: Use null-delimited input (`printf '%s\0'` + `xargs -0`) to prevent `xargs` from misinterpreting directory paths that contain spaces, quotes, or backslashes.</violation>
</file>

<file name="scripts/cross-verify.sh">

<violation number="1" location="scripts/cross-verify.sh:22">
P2: Quote `$TRAJECTORY_GLOB` or use an array to safely expand the glob. Unquoted variable expansion combined with globbing can break when paths contain spaces or special characters.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant CLI as Script / CLI
    participant Eval as Evaluator (runWithVerifier)
    participant BT as Braintrust Span
    participant Cache as RubricCache
    participant Agent as Agent Executor
    participant Recorder as TrajectoryRecorder
    participant Verifier as Verifier (V3Evaluator)

    Note over CLI,Verifier: Verifier Benchmark Flow

    CLI->>Eval: runWithVerifier(taskSpec, dataset)
    
    Eval->>BT: tracedSpan("verifier.rubric")
    alt Precomputed rubric
        BT->>BT: Use taskSpec.precomputedRubric
    else Cache disabled (env)
        BT->>Eval: evaluator.generateRubric(taskSpec)
    else Cache enabled
        BT->>Cache: cache.read(taskSpec)
        alt Cache hit
            Cache-->>BT: Return cached rubric
        else Cache miss
            BT->>Eval: evaluator.generateRubric(taskSpec)
            Eval->>Cache: cache.write(taskSpec, rubric)
        end
    end
    BT->>BT: span.log({output, metadata})
    BT-->>Eval: Return resolved rubric

    Eval->>Eval: Hydrate TaskSpec with rubric

    Eval->>BT: tracedSpan("agent.execute")
    BT->>Agent: agent.execute(instruction)
    Agent-->>BT: AgentResult (including usage)
    BT->>BT: span.log({output, metrics})
    BT-->>Eval: Return AgentResult

    Eval->>Recorder: recorder.finish(agentResult)
    Recorder-->>Eval: Trajectory

    Eval->>BT: tracedSpan("verifier.verify")
    BT->>Verifier: evaluator.verify(trajectory, hydratedTaskSpec)
    Verifier-->>BT: Verdict
    BT->>BT: span.log({output, scores, metadata})
    BT-->>Eval: Return Verdict

    Eval->>Recorder: recorder.persistVerdict(verdict)

    opt Cross-Verification (scripts/cross-verify-parallel.sh)
        CLI->>CLI: Enumerate trajectory dirs
        loop Each trajectory
            par Approach A (VERIFIER_APPROACH=a)
                CLI->>Eval: verify() with approach=a
                Eval->>BT: tracedSpan("verifier.rubric") & tracedSpan("verifier.verify")
                Note over Eval,Verifier: Same agent trajectory, different verifier logic
                Eval-->>CLI: Score output to scores/mmrubric_cross-a.json
            and Approach B (VERIFIER_APPROACH=b)
                CLI->>Eval: verify() with approach=b
                Eval->>BT: tracedSpan("verifier.rubric") & tracedSpan("verifier.verify")
                Eval-->>CLI: Score output to scores/mmrubric_cross-b.json
            end
        end
    end

    Note over CLI,Verifier: Default backend remains "legacy" until benchmark shows verifier improvement

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic}

cubic-dev-ai · 2026-05-15T21:05:32Z

+  JOBS+=("$d|a")
+done
+
+printf '%s\n' "${JOBS[@]}" | xargs -I {} -n 1 -P "$PARALLEL" bash -c '


P2: Use null-delimited input (printf '%s\0' + xargs -0) to prevent xargs from misinterpreting directory paths that contain spaces, quotes, or backslashes.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At scripts/cross-verify-parallel.sh, line 58: <comment>Use null-delimited input (`printf '%s\0'` + `xargs -0`) to prevent `xargs` from misinterpreting directory paths that contain spaces, quotes, or backslashes.</comment> <file context> @@ -0,0 +1,63 @@ + JOBS+=("$d|a") +done + +printf '%s\n' "${JOBS[@]}" | xargs -I {} -n 1 -P "$PARALLEL" bash -c ' + IFS="|" read -r dir approach <<< "$1" + run_one "$dir" "$approach" </file context>

cubic-dev-ai · 2026-05-15T21:05:32Z

+DIRS=()
+while IFS= read -r d; do
+  DIRS+=("$d")
+done < <(find $TRAJECTORY_GLOB -mindepth 1 -maxdepth 1 -type d | sort)


P2: Quote $TRAJECTORY_GLOB or use an array to safely expand the glob. Unquoted variable expansion combined with globbing can break when paths contain spaces or special characters.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At scripts/cross-verify.sh, line 22: <comment>Quote `$TRAJECTORY_GLOB` or use an array to safely expand the glob. Unquoted variable expansion combined with globbing can break when paths contain spaces or special characters.</comment> <file context> @@ -0,0 +1,44 @@ +DIRS=() +while IFS= read -r d; do + DIRS+=("$d") +done < <(find $TRAJECTORY_GLOB -mindepth 1 -maxdepth 1 -type d | sort) + +echo "Found ${#DIRS[@]} trajectory dirs" </file context>

cubic-dev-ai Bot reviewed May 15, 2026

View reviewed changes

miguelg719 force-pushed the miguelgonzalez/verifier-10-benchmark-instrumentation branch from 57873b1 to eede85a Compare May 15, 2026 21:23

miguelg719 mentioned this pull request May 15, 2026

docs(verifier): add README for the new rubric verifier #2139

Draft

3 tasks

miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch from 09ba439 to 6774e3d Compare May 15, 2026 21:45

miguelg719 force-pushed the miguelgonzalez/verifier-10-benchmark-instrumentation branch 2 times, most recently from 3d202fa to e4fbe53 Compare May 15, 2026 22:33

miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch from 1ae8368 to 573e4e0 Compare May 15, 2026 23:27

miguelg719 force-pushed the miguelgonzalez/verifier-10-benchmark-instrumentation branch from e4fbe53 to 248b0ea Compare May 15, 2026 23:27

miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch from 573e4e0 to ed49853 Compare May 16, 2026 04:40

miguelg719 force-pushed the miguelgonzalez/verifier-10-benchmark-instrumentation branch from 248b0ea to 9326826 Compare May 16, 2026 04:40

miguelg719 added 4 commits May 15, 2026 22:49

feat(evals): add verifier benchmark instrumentation

10a8f06

docs(evals): clarify verifier env naming

7aea577

docs(evals): include outcome-only verifier matrix

3d15cec

fix(evals): use result filenames in cross verification

81ee924

miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch from ed49853 to ec15b9b Compare May 16, 2026 05:50

miguelg719 force-pushed the miguelgonzalez/verifier-10-benchmark-instrumentation branch from 9326826 to 81ee924 Compare May 16, 2026 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): add verifier benchmark instrumentation#2138

feat(evals): add verifier benchmark instrumentation#2138
miguelg719 wants to merge 4 commits into
miguelgonzalez/verifier-09-harness-adaptersfrom
miguelgonzalez/verifier-10-benchmark-instrumentation

miguelg719 commented May 15, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented May 15, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

cubic-dev-ai Bot May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miguelg719 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What Changed

Tests

Uh oh!

changeset-bot Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelg719 commented May 15, 2026 •

edited

Loading

changeset-bot Bot commented May 15, 2026 •

edited

Loading

cubic-dev-ai Bot May 15, 2026 •

edited

Loading

cubic-dev-ai Bot May 15, 2026 •

edited

Loading