Skip to content

feat(evals): add verifier benchmark instrumentation#2138

Open
miguelg719 wants to merge 4 commits into
miguelgonzalez/verifier-09-harness-adaptersfrom
miguelgonzalez/verifier-10-benchmark-instrumentation
Open

feat(evals): add verifier benchmark instrumentation#2138
miguelg719 wants to merge 4 commits into
miguelgonzalez/verifier-09-harness-adaptersfrom
miguelgonzalez/verifier-10-benchmark-instrumentation

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 15, 2026

Why

The verifier should not become the default until it is benchmarked against legacy and Approach A/B are compared on the same saved trajectories. This PR adds the instrumentation and scripts needed for that evidence gate.

What Changed

  • Added Braintrust spans around rubric resolution, agent execution, and verifier judgment.
  • Added cross-verification scripts for rescoring saved trajectories with Approach A and Approach B.
  • Added benchmark matrix documentation.
  • Kept the default evaluator backend as legacy until benchmark evidence supports a flip.

Tests

  • pnpm -w exec prettier --check packages/core/lib/v3/index.ts packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/core/tests/unit/verifier-trajectory.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agent
  • pnpm --filter @browserbasehq/stagehand run typecheck
  • pnpm --filter @browserbasehq/stagehand run build:esm
  • pnpm --filter @browserbasehq/stagehand run test:core -- packages/core/dist/esm/tests/unit/verifier-failure-step-parser.test.js packages/core/dist/esm/tests/unit/verifier-trajectory.test.js
  • pnpm --filter @browserbasehq/stagehand-evals run typecheck
  • pnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.ts

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

⚠️ No Changeset found

Latest commit: 81ee924

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 5 files

Confidence score: 3/5

  • There is some merge risk because scripts/cross-verify-parallel.sh has concrete shell-safety issues (null-delimiting for xargs, and quoting redirect targets) that can fail on valid paths containing spaces, quotes, or glob characters.
  • The most severe issue is the xargs input handling in scripts/cross-verify-parallel.sh (6/10, confidence 7/10): without printf '%s\0' + xargs -0, path parsing can break and cause incorrect task execution.
  • scripts/cross-verify.sh also has a quoting concern around $TRAJECTORY_GLOB; unquoted expansion can mis-handle special characters, so this is user-facing in environments with non-trivial file names.
  • Pay close attention to scripts/cross-verify-parallel.sh and scripts/cross-verify.sh - path quoting and argument delimiting need hardening to avoid shell parsing errors.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="scripts/cross-verify-parallel.sh">

<violation number="1" location="scripts/cross-verify-parallel.sh:40">
P2: Unquoted redirect target will cause "ambiguous redirect" if `$task` contains spaces or glob characters. Quote the path to make the redirect robust.</violation>

<violation number="2" location="scripts/cross-verify-parallel.sh:58">
P2: Use null-delimited input (`printf '%s\0'` + `xargs -0`) to prevent `xargs` from misinterpreting directory paths that contain spaces, quotes, or backslashes.</violation>
</file>

<file name="scripts/cross-verify.sh">

<violation number="1" location="scripts/cross-verify.sh:22">
P2: Quote `$TRAJECTORY_GLOB` or use an array to safely expand the glob. Unquoted variable expansion combined with globbing can break when paths contain spaces or special characters.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant CLI as Script / CLI
    participant Eval as Evaluator (runWithVerifier)
    participant BT as Braintrust Span
    participant Cache as RubricCache
    participant Agent as Agent Executor
    participant Recorder as TrajectoryRecorder
    participant Verifier as Verifier (V3Evaluator)

    Note over CLI,Verifier: Verifier Benchmark Flow

    CLI->>Eval: runWithVerifier(taskSpec, dataset)
    
    Eval->>BT: tracedSpan("verifier.rubric")
    alt Precomputed rubric
        BT->>BT: Use taskSpec.precomputedRubric
    else Cache disabled (env)
        BT->>Eval: evaluator.generateRubric(taskSpec)
    else Cache enabled
        BT->>Cache: cache.read(taskSpec)
        alt Cache hit
            Cache-->>BT: Return cached rubric
        else Cache miss
            BT->>Eval: evaluator.generateRubric(taskSpec)
            Eval->>Cache: cache.write(taskSpec, rubric)
        end
    end
    BT->>BT: span.log({output, metadata})
    BT-->>Eval: Return resolved rubric

    Eval->>Eval: Hydrate TaskSpec with rubric

    Eval->>BT: tracedSpan("agent.execute")
    BT->>Agent: agent.execute(instruction)
    Agent-->>BT: AgentResult (including usage)
    BT->>BT: span.log({output, metrics})
    BT-->>Eval: Return AgentResult

    Eval->>Recorder: recorder.finish(agentResult)
    Recorder-->>Eval: Trajectory

    Eval->>BT: tracedSpan("verifier.verify")
    BT->>Verifier: evaluator.verify(trajectory, hydratedTaskSpec)
    Verifier-->>BT: Verdict
    BT->>BT: span.log({output, scores, metadata})
    BT-->>Eval: Return Verdict

    Eval->>Recorder: recorder.persistVerdict(verdict)

    opt Cross-Verification (scripts/cross-verify-parallel.sh)
        CLI->>CLI: Enumerate trajectory dirs
        loop Each trajectory
            par Approach A (VERIFIER_APPROACH=a)
                CLI->>Eval: verify() with approach=a
                Eval->>BT: tracedSpan("verifier.rubric") & tracedSpan("verifier.verify")
                Note over Eval,Verifier: Same agent trajectory, different verifier logic
                Eval-->>CLI: Score output to scores/mmrubric_cross-a.json
            and Approach B (VERIFIER_APPROACH=b)
                CLI->>Eval: verify() with approach=b
                Eval->>BT: tracedSpan("verifier.rubric") & tracedSpan("verifier.verify")
                Eval-->>CLI: Score output to scores/mmrubric_cross-b.json
            end
        end
    end

    Note over CLI,Verifier: Default backend remains "legacy" until benchmark shows verifier improvement
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic

JOBS+=("$d|a")
done

printf '%s\n' "${JOBS[@]}" | xargs -I {} -n 1 -P "$PARALLEL" bash -c '
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Use null-delimited input (printf '%s\0' + xargs -0) to prevent xargs from misinterpreting directory paths that contain spaces, quotes, or backslashes.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/cross-verify-parallel.sh, line 58:

<comment>Use null-delimited input (`printf '%s\0'` + `xargs -0`) to prevent `xargs` from misinterpreting directory paths that contain spaces, quotes, or backslashes.</comment>

<file context>
@@ -0,0 +1,63 @@
+  JOBS+=("$d|a")
+done
+
+printf '%s\n' "${JOBS[@]}" | xargs -I {} -n 1 -P "$PARALLEL" bash -c '
+  IFS="|" read -r dir approach <<< "$1"
+  run_one "$dir" "$approach"
</file context>
Fix with Cubic

Comment thread scripts/cross-verify-parallel.sh
Comment thread scripts/cross-verify.sh
DIRS=()
while IFS= read -r d; do
DIRS+=("$d")
done < <(find $TRAJECTORY_GLOB -mindepth 1 -maxdepth 1 -type d | sort)
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Quote $TRAJECTORY_GLOB or use an array to safely expand the glob. Unquoted variable expansion combined with globbing can break when paths contain spaces or special characters.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/cross-verify.sh, line 22:

<comment>Quote `$TRAJECTORY_GLOB` or use an array to safely expand the glob. Unquoted variable expansion combined with globbing can break when paths contain spaces or special characters.</comment>

<file context>
@@ -0,0 +1,44 @@
+DIRS=()
+while IFS= read -r d; do
+  DIRS+=("$d")
+done < <(find $TRAJECTORY_GLOB -mindepth 1 -maxdepth 1 -type d | sort)
+
+echo "Found ${#DIRS[@]} trajectory dirs"
</file context>
Fix with Cubic

@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-10-benchmark-instrumentation branch from 57873b1 to eede85a Compare May 15, 2026 21:23
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch from 09ba439 to 6774e3d Compare May 15, 2026 21:45
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-10-benchmark-instrumentation branch 2 times, most recently from 3d202fa to e4fbe53 Compare May 15, 2026 22:33
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch from 1ae8368 to 573e4e0 Compare May 15, 2026 23:27
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-10-benchmark-instrumentation branch from e4fbe53 to 248b0ea Compare May 15, 2026 23:27
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch from 573e4e0 to ed49853 Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-10-benchmark-instrumentation branch from 248b0ea to 9326826 Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch from ed49853 to ec15b9b Compare May 16, 2026 05:50
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-10-benchmark-instrumentation branch from 9326826 to 81ee924 Compare May 16, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant