feat(evals): add verifier harness adapters by miguelg719 · Pull Request #2137 · browserbase/stagehand

miguelg719 · 2026-05-15T20:58:53Z

Why

External harnesses such as Claude Code and Codex should produce verifier-compatible trajectories without requiring Stagehand agent execution. This makes saved-trajectory verification usable across harnesses.

What Changed

Added Claude Code trajectory adapter.
Added Codex trajectory adapter.
Added external harness trajectory persistence helper.
Added adapter smoke script covering trajectory conversion and persistence.
Updated Codex and Claude runner plumbing for verifier-compatible artifacts.
Ensured both external harness runners instantiate V3Evaluator with backend: "verifier".
Removed repeated unsafe EVAL_SUCCESS_MODE casts from external harness runners.

Tests

pnpm --filter @browserbasehq/stagehand-evals run typecheck
pnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.ts
node --import tsx packages/evals/scripts/verify-harness-adapters.ts
git diff --check

changeset-bot · 2026-05-15T20:59:11Z

⚠️ No Changeset found

Latest commit: ec15b9b

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

1 issue found across 7 files

Confidence score: 3/5

There is a concrete regression risk in packages/evals/framework/claudeCodeRunner.ts: the V3Evaluator call is missing { backend: "verifier" }, while other call sites include it.
Given the 6/10 severity and high confidence (8/10), this could route verification to the wrong backend and affect evaluation behavior in user-visible ways.
Pay close attention to packages/evals/framework/claudeCodeRunner.ts - ensure the evaluator is initialized with the correct backend option to avoid misrouted verification.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/framework/claudeCodeRunner.ts">

<violation number="1" location="packages/evals/framework/claudeCodeRunner.ts:294">
P2: Missing `{ backend: "verifier" }` option in V3Evaluator constructor. Every other usage in the codebase passes this option — omitting it here may route verification through the wrong backend.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant CC as Claude Code Agent (SDK)
    participant CX as Codex Agent (SDK)
    participant CR as ClaudeCode Runner
    participant XR as Codex Runner
    participant CA as ClaudeCodeAdapter
    participant XA as CodexAdapter
    participant TA as TrajectoryAdapter (base)
    participant PT as persistAdapterTrajectory
    participant VE as V3Evaluator
    participant RC as RubricCache
    participant FS as File System (.trajectories/)
    participant LOG as EvalLogger

    Note over CC,FS: NEW: External Harness Verifier Flow

    par Claude Code Flow
        CC->>CR: SDK message stream (tool_use, tool_result, result)
        CR->>CR: accumulate messages[], extract finalAnswer, usage, timing
        alt verifier config present
            CR->>CA: fromHarnessResult(messages, taskSpec)
            CA->>CA: parse tool_use blocks → tool calls
            CA->>CA: match tool_use_id → tool_result
            CA->>CA: extract reasoning from text blocks
            CA->>TA: buildTrajectory(toolCalls, taskSpec)
            TA-->>CA: Trajectory with empty probeEvidence
            CA-->>CR: Trajectory (2 steps, reasoning folded)
            CR->>VE: import V3Evaluator(v3)
            alt precomputedRubric in taskSpec
                VE->>CR: use existing rubric
            else VERIFIER_DISABLE_RUBRIC_CACHE=1
                VE->>VE: generateRubric(taskSpec)
            else rubric cache enabled
                VE->>RC: getOrGenerate(taskSpec, evaluator)
                RC-->>VE: cached/generated rubric
            end
            VE->>VE: verify(trajectory, hydratedSpec)
            VE-->>CR: Verdict (outcomeSuccess, processScore)
            CR->>PT: persistAdapterTrajectory(trajectory, taskSpec, verdict)
            alt persistence enabled
                PT->>FS: mkdir <outputRoot>/<runId>/<task.id>/
                PT->>FS: write trajectory.json
                PT->>FS: write task_data.json
                PT->>FS: write times.json
                PT->>FS: write scores/mmrubric_v1.json
                PT->>FS: write core.log
                FS-->>PT: files written
            else persistence disabled
                PT-->>CR: { persisted: false }
            end
            PT-->>CR: { directory, persisted }
            CR->>LOG: log verdict details
            CR-->>CC: TaskResult with verifier fields
        else no verifier config
            CR-->>CC: TaskResult (legacy mode)
        end
    end

    par Codex Flow
        CX->>XR: ThreadEvent stream (item.completed, turn.completed)
        XR->>XR: accumulate events[], extract usage, timing
        alt verifier config present
            XR->>XA: fromHarnessResult(events, taskSpec)
            XA->>XA: parse item.completed events
            XA->>XA: command_execution → tool call (bash/browse)
            XA->>XA: mcp_tool_call → tool call (server.tool)
            XA->>XA: reasoning items → fold into next tool call
            XA->>XA: agent_message → finalAnswer
            XA->>TA: buildTrajectory(toolCalls, taskSpec)
            TA-->>XA: Trajectory with empty probeEvidence
            XA-->>XR: Trajectory (n steps, structured_content→json)
            XR->>VE: import V3Evaluator(v3)
            alt precomputedRubric in taskSpec
                VE->>XR: use existing rubric
            else VERIFIER_DISABLE_RUBRIC_CACHE=1
                VE->>VE: generateRubric(taskSpec)
            else rubric cache enabled
                VE->>RC: getOrGenerate(taskSpec, evaluator)
                RC-->>VE: cached/generated rubric
            end
            VE->>VE: verify(trajectory, hydratedSpec)
            VE-->>XR: Verdict (outcomeSuccess, processScore)
            XR->>PT: persistAdapterTrajectory(trajectory, taskSpec, verdict)
            alt persistence enabled
                PT->>FS: mkdir <outputRoot>/<runId>/<task.id>/
                PT->>FS: write trajectory.json
                PT->>FS: write task_data.json
                PT->>FS: write times.json
                PT->>FS: write scores/mmrubric_v1.json
                PT->>FS: write core.log
                FS-->>PT: files written
            else persistence disabled
                PT-->>XR: { persisted: false }
            end
            PT-->>XR: { directory, persisted }
            XR->>LOG: log verdict details
            XR-->>CX: TaskResult with verifier fields
        else no verifier config
            XR-->>CX: TaskResult (legacy mode)
        end
    end

    Note over CA,XA: NEW: Harness Adapters (pure functions, no browser)
    Note over TA: NEW: NormalizedToolCall + Trajectory base
    Note over PT: NEW: Disk layout matching TrajectoryRecorder
    Note over VE: NEW: Verifier integration in both runners

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic}

cubic-dev-ai Bot reviewed May 15, 2026

View reviewed changes

Comment thread packages/evals/framework/claudeCodeRunner.ts Outdated

miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from f7d8f2d to 9ffa4e9 Compare May 15, 2026 21:23

miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch 2 times, most recently from 09ba439 to 6774e3d Compare May 15, 2026 21:45

miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch 2 times, most recently from 2a24116 to 71ed229 Compare May 15, 2026 22:33

miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch from 6774e3d to 1ae8368 Compare May 15, 2026 22:33

miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 71ed229 to 2f80063 Compare May 15, 2026 23:27

miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch 2 times, most recently from 573e4e0 to ed49853 Compare May 16, 2026 04:40

miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 2f80063 to 1b60889 Compare May 16, 2026 04:40

miguelg719 added 5 commits May 15, 2026 22:49

feat(evals): add verifier harness adapters

6568ed4

fix(evals): route harnesses through verifier

7d98da5

fix(evals): validate external harness success mode

442fac4

test(evals): cover persisted trajectory images

d434747

fix(evals): align harness verifier result API

ec15b9b

miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 1b60889 to bf49158 Compare May 16, 2026 05:50

miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch from ed49853 to ec15b9b Compare May 16, 2026 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): add verifier harness adapters#2137

feat(evals): add verifier harness adapters#2137
miguelg719 wants to merge 5 commits into
miguelgonzalez/verifier-08-dataset-migrationsfrom
miguelgonzalez/verifier-09-harness-adapters

miguelg719 commented May 15, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented May 15, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miguelg719 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What Changed

Tests

Uh oh!

changeset-bot Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelg719 commented May 15, 2026 •

edited

Loading

changeset-bot Bot commented May 15, 2026 •

edited

Loading