Skip to content

feat(evals): add verifier harness adapters#2137

Open
miguelg719 wants to merge 5 commits into
miguelgonzalez/verifier-08-dataset-migrationsfrom
miguelgonzalez/verifier-09-harness-adapters
Open

feat(evals): add verifier harness adapters#2137
miguelg719 wants to merge 5 commits into
miguelgonzalez/verifier-08-dataset-migrationsfrom
miguelgonzalez/verifier-09-harness-adapters

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 15, 2026

Why

External harnesses such as Claude Code and Codex should produce verifier-compatible trajectories without requiring Stagehand agent execution. This makes saved-trajectory verification usable across harnesses.

What Changed

  • Added Claude Code trajectory adapter.
  • Added Codex trajectory adapter.
  • Added external harness trajectory persistence helper.
  • Added adapter smoke script covering trajectory conversion and persistence.
  • Updated Codex and Claude runner plumbing for verifier-compatible artifacts.
  • Ensured both external harness runners instantiate V3Evaluator with backend: "verifier".
  • Removed repeated unsafe EVAL_SUCCESS_MODE casts from external harness runners.

Tests

  • pnpm --filter @browserbasehq/stagehand-evals run typecheck
  • pnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.ts
  • node --import tsx packages/evals/scripts/verify-harness-adapters.ts
  • git diff --check

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

⚠️ No Changeset found

Latest commit: ec15b9b

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files

Confidence score: 3/5

  • There is a concrete regression risk in packages/evals/framework/claudeCodeRunner.ts: the V3Evaluator call is missing { backend: "verifier" }, while other call sites include it.
  • Given the 6/10 severity and high confidence (8/10), this could route verification to the wrong backend and affect evaluation behavior in user-visible ways.
  • Pay close attention to packages/evals/framework/claudeCodeRunner.ts - ensure the evaluator is initialized with the correct backend option to avoid misrouted verification.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/framework/claudeCodeRunner.ts">

<violation number="1" location="packages/evals/framework/claudeCodeRunner.ts:294">
P2: Missing `{ backend: "verifier" }` option in V3Evaluator constructor. Every other usage in the codebase passes this option — omitting it here may route verification through the wrong backend.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant CC as Claude Code Agent (SDK)
    participant CX as Codex Agent (SDK)
    participant CR as ClaudeCode Runner
    participant XR as Codex Runner
    participant CA as ClaudeCodeAdapter
    participant XA as CodexAdapter
    participant TA as TrajectoryAdapter (base)
    participant PT as persistAdapterTrajectory
    participant VE as V3Evaluator
    participant RC as RubricCache
    participant FS as File System (.trajectories/)
    participant LOG as EvalLogger

    Note over CC,FS: NEW: External Harness Verifier Flow

    par Claude Code Flow
        CC->>CR: SDK message stream (tool_use, tool_result, result)
        CR->>CR: accumulate messages[], extract finalAnswer, usage, timing
        alt verifier config present
            CR->>CA: fromHarnessResult(messages, taskSpec)
            CA->>CA: parse tool_use blocks → tool calls
            CA->>CA: match tool_use_id → tool_result
            CA->>CA: extract reasoning from text blocks
            CA->>TA: buildTrajectory(toolCalls, taskSpec)
            TA-->>CA: Trajectory with empty probeEvidence
            CA-->>CR: Trajectory (2 steps, reasoning folded)
            CR->>VE: import V3Evaluator(v3)
            alt precomputedRubric in taskSpec
                VE->>CR: use existing rubric
            else VERIFIER_DISABLE_RUBRIC_CACHE=1
                VE->>VE: generateRubric(taskSpec)
            else rubric cache enabled
                VE->>RC: getOrGenerate(taskSpec, evaluator)
                RC-->>VE: cached/generated rubric
            end
            VE->>VE: verify(trajectory, hydratedSpec)
            VE-->>CR: Verdict (outcomeSuccess, processScore)
            CR->>PT: persistAdapterTrajectory(trajectory, taskSpec, verdict)
            alt persistence enabled
                PT->>FS: mkdir <outputRoot>/<runId>/<task.id>/
                PT->>FS: write trajectory.json
                PT->>FS: write task_data.json
                PT->>FS: write times.json
                PT->>FS: write scores/mmrubric_v1.json
                PT->>FS: write core.log
                FS-->>PT: files written
            else persistence disabled
                PT-->>CR: { persisted: false }
            end
            PT-->>CR: { directory, persisted }
            CR->>LOG: log verdict details
            CR-->>CC: TaskResult with verifier fields
        else no verifier config
            CR-->>CC: TaskResult (legacy mode)
        end
    end

    par Codex Flow
        CX->>XR: ThreadEvent stream (item.completed, turn.completed)
        XR->>XR: accumulate events[], extract usage, timing
        alt verifier config present
            XR->>XA: fromHarnessResult(events, taskSpec)
            XA->>XA: parse item.completed events
            XA->>XA: command_execution → tool call (bash/browse)
            XA->>XA: mcp_tool_call → tool call (server.tool)
            XA->>XA: reasoning items → fold into next tool call
            XA->>XA: agent_message → finalAnswer
            XA->>TA: buildTrajectory(toolCalls, taskSpec)
            TA-->>XA: Trajectory with empty probeEvidence
            XA-->>XR: Trajectory (n steps, structured_content→json)
            XR->>VE: import V3Evaluator(v3)
            alt precomputedRubric in taskSpec
                VE->>XR: use existing rubric
            else VERIFIER_DISABLE_RUBRIC_CACHE=1
                VE->>VE: generateRubric(taskSpec)
            else rubric cache enabled
                VE->>RC: getOrGenerate(taskSpec, evaluator)
                RC-->>VE: cached/generated rubric
            end
            VE->>VE: verify(trajectory, hydratedSpec)
            VE-->>XR: Verdict (outcomeSuccess, processScore)
            XR->>PT: persistAdapterTrajectory(trajectory, taskSpec, verdict)
            alt persistence enabled
                PT->>FS: mkdir <outputRoot>/<runId>/<task.id>/
                PT->>FS: write trajectory.json
                PT->>FS: write task_data.json
                PT->>FS: write times.json
                PT->>FS: write scores/mmrubric_v1.json
                PT->>FS: write core.log
                FS-->>PT: files written
            else persistence disabled
                PT-->>XR: { persisted: false }
            end
            PT-->>XR: { directory, persisted }
            XR->>LOG: log verdict details
            XR-->>CX: TaskResult with verifier fields
        else no verifier config
            XR-->>CX: TaskResult (legacy mode)
        end
    end

    Note over CA,XA: NEW: Harness Adapters (pure functions, no browser)
    Note over TA: NEW: NormalizedToolCall + Trajectory base
    Note over PT: NEW: Disk layout matching TrajectoryRecorder
    Note over VE: NEW: Verifier integration in both runners
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic

Comment thread packages/evals/framework/claudeCodeRunner.ts Outdated
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from f7d8f2d to 9ffa4e9 Compare May 15, 2026 21:23
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch 2 times, most recently from 09ba439 to 6774e3d Compare May 15, 2026 21:45
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch 2 times, most recently from 2a24116 to 71ed229 Compare May 15, 2026 22:33
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch from 6774e3d to 1ae8368 Compare May 15, 2026 22:33
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 71ed229 to 2f80063 Compare May 15, 2026 23:27
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch 2 times, most recently from 573e4e0 to ed49853 Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 2f80063 to 1b60889 Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-08-dataset-migrations branch from 1b60889 to bf49158 Compare May 16, 2026 05:50
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-09-harness-adapters branch from ed49853 to ec15b9b Compare May 16, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant