feat(evals): add verifier harness adapters#2137
Open
miguelg719 wants to merge 5 commits into
Open
Conversation
|
Contributor
There was a problem hiding this comment.
1 issue found across 7 files
Confidence score: 3/5
- There is a concrete regression risk in
packages/evals/framework/claudeCodeRunner.ts: theV3Evaluatorcall is missing{ backend: "verifier" }, while other call sites include it. - Given the 6/10 severity and high confidence (8/10), this could route verification to the wrong backend and affect evaluation behavior in user-visible ways.
- Pay close attention to
packages/evals/framework/claudeCodeRunner.ts- ensure the evaluator is initialized with the correct backend option to avoid misrouted verification.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/framework/claudeCodeRunner.ts">
<violation number="1" location="packages/evals/framework/claudeCodeRunner.ts:294">
P2: Missing `{ backend: "verifier" }` option in V3Evaluator constructor. Every other usage in the codebase passes this option — omitting it here may route verification through the wrong backend.</violation>
</file>
Architecture diagram
sequenceDiagram
participant CC as Claude Code Agent (SDK)
participant CX as Codex Agent (SDK)
participant CR as ClaudeCode Runner
participant XR as Codex Runner
participant CA as ClaudeCodeAdapter
participant XA as CodexAdapter
participant TA as TrajectoryAdapter (base)
participant PT as persistAdapterTrajectory
participant VE as V3Evaluator
participant RC as RubricCache
participant FS as File System (.trajectories/)
participant LOG as EvalLogger
Note over CC,FS: NEW: External Harness Verifier Flow
par Claude Code Flow
CC->>CR: SDK message stream (tool_use, tool_result, result)
CR->>CR: accumulate messages[], extract finalAnswer, usage, timing
alt verifier config present
CR->>CA: fromHarnessResult(messages, taskSpec)
CA->>CA: parse tool_use blocks → tool calls
CA->>CA: match tool_use_id → tool_result
CA->>CA: extract reasoning from text blocks
CA->>TA: buildTrajectory(toolCalls, taskSpec)
TA-->>CA: Trajectory with empty probeEvidence
CA-->>CR: Trajectory (2 steps, reasoning folded)
CR->>VE: import V3Evaluator(v3)
alt precomputedRubric in taskSpec
VE->>CR: use existing rubric
else VERIFIER_DISABLE_RUBRIC_CACHE=1
VE->>VE: generateRubric(taskSpec)
else rubric cache enabled
VE->>RC: getOrGenerate(taskSpec, evaluator)
RC-->>VE: cached/generated rubric
end
VE->>VE: verify(trajectory, hydratedSpec)
VE-->>CR: Verdict (outcomeSuccess, processScore)
CR->>PT: persistAdapterTrajectory(trajectory, taskSpec, verdict)
alt persistence enabled
PT->>FS: mkdir <outputRoot>/<runId>/<task.id>/
PT->>FS: write trajectory.json
PT->>FS: write task_data.json
PT->>FS: write times.json
PT->>FS: write scores/mmrubric_v1.json
PT->>FS: write core.log
FS-->>PT: files written
else persistence disabled
PT-->>CR: { persisted: false }
end
PT-->>CR: { directory, persisted }
CR->>LOG: log verdict details
CR-->>CC: TaskResult with verifier fields
else no verifier config
CR-->>CC: TaskResult (legacy mode)
end
end
par Codex Flow
CX->>XR: ThreadEvent stream (item.completed, turn.completed)
XR->>XR: accumulate events[], extract usage, timing
alt verifier config present
XR->>XA: fromHarnessResult(events, taskSpec)
XA->>XA: parse item.completed events
XA->>XA: command_execution → tool call (bash/browse)
XA->>XA: mcp_tool_call → tool call (server.tool)
XA->>XA: reasoning items → fold into next tool call
XA->>XA: agent_message → finalAnswer
XA->>TA: buildTrajectory(toolCalls, taskSpec)
TA-->>XA: Trajectory with empty probeEvidence
XA-->>XR: Trajectory (n steps, structured_content→json)
XR->>VE: import V3Evaluator(v3)
alt precomputedRubric in taskSpec
VE->>XR: use existing rubric
else VERIFIER_DISABLE_RUBRIC_CACHE=1
VE->>VE: generateRubric(taskSpec)
else rubric cache enabled
VE->>RC: getOrGenerate(taskSpec, evaluator)
RC-->>VE: cached/generated rubric
end
VE->>VE: verify(trajectory, hydratedSpec)
VE-->>XR: Verdict (outcomeSuccess, processScore)
XR->>PT: persistAdapterTrajectory(trajectory, taskSpec, verdict)
alt persistence enabled
PT->>FS: mkdir <outputRoot>/<runId>/<task.id>/
PT->>FS: write trajectory.json
PT->>FS: write task_data.json
PT->>FS: write times.json
PT->>FS: write scores/mmrubric_v1.json
PT->>FS: write core.log
FS-->>PT: files written
else persistence disabled
PT-->>XR: { persisted: false }
end
PT-->>XR: { directory, persisted }
XR->>LOG: log verdict details
XR-->>CX: TaskResult with verifier fields
else no verifier config
XR-->>CX: TaskResult (legacy mode)
end
end
Note over CA,XA: NEW: Harness Adapters (pure functions, no browser)
Note over TA: NEW: NormalizedToolCall + Trajectory base
Note over PT: NEW: Disk layout matching TrajectoryRecorder
Note over VE: NEW: Verifier integration in both runners
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic
f7d8f2d to
9ffa4e9
Compare
09ba439 to
6774e3d
Compare
2a24116 to
71ed229
Compare
6774e3d to
1ae8368
Compare
71ed229 to
2f80063
Compare
573e4e0 to
ed49853
Compare
2f80063 to
1b60889
Compare
1b60889 to
bf49158
Compare
ed49853 to
ec15b9b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
External harnesses such as Claude Code and Codex should produce verifier-compatible trajectories without requiring Stagehand agent execution. This makes saved-trajectory verification usable across harnesses.
What Changed
V3Evaluatorwithbackend: "verifier".EVAL_SUCCESS_MODEcasts from external harness runners.Tests
pnpm --filter @browserbasehq/stagehand-evals run typecheckpnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.tsnode --import tsx packages/evals/scripts/verify-harness-adapters.tsgit diff --check