feat(evals): wire WebTailBench through verifier#2135
Conversation
|
There was a problem hiding this comment.
2 issues found across 4 files
Confidence score: 3/5
- There is moderate merge risk because both findings are medium severity (6/10) with solid confidence, and each can affect correctness of eval outcomes or debugging behavior.
- In
packages/evals/framework/verifierAdapter.ts, a rejection fromrecorder.finish()inside the catch path can mask the original agent failure, making root-cause diagnosis harder and potentially changing observed failure behavior. - Pay close attention to
packages/evals/framework/verifierAdapter.tsandpackages/evals/tasks/bench/agent/webtailbench.ts- preserve original thrown errors during persistence failure handling, and validateEVAL_SUCCESS_MODEso invalid values cannot produceundefinedsuccess mapping.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/framework/verifierAdapter.ts">
<violation number="1" location="packages/evals/framework/verifierAdapter.ts:110">
P2: If `recorder.finish()` rejects inside the catch block, the original agent error is lost. Wrap the persistence call in its own try/catch so the original error is always rethrown.</violation>
</file>
<file name="packages/evals/tasks/bench/agent/webtailbench.ts">
<violation number="1" location="packages/evals/tasks/bench/agent/webtailbench.ts:77">
P2: Unvalidated env var cast can silently corrupt eval results. If `EVAL_SUCCESS_MODE` is set to a typo (e.g., `"outcomes"`), `verdictToSuccess` falls through its switch without matching, returns `undefined`, and all evals report `_success: false`. Validate or constrain the value before passing it.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Bench as Bench Task<br/>(webtailbench.ts)
participant Adapter as runWithVerifier<br/>(verifierAdapter.ts)
participant Cache as RubricCache<br/>(rubricCache.ts)
participant Recorder as TrajectoryRecorder
participant Evaluator as V3Evaluator<br/>(verifier backend)
participant Agent as Agent<br/>(agent.execute())
Note over Bench,Agent: WebTailBench scoring pipeline (Wave 1 MVP)
Bench->>Adapter: runWithVerifier({v3, agent, taskSpec, dataset:"webtailbench"})
Note over Adapter,Cache: Resolve Rubric
Adapter->>Adapter: Check precomputedRubric on taskSpec
alt precomputed rubric provided (e.g., upstream WebTailBench)
Adapter->>Adapter: Use precomputed rubric directly
else VERIFIER_DISABLE_RUBRIC_CACHE=1
Adapter->>Evaluator: generateRubric(taskSpec)
Evaluator-->>Adapter: Generated rubric
else cache available
Adapter->>Cache: getOrGenerate(taskSpec, evaluator)
Cache-->>Adapter: Cached or generated rubric
end
Adapter->>Adapter: Build hydratedTaskSpec with resolved rubric
Note over Adapter,Agent: Record Trajectory
Adapter->>Recorder: start()
Recorder->>Recorder: Subscribe to bus events
Adapter->>Agent: execute({instruction, maxSteps})
Agent-->>Adapter: AgentResult
alt Agent throws error
Agent-->>Adapter: Error
Adapter->>Recorder: finish({status:"error"})
Recorder-->>Adapter: Partial trajectory
Adapter->>Adapter: Attach trajectoryDir to error
Adapter-->>Bench: Re-throw error
end
Adapter->>Recorder: finish({status:"complete", finalAnswer, usage})
Recorder-->>Adapter: Recorded Trajectory
Note over Adapter,Evaluator: Verify & Persist
Adapter->>Evaluator: verify(trajectory, hydratedTaskSpec)
Evaluator-->>Adapter: Verdict (outcome + process scores)
Adapter->>Recorder: persistVerdict(verdict)
Adapter-->>Bench: {trajectory, verdict, rubric, trajectoryDir}
Note over Bench: Convert Verdict to Success Flag
Bench->>Bench: verdictToSuccess(verdict, EVAL_SUCCESS_MODE)
Note right of Bench: mode: outcome | process | both<br/>default = outcome
Bench-->>Bench: Return eval result with _success flag
Note over Bench: Cleanup on Error Path
alt Error in Adapter
Bench-->>Bench: Catch error with trajectoryDir
Bench-->>Bench: Return {_success: false, trajectoryDir}
end
Note over Bench,Evaluator: Persistence (env-gated)
opt VERIFIER_PERSIST_TRAJECTORIES set
Recorder->>Recorder: Write trajectory JSON to disk
Recorder->>Recorder: Write verdict JSON to disk
end
opt rubric caching active
Cache->>Cache: Cache rubric to .rubric-cache/webtailbench/<task-id>.json
end
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic
| }); | ||
| } catch (e) { | ||
| recorderStatus = "error"; | ||
| const trajectory = await recorder.finish({ status: recorderStatus }); |
There was a problem hiding this comment.
P2: If recorder.finish() rejects inside the catch block, the original agent error is lost. Wrap the persistence call in its own try/catch so the original error is always rethrown.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/framework/verifierAdapter.ts, line 110:
<comment>If `recorder.finish()` rejects inside the catch block, the original agent error is lost. Wrap the persistence call in its own try/catch so the original error is always rethrown.</comment>
<file context>
@@ -0,0 +1,160 @@
+ });
+ } catch (e) {
+ recorderStatus = "error";
+ const trajectory = await recorder.finish({ status: recorderStatus });
+ // Re-throw after persisting so the bench task can decide how to report.
+ const wrapped = e instanceof Error ? e : new Error(String(e));
</file context>
| }), | ||
| ); | ||
| } | ||
| const successMode = |
There was a problem hiding this comment.
P2: Unvalidated env var cast can silently corrupt eval results. If EVAL_SUCCESS_MODE is set to a typo (e.g., "outcomes"), verdictToSuccess falls through its switch without matching, returns undefined, and all evals report _success: false. Validate or constrain the value before passing it.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tasks/bench/agent/webtailbench.ts, line 77:
<comment>Unvalidated env var cast can silently corrupt eval results. If `EVAL_SUCCESS_MODE` is set to a typo (e.g., `"outcomes"`), `verdictToSuccess` falls through its switch without matching, returns `undefined`, and all evals report `_success: false`. Validate or constrain the value before passing it.</comment>
<file context>
@@ -41,70 +56,62 @@ export default defineBenchTask(
- }),
- );
- }
+ const successMode =
+ (process.env.EVAL_SUCCESS_MODE as "outcome" | "process" | "both") ||
+ "outcome";
</file context>
d4c4260 to
4923ce6
Compare
105b13c to
f7df0cf
Compare
dcc5bfc to
d736522
Compare
f7df0cf to
a476ada
Compare
d736522 to
cd1f8f4
Compare
8ec81fd to
cce3a9b
Compare
cd1f8f4 to
95ada04
Compare
95ada04 to
4f141e7
Compare
cce3a9b to
47dc1d5
Compare
Why
The eval harness needs one reusable path for running an agent, recording its trajectory, and scoring it with the verifier. This PR applies that path to WebTailBench first as the narrowest dataset migration.
What Changed
runWithVerifierfor rubric resolution, trajectory recording, verifier scoring, and result persistence.outcome.backend: "verifier".Tests
pnpm --filter @browserbasehq/stagehand run typecheckpnpm --filter @browserbasehq/stagehand-evals run typecheckpnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.tsgit diff --check