feat(evals): add offline verifier CLI#2134
Conversation
|
There was a problem hiding this comment.
3 issues found across 9 files
Confidence score: 3/5
- There is some concrete regression risk here:
packages/evals/framework/rubricCache.tsreaddoes not confirmparsed.taskId === taskSpec.id, so sanitized ID collisions can return the wrong cached rubric data when hashes align. - Two CLI/runtime behaviors are likely to confuse users but are straightforward to fix:
packages/evals/tui/commands/verify.tssilently ignores trailing--model/--labelwithout values, andpackages/evals/scripts/verify-live-trajectory.tspassestimeoutMstopage.goto()(ignored), causing fallback to Playwright’s default timeout. - Given one medium-severity correctness issue plus two medium input/timeout handling issues, this looks mergeable with caution after targeted fixes rather than a hard block.
- Pay close attention to
packages/evals/framework/rubricCache.ts,packages/evals/tui/commands/verify.ts, andpackages/evals/scripts/verify-live-trajectory.ts- cache key/task ID validation, missing flag-value errors, and ignored navigation timeout options need verification.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/tui/commands/verify.ts">
<violation number="1" location="packages/evals/tui/commands/verify.ts:89">
P2: Missing validation when `--model` or `--label` is passed without a following value. If either is the last argument, `args[++i]` is `undefined` and the flag is silently ignored rather than producing an error.</violation>
</file>
<file name="packages/evals/framework/rubricCache.ts">
<violation number="1" location="packages/evals/framework/rubricCache.ts:95">
P2: The `read` method does not verify `parsed.taskId` matches `taskSpec.id`. Since `entryPath` sanitizes characters (`:`, `/`, etc.) to `_`, distinct task IDs can map to the same file. When instruction hashes also happen to match, a stale/wrong rubric is served silently. Add a `taskId` equality check alongside the `instructionHash` check.</violation>
</file>
<file name="packages/evals/scripts/verify-live-trajectory.ts">
<violation number="1" location="packages/evals/scripts/verify-live-trajectory.ts:38">
P2: Playwright's `page.goto()` accepts `timeout`, not `timeoutMs`. This option is silently ignored, so the navigation falls back to the default 30s timeout instead of the intended 60s.</violation>
</file>
Architecture diagram
sequenceDiagram
participant CLI as CLI / terminal
participant CmdRouter as Command Router (cli.ts)
participant VerifyCmd as verify Command
participant Trajectory as Trajectory Dir (disk)
participant RubricCache as RubricCache
participant V3Eval as V3Evaluator (verifier backend)
participant V3 as V3 instance (headless)
participant TrajectoryRecorder as TrajectoryRecorder (live)
Note over CLI,V3: NEW: Offline verify path (red arrow) vs existing live run (blue arrows)
CLI->>CmdRouter: evals verify <trajectory-dir> [options]
CmdRouter->>VerifyCmd: handleVerify(args)
VerifyCmd->>Trajectory: read trajectory.json + task_data.json
Trajectory-->>VerifyCmd: Trajectory + TaskSpec
VerifyCmd->>V3Eval: new V3Evaluator(v3, {backend:"verifier"})
Note over VerifyCmd,V3Eval: No browser launched — V3 constructed without init()
VerifyCmd->>V3Eval: verify(trajectory, taskSpec)
V3Eval->>RubricCache: getOrGenerate(taskSpec, evaluator)
alt Cache hit (same instruction hash)
RubricCache-->>V3Eval: cached Rubric
else Cache miss or hash drift
RubricCache->>RubricCache: hashInstruction(taskSpec.instruction)
V3Eval->>V3Eval: generateRubric(taskSpec) — Step 0a
V3Eval->>RubricCache: write(taskSpec, rubric)
RubricCache-->>V3Eval: freshly generated Rubric
end
V3Eval->>V3Eval: score trajectory against rubric — Step 8
V3Eval-->>VerifyCmd: Verdict (outcomeSuccess, processScore, perCriterion)
alt --json flag
VerifyCmd->>CLI: JSON stringified Verdict to stdout
else default (human summary)
VerifyCmd->>CLI: colored summary (score, criteria, findings)
alt --dry-run not set
VerifyCmd->>Trajectory: write scores/mmrubric_<label>.json
end
end
Note over CLI,Trajectory: Live run path (unchanged, shown for context)
CLI->>CmdRouter: evals run <target>
CmdRouter->>V3: agent.execute(instruction)
V3->>TrajectoryRecorder: start() — subscribe to bus events
V3->>V3: perform browser automation steps
V3->>TrajectoryRecorder: capture step events (screenshots, URLs, evidence)
V3-->>CmdRouter: agent result
TrajectoryRecorder->>Trajectory: persist() — write trajectory.json, screenshots, task_data.json, times.json
CmdRouter->>CLI: run summary
Note over CmdRouter,V3: Success mode plumbing (--success flag)
CmdRouter->>CmdRouter: resolve successMode from --success / EVAL_SUCCESS_MODE / "outcome"
CmdRouter->>V3: envOverrides.EVAL_SUCCESS_MODE = successMode
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic
| parsed.json = true; | ||
| } else if (a === "--dry-run") { | ||
| parsed.dryRun = true; | ||
| } else if (a === "--model") { |
There was a problem hiding this comment.
P2: Missing validation when --model or --label is passed without a following value. If either is the last argument, args[++i] is undefined and the flag is silently ignored rather than producing an error.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tui/commands/verify.ts, line 89:
<comment>Missing validation when `--model` or `--label` is passed without a following value. If either is the last argument, `args[++i]` is `undefined` and the flag is silently ignored rather than producing an error.</comment>
<file context>
@@ -0,0 +1,238 @@
+ parsed.json = true;
+ } else if (a === "--dry-run") {
+ parsed.dryRun = true;
+ } else if (a === "--model") {
+ parsed.model = args[++i];
+ } else if (a === "--label") {
</file context>
163db47 to
ebe60bf
Compare
4923ce6 to
dcc5bfc
Compare
ebe60bf to
191904b
Compare
dcc5bfc to
d736522
Compare
191904b to
62cb8db
Compare
d736522 to
cd1f8f4
Compare
a6ee702 to
2e7ff0f
Compare
cd1f8f4 to
95ada04
Compare
2e7ff0f to
b725247
Compare
95ada04 to
4f141e7
Compare
Why
Verifier iteration should not require rerunning browser automation. This PR adds offline saved-trajectory rescoring so prompts, approaches, and scoring behavior can be compared against the same trajectory artifacts without changing the existing CLI architecture.
What Changed
evals verify <trajectory-dir>for offline trajectory rescoring through the existing command-tree dispatch.taskIdand instruction hash before returning cached data.Tests
pnpm --filter @browserbasehq/stagehand run typecheckpnpm --filter @browserbasehq/stagehand-evals run typecheckpnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.tspnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agentgit diff --check