feat(evals): add offline verifier CLI by miguelg719 · Pull Request #2134 · browserbase/stagehand

miguelg719 · 2026-05-15T20:58:47Z

Why

Verifier iteration should not require rerunning browser automation. This PR adds offline saved-trajectory rescoring so prompts, approaches, and scoring behavior can be compared against the same trajectory artifacts without changing the existing CLI architecture.

What Changed

Added evals verify <trajectory-dir> for offline trajectory rescoring through the existing command-tree dispatch.
Preserved REPL quiet handling, first-run state, welcome behavior, and existing command-tree routing.
Added rubric cache utilities for generated rubrics.
Hardened rubric cache reads to verify both taskId and instruction hash before returning cached data.
Added live/offline verifier scripts.
Added TUI command parsing and help support for the verify command.
Ensured offline verification explicitly uses the verifier backend.
Removed upstream verifier references from comments.

Tests

pnpm --filter @browserbasehq/stagehand run typecheck
pnpm --filter @browserbasehq/stagehand-evals run typecheck
pnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.ts
pnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agent
git diff --check

changeset-bot · 2026-05-15T20:59:02Z

⚠️ No Changeset found

Latest commit: 4f141e7

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

3 issues found across 9 files

Confidence score: 3/5

There is some concrete regression risk here: packages/evals/framework/rubricCache.ts read does not confirm parsed.taskId === taskSpec.id, so sanitized ID collisions can return the wrong cached rubric data when hashes align.
Two CLI/runtime behaviors are likely to confuse users but are straightforward to fix: packages/evals/tui/commands/verify.ts silently ignores trailing --model/--label without values, and packages/evals/scripts/verify-live-trajectory.ts passes timeoutMs to page.goto() (ignored), causing fallback to Playwright’s default timeout.
Given one medium-severity correctness issue plus two medium input/timeout handling issues, this looks mergeable with caution after targeted fixes rather than a hard block.
Pay close attention to packages/evals/framework/rubricCache.ts, packages/evals/tui/commands/verify.ts, and packages/evals/scripts/verify-live-trajectory.ts - cache key/task ID validation, missing flag-value errors, and ignored navigation timeout options need verification.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/tui/commands/verify.ts">

<violation number="1" location="packages/evals/tui/commands/verify.ts:89">
P2: Missing validation when `--model` or `--label` is passed without a following value. If either is the last argument, `args[++i]` is `undefined` and the flag is silently ignored rather than producing an error.</violation>
</file>

<file name="packages/evals/framework/rubricCache.ts">

<violation number="1" location="packages/evals/framework/rubricCache.ts:95">
P2: The `read` method does not verify `parsed.taskId` matches `taskSpec.id`. Since `entryPath` sanitizes characters (`:`, `/`, etc.) to `_`, distinct task IDs can map to the same file. When instruction hashes also happen to match, a stale/wrong rubric is served silently. Add a `taskId` equality check alongside the `instructionHash` check.</violation>
</file>

<file name="packages/evals/scripts/verify-live-trajectory.ts">

<violation number="1" location="packages/evals/scripts/verify-live-trajectory.ts:38">
P2: Playwright's `page.goto()` accepts `timeout`, not `timeoutMs`. This option is silently ignored, so the navigation falls back to the default 30s timeout instead of the intended 60s.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant CLI as CLI / terminal
    participant CmdRouter as Command Router (cli.ts)
    participant VerifyCmd as verify Command
    participant Trajectory as Trajectory Dir (disk)
    participant RubricCache as RubricCache
    participant V3Eval as V3Evaluator (verifier backend)
    participant V3 as V3 instance (headless)
    participant TrajectoryRecorder as TrajectoryRecorder (live)

    Note over CLI,V3: NEW: Offline verify path (red arrow) vs existing live run (blue arrows)

    CLI->>CmdRouter: evals verify <trajectory-dir> [options]
    CmdRouter->>VerifyCmd: handleVerify(args)
    VerifyCmd->>Trajectory: read trajectory.json + task_data.json
    Trajectory-->>VerifyCmd: Trajectory + TaskSpec
    VerifyCmd->>V3Eval: new V3Evaluator(v3, {backend:"verifier"})
    Note over VerifyCmd,V3Eval: No browser launched — V3 constructed without init()
    VerifyCmd->>V3Eval: verify(trajectory, taskSpec)
    V3Eval->>RubricCache: getOrGenerate(taskSpec, evaluator)
    alt Cache hit (same instruction hash)
        RubricCache-->>V3Eval: cached Rubric
    else Cache miss or hash drift
        RubricCache->>RubricCache: hashInstruction(taskSpec.instruction)
        V3Eval->>V3Eval: generateRubric(taskSpec) — Step 0a
        V3Eval->>RubricCache: write(taskSpec, rubric)
        RubricCache-->>V3Eval: freshly generated Rubric
    end
    V3Eval->>V3Eval: score trajectory against rubric — Step 8
    V3Eval-->>VerifyCmd: Verdict (outcomeSuccess, processScore, perCriterion)
    alt --json flag
        VerifyCmd->>CLI: JSON stringified Verdict to stdout
    else default (human summary)
        VerifyCmd->>CLI: colored summary (score, criteria, findings)
        alt --dry-run not set
            VerifyCmd->>Trajectory: write scores/mmrubric_<label>.json
        end
    end

    Note over CLI,Trajectory: Live run path (unchanged, shown for context)
    CLI->>CmdRouter: evals run <target>
    CmdRouter->>V3: agent.execute(instruction)
    V3->>TrajectoryRecorder: start() — subscribe to bus events
    V3->>V3: perform browser automation steps
    V3->>TrajectoryRecorder: capture step events (screenshots, URLs, evidence)
    V3-->>CmdRouter: agent result
    TrajectoryRecorder->>Trajectory: persist() — write trajectory.json, screenshots, task_data.json, times.json
    CmdRouter->>CLI: run summary

    Note over CmdRouter,V3: Success mode plumbing (--success flag)
    CmdRouter->>CmdRouter: resolve successMode from --success / EVAL_SUCCESS_MODE / "outcome"
    CmdRouter->>V3: envOverrides.EVAL_SUCCESS_MODE = successMode

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic}

cubic-dev-ai · 2026-05-15T21:04:13Z

+      parsed.json = true;
+    } else if (a === "--dry-run") {
+      parsed.dryRun = true;
+    } else if (a === "--model") {


P2: Missing validation when --model or --label is passed without a following value. If either is the last argument, args[++i] is undefined and the flag is silently ignored rather than producing an error.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tui/commands/verify.ts, line 89: <comment>Missing validation when `--model` or `--label` is passed without a following value. If either is the last argument, `args[++i]` is `undefined` and the flag is silently ignored rather than producing an error.</comment> <file context> @@ -0,0 +1,238 @@ + parsed.json = true; + } else if (a === "--dry-run") { + parsed.dryRun = true; + } else if (a === "--model") { + parsed.model = args[++i]; + } else if (a === "--label") { </file context>

cubic-dev-ai Bot reviewed May 15, 2026

View reviewed changes

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 163db47 to ebe60bf Compare May 15, 2026 21:23

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch 2 times, most recently from 4923ce6 to dcc5bfc Compare May 15, 2026 21:45

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from ebe60bf to 191904b Compare May 15, 2026 21:45

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from dcc5bfc to d736522 Compare May 15, 2026 22:33

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 191904b to 62cb8db Compare May 15, 2026 22:33

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from d736522 to cd1f8f4 Compare May 15, 2026 23:27

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch 2 times, most recently from a6ee702 to 2e7ff0f Compare May 16, 2026 04:40

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from cd1f8f4 to 95ada04 Compare May 16, 2026 04:40

miguelg719 added 7 commits May 15, 2026 22:49

fix(verifier): bound failure step parsing

1265dca

feat(evals): add offline verifier CLI

3d1a1b1

fix(evals): use camel raw verifier metadata

3069b2f

fix(evals): restore command tree verifier cli

fd4b797

fix(evals): include doctor in restored help

68583d0

docs(evals): remove rollout comments from offline verifier

42a81b9

fix(evals): align offline verifier result naming

4f141e7

miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 2e7ff0f to b725247 Compare May 16, 2026 05:50

miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from 95ada04 to 4f141e7 Compare May 16, 2026 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): add offline verifier CLI#2134

feat(evals): add offline verifier CLI#2134
miguelg719 wants to merge 7 commits into
miguelgonzalez/verifier-05-core-enginefrom
miguelgonzalez/verifier-06-offline-cli

miguelg719 commented May 15, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented May 15, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miguelg719 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What Changed

Tests

Uh oh!

changeset-bot Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelg719 commented May 15, 2026 •

edited

Loading

changeset-bot Bot commented May 15, 2026 •

edited

Loading

cubic-dev-ai Bot May 15, 2026 •

edited

Loading