Skip to content

feat(evals): wire WebTailBench through verifier#2135

Open
miguelg719 wants to merge 5 commits into
miguelgonzalez/verifier-06-offline-clifrom
miguelgonzalez/verifier-07-evals-adapter
Open

feat(evals): wire WebTailBench through verifier#2135
miguelg719 wants to merge 5 commits into
miguelgonzalez/verifier-06-offline-clifrom
miguelgonzalez/verifier-07-evals-adapter

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 15, 2026

Why

The eval harness needs one reusable path for running an agent, recording its trajectory, and scoring it with the verifier. This PR applies that path to WebTailBench first as the narrowest dataset migration.

What Changed

  • Added runWithVerifier for rubric resolution, trajectory recording, verifier scoring, and result persistence.
  • Migrated the WebTailBench task path to verifier-backed scoring.
  • Added validated success-mode conversion from verifier verdicts; invalid or missing values fall back to outcome.
  • Added focused tests for success-mode validation.
  • Added a focused WebTailBench verifier smoke script.
  • Kept verifier usage explicit with backend: "verifier".
  • Removed upstream verifier references from comments.

Tests

  • pnpm --filter @browserbasehq/stagehand run typecheck
  • pnpm --filter @browserbasehq/stagehand-evals run typecheck
  • pnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.ts
  • git diff --check

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

⚠️ No Changeset found

Latest commit: 47dc1d5

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Confidence score: 3/5

  • There is moderate merge risk because both findings are medium severity (6/10) with solid confidence, and each can affect correctness of eval outcomes or debugging behavior.
  • In packages/evals/framework/verifierAdapter.ts, a rejection from recorder.finish() inside the catch path can mask the original agent failure, making root-cause diagnosis harder and potentially changing observed failure behavior.
  • Pay close attention to packages/evals/framework/verifierAdapter.ts and packages/evals/tasks/bench/agent/webtailbench.ts - preserve original thrown errors during persistence failure handling, and validate EVAL_SUCCESS_MODE so invalid values cannot produce undefined success mapping.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/framework/verifierAdapter.ts">

<violation number="1" location="packages/evals/framework/verifierAdapter.ts:110">
P2: If `recorder.finish()` rejects inside the catch block, the original agent error is lost. Wrap the persistence call in its own try/catch so the original error is always rethrown.</violation>
</file>

<file name="packages/evals/tasks/bench/agent/webtailbench.ts">

<violation number="1" location="packages/evals/tasks/bench/agent/webtailbench.ts:77">
P2: Unvalidated env var cast can silently corrupt eval results. If `EVAL_SUCCESS_MODE` is set to a typo (e.g., `"outcomes"`), `verdictToSuccess` falls through its switch without matching, returns `undefined`, and all evals report `_success: false`. Validate or constrain the value before passing it.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Bench as Bench Task<br/>(webtailbench.ts)
    participant Adapter as runWithVerifier<br/>(verifierAdapter.ts)
    participant Cache as RubricCache<br/>(rubricCache.ts)
    participant Recorder as TrajectoryRecorder
    participant Evaluator as V3Evaluator<br/>(verifier backend)
    participant Agent as Agent<br/>(agent.execute())

    Note over Bench,Agent: WebTailBench scoring pipeline (Wave 1 MVP)

    Bench->>Adapter: runWithVerifier({v3, agent, taskSpec, dataset:"webtailbench"})

    Note over Adapter,Cache: Resolve Rubric

    Adapter->>Adapter: Check precomputedRubric on taskSpec
    alt precomputed rubric provided (e.g., upstream WebTailBench)
        Adapter->>Adapter: Use precomputed rubric directly
    else VERIFIER_DISABLE_RUBRIC_CACHE=1
        Adapter->>Evaluator: generateRubric(taskSpec)
        Evaluator-->>Adapter: Generated rubric
    else cache available
        Adapter->>Cache: getOrGenerate(taskSpec, evaluator)
        Cache-->>Adapter: Cached or generated rubric
    end
    Adapter->>Adapter: Build hydratedTaskSpec with resolved rubric

    Note over Adapter,Agent: Record Trajectory

    Adapter->>Recorder: start()
    Recorder->>Recorder: Subscribe to bus events
    Adapter->>Agent: execute({instruction, maxSteps})
    Agent-->>Adapter: AgentResult
    alt Agent throws error
        Agent-->>Adapter: Error
        Adapter->>Recorder: finish({status:"error"})
        Recorder-->>Adapter: Partial trajectory
        Adapter->>Adapter: Attach trajectoryDir to error
        Adapter-->>Bench: Re-throw error
    end
    Adapter->>Recorder: finish({status:"complete", finalAnswer, usage})
    Recorder-->>Adapter: Recorded Trajectory

    Note over Adapter,Evaluator: Verify & Persist

    Adapter->>Evaluator: verify(trajectory, hydratedTaskSpec)
    Evaluator-->>Adapter: Verdict (outcome + process scores)
    Adapter->>Recorder: persistVerdict(verdict)
    Adapter-->>Bench: {trajectory, verdict, rubric, trajectoryDir}

    Note over Bench: Convert Verdict to Success Flag

    Bench->>Bench: verdictToSuccess(verdict, EVAL_SUCCESS_MODE)
    Note right of Bench: mode: outcome | process | both<br/>default = outcome
    Bench-->>Bench: Return eval result with _success flag

    Note over Bench: Cleanup on Error Path

    alt Error in Adapter
        Bench-->>Bench: Catch error with trajectoryDir
        Bench-->>Bench: Return {_success: false, trajectoryDir}
    end

    Note over Bench,Evaluator: Persistence (env-gated)
    opt VERIFIER_PERSIST_TRAJECTORIES set
        Recorder->>Recorder: Write trajectory JSON to disk
        Recorder->>Recorder: Write verdict JSON to disk
    end
    opt rubric caching active
        Cache->>Cache: Cache rubric to .rubric-cache/webtailbench/<task-id>.json
    end
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic

});
} catch (e) {
recorderStatus = "error";
const trajectory = await recorder.finish({ status: recorderStatus });
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: If recorder.finish() rejects inside the catch block, the original agent error is lost. Wrap the persistence call in its own try/catch so the original error is always rethrown.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/framework/verifierAdapter.ts, line 110:

<comment>If `recorder.finish()` rejects inside the catch block, the original agent error is lost. Wrap the persistence call in its own try/catch so the original error is always rethrown.</comment>

<file context>
@@ -0,0 +1,160 @@
+    });
+  } catch (e) {
+    recorderStatus = "error";
+    const trajectory = await recorder.finish({ status: recorderStatus });
+    // Re-throw after persisting so the bench task can decide how to report.
+    const wrapped = e instanceof Error ? e : new Error(String(e));
</file context>
Fix with Cubic

}),
);
}
const successMode =
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Unvalidated env var cast can silently corrupt eval results. If EVAL_SUCCESS_MODE is set to a typo (e.g., "outcomes"), verdictToSuccess falls through its switch without matching, returns undefined, and all evals report _success: false. Validate or constrain the value before passing it.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/tasks/bench/agent/webtailbench.ts, line 77:

<comment>Unvalidated env var cast can silently corrupt eval results. If `EVAL_SUCCESS_MODE` is set to a typo (e.g., `"outcomes"`), `verdictToSuccess` falls through its switch without matching, returns `undefined`, and all evals report `_success: false`. Validate or constrain the value before passing it.</comment>

<file context>
@@ -41,70 +56,62 @@ export default defineBenchTask(
-          }),
-        );
-      }
+      const successMode =
+        (process.env.EVAL_SUCCESS_MODE as "outcome" | "process" | "both") ||
+        "outcome";
</file context>
Fix with Cubic

@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from d4c4260 to 4923ce6 Compare May 15, 2026 21:23
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch 2 times, most recently from 105b13c to f7df0cf Compare May 15, 2026 21:45
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch 2 times, most recently from dcc5bfc to d736522 Compare May 15, 2026 22:33
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch from f7df0cf to a476ada Compare May 15, 2026 22:33
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from d736522 to cd1f8f4 Compare May 15, 2026 23:27
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch 2 times, most recently from 8ec81fd to cce3a9b Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from cd1f8f4 to 95ada04 Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-06-offline-cli branch from 95ada04 to 4f141e7 Compare May 16, 2026 05:50
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-07-evals-adapter branch from cce3a9b to 47dc1d5 Compare May 16, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant