[STG-2457] fix(evals): repair claude_code browse harness + rubric-score via V3Evaluator by shrey150 · Pull Request #2299 · browserbase/stagehand

shrey150 · 2026-07-01T22:21:31Z

Summary

Two fixes to the claude_code / browse eval harness in packages/evals.

1. Contract fix — browse CLI v0.9.1

The harness drove browse with a stale contract (browse --json … env local). browse CLI v0.9.1 dropped the env subcommand and the global --json flag, so every browse invocation in the harness was broken. Fixed to:

Per-command mode selection: --local / --remote instead of the removed env subcommand.
--session <name> on every command to pin the eval's daemon.
Rely on the CLI's JSON-by-default output (no --json).
The mode flag is only appended to the driver commands that accept it (skipped for stop / status) and is explicit so a set BROWSERBASE_API_KEY cannot silently auto-select remote when we asked for local.

Files: packages/evals/framework/claudeCodeToolAdapter.ts, packages/evals/core/tools/browse_cli.ts.

2. Verifier wiring — V3Evaluator rubric scoring

The claude_code path scored solely off the agent's self-reported EVAL_RESULT line — the agent grading its own homework. The V3Evaluator rubric verifier already existed inside claudeCodeRunner, but no caller ever constructed or passed a ClaudeCodeVerifierConfig (unfinished migration from #2137).

Now benchHarness builds that config and threads it through:

A browser-free V3 (disableAPI: true) used purely as the LLM-client carrier for V3Evaluator — mirrors how evals verify constructs its carrier; it is never init()-ed.
Judge model defaults to google/gemini-2.5-flash (V3Evaluator's own tuned default), overridable via EVAL_CLAUDE_CODE_VERIFIER_MODEL.
Rubric is taken from the row's precomputed_rubric when present, otherwise generated + cached per dataset.
Default ON; disable with EVAL_CLAUDE_CODE_VERIFIER=0/false/off to fall back to the legacy self-report path.
externalHarnessPlan threads precomputed_rubric / expectedAnswer into the TaskSpec.
claudeCodeRunner gains judge-model + judge-key plumbing — fixes a latent bug where a non-default (e.g. Anthropic) judge received the Gemini key as its credential (invalid x-api-key).

Judge-model caveat: small models (e.g. anthropic/claude-haiku-4-5) intermittently fail the fused structured-output judgment ("response did not match schema"), which the verifier surfaces as evidenceInsufficient → a spurious outcome=false. That is why the default judge is gemini-2.5-flash, not the agent model.

No changeset: @browserbasehq/stagehand-evals is private: true (never published), so it does not trigger a release.

E2E Test Matrix

Command / flow	Observed output	Confidence / sufficiency
`pnpm --dir packages/evals build` + `typecheck`	build `Done` (esm + cli), `tsc --noEmit` exit 0, no errors	Proves the 5 changed files compile and typecheck against current core.
`pnpm --dir packages/evals test:unit` (`vitest run`)	345 passed across 46 test files, 0 failed	Full unit suite green on this branch.
TRUE-POSITIVE — claude_code harness, `--env local`, agent `anthropic/claude-haiku-4-5`, default judge (`gemini-2.5-flash`), `EVAL_WEBTAILBENCH_IDS=selfdoc_deadlink` (mock harness, achievable task)	verifier `result: outcome=true process=1.00`, `evidenceInsufficient: []` → run passed (pass rate 100%)	Proves the wired verifier scores an achievable task as success against ground-truth rubric, not the agent's self-report.
TRUE-NEGATIVE — same harness/agent/judge, gate reset, `EVAL_WEBTAILBENCH_IDS=selfdoc_gated EVAL_CLAUDE_CODE_MAX_TURNS=3` (task cannot complete in the turn budget)	verifier `result: outcome=false process=0.25`, `evidenceInsufficient: []` → run failed (pass rate 0%)	Proves the verifier discriminates: a task the agent cannot finish scores low process + `outcome=false`, even though the legacy self-report path would be trivially gameable.

Discrimination evidence was read from the persisted scores/result.json (via VERIFIER_PERSIST_TRAJECTORIES=1) rather than TUI stdout, so the values above are the exact EvaluationResult fields. The mock harness + selfdoc_* rows were throwaway (not committed); the committed diff is exactly the 5 infra files.

Broad browserbase-corpus validation (WebTailBench / WebVoyager at scale) follows in the A/B run.

Linear

STG-2457 — https://linear.app/browserbase/issue/STG-2457/repair-claude-code-browse-eval-harness-rubric-score-via-v3evaluator

Summary by cubic

Repairs the claude_code browse eval harness by updating to the new browse CLI and enabling rubric-based scoring via V3Evaluator, addressing STG-2457. Restores local/remote runs and provides reliable pass/fail judgments independent of agent self-report, including correct credentials for gateway/ judges.

Bug Fixes
- Update browse CLI integration to v0.9.1: per-command --local/--remote + --session, JSON-by-default, and skip mode for stop/status; require explicit mode to prevent accidental remote when BROWSERBASE_API_KEY is set.
- Judge credentials: route non-default judges to their own key and resolve gateway/* via AI_GATEWAY_API_KEY; fail fast when an override judge needs a key and it’s missing; exempt keyless providers (e.g., ollama, bedrock); build config inside try/finally so the tool adapter always cleans up.
- Sanitize fallback verifier TaskSpec.id to avoid path issues in persisted trajectories.
- Add verifier-config tests covering keyless overrides, gateway/* resolution, missing-key fail-fast, and cleanup on errors.
New Features
- Enable rubric-based scoring for claude_code via V3Evaluator using a browser-free V3 carrier from @browserbasehq/stagehand.
- Default judge is google/gemini-2.5-flash; override with EVAL_CLAUDE_CODE_VERIFIER_MODEL. Disable verifier with EVAL_CLAUDE_CODE_VERIFIER=0/false/off.
- Thread dataset precomputed_rubric and expectedAnswer into the TaskSpec; otherwise generate and cache rubrics per dataset.

^{Written for commit 821d497. Summary will update on new commits.}

…aluator Two fixes to the claude_code/browse eval harness: Contract fix: the browse harness drove the CLI with a stale contract (`browse --json ... env local`). browse CLI v0.9.1 dropped the `env` subcommand and the global `--json` flag. Switch to per-command `--local`/`--remote` mode selection plus `--session`, and rely on the CLI's JSON-by-default output. The mode flag is only passed to the driver commands that accept it (skipped for `stop`/`status`) and is explicit so a set BROWSERBASE_API_KEY cannot silently auto-select remote. (claudeCodeToolAdapter.ts, browse_cli.ts) Verifier wiring: the claude_code path scored solely off the agent's self-reported EVAL_RESULT line. The V3Evaluator rubric verifier already existed in claudeCodeRunner but no caller ever constructed or passed a ClaudeCodeVerifierConfig (unfinished migration from #2137). benchHarness now builds that config -- a browser-free V3 (disableAPI) as the LLM-client carrier for V3Evaluator, judge model defaulting to google/gemini-2.5-flash, rubric taken from the row's precomputed_rubric or generated + cached -- and threads it into runClaudeCodeAgent. Default ON; disable with EVAL_CLAUDE_CODE_VERIFIER=0/false/off to fall back to self-report. externalHarnessPlan threads precomputed_rubric/expectedAnswer into the TaskSpec, and claudeCodeRunner gains judge-model + judge-key plumbing so a non-default (e.g. Anthropic) judge receives its own provider credential instead of the Gemini key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

changeset-bot · 2026-07-01T22:21:39Z

⚠️ No Changeset found

Latest commit: 821d497

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

cubic-dev-ai

3 issues found across 5 files

Confidence score: 3/5

In packages/evals/framework/claudeCodeRunner.ts, V3Evaluator can silently fall back to legacy self-report when judgeModel is overridden without matching judgeClientOptions, which can invalidate verifier-based results while appearing successful — require explicit client-option/model validation (or fail fast) before merging.
In packages/evals/framework/benchHarness.ts, deriving verifier task IDs from raw instruction text can create unstable or unsafe trajectory paths when instructions include path-like characters, risking misplaced outputs or run failures — sanitize and normalize the fallback ID segment before assigning taskSpec.id.
In packages/evals/framework/claudeCodeToolAdapter.ts, the new browse wrapper has untested command-specific --local/--remote flag assembly, so a regression here could silently break harness startup/dispatch for affected subcommands — add targeted tests for each browse subcommand/flag combination before merging.

Architecture diagram

sequenceDiagram
    participant H as benchHarness
    participant P as externalHarnessPlan
    participant TA as claudeCodeToolAdapter
    participant BCLI as browse CLI v0.9.1
    participant CR as claudeCodeRunner
    participant VE as V3Evaluator
    participant LLM as Judge Model (Gemini)
    participant RC as RubricCache

    Note over H,RC: Claude Code Eval Flow with Verifier

    H->>P: buildExternalHarnessTaskPlan(input)
    P-->>H: plan (includes precomputedRubric, expectedAnswer)

    H->>H: buildClaudeCodeVerifierConfig(plan, logger)
    Note over H: Check EVAL_CLAUDE_CODE_VERIFIER env
    alt Verifier enabled (default)
        H->>H: Resolve judge model & API key
        H->>H: Create V3 carrier (disableAPI: true)
        H->>H: Build taskSpec with rubric/answer
        H-->>CR: verifier config (v3, taskSpec, judgeModel, judgeClientOptions)
    else Verifier disabled
        H-->>CR: verifier undefined
    end

    H->>CR: runClaudeCodeAgent(plan, model, verifier, ...)

    CR->>TA: prepareBrowseCliHarnessAdapter(input)
    Note over TA: Build wrapper script with new CLI contract
    TA->>TA: Write wrapper bash script
    Note over TA: Uses --local/--remote, --session, no --json
    TA-->>CR: tool surface, adapter cleanup

    CR->>CR: Spawn agent process (Claude Code)
    Note over CR: Agent uses browse via wrapper script

    CR->>BCLI: browse (via wrapper) with new args
    Note over BCLI: e.g., browse --local --session eval-sess-123 goto url

    alt Verifier config present
        CR->>VE: V3Evaluator(v3, { backend:"verifier", modelName, modelClientOptions })
        VE->>VE: Determine rubric
        alt precomputedRubric provided
            VE->>VE: Use taskSpec.precomputedRubric
        else generate rubric
            VE->>RC: getOrGenerateRubric(dataset, instruction, judgeModel)
            RC->>LLM: Generate rubric
            LLM-->>RC: rubric
            RC-->>VE: rubric
        end
        VE->>LLM: Judge trajectory using rubric
        LLM-->>VE: EvaluationResult (outcome, process, evidence)
        VE-->>CR: result
        CR-->>H: TaskResult with verifier score
    else No verifier (legacy)
        CR->>CR: Parse agent self-reported EVAL_RESULT
        CR-->>H: TaskResult with self-report score
    end

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

…ved judge key Address cubic review on #2299: - Sanitize the instruction-derived fallback segment of the verifier TaskSpec id (replace non [A-Za-z0-9_-] with _) so it can't inject `/` or `..` into the persisted trajectory directory path. - Move the judge-key check ahead of the try/catch and throw a clear config error when EVAL_CLAUDE_CODE_VERIFIER_MODEL is set but its provider key can't be resolved, instead of silently downgrading the run to legacy self-report. The built-in gemini default stays graceful. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai

2 issues found across 1 file (changes from recent commits).

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

…ier throw Address two Cubic findings on the claude_code verifier judge-key guard: 1. Exempt API-keyless providers. loadApiKeyFromEnv returns undefined for keyless providers (ollama, bedrock — absent from providerEnvVarMap) by design, but the fail-fast guard treated that as a config error and rejected them. Only throw when the judge provider genuinely requires a key (present in providerEnvVarMap) and it is missing; keyless judges now proceed with no explicit apiKey. Key-requiring providers with a missing key still fail fast, keeping the silent-Gemini-key bug fixed. 2. No tool-adapter leak on verifier throw. buildClaudeCodeVerifierConfig was called before the try/finally that owns the prepared tool adapter, so a fail-fast throw skipped toolAdapter.cleanup(). Moved the call inside the try so cleanup runs on a verifier-config throw. Fail-fast behavior is preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai

2 issues found across 1 file (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

Address cubic review nits on the claude_code rubric verifier config. FIX 1 (gateway judge credential): the keyless-provider exemption treated any provider absent from the SDK's providerEnvVarMap as keyless. But `gateway/...` (Vercel AI Gateway) is not in the map yet needs AI_GATEWAY_API_KEY, so a `gateway/` judge override would silently proceed without its credential and downgrade the verifier to self-report. Add resolveJudgeApiKey (maps `gateway` → AI_GATEWAY_API_KEY, else loadApiKeyFromEnv) and judgeProviderRequiresKey (true for providerEnvVarMap entries plus `gateway`) so a gateway judge resolves its key and still fail-fasts when it is missing; ollama/bedrock and the default gemini judge stay exempt. FIX 2 (regression tests): add packages/evals/tests/framework/verifierConfig.test.ts covering (a) a keyless override (ollama) builds a config without an apiKey, (b) an anthropic override with the key unset throws the config error while toolAdapter.cleanup() still runs (fail-fast inside try/finally, via claudeCodeHarness.execute), and (c) a gateway/ override resolves AI_GATEWAY_API_KEY (plus a missing-gateway-key fail-fast case). Exported buildClaudeCodeVerifierConfig for direct unit testing. Unit tests: 349 pass (was 345, +4 new). Build + typecheck + lint clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shrey150 · 2026-07-02T00:07:17Z

No changeset needed: this change is scoped entirely to @browserbasehq/stagehand-evals, which is a private package ("private": true) and not published, so there is no release/version bump to trigger. Both cubic nits addressed in 821d497 (gateway judge credential resolution + focused verifier-config regression tests; unit suite 345 → 349).

cubic-dev-ai Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread packages/evals/framework/claudeCodeToolAdapter.ts

Comment thread packages/evals/framework/claudeCodeRunner.ts

Comment thread packages/evals/framework/benchHarness.ts Outdated

cubic-dev-ai Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread packages/evals/framework/benchHarness.ts Outdated

Comment thread packages/evals/framework/benchHarness.ts

cubic-dev-ai Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread packages/evals/framework/benchHarness.ts Outdated

Comment thread packages/evals/framework/benchHarness.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[STG-2457] fix(evals): repair claude_code browse harness + rubric-score via V3Evaluator#2299

[STG-2457] fix(evals): repair claude_code browse harness + rubric-score via V3Evaluator#2299
shrey150 wants to merge 4 commits into
mainfrom
shrey/evals-claude-code-verifier

shrey150 commented Jul 1, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

changeset-bot Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

shrey150 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shrey150 commented Jul 1, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Contract fix — browse CLI v0.9.1

2. Verifier wiring — V3Evaluator rubric scoring

E2E Test Matrix

Linear

Summary by cubic

Uh oh!

changeset-bot Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shrey150 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shrey150 commented Jul 1, 2026 •

edited by cubic-dev-ai Bot

Loading

changeset-bot Bot commented Jul 1, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading