Skip to content

[STG-2457] fix(evals): repair claude_code browse harness + rubric-score via V3Evaluator#2299

Open
shrey150 wants to merge 4 commits into
mainfrom
shrey/evals-claude-code-verifier
Open

[STG-2457] fix(evals): repair claude_code browse harness + rubric-score via V3Evaluator#2299
shrey150 wants to merge 4 commits into
mainfrom
shrey/evals-claude-code-verifier

Conversation

@shrey150

@shrey150 shrey150 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Two fixes to the claude_code / browse eval harness in packages/evals.

1. Contract fix — browse CLI v0.9.1

The harness drove browse with a stale contract (browse --json … env local). browse CLI v0.9.1 dropped the env subcommand and the global --json flag, so every browse invocation in the harness was broken. Fixed to:

  • Per-command mode selection: --local / --remote instead of the removed env subcommand.
  • --session <name> on every command to pin the eval's daemon.
  • Rely on the CLI's JSON-by-default output (no --json).
  • The mode flag is only appended to the driver commands that accept it (skipped for stop / status) and is explicit so a set BROWSERBASE_API_KEY cannot silently auto-select remote when we asked for local.

Files: packages/evals/framework/claudeCodeToolAdapter.ts, packages/evals/core/tools/browse_cli.ts.

2. Verifier wiring — V3Evaluator rubric scoring

The claude_code path scored solely off the agent's self-reported EVAL_RESULT line — the agent grading its own homework. The V3Evaluator rubric verifier already existed inside claudeCodeRunner, but no caller ever constructed or passed a ClaudeCodeVerifierConfig (unfinished migration from #2137).

Now benchHarness builds that config and threads it through:

  • A browser-free V3 (disableAPI: true) used purely as the LLM-client carrier for V3Evaluator — mirrors how evals verify constructs its carrier; it is never init()-ed.
  • Judge model defaults to google/gemini-2.5-flash (V3Evaluator's own tuned default), overridable via EVAL_CLAUDE_CODE_VERIFIER_MODEL.
  • Rubric is taken from the row's precomputed_rubric when present, otherwise generated + cached per dataset.
  • Default ON; disable with EVAL_CLAUDE_CODE_VERIFIER=0/false/off to fall back to the legacy self-report path.
  • externalHarnessPlan threads precomputed_rubric / expectedAnswer into the TaskSpec.
  • claudeCodeRunner gains judge-model + judge-key plumbing — fixes a latent bug where a non-default (e.g. Anthropic) judge received the Gemini key as its credential (invalid x-api-key).

Judge-model caveat: small models (e.g. anthropic/claude-haiku-4-5) intermittently fail the fused structured-output judgment ("response did not match schema"), which the verifier surfaces as evidenceInsufficient → a spurious outcome=false. That is why the default judge is gemini-2.5-flash, not the agent model.

No changeset: @browserbasehq/stagehand-evals is private: true (never published), so it does not trigger a release.

E2E Test Matrix

Command / flow Observed output Confidence / sufficiency
pnpm --dir packages/evals build + typecheck build Done (esm + cli), tsc --noEmit exit 0, no errors Proves the 5 changed files compile and typecheck against current core.
pnpm --dir packages/evals test:unit (vitest run) 345 passed across 46 test files, 0 failed Full unit suite green on this branch.
TRUE-POSITIVE — claude_code harness, --env local, agent anthropic/claude-haiku-4-5, default judge (gemini-2.5-flash), EVAL_WEBTAILBENCH_IDS=selfdoc_deadlink (mock harness, achievable task) verifier result: outcome=true process=1.00, evidenceInsufficient: [] → run passed (pass rate 100%) Proves the wired verifier scores an achievable task as success against ground-truth rubric, not the agent's self-report.
TRUE-NEGATIVE — same harness/agent/judge, gate reset, EVAL_WEBTAILBENCH_IDS=selfdoc_gated EVAL_CLAUDE_CODE_MAX_TURNS=3 (task cannot complete in the turn budget) verifier result: outcome=false process=0.25, evidenceInsufficient: [] → run failed (pass rate 0%) Proves the verifier discriminates: a task the agent cannot finish scores low process + outcome=false, even though the legacy self-report path would be trivially gameable.

Discrimination evidence was read from the persisted scores/result.json (via VERIFIER_PERSIST_TRAJECTORIES=1) rather than TUI stdout, so the values above are the exact EvaluationResult fields. The mock harness + selfdoc_* rows were throwaway (not committed); the committed diff is exactly the 5 infra files.

Broad browserbase-corpus validation (WebTailBench / WebVoyager at scale) follows in the A/B run.

Linear

STG-2457 — https://linear.app/browserbase/issue/STG-2457/repair-claude-code-browse-eval-harness-rubric-score-via-v3evaluator


Summary by cubic

Repairs the claude_code browse eval harness by updating to the new browse CLI and enabling rubric-based scoring via V3Evaluator, addressing STG-2457. Restores local/remote runs and provides reliable pass/fail judgments independent of agent self-report, including correct credentials for gateway/ judges.

  • Bug Fixes

    • Update browse CLI integration to v0.9.1: per-command --local/--remote + --session, JSON-by-default, and skip mode for stop/status; require explicit mode to prevent accidental remote when BROWSERBASE_API_KEY is set.
    • Judge credentials: route non-default judges to their own key and resolve gateway/* via AI_GATEWAY_API_KEY; fail fast when an override judge needs a key and it’s missing; exempt keyless providers (e.g., ollama, bedrock); build config inside try/finally so the tool adapter always cleans up.
    • Sanitize fallback verifier TaskSpec.id to avoid path issues in persisted trajectories.
    • Add verifier-config tests covering keyless overrides, gateway/* resolution, missing-key fail-fast, and cleanup on errors.
  • New Features

    • Enable rubric-based scoring for claude_code via V3Evaluator using a browser-free V3 carrier from @browserbasehq/stagehand.
    • Default judge is google/gemini-2.5-flash; override with EVAL_CLAUDE_CODE_VERIFIER_MODEL. Disable verifier with EVAL_CLAUDE_CODE_VERIFIER=0/false/off.
    • Thread dataset precomputed_rubric and expectedAnswer into the TaskSpec; otherwise generate and cache rubrics per dataset.

Written for commit 821d497. Summary will update on new commits.

Review in cubic

…aluator

Two fixes to the claude_code/browse eval harness:

Contract fix: the browse harness drove the CLI with a stale contract
(`browse --json ... env local`). browse CLI v0.9.1 dropped the `env`
subcommand and the global `--json` flag. Switch to per-command
`--local`/`--remote` mode selection plus `--session`, and rely on the
CLI's JSON-by-default output. The mode flag is only passed to the driver
commands that accept it (skipped for `stop`/`status`) and is explicit so a
set BROWSERBASE_API_KEY cannot silently auto-select remote.
(claudeCodeToolAdapter.ts, browse_cli.ts)

Verifier wiring: the claude_code path scored solely off the agent's
self-reported EVAL_RESULT line. The V3Evaluator rubric verifier already
existed in claudeCodeRunner but no caller ever constructed or passed a
ClaudeCodeVerifierConfig (unfinished migration from #2137). benchHarness
now builds that config -- a browser-free V3 (disableAPI) as the LLM-client
carrier for V3Evaluator, judge model defaulting to google/gemini-2.5-flash,
rubric taken from the row's precomputed_rubric or generated + cached -- and
threads it into runClaudeCodeAgent. Default ON; disable with
EVAL_CLAUDE_CODE_VERIFIER=0/false/off to fall back to self-report.
externalHarnessPlan threads precomputed_rubric/expectedAnswer into the
TaskSpec, and claudeCodeRunner gains judge-model + judge-key plumbing so a
non-default (e.g. Anthropic) judge receives its own provider credential
instead of the Gemini key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@changeset-bot

changeset-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 821d497

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 5 files

Confidence score: 3/5

  • In packages/evals/framework/claudeCodeRunner.ts, V3Evaluator can silently fall back to legacy self-report when judgeModel is overridden without matching judgeClientOptions, which can invalidate verifier-based results while appearing successful — require explicit client-option/model validation (or fail fast) before merging.
  • In packages/evals/framework/benchHarness.ts, deriving verifier task IDs from raw instruction text can create unstable or unsafe trajectory paths when instructions include path-like characters, risking misplaced outputs or run failures — sanitize and normalize the fallback ID segment before assigning taskSpec.id.
  • In packages/evals/framework/claudeCodeToolAdapter.ts, the new browse wrapper has untested command-specific --local/--remote flag assembly, so a regression here could silently break harness startup/dispatch for affected subcommands — add targeted tests for each browse subcommand/flag combination before merging.
Architecture diagram
sequenceDiagram
    participant H as benchHarness
    participant P as externalHarnessPlan
    participant TA as claudeCodeToolAdapter
    participant BCLI as browse CLI v0.9.1
    participant CR as claudeCodeRunner
    participant VE as V3Evaluator
    participant LLM as Judge Model (Gemini)
    participant RC as RubricCache

    Note over H,RC: Claude Code Eval Flow with Verifier

    H->>P: buildExternalHarnessTaskPlan(input)
    P-->>H: plan (includes precomputedRubric, expectedAnswer)

    H->>H: buildClaudeCodeVerifierConfig(plan, logger)
    Note over H: Check EVAL_CLAUDE_CODE_VERIFIER env
    alt Verifier enabled (default)
        H->>H: Resolve judge model & API key
        H->>H: Create V3 carrier (disableAPI: true)
        H->>H: Build taskSpec with rubric/answer
        H-->>CR: verifier config (v3, taskSpec, judgeModel, judgeClientOptions)
    else Verifier disabled
        H-->>CR: verifier undefined
    end

    H->>CR: runClaudeCodeAgent(plan, model, verifier, ...)

    CR->>TA: prepareBrowseCliHarnessAdapter(input)
    Note over TA: Build wrapper script with new CLI contract
    TA->>TA: Write wrapper bash script
    Note over TA: Uses --local/--remote, --session, no --json
    TA-->>CR: tool surface, adapter cleanup

    CR->>CR: Spawn agent process (Claude Code)
    Note over CR: Agent uses browse via wrapper script

    CR->>BCLI: browse (via wrapper) with new args
    Note over BCLI: e.g., browse --local --session eval-sess-123 goto url

    alt Verifier config present
        CR->>VE: V3Evaluator(v3, { backend:"verifier", modelName, modelClientOptions })
        VE->>VE: Determine rubric
        alt precomputedRubric provided
            VE->>VE: Use taskSpec.precomputedRubric
        else generate rubric
            VE->>RC: getOrGenerateRubric(dataset, instruction, judgeModel)
            RC->>LLM: Generate rubric
            LLM-->>RC: rubric
            RC-->>VE: rubric
        end
        VE->>LLM: Judge trajectory using rubric
        LLM-->>VE: EvaluationResult (outcome, process, evidence)
        VE-->>CR: result
        CR-->>H: TaskResult with verifier score
    else No verifier (legacy)
        CR->>CR: Parse agent self-reported EVAL_RESULT
        CR-->>H: TaskResult with self-report score
    end
Loading

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread packages/evals/framework/claudeCodeToolAdapter.ts
Comment thread packages/evals/framework/claudeCodeRunner.ts
Comment thread packages/evals/framework/benchHarness.ts Outdated
…ved judge key

Address cubic review on #2299:
- Sanitize the instruction-derived fallback segment of the verifier TaskSpec id
  (replace non [A-Za-z0-9_-] with _) so it can't inject `/` or `..` into the
  persisted trajectory directory path.
- Move the judge-key check ahead of the try/catch and throw a clear config
  error when EVAL_CLAUDE_CODE_VERIFIER_MODEL is set but its provider key can't
  be resolved, instead of silently downgrading the run to legacy self-report.
  The built-in gemini default stays graceful.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread packages/evals/framework/benchHarness.ts Outdated
Comment thread packages/evals/framework/benchHarness.ts
…ier throw

Address two Cubic findings on the claude_code verifier judge-key guard:

1. Exempt API-keyless providers. loadApiKeyFromEnv returns undefined for
   keyless providers (ollama, bedrock — absent from providerEnvVarMap) by
   design, but the fail-fast guard treated that as a config error and
   rejected them. Only throw when the judge provider genuinely requires a
   key (present in providerEnvVarMap) and it is missing; keyless judges now
   proceed with no explicit apiKey. Key-requiring providers with a missing
   key still fail fast, keeping the silent-Gemini-key bug fixed.

2. No tool-adapter leak on verifier throw. buildClaudeCodeVerifierConfig
   was called before the try/finally that owns the prepared tool adapter, so
   a fail-fast throw skipped toolAdapter.cleanup(). Moved the call inside the
   try so cleanup runs on a verifier-config throw. Fail-fast behavior is
   preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread packages/evals/framework/benchHarness.ts Outdated
Comment thread packages/evals/framework/benchHarness.ts
Address cubic review nits on the claude_code rubric verifier config.

FIX 1 (gateway judge credential): the keyless-provider exemption treated any
provider absent from the SDK's providerEnvVarMap as keyless. But `gateway/...`
(Vercel AI Gateway) is not in the map yet needs AI_GATEWAY_API_KEY, so a
`gateway/` judge override would silently proceed without its credential and
downgrade the verifier to self-report. Add resolveJudgeApiKey (maps `gateway` →
AI_GATEWAY_API_KEY, else loadApiKeyFromEnv) and judgeProviderRequiresKey (true
for providerEnvVarMap entries plus `gateway`) so a gateway judge resolves its
key and still fail-fasts when it is missing; ollama/bedrock and the default
gemini judge stay exempt.

FIX 2 (regression tests): add packages/evals/tests/framework/verifierConfig.test.ts
covering (a) a keyless override (ollama) builds a config without an apiKey,
(b) an anthropic override with the key unset throws the config error while
toolAdapter.cleanup() still runs (fail-fast inside try/finally, via
claudeCodeHarness.execute), and (c) a gateway/ override resolves
AI_GATEWAY_API_KEY (plus a missing-gateway-key fail-fast case). Exported
buildClaudeCodeVerifierConfig for direct unit testing.

Unit tests: 349 pass (was 345, +4 new). Build + typecheck + lint clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shrey150

shrey150 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

No changeset needed: this change is scoped entirely to @browserbasehq/stagehand-evals, which is a private package ("private": true) and not published, so there is no release/version bump to trigger. Both cubic nits addressed in 821d497 (gateway judge credential resolution + focused verifier-config regression tests; unit suite 345 → 349).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant