[STG-2457] fix(evals): repair claude_code browse harness + rubric-score via V3Evaluator#2299
[STG-2457] fix(evals): repair claude_code browse harness + rubric-score via V3Evaluator#2299shrey150 wants to merge 4 commits into
Conversation
…aluator Two fixes to the claude_code/browse eval harness: Contract fix: the browse harness drove the CLI with a stale contract (`browse --json ... env local`). browse CLI v0.9.1 dropped the `env` subcommand and the global `--json` flag. Switch to per-command `--local`/`--remote` mode selection plus `--session`, and rely on the CLI's JSON-by-default output. The mode flag is only passed to the driver commands that accept it (skipped for `stop`/`status`) and is explicit so a set BROWSERBASE_API_KEY cannot silently auto-select remote. (claudeCodeToolAdapter.ts, browse_cli.ts) Verifier wiring: the claude_code path scored solely off the agent's self-reported EVAL_RESULT line. The V3Evaluator rubric verifier already existed in claudeCodeRunner but no caller ever constructed or passed a ClaudeCodeVerifierConfig (unfinished migration from #2137). benchHarness now builds that config -- a browser-free V3 (disableAPI) as the LLM-client carrier for V3Evaluator, judge model defaulting to google/gemini-2.5-flash, rubric taken from the row's precomputed_rubric or generated + cached -- and threads it into runClaudeCodeAgent. Default ON; disable with EVAL_CLAUDE_CODE_VERIFIER=0/false/off to fall back to self-report. externalHarnessPlan threads precomputed_rubric/expectedAnswer into the TaskSpec, and claudeCodeRunner gains judge-model + judge-key plumbing so a non-default (e.g. Anthropic) judge receives its own provider credential instead of the Gemini key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
There was a problem hiding this comment.
3 issues found across 5 files
Confidence score: 3/5
- In
packages/evals/framework/claudeCodeRunner.ts,V3Evaluatorcan silently fall back to legacy self-report whenjudgeModelis overridden without matchingjudgeClientOptions, which can invalidate verifier-based results while appearing successful — require explicit client-option/model validation (or fail fast) before merging. - In
packages/evals/framework/benchHarness.ts, deriving verifier task IDs from raw instruction text can create unstable or unsafe trajectory paths when instructions include path-like characters, risking misplaced outputs or run failures — sanitize and normalize the fallback ID segment before assigningtaskSpec.id. - In
packages/evals/framework/claudeCodeToolAdapter.ts, the new browse wrapper has untested command-specific--local/--remoteflag assembly, so a regression here could silently break harness startup/dispatch for affected subcommands — add targeted tests for each browse subcommand/flag combination before merging.
Architecture diagram
sequenceDiagram
participant H as benchHarness
participant P as externalHarnessPlan
participant TA as claudeCodeToolAdapter
participant BCLI as browse CLI v0.9.1
participant CR as claudeCodeRunner
participant VE as V3Evaluator
participant LLM as Judge Model (Gemini)
participant RC as RubricCache
Note over H,RC: Claude Code Eval Flow with Verifier
H->>P: buildExternalHarnessTaskPlan(input)
P-->>H: plan (includes precomputedRubric, expectedAnswer)
H->>H: buildClaudeCodeVerifierConfig(plan, logger)
Note over H: Check EVAL_CLAUDE_CODE_VERIFIER env
alt Verifier enabled (default)
H->>H: Resolve judge model & API key
H->>H: Create V3 carrier (disableAPI: true)
H->>H: Build taskSpec with rubric/answer
H-->>CR: verifier config (v3, taskSpec, judgeModel, judgeClientOptions)
else Verifier disabled
H-->>CR: verifier undefined
end
H->>CR: runClaudeCodeAgent(plan, model, verifier, ...)
CR->>TA: prepareBrowseCliHarnessAdapter(input)
Note over TA: Build wrapper script with new CLI contract
TA->>TA: Write wrapper bash script
Note over TA: Uses --local/--remote, --session, no --json
TA-->>CR: tool surface, adapter cleanup
CR->>CR: Spawn agent process (Claude Code)
Note over CR: Agent uses browse via wrapper script
CR->>BCLI: browse (via wrapper) with new args
Note over BCLI: e.g., browse --local --session eval-sess-123 goto url
alt Verifier config present
CR->>VE: V3Evaluator(v3, { backend:"verifier", modelName, modelClientOptions })
VE->>VE: Determine rubric
alt precomputedRubric provided
VE->>VE: Use taskSpec.precomputedRubric
else generate rubric
VE->>RC: getOrGenerateRubric(dataset, instruction, judgeModel)
RC->>LLM: Generate rubric
LLM-->>RC: rubric
RC-->>VE: rubric
end
VE->>LLM: Judge trajectory using rubric
LLM-->>VE: EvaluationResult (outcome, process, evidence)
VE-->>CR: result
CR-->>H: TaskResult with verifier score
else No verifier (legacy)
CR->>CR: Parse agent self-reported EVAL_RESULT
CR-->>H: TaskResult with self-report score
end
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…ved judge key Address cubic review on #2299: - Sanitize the instruction-derived fallback segment of the verifier TaskSpec id (replace non [A-Za-z0-9_-] with _) so it can't inject `/` or `..` into the persisted trajectory directory path. - Move the judge-key check ahead of the try/catch and throw a clear config error when EVAL_CLAUDE_CODE_VERIFIER_MODEL is set but its provider key can't be resolved, instead of silently downgrading the run to legacy self-report. The built-in gemini default stays graceful. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
2 issues found across 1 file (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…ier throw Address two Cubic findings on the claude_code verifier judge-key guard: 1. Exempt API-keyless providers. loadApiKeyFromEnv returns undefined for keyless providers (ollama, bedrock — absent from providerEnvVarMap) by design, but the fail-fast guard treated that as a config error and rejected them. Only throw when the judge provider genuinely requires a key (present in providerEnvVarMap) and it is missing; keyless judges now proceed with no explicit apiKey. Key-requiring providers with a missing key still fail fast, keeping the silent-Gemini-key bug fixed. 2. No tool-adapter leak on verifier throw. buildClaudeCodeVerifierConfig was called before the try/finally that owns the prepared tool adapter, so a fail-fast throw skipped toolAdapter.cleanup(). Moved the call inside the try so cleanup runs on a verifier-config throw. Fail-fast behavior is preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
2 issues found across 1 file (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
Address cubic review nits on the claude_code rubric verifier config. FIX 1 (gateway judge credential): the keyless-provider exemption treated any provider absent from the SDK's providerEnvVarMap as keyless. But `gateway/...` (Vercel AI Gateway) is not in the map yet needs AI_GATEWAY_API_KEY, so a `gateway/` judge override would silently proceed without its credential and downgrade the verifier to self-report. Add resolveJudgeApiKey (maps `gateway` → AI_GATEWAY_API_KEY, else loadApiKeyFromEnv) and judgeProviderRequiresKey (true for providerEnvVarMap entries plus `gateway`) so a gateway judge resolves its key and still fail-fasts when it is missing; ollama/bedrock and the default gemini judge stay exempt. FIX 2 (regression tests): add packages/evals/tests/framework/verifierConfig.test.ts covering (a) a keyless override (ollama) builds a config without an apiKey, (b) an anthropic override with the key unset throws the config error while toolAdapter.cleanup() still runs (fail-fast inside try/finally, via claudeCodeHarness.execute), and (c) a gateway/ override resolves AI_GATEWAY_API_KEY (plus a missing-gateway-key fail-fast case). Exported buildClaudeCodeVerifierConfig for direct unit testing. Unit tests: 349 pass (was 345, +4 new). Build + typecheck + lint clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No changeset needed: this change is scoped entirely to |
Summary
Two fixes to the
claude_code/ browse eval harness inpackages/evals.1. Contract fix — browse CLI v0.9.1
The harness drove browse with a stale contract (
browse --json … env local). browse CLI v0.9.1 dropped theenvsubcommand and the global--jsonflag, so every browse invocation in the harness was broken. Fixed to:--local/--remoteinstead of the removedenvsubcommand.--session <name>on every command to pin the eval's daemon.--json).stop/status) and is explicit so a setBROWSERBASE_API_KEYcannot silently auto-select remote when we asked for local.Files:
packages/evals/framework/claudeCodeToolAdapter.ts,packages/evals/core/tools/browse_cli.ts.2. Verifier wiring — V3Evaluator rubric scoring
The
claude_codepath scored solely off the agent's self-reportedEVAL_RESULTline — the agent grading its own homework. TheV3Evaluatorrubric verifier already existed insideclaudeCodeRunner, but no caller ever constructed or passed aClaudeCodeVerifierConfig(unfinished migration from #2137).Now
benchHarnessbuilds that config and threads it through:V3(disableAPI: true) used purely as the LLM-client carrier forV3Evaluator— mirrors howevals verifyconstructs its carrier; it is neverinit()-ed.google/gemini-2.5-flash(V3Evaluator's own tuned default), overridable viaEVAL_CLAUDE_CODE_VERIFIER_MODEL.precomputed_rubricwhen present, otherwise generated + cached per dataset.EVAL_CLAUDE_CODE_VERIFIER=0/false/offto fall back to the legacy self-report path.externalHarnessPlanthreadsprecomputed_rubric/expectedAnswerinto theTaskSpec.claudeCodeRunnergains judge-model + judge-key plumbing — fixes a latent bug where a non-default (e.g. Anthropic) judge received the Gemini key as its credential (invalid x-api-key).Judge-model caveat: small models (e.g.
anthropic/claude-haiku-4-5) intermittently fail the fused structured-output judgment ("response did not match schema"), which the verifier surfaces asevidenceInsufficient→ a spuriousoutcome=false. That is why the default judge isgemini-2.5-flash, not the agent model.No changeset:
@browserbasehq/stagehand-evalsisprivate: true(never published), so it does not trigger a release.E2E Test Matrix
pnpm --dir packages/evals build+typecheckDone(esm + cli),tsc --noEmitexit 0, no errorspnpm --dir packages/evals test:unit(vitest run)--env local, agentanthropic/claude-haiku-4-5, default judge (gemini-2.5-flash),EVAL_WEBTAILBENCH_IDS=selfdoc_deadlink(mock harness, achievable task)result: outcome=true process=1.00,evidenceInsufficient: []→ run passed (pass rate 100%)EVAL_WEBTAILBENCH_IDS=selfdoc_gated EVAL_CLAUDE_CODE_MAX_TURNS=3(task cannot complete in the turn budget)result: outcome=false process=0.25,evidenceInsufficient: []→ run failed (pass rate 0%)outcome=false, even though the legacy self-report path would be trivially gameable.Discrimination evidence was read from the persisted
scores/result.json(viaVERIFIER_PERSIST_TRAJECTORIES=1) rather than TUI stdout, so the values above are the exactEvaluationResultfields. The mock harness +selfdoc_*rows were throwaway (not committed); the committed diff is exactly the 5 infra files.Broad browserbase-corpus validation (WebTailBench / WebVoyager at scale) follows in the A/B run.
Linear
STG-2457 — https://linear.app/browserbase/issue/STG-2457/repair-claude-code-browse-eval-harness-rubric-score-via-v3evaluator
Summary by cubic
Repairs the
claude_codebrowse eval harness by updating to the newbrowseCLI and enabling rubric-based scoring viaV3Evaluator, addressing STG-2457. Restores local/remote runs and provides reliable pass/fail judgments independent of agent self-report, including correct credentials forgateway/judges.Bug Fixes
browseCLI integration to v0.9.1: per-command--local/--remote+--session, JSON-by-default, and skip mode forstop/status; require explicit mode to prevent accidental remote whenBROWSERBASE_API_KEYis set.gateway/*viaAI_GATEWAY_API_KEY; fail fast when an override judge needs a key and it’s missing; exempt keyless providers (e.g.,ollama,bedrock); build config inside try/finally so the tool adapter always cleans up.TaskSpec.idto avoid path issues in persisted trajectories.gateway/*resolution, missing-key fail-fast, and cleanup on errors.New Features
claude_codeviaV3Evaluatorusing a browser-freeV3carrier from@browserbasehq/stagehand.google/gemini-2.5-flash; override withEVAL_CLAUDE_CODE_VERIFIER_MODEL. Disable verifier withEVAL_CLAUDE_CODE_VERIFIER=0/false/off.precomputed_rubricandexpectedAnswerinto theTaskSpec; otherwise generate and cache rubrics per dataset.Written for commit 821d497. Summary will update on new commits.