Skip to content

[Reproducibility] Unable to reproduce Opus 4.7 LeaderBoard score (~64.5%) with official code and SDK — getting ~58.78% #9

@XueqingLin

Description

@XueqingLin

Summary

I attempted to reproduce the Claude Opus 4.7 + Claude Code score on the LeaderBoard (~64.5%) using the official Workspace-Bench code from the main branch with the exact same SDK version (@anthropic-ai/claude-agent-sdk v0.2.107). Despite matching the code and SDK, I consistently get ~58.78% on Workspace-Bench-Lite (100 tasks), a gap of approximately 6 points.

Reproduction Steps

  1. Cloned the official repository (main branch)
  2. Installed @anthropic-ai/claude-agent-sdk v0.2.107 (with underlying @anthropic-ai/sdk v0.81.0)
  3. Ran evaluation on Workspace-Bench-Lite (100 tasks)
  4. Used Agent-as-a-Judge for scoring

Environment Details

Component Configuration
Code OpenDataBox/Workspace-Bench main branch
Agent SDK @anthropic-ai/claude-agent-sdk v0.2.107
Underlying SDK @anthropic-ai/sdk v0.81.0
Model claude-opus-4-7 (via API gateway)
Dataset Workspace-Bench-Lite (100 tasks)
Judge Agent-as-a-Judge
Concurrency 100 tasks in parallel

Results

Metric Value
Rubric Pass Rate 58.78% (1091/1856)
Mean (per-task) 60.5%
Median 59.5%
Perfect score tasks (100%) 14/100
Zero score tasks (0%) 1/100
API failures 0
Agent success rate 100%
Judge completion rate 100%

Known Adaptation

Since I accessed claude-opus-4-7 through a compatible API gateway rather than the Anthropic official API directly, I needed to adapt the thinking parameter:

  • The SDK's default thinking.type: "enabled" was changed to "adaptive" (gateway requirement)
  • Sampling parameters (temperature, top_p) were removed (gateway restriction)
  • output_config.effort: "high" was injected

These adaptations were necessary because the gateway enforces stricter parameter constraints than the Anthropic official API.

Questions

To help the community reproduce the LeaderBoard results, could you clarify:

  1. API Endpoint: Was the LeaderBoard score obtained using the Anthropic official API (api.anthropic.com) directly?
  2. Thinking Configuration: What thinking parameters were used (e.g., type: "enabled" vs "adaptive", budget_tokens value)?
  3. Model Parameters: Were any specific max_tokens, temperature, or other parameters configured?
  4. SDK Version: Was @anthropic-ai/claude-agent-sdk v0.2.107 the exact version used for the LeaderBoard evaluation?
  5. Judge Model: Which model was used as the Judge for the LeaderBoard scores?

Analysis

I've ruled out infrastructure issues (0 API failures, 100% task completion). The ~6 point gap likely stems from differences in API endpoint behavior or calling parameters. Any configuration details would greatly help the community reproduce the benchmark results accurately.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions