Summary
I attempted to reproduce the Claude Opus 4.7 + Claude Code score on the LeaderBoard (~64.5%) using the official Workspace-Bench code from the main branch with the exact same SDK version (@anthropic-ai/claude-agent-sdk v0.2.107). Despite matching the code and SDK, I consistently get ~58.78% on Workspace-Bench-Lite (100 tasks), a gap of approximately 6 points.
Reproduction Steps
- Cloned the official repository (
main branch)
- Installed
@anthropic-ai/claude-agent-sdk v0.2.107 (with underlying @anthropic-ai/sdk v0.81.0)
- Ran evaluation on Workspace-Bench-Lite (100 tasks)
- Used Agent-as-a-Judge for scoring
Environment Details
| Component |
Configuration |
| Code |
OpenDataBox/Workspace-Bench main branch |
| Agent SDK |
@anthropic-ai/claude-agent-sdk v0.2.107 |
| Underlying SDK |
@anthropic-ai/sdk v0.81.0 |
| Model |
claude-opus-4-7 (via API gateway) |
| Dataset |
Workspace-Bench-Lite (100 tasks) |
| Judge |
Agent-as-a-Judge |
| Concurrency |
100 tasks in parallel |
Results
| Metric |
Value |
| Rubric Pass Rate |
58.78% (1091/1856) |
| Mean (per-task) |
60.5% |
| Median |
59.5% |
| Perfect score tasks (100%) |
14/100 |
| Zero score tasks (0%) |
1/100 |
| API failures |
0 |
| Agent success rate |
100% |
| Judge completion rate |
100% |
Known Adaptation
Since I accessed claude-opus-4-7 through a compatible API gateway rather than the Anthropic official API directly, I needed to adapt the thinking parameter:
- The SDK's default
thinking.type: "enabled" was changed to "adaptive" (gateway requirement)
- Sampling parameters (
temperature, top_p) were removed (gateway restriction)
output_config.effort: "high" was injected
These adaptations were necessary because the gateway enforces stricter parameter constraints than the Anthropic official API.
Questions
To help the community reproduce the LeaderBoard results, could you clarify:
- API Endpoint: Was the LeaderBoard score obtained using the Anthropic official API (
api.anthropic.com) directly?
- Thinking Configuration: What
thinking parameters were used (e.g., type: "enabled" vs "adaptive", budget_tokens value)?
- Model Parameters: Were any specific
max_tokens, temperature, or other parameters configured?
- SDK Version: Was
@anthropic-ai/claude-agent-sdk v0.2.107 the exact version used for the LeaderBoard evaluation?
- Judge Model: Which model was used as the Judge for the LeaderBoard scores?
Analysis
I've ruled out infrastructure issues (0 API failures, 100% task completion). The ~6 point gap likely stems from differences in API endpoint behavior or calling parameters. Any configuration details would greatly help the community reproduce the benchmark results accurately.
Thank you!
Summary
I attempted to reproduce the Claude Opus 4.7 + Claude Code score on the LeaderBoard (~64.5%) using the official Workspace-Bench code from the
mainbranch with the exact same SDK version (@anthropic-ai/claude-agent-sdkv0.2.107). Despite matching the code and SDK, I consistently get ~58.78% on Workspace-Bench-Lite (100 tasks), a gap of approximately 6 points.Reproduction Steps
mainbranch)@anthropic-ai/claude-agent-sdkv0.2.107 (with underlying@anthropic-ai/sdkv0.81.0)Environment Details
OpenDataBox/Workspace-Benchmainbranch@anthropic-ai/claude-agent-sdkv0.2.107@anthropic-ai/sdkv0.81.0claude-opus-4-7(via API gateway)Results
Known Adaptation
Since I accessed
claude-opus-4-7through a compatible API gateway rather than the Anthropic official API directly, I needed to adapt thethinkingparameter:thinking.type: "enabled"was changed to"adaptive"(gateway requirement)temperature,top_p) were removed (gateway restriction)output_config.effort: "high"was injectedThese adaptations were necessary because the gateway enforces stricter parameter constraints than the Anthropic official API.
Questions
To help the community reproduce the LeaderBoard results, could you clarify:
api.anthropic.com) directly?thinkingparameters were used (e.g.,type: "enabled"vs"adaptive",budget_tokensvalue)?max_tokens,temperature, or other parameters configured?@anthropic-ai/claude-agent-sdkv0.2.107 the exact version used for the LeaderBoard evaluation?Analysis
I've ruled out infrastructure issues (0 API failures, 100% task completion). The ~6 point gap likely stems from differences in API endpoint behavior or calling parameters. Any configuration details would greatly help the community reproduce the benchmark results accurately.
Thank you!