[Reproducibility] Unable to reproduce Opus 4.7 LeaderBoard score (~64.5%) with official code and SDK — getting ~58.78%

### Summary

I attempted to reproduce the Claude Opus 4.7 + Claude Code score on the [LeaderBoard](https://workspace-bench.github.io/leaderboard.html) (~64.5%) using the official Workspace-Bench code from the `main` branch with the exact same SDK version (`@anthropic-ai/claude-agent-sdk` v0.2.107). Despite matching the code and SDK, I consistently get **~58.78%** on Workspace-Bench-Lite (100 tasks), a gap of approximately **6 points**.

### Reproduction Steps

1. Cloned the official repository (`main` branch)
2. Installed `@anthropic-ai/claude-agent-sdk` v0.2.107 (with underlying `@anthropic-ai/sdk` v0.81.0)
3. Ran evaluation on Workspace-Bench-Lite (100 tasks)
4. Used Agent-as-a-Judge for scoring

### Environment Details

| Component | Configuration |
|-----------|--------------|
| Code | `OpenDataBox/Workspace-Bench` `main` branch |
| Agent SDK | `@anthropic-ai/claude-agent-sdk` v0.2.107 |
| Underlying SDK | `@anthropic-ai/sdk` v0.81.0 |
| Model | `claude-opus-4-7` (via API gateway) |
| Dataset | Workspace-Bench-Lite (100 tasks) |
| Judge | Agent-as-a-Judge |
| Concurrency | 100 tasks in parallel |

### Results

| Metric | Value |
|--------|-------|
| Rubric Pass Rate | **58.78%** (1091/1856) |
| Mean (per-task) | 60.5% |
| Median | 59.5% |
| Perfect score tasks (100%) | 14/100 |
| Zero score tasks (0%) | 1/100 |
| API failures | **0** |
| Agent success rate | **100%** |
| Judge completion rate | **100%** |

### Known Adaptation

Since I accessed `claude-opus-4-7` through a compatible API gateway rather than the Anthropic official API directly, I needed to adapt the `thinking` parameter:

- The SDK's default `thinking.type: "enabled"` was changed to `"adaptive"` (gateway requirement)
- Sampling parameters (`temperature`, `top_p`) were removed (gateway restriction)
- `output_config.effort: "high"` was injected

These adaptations were necessary because the gateway enforces stricter parameter constraints than the Anthropic official API.

### Questions

To help the community reproduce the LeaderBoard results, could you clarify:

1. **API Endpoint**: Was the LeaderBoard score obtained using the Anthropic official API (`api.anthropic.com`) directly?
2. **Thinking Configuration**: What `thinking` parameters were used (e.g., `type: "enabled"` vs `"adaptive"`, `budget_tokens` value)?
3. **Model Parameters**: Were any specific `max_tokens`, `temperature`, or other parameters configured?
4. **SDK Version**: Was `@anthropic-ai/claude-agent-sdk` v0.2.107 the exact version used for the LeaderBoard evaluation?
5. **Judge Model**: Which model was used as the Judge for the LeaderBoard scores?

### Analysis

I've ruled out infrastructure issues (0 API failures, 100% task completion). The ~6 point gap likely stems from differences in API endpoint behavior or calling parameters. Any configuration details would greatly help the community reproduce the benchmark results accurately.

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Reproducibility] Unable to reproduce Opus 4.7 LeaderBoard score (~64.5%) with official code and SDK — getting ~58.78% #9

Summary

Reproduction Steps

Environment Details

Results

Known Adaptation

Questions

Analysis

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Configuration
Code	`OpenDataBox/Workspace-Bench` `main` branch
Agent SDK	`@anthropic-ai/claude-agent-sdk` v0.2.107
Underlying SDK	`@anthropic-ai/sdk` v0.81.0
Model	`claude-opus-4-7` (via API gateway)
Dataset	Workspace-Bench-Lite (100 tasks)
Judge	Agent-as-a-Judge
Concurrency	100 tasks in parallel

Metric	Value
Rubric Pass Rate	58.78% (1091/1856)
Mean (per-task)	60.5%
Median	59.5%
Perfect score tasks (100%)	14/100
Zero score tasks (0%)	1/100
API failures	0
Agent success rate	100%
Judge completion rate	100%

[Reproducibility] Unable to reproduce Opus 4.7 LeaderBoard score (~64.5%) with official code and SDK — getting ~58.78% #9

Description

Summary

Reproduction Steps

Environment Details

Results

Known Adaptation

Questions

Analysis

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions