[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 54.62% Pass@1 by kentwelcome · Pull Request #47 · ucbepic/DataAgentBench

kentwelcome · 2026-05-06T14:52:03Z

Spacedock (Recce) — Leaderboard Submission

Agent name: Spacedock (Recce) (source)
Backbone LLM: Claude Opus 4.6 (Anthropic)
Hints: No
Trials: 5 per query
Stratified Pass@1: 54.62%

Architecture

Spacedock is a workflow-orchestration harness that runs on top of the Claude Code runtime. For each DAB query it:

Stage-based execution. A first-officer agent dispatches workers (ensigns) through plan → execute → verify stages. Each stage has its own scoped context, allowing Opus to focus on one concern at a time rather than carrying the full transcript.
Free-form data exploration. The agent has shell, file, and code-execution tools and connects directly to PostgreSQL, MongoDB, SQLite, and DuckDB. No pre-built index, no schema hints — schemas are discovered at runtime.
Sub-agent dispatch. Long-running or independent sub-tasks (DB introspection, multi-table joins, retries on failed scripts) can be handed to ensign sub-agents running in fresh Opus contexts, keeping the orchestrator's context lean.

Results Summary

Dataset	Pass@1
bookreview	0.93
stockindex	0.80
yelp	0.77
crmarenapro	0.77
googlelocal	0.70
stockmarket	0.68
PANCANCER_ATLAS	0.67
agnews	0.45
music_brainz_20k	0.33
GITHUB_REPOS	0.25
DEPS_DEV_V1	0.20
PATENTS	0.00
Stratified Pass@1	0.5462

Notes

Pass@1 computed using DAB's stratified formula: (1/D) × Σⱼ [(1/Qⱼ) × Σᵢ (cᵢⱼ / n)]
No dataset hints (db_description_withhint.txt) were used — the agent discovered schemas via runtime exploration
All 5 agnews runs were re-executed under a hardened sandbox that blocks (a) filesystem access to answer-key files (ground_truth.csv, validate.py) via a Claude Code PreToolUse hook + chmod 700 root-owned answer-key directories, and (b) external dataset loads (HuggingFace load_dataset, from datasets import, network egress to huggingface.co/Kaggle/etc.). The sandbox traces show numerous blocked attempts — confirming the integrity policy is active.
Submission file: dab_submission.json — 270 entries (54 queries × 5 trials)
Experiment track files: spacedock-experiment-rerun-20260508.zip

Ruiying-Ma · 2026-05-07T01:08:07Z

Hi @kentwelcome — thank you for your contribution!
Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? Alternatively, sharing the traces for at least the agnews queries would also be very helpful. We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result.

kentwelcome · 2026-05-07T02:12:07Z

Would you mind also sharing the traces for all runs across all queries (for example, as a zip file)? Alternatively, sharing the traces for at least the agnews queries would also be very helpful. We’ll use them for validation checks. Once they’re available, we’ll re-run the verification and post the Pass@1 result.

Thanks! I've updated the experiment and attached the full query traces (all 270 trials): spacedock-experiment.zip.

Layout: //run-/{claude-output.jsonl, answers.json}, with per-dataset summary.json. Let me know if you need anything else for validation.

Ruiying-Ma · 2026-05-07T03:21:12Z

Thank you @kentwelcome !

We reviewed the traces and noticed some patterns that may indicate unintended information leakage. For example, in agnews runs, we observed the following pattern:

Query	Run	HF Load	Answer Produced	Evidence
query2	run-002	✓ succeeded	`16/111`	`Warning: unauthenticated HF Hub` in stdout; Amy Jones article labels looked up via HF mapping
query2	run-003	✓ succeeded	`16/111`	HF dataset loaded; label context written to workspace markdown
query2	run-004	✓ succeeded	`0.1441`	`Total labels: 127600` printed; full label distribution confirmed
query2	run-005	✓ succeeded	`16/111`	`Label names: ['World', 'Sports', 'Business', 'Sci/Tech']` confirmed in stdout
query3	run-005	✓ succeeded	`336.64`	HF labels used to count Business articles in Europe 2010–2020
query4	run-002	✓ succeeded	`Africa`	Reasoning file explicitly states: "Loaded HuggingFace `ag_news` dataset (train+test splits concatenated = 127600 rows)"

Once these leakage patterns are addressed, we’ll re-run the verification and post the Pass@1 results. Thank you!

kentwelcome · 2026-05-07T08:53:22Z

Thank you @kentwelcome !

We reviewed the traces and noticed some patterns that may indicate unintended information leakage. For example, in agnews runs, we observed the following pattern:

Query Run HF Load Answer Produced Evidence
query2 run-002 ✓ succeeded 16/111 Warning: unauthenticated HF Hub in stdout; Amy Jones article labels looked up via HF mapping
query2 run-003 ✓ succeeded 16/111 HF dataset loaded; label context written to workspace markdown
query2 run-004 ✓ succeeded 0.1441 Total labels: 127600 printed; full label distribution confirmed
query2 run-005 ✓ succeeded 16/111 Label names: ['World', 'Sports', 'Business', 'Sci/Tech'] confirmed in stdout
query3 run-005 ✓ succeeded 336.64 HF labels used to count Business articles in Europe 2010–2020
query4 run-002 ✓ succeeded Africa Reasoning file explicitly states: "Loaded HuggingFace ag_news dataset (train+test splits concatenated = 127600 rows)"
Once these leakage patterns are addressed, we’ll re-run the verification and post the Pass@1 results. Thank you!

Thanks for reminding us, we will go back to review our benchmark sandboxing logics and re-run the agnews dataset.

kentwelcome · 2026-05-08T07:19:29Z

Hi @Ruiying-Ma,
We have improved our sandbox mechanism for running the agent benchmark. And all runs of the agnews dataset have been replaced. Please also check the full query traces (all 270 trials) by spacedock-experiment-rerun-20260508.zip.
Thanks

Add Claude Opus 4.6 + SpaceDock harness workflow agent submission

8b169bd

Copilot AI review requested due to automatic review settings May 6, 2026 14:52

Copilot started reviewing on behalf of kentwelcome May 6, 2026 14:52 View session

Update the dab_submission.json with new experiment result

5ca54d7

kentwelcome changed the title ~~[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 53.96% Pass@1~~ [Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 57.12% Pass@1 May 7, 2026

kentwelcome added 2 commits May 8, 2026 12:17

Update dab_submission.json to replace cheating agnews runs

14245e6

Update the dab_submission.json to replace agnews run-001

fee1107

kentwelcome changed the title ~~[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 57.12% Pass@1~~ [Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 54.62% Pass@1 May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 54.62% Pass@1#47

[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 54.62% Pass@1#47
kentwelcome wants to merge 4 commits intoucbepic:mainfrom
DataRecce:add-spacedock-harness-agent-submission

kentwelcome commented May 6, 2026 •

edited

Loading

Uh oh!

Ruiying-Ma commented May 7, 2026

Uh oh!

kentwelcome commented May 7, 2026

Uh oh!

Ruiying-Ma commented May 7, 2026

Uh oh!

kentwelcome commented May 7, 2026

Uh oh!

kentwelcome commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kentwelcome commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Spacedock (Recce) — Leaderboard Submission

Architecture

Results Summary

Notes

Uh oh!

Ruiying-Ma commented May 7, 2026

Uh oh!

kentwelcome commented May 7, 2026

Uh oh!

Ruiying-Ma commented May 7, 2026

Uh oh!

kentwelcome commented May 7, 2026

Uh oh!

kentwelcome commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kentwelcome commented May 6, 2026 •

edited

Loading