[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 54.62% Pass@1#47
[Leaderboard] Spacedock (Recce) — Claude Opus 4.6 — 54.62% Pass@1#47kentwelcome wants to merge 4 commits intoucbepic:mainfrom
Conversation
|
Hi @kentwelcome — thank you for your contribution! |
Thanks! I've updated the experiment and attached the full query traces (all 270 trials): spacedock-experiment.zip. Layout: //run-/{claude-output.jsonl, answers.json}, with per-dataset summary.json. Let me know if you need anything else for validation. |
|
Thank you @kentwelcome ! We reviewed the traces and noticed some patterns that may indicate unintended information leakage. For example, in agnews runs, we observed the following pattern:
Once these leakage patterns are addressed, we’ll re-run the verification and post the Pass@1 results. Thank you! |
Thanks for reminding us, we will go back to review our benchmark sandboxing logics and re-run the agnews dataset. |
|
Hi @Ruiying-Ma, |
Spacedock (Recce) — Leaderboard Submission
Agent name: Spacedock (Recce) (source)
Backbone LLM: Claude Opus 4.6 (Anthropic)
Hints: No
Trials: 5 per query
Stratified Pass@1: 54.62%
Architecture
Spacedock is a workflow-orchestration harness that runs on top of the Claude Code runtime. For each DAB query it:
Results Summary
Notes
(1/D) × Σⱼ [(1/Qⱼ) × Σᵢ (cᵢⱼ / n)]db_description_withhint.txt) were used — the agent discovered schemas via runtime explorationground_truth.csv,validate.py) via a Claude Code PreToolUse hook + chmod 700 root-owned answer-key directories, and (b) external dataset loads (HuggingFaceload_dataset,from datasets import, network egress tohuggingface.co/Kaggle/etc.). The sandbox traces show numerous blocked attempts — confirming the integrity policy is active.dab_submission.json— 270 entries (54 queries × 5 trials)