Skip to content

Commit de54616

Browse files
authored
EXP-bench in pubs. (#367)
1 parent 47e52b8 commit de54616

File tree

2 files changed

+15
-1
lines changed

2 files changed

+15
-1
lines changed

source/_data/SymbioticLab.bib

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2333,4 +2333,18 @@ @InProceedings{mordal:iclr26
23332333
Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$--$11.6\times$ lower GPU hours than grid search.
23342334
We have also discovered that Mordal achieves about 69\% higher weighted Kendall’s $\tau$ on average than the state-of-the-art model selection method across diverse tasks.
23352335
}
2336-
}
2336+
}
2337+
2338+
@InProceedings{expbench:iclr26,
2339+
author = {Patrick Tser Jern Kon and Qiuyi Ding and Jiachen Liu and Xinyi Zhu and Jingjia Peng and Jiarong Xing and Yibo Huang and Yiming Qiu and Jayanth Srinivasa and Myungjin Lee and Mosharaf Chowdhury and Matei Zaharia and Ang Chen},
2340+
booktitle = {ICLR},
2341+
title = {{EXP-BENCH}: Can {AI} Conduct {AI} Research Experiments?},
2342+
year = {2026},
2343+
month = {April},
2344+
publist_confkey = {ICLR'26},
2345+
publist_link = {paper || expbench-iclr26.pdf},
2346+
publist_topic = {Systems + AI},
2347+
publist_abstract = {
2348+
Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgents on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20–35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve targeted research components and agent planning ability.
2349+
}
2350+
}
8.32 MB
Binary file not shown.

0 commit comments

Comments
 (0)