FOREAGENT is our autonomous machine learning agent, evolved from AIDE and integrated with MLE-bench for standardized launching and evaluation. It implements a novel Predict-then-Verify loop to drastically reduce execution costs.
FOREAGENT re-engineers the improvement stage into a Predict-then-Verify loop to bridge the implementation gap. The workflow consists of three distinct phases:
-
🚀 High-Volume Generation: Proposing
$m=10$ solution candidates in parallel to maximize search breadth without immediate execution costs. -
⚖️ Confidence-Gated Pairwise Selection: Employing a confidence gate (
$c=0.7$ ) to filter candidates, ensuring only high-certainty solutions proceed. -
✅ Verification Execution: Physically verifying only the top-$k$ (
$k=1$ ) candidate to anchor the trajectory with real execution feedback.
👉 For more theoretical details, please refer to the ForeAgent Detailed Design.
We evaluate FOREAGENT on five diverse AI4Science tasks from MLE-bench:
| Competition ID | Domain | Dataset Size |
|---|---|---|
stanford-covid-vaccine |
🧬 Biology | 14MB |
ventilator-pressure-prediction |
🔭 Physics | 291MB |
statoil-iceberg-classifier-challenge |
🌍 Geoscience | 205MB |
aerial-cactus-identification |
🌿 Ecology | 25.4MB |
histopathologic-cancer-detection |
⚕️ Medicine | 7.7GB |
- Benchmark: MLE-bench (12-hour limit).
- Baselines: AIDE.
- Models:
- Coding: DeepSeek-V3.2.
- Implicit World Modeling: DeepSeek-V3.2-Thinking.
- Reliability: 3 independent runs per task; reporting average Beat Ratio.
By substituting costly execution with rapid inference, FOREAGENT achieves:
- ⚡ 6× Speedup in average execution time.
- 🔍 3.2× More Nodes explored within 1/6th of the time budget.
- 🏆 +6% Beat Ratio improvement over baselines.
We run FOREAGENT on top of MLE-bench. Please create a dedicated environment and install the package:
# Create a fresh environment (recommended)
conda create -n mlebench python=3.10
conda activate mlebench
# Install in editable mode
pip install -e .Use the provided script to build the docker image and launch agent runs. The run_agent.py script accepts the following key arguments:
--agent-id: Select the agent image name.--competition-set: Path to a text file containing competition IDs.--data-dir: Path to the prepared Kaggle data directory.--gpu-device: GPU device IDs to use.
Example Command:
# See scripts/run.sh for the complete template
bash scripts/run.sh
⚠️ Note: Please verify and adjust paths (e.g.,--data-dir) inscripts/run.shto match your local environment before running.
Submissions must be .csv files following the competition's format.
Batch Grading: Provide a JSONL file where each line contains:
competition_id: The competition ID.submission_path: Path to the submission CSV.
mlebench grade --submission-file results.jsonlSingle Sample Grading:
mlebench grade-sample \
--competition-id stanford-covid-vaccine \
--submission-file submission.csv