A comprehensive leaderboard for evaluating research AI systems across multiple dimensions of quality and accuracy.
DeepScholar-Bench is a live benchmark for evaluating generative research synthesis systems. It draws queries from recent ArXiv papers and focuses on generating related work sections by retrieving, synthesizing, and citing prior research. The benchmark provides holistic automated evaluation across three key dimensions with metrics that show strong agreement with expert human judgments.
The benchmark evaluates systems across three main categories:
- Organization (Org.): Assesses organization and coherence of system answer
- Nugget Coverage (Nugget Cov.): Assesses the answer's coverage of essential facts
- Relevance Rate (Rel. Rate.): Measures avg. relevance among all referenced sources
- Document Importance (Doc. Imp.): Measures how notable referenced sources are, using citation counts
- Reference Coverage (Ref. Cov.): Assesses the referenced set's coverage of key, important references
- Citation Precision (Cite-P): Measures percent of cited sources that support their accompanying claim
- Claim Coverage (Claim Cov.): Measures percent of claims that are fully supported by cited sources
The live leaderboard is hosted on GitHub Pages and can be accessed at: https://guestrin-lab.github.io/deepscholar-leaderboard/
deepscholar-bench/
├── deepscholar_bench_leaderboard.html # Main leaderboard HTML file
├── leaderboard_data.csv # CSV data for the leaderboard
├── create_leaderboard.py # Python script to generate the leaderboard
└── README.md # This file
To update the leaderboard with new data:
-
Run the Python script:
python create_leaderboard.py
-
The script will:
- Fetch the latest data from Google Sheets
- Process and clean the data
- Generate an updated HTML leaderboard
- Save both HTML and CSV versions
-
Commit and push the changes to trigger a GitHub Pages update
This repository is configured to host the leaderboard on GitHub Pages. The main HTML file (deepscholar_bench_leaderboard.html) will be automatically served at the repository's GitHub Pages URL.
The leaderboard currently shows the top research AI systems ranked by their average performance across all metrics. Systems are categorized as either "Open" (open-source) or "Closed" (proprietary) and include information about the underlying language models used.
To submit your solution to the DeepScholar-Bench leaderboard:
- Use our Google Form to submit your results
- Provide details about your system architecture and methodology
- Submit your results for evaluation
For questions and inquiries, please contact: lianapat@stanford.edu
- GitHub Repository: https://github.com/guestrin-lab/deepscholar-bench
- Research Paper: [Link to be added]
This project is open source and available under the MIT License.
Last updated: 2025-01-26