Evaluation infrastructure for MCP Atlas - a benchmark for evaluating AI models' tool-use capabilities across 36 MCP servers with 500 tasks.
cd lhaw/experiments/mcpatlas
# Checkout our branch with user simulator changes for MCP-Atlas support
git clone git@github.com:scaleapi/mcp-atlas.git mcp-atlas
cd mcp-atlas
git checkout lhaw/ask-user-tool
git submodule update --init --recursiveCopy the environment template and add your API keys:
# in lhaw/experiments/mcpatlas/mcp-atlas
cp env.template .env
# Edit .env with your API keysImportant: Values in .env must NOT be quoted. Use:
AIRTABLE_API_KEY=pat2ue5q...Not:
AIRTABLE_API_KEY="pat2ue5q..." # Wrong! Quotes break JSON configRequired API keys depend on which tasks you want to run:
- No setup needed (20 servers): calculator, wikipedia, filesystem, git, fetch, arxiv, etc.
- API keys only (11 servers): GitHub, Google Maps, Brave Search, etc.
- API keys + data imports (5 servers): Airtable, Notion, MongoDB, Slack, Google Calendar
# Inside experiments/mcpatlas/mcp-atlas
# One-time: build the Docker image
make build
# Terminal 1: Start MCP servers
make run-docker
# Terminal 2: Start completion service
make run-mcp-completionTwo-phase service workflow: The ask_user tool is registered at service startup
based on the USER_TOOL_ENABLED environment variable. This means:
- Baselines + underspec without ask_user — start the service normally (above)
- Underspec with ask_user + persona ablations — stop the service (Ctrl-C), then
restart with:
USER_TOOL_ENABLED=True make run-mcp-completion
Setting USER_TOOL_ENABLED=True only on the orchestrator command is not sufficient
— the service must be restarted with it for the tool to be available.
# Terminal 3: Run evaluation
python task_completion_mcpatlas.py \
--format_name full_baseline \
--backend_model opus_4_5 \
--num_trials 3 \
--max_k 3 \
--concurrency 10The dataset (MCP-Atlas.csv) is automatically downloaded from HuggingFace on first run.
| Argument | Description |
|---|---|
--format_name |
Experiment name (required) |
--backend_model |
LLM model identifier (required) |
--input_csv |
Input CSV (default: experiments/mcpatlas/MCP-Atlas.csv) |
--task_ids |
Comma-separated task IDs to run |
--limit |
Limit number of tasks |
--start_index |
Start index (0-based) |
--end_index |
End index (exclusive) |
--num_trials |
Number of runs for pass@k (default: 3) |
--max_k |
Max k for pass@k metrics |
--concurrency |
Parallel completions (default: 10) |
--skip_scoring |
Skip LLM-as-judge scoring step |
--evaluator_model |
LLM model for evaluation scoring |
Enable Redis-based caching for MCP tool calls to reduce API costs and improve latency:
# Optional, Scale-internal only: install sgpml-cache manually after normal setup (refresh code artifact + export UV_INDEX)
cd services/mcp_eval && uv pip install sgpml-cache && cd ../..
# Run with caching enabled
TOOL_CACHE_ENABLED=True python task_completion_mcpatlas.py \
--format_name baseline_v1 \
--backend_model sonnet_4_5sgpml-cache is intentionally not listed in the public mcp-atlas pyproject.toml, because that would break normal public uv workflows when Scale's private package index is unavailable.
| Variable | Description | Default |
|---|---|---|
TOOL_CACHE_ENABLED |
Enable tool response caching | False |
TOOL_CACHE_REDIS_URL |
Redis connection URL | (must be set) |
TOOL_CACHE_NAMESPACE |
Redis key namespace for isolation | mcp_eval |
TOOL_CACHE_TTL_DAYS |
Cache TTL in days | 365 |
Cache keys are deterministic (args serialized with sort_keys=True). Error responses are NOT cached. ask_user calls are never cached.
experiments/mcpatlas/runs/
└── run_{format}_{model}_{timestamp}/
├── completions_*_{exp}_{ts}.csv # Raw completions per experiment
├── scores_*_{exp}_{ts}/scored_*.csv # Scores per experiment
├── results_{exp}.html # HTML visualization per experiment
├── evaluation_{format}_{model}.json # Aggregated evaluation results
├── pass_k_per_task.csv # pass@k / pass^k per task + AGGREGATE
└── final_results/
└── <task_id>/result.json # Per-task results in our format
This orchestrator wraps the official MCP Atlas scripts:
- Docker Setup: Starts
agent-environmentcontainer (port 1984) for MCP servers andmcp_evalservice (port 3000) for completions - Completions: Calls
mcp_completion_script.pywhich runs agentic loops with tool calls - Scoring: Calls
mcp_evals_scores.pywhich uses LLM-as-judge to evaluate completions against ground truth claims - Conversion: Converts MCP Atlas CSV output to our pass@k compatible format
calculator, wikipedia, filesystem, git, weather, fetch, whois, ddg-search, arxiv, pubmed, open-library, mcp-code-executor, and more.
exa, brave-search, oxylabs, github, google-maps, lara-translate, stripe, shopify, sendgrid, and more.
These servers require accounts on external services, data imports, and API keys. The data exports are included in the cloned repo at mcp-atlas/data_exports/.
| Server | Setup Steps |
|---|---|
| Airtable | Copy base from shared link in airtable_database_online_link.txt, set AIRTABLE_API_KEY |
| Google Calendar | Unzip calendar_mcp_eval_export.zip, import .ics file, set GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, GOOGLE_REFRESH_TOKEN |
| Notion | Import mcp-atlas-notion-data.zip via Settings > Import, set NOTION_TOKEN |
| MongoDB | Unzip mongo_dump_video_game_store-UNZIP-FIRST.zip, restore with mongorestore, set MONGODB_CONNECTION_STRING |
| Slack | Unzip slack_mcp_eval_export.zip, import to workspace, set SLACK_MCP_XOXC_TOKEN, SLACK_MCP_XOXD_TOKEN |
See MCP Atlas data_exports README for detailed instructions.
Note: Most tasks (~400/500) don't require these data imports. Use the default server filtering to run only tasks with available servers.
Generate underspecified prompts from MCP Atlas tasks for strategic clarification testing.
cd lhaw
source .venv/bin/activate && source .env
# Generate for tasks that passed baselines (JSON list of task IDs)
python scripts/generate_mcpatlas_underspec.py \
passed_task_ids.json \
--output-dir experiments/mcpatlas/underspec_output \
--severity DELETE --max-level 2| Argument | Description |
|---|---|
--severity |
DELETE (default), VAGUIFY, or GENERICIZE |
--top-k |
Limit segments extracted (default: all) |
--variant-top-k |
Variants per task in CSV (default: 1) |
--max-level |
1=single segment, 2=pairs |
underspec_output/
├── underspec_prompts.csv # PROMPT = underspecified, + original_prompt, removed_value
├── original_prompts.csv # PROMPT = original (for reference)
├── summary.json # Run metadata
└── json/ # Full segment/variant details per task
from synthetic.pipeline import SyntheticPipeline, Severity
pipeline = SyntheticPipeline()
result = pipeline.process_mcpatlas_task("689f4d693e212e8ef3390720", severity=Severity.DELETE)
print(f"Segments: {len(result.segments)}")
for v in result.variants:
print(v.underspecified_prompt)End-to-end pipeline for running underspecification experiments with the ask_user tool.
Run baseline experiments on the full dataset:
python task_completion_mcpatlas.py \
--format_name full_baseline \
--backend_model anthropic/claude-opus-4-5-20251101 \
--num_trials 3 \
--max_k 3 \
--concurrency 10Run the underspecification pipeline to create modified prompts:
python scripts/prepare_underspec_dataset.py configs/intersection_delete_2segments.yamlRe-run on Gemini to filter which underspec tasks are actually ambiguous:
python task_completion_mcpatlas.py \
--format_name full_underspec_2_seg \
--backend_model gemini_3_pro \
--num_trials 3 \
--max_k 3 \
--concurrency 20 \
--input_csv <underspec_output_dir>/underspec_prompts.csvCategorize results into: benign underspec, new task, outcome critical underspec:
uv run python scripts/process_underspec_run.py \
--dataset-dir experiments/mcpatlas/underspec_datasets/<dataset_dir> \
--run-dir <run_dir>Create a sub-sampled CSV for focused evaluation:
python scripts/sample_from_filtered.py \
--input <filtered_underspec_with_baseline.csv> \
--output <sampled_filtered_underspec_with_baseline.csv>Re-run pass@3 for benchmarked models with and without ask_user tool:
# Without ask_user (run with normal completion service)
python task_completion_mcpatlas.py \
--format_name sampled_underspecified \
--backend_model anthropic/claude-opus-4-5-20251101 \
--num_trials 3 \
--max_k 3 \
--concurrency 3 \
--input_csv <sampled_filtered_underspec_with_baseline.csv>Stop & restart the completion service before running with ask_user: In Terminal 2, press Ctrl-C, then restart with
USER_TOOL_ENABLED=True make run-mcp-completion
# With ask_user
USER_TOOL_ENABLED=True USE_SYSTEM_PROMPT_IN_COMPLETION=true \
python task_completion_mcpatlas.py \
--format_name sampled_underspecified_with_user \
--backend_model anthropic/claude-opus-4-5-20251101 \
--num_trials 3 \
--max_k 3 \
--concurrency 3 \
--input_csv <sampled_filtered_underspec_with_baseline.csv>Generate pass@3 comparison plots for base benchmarked models:
python scripts/plot_pass3_comparison.py \
--dataset-dir experiments/mcpatlas/underspec_datasets/<dataset_dir> \
--config scripts/runs_config.yamlGenerate comparison plots for additional models:
python scripts/plot_pass3_from_runs.py \
--baseline-run <baseline_run_dir> \
--no-user-run <underspec_run_dir> \
--with-user-run <underspec_with_user_run_dir> \
--output-dir <output_dir> \
--model-name "Model Name" \
--mapping-csv <sampled_filtered_underspec_with_baseline.csv>Evaluate the quality of ask_user questions using LLM judging:
python scripts/analyze_ask_user.py \
--run-dir <underspec_with_user_run_dir> \
--output ask_user_results.csv \
--judge-model gpt-4oGenerate analysis plots for ask_user question quality:
python scripts/plot_ask_user.py \
--input <ask_user_results.csv> \
--output-dir <output_dir> \
--filter-passed false # or true for passed-onlyInstructions for reproducing results and ablation studies from the LHAW paper.
The main MCP-Atlas results are produced by the full workflow (Steps 1-11 above) or run_mcpatlas_example.sh. Below maps each paper table to the script that generates it:
| Paper Table | Content | Script / Step |
|---|---|---|
| Table 2 | Pass@3 by number of segments removed | underspec_pass3_by_segments.py (Step 11) |
| Table 3 | Overall task success, user behavior stats, Gain/Q | compute_mcpatlas_summary.py (Step 12) |
| Table 6 | Pass@3 by information dimension | plot_pass3_from_runs.py --mapping-csv (uses dimension column from underspec CSV) |
| Table 7 | Avg checkpoint progress by dimension | Same as Table 6 (checkpoint data from scored CSVs) |
| Table 8 | Pass@3 by ambiguity class (MCP-Atlas) | plot_pass3_from_runs.py --mapping-csv → pass3_by_ambiguity.png (Step 9) |
Producing Table 2 (segments removed breakdown):
python experiments/mcpatlas/scripts/underspec_pass3_by_segments.py \
<underspec_run_dir_1> <underspec_run_dir_2> ... \
--base-path <underspec_output_dir> \
--output pass3_by_segments.csvProducing Table 8 (ambiguity class breakdown):
python experiments/mcpatlas/scripts/plot_pass3_from_runs.py \
--baseline-run <baseline_run> \
--no-user-run <underspec_run> \
--with-user-run <underspec_ask_run> \
--output-dir reports/ \
--model-name "opus_4_5" \
--mapping-csv <underspec_prompts.csv>This produces both pass3_comparison.png (Table 3 bars) and pass3_by_ambiguity.png (Table 8 bars).
Producing Tables 6-7 (dimension breakdown):
The dimension information is in the underspec CSV (information_dimension column). Use underspec_stats.py to inspect the distribution:
python experiments/mcpatlas/scripts/underspec_stats.py \
--input <underspec_output_dir>/filtered_underspec.csvTests how user persona (perceived cost of asking) affects clarification behavior. Three personas append different system prompt suffixes:
| Persona | USER_TYPE value |
Behavior |
|---|---|---|
| Supervisor | supervisor |
User is sitting with you, waiting to help. Do not guess. |
| Standard Assistant | standard_assistant |
User is available but working on other tasks. |
| Busy Executive | busy_executive |
User is very busy. Only interrupt for factual failures. |
Run the same underspec tasks with each persona:
for persona in supervisor standard_assistant busy_executive; do
USER_TOOL_ENABLED=True \
USE_SYSTEM_PROMPT_IN_COMPLETION=true \
USER_TYPE=$persona \
python task_completion_mcpatlas.py \
--format_name "underspec_${persona}" \
--backend_model sonnet_4_5 \
--num_trials 3 --max_k 3 \
--input_csv <sampled_underspec.csv> \
--concurrency 10
donePrerequisites:
- The completion service must be running with
USER_TOOL_ENABLED=True make run-mcp-completion(stop and restart if it was started without this)
Required env vars on the orchestrator command (all three):
USER_TOOL_ENABLED=True— enables the ask_user toolUSE_SYSTEM_PROMPT_IN_COMPLETION=true— injects the system prompt (which includes persona text)USER_TYPE— selects the persona variant
Then compare Gain%/Q across personas using the same analysis pipeline (Steps 9-10).
Uses an LLM judge to classify ask_user questions into 7 failure categories with 3 sub-labels each (see paper Appendix I.1 for the full taxonomy).
# Run judging on a with-user experiment
python scripts/analyze_ask_user.py \
--run-dir <underspec_with_user_run_dir> \
--output ask_user_analysis.csv \
--judge-model gpt-4o
# Visualize failure mode breakdown (Figure 5)
python scripts/plot_ask_user.py \
--input ask_user_analysis.csv \
--output-dir plots/
# Filter to only failed trials (Table 17)
python scripts/plot_ask_user.py \
--input ask_user_analysis.csv \
--output-dir plots/ \
--filter-passed falseFailure categories: Question Quality, Question Targeting, Information Integration, Over-Clarification, Under-Clarification, Timing & Strategy, Response Misinterpretation.
Additional scripts in experiments/mcpatlas/scripts/ for ad-hoc exploration and reporting:
| Script | Purpose |
|---|---|
underspec_viewer.py |
Interactive HTML viewer for exploring underspec variants (prompts, segments, trial traces) |
underspec_stats.py |
Print statistics (dimension, severity, segments) grouped by ambiguity class |
sample_from_filtered.py |
Sample from filtered variants with diversity constraints (class balance, task spread) |
generate_categorization_report.py |
HTML report showing tasks grouped by category with judge reasoning |
aggregate_passed_tasks.py |
Aggregate passed task IDs across model runs (union/intersection) |
categorize_underspec_tasks.py |
LLM-based categorization (overspecified/new_task/underspecified) — alternative to process_underspec_run.py |
compute_mcpatlas_summary.py |
Consolidated Table 3 summary: baseline/underspec/+user pass@3, Ask%, Gain/Q |
plot_pass3_comparison.py |
Config-file-driven pass@3 charts — alternative to plot_pass3_from_runs.py |
prepare_underspec_dataset.py |
Config-driven dataset preparation pipeline (uses configs/*.yaml) |
run_ask_user_analysis.sh |
Batch wrapper for analyze_ask_user.py + plot_ask_user.py across multiple runs |
Check Docker is running and has sufficient memory (8GB+ recommended):
docker infoCheck container logs:
docker logs mcpatlas-agent-envMCP-Atlas uses two local ports:
1984for the Dockerizedagent-environmentMCP servers3000by default for the localmcp_evalcompletion service (PORTinmcp-atlas/.envmay override this, e.g.3001)
If make run-docker fails with Bind for 0.0.0.0:1984 failed: port is already allocated,
find and stop the stale Docker container:
docker ps --format '{{.ID}}\t{{.Names}}\t{{.Ports}}' | rg '1984'
docker stop <container_name_or_id>
# If needed:
docker rm -f <container_name_or_id>If make run-mcp-completion fails because the completion-service port is already in use,
first check PORT= in mcp-atlas/.env, then stop the old completion service in its
original terminal with Ctrl-C. If you no longer have that terminal, find the listening
process and kill it:
# Replace 3000 with the value from PORT= in mcp-atlas/.env
sudo ss -ltnp '( sport = :3000 )'
kill <pid>
# If needed:
kill -9 <pid>To check both ports at once:
docker ps --format '{{.ID}}\t{{.Names}}\t{{.Ports}}' | rg '1984|3000'
ss -ltnp '( sport = :1984 or sport = :3000 or sport = :3001 )'Verify your .env file has the required API keys for the servers your tasks use.