A comprehensive Minecraft-based benchmark evaluation agent for evaluating AI agents' planning, reasoning, and multi-step execution capabilities. Green Agent evaluates Purple Agent in complex, open-world minecraft environments using the A2A protocol.
This Green Agent provides a rigorous evaluation for various tasks in minecraft environment that goes beyond simple task completion metrics. It evaluates:
- Spatial Reasoning: Navigation, building, and exploration in 3D space
- Tool Use: Strategic use of items and environmental interactions
- Resource Management: Efficient use of materials and time under constraints
- Adaptability: Handling diverse task types across various distinct categories
- Multi-step Planning: Complex tasks requiring sequential action planning (e.g., crafting requires gathering materials β using crafting table β combining items)
Diverse Tasks across 12 categories
Reward-based Evaluation with simulation tracking and video assessment
Reproducible & Consistent results with deterministic task initialization
A2A Protocol Compatible - works with any compliant agent
Automated Scoring using MCU benchmark reward system
Configurable - select categories and customize step limits
- Python 3.10+
- Docker (for containerized deployment)
- OpenAI API key (for video evaluation)
First, clone repository and install dependencies:
# Clone repository
git clone https://github.com/KWSMooBang/MCU-AgentBeats.git
cd MCU-AgentBeats
# Install dependencies with uv
uv syncThen, Create a .env file in the project root:
# Required for video evaluation
OPENAI_API_KEY=your-openai-api-key-hereuv python src/server.py --host 0.0.0.0 --port 9009
# With custom agent card URL
python src/server.py --host 0.0.0.0 --port 9009 --card-url "http://localhost:9009"docker build -t mcu-green-agent .
docker run \
-p 9009:9009 \
--env-file .env \
mcu-green-agent \
--host 0.0.0.0 \
--port 9009Server Health Check:
curl http://localhost:9009/.well-known/agent-card.jsonThe repository includes a sample Purple Agent using STEVE-1 model for testing:
# In a separate terminal, start the test Purple Agent
uv run python test_purple_agent.py --host 0.0.0.0 --port 9019
# When you run green agent server with Docker
uv run python test_purple_agent.py --host 0.0.0.0 --port 9019 --card-url "http://host.docker.internal:9019"Edit test_scenario.toml to configure your evaluation:
[green_agent]
endpoint = "http://127.0.0.1:9009"
[[participants]]
role = "agent"
endpoint = http://127.0.0.1:9019"
or
endpoint = "http://host.docker.internal:9019" (with Docker)
[config]
task_category = "crafting" # Change to desired category
max_steps = 900 # Optional: customize step limitAvailable categories: building, combat, crafting, decoration, ender_dragon, explore, find, mine_diamond_from_scratch, mining_and_collecting, motion, tool_use, trapping
# Run evaluation with scenario file
uv run python test_evaluation.py test_scenario.toml
# Save results to JSON file
uv run python test_evaluation.py test_scenario.toml output/results.jsonExpected Output:
[Status: submitted]
Starting MCU evaluation with 8 tasks from crafting
[Status: working]
Running task: craft_furnace
[Status: working]
Running task: craft_ladder
[Status: completed]
MCU Evaluation Result
Category: crafting
Number of Tasks: 8
Total Score: 65.4
Task Results:
Task 'craft_furnace': 8.7
Task 'craft_ladder': 7.9
Task 'craft_enchanting_table': 8.2
...
- participants.agent (str): URL of the Purple Agent to evaluate
- Example:
"http://green-agent:9009"or"http://purple-agent:9019" - Must be A2A protocol compatible
- Example:
-
task_category (str): Task category to evaluate
- Examples:
"combat","crafting","building" - Available categories:
building,combat,crafting,decoration,ender_dragon,explore,find,mine_diamond_from_scratch,mining_and_collecting,motion,tool_use,trapping
- Examples:
-
max_steps (int): Maximum steps per task
- Default for standard tasks: 1200 steps
- Default for long-term tasks: 12000 steps
- Long-term tasks categories:
kill_ender_dragon,mine_diamond_from_scratch - These tasks require extensive resource gathering, crafting chains, and exploration
- Long-term tasks categories:
- Custom values: Can be set in config, but will be capped at 1200 for non-long-term tasks
- Recommended ranges:
- Quick tasks (motion, explore): 500-900 steps
- Standard tasks (crafting, combat): 900-1500 steps
- Long-term tasks: 10000-12000 steps
MCU-AgentBeats/
βββ src/
β βββ server.py # A2A server entry point
β βββ agent.py # Main evaluation logic
β βββ messenger.py # Purple agent communication
β βββ executor.py # Task execution management
β βββ model.py # Pydantic data models
β βββ util.py # Helper functions (task loading, video processing)
β
βββ MCU_benchmark/
β βββ task_configs/
β β βββ tasks/ # Task definitions by category
β β βββ building/
β β βββ combat/
β β βββ crafting/
β β βββ decoration/
β β βββ ender_dragon/
β β βββ explore/
β β βββ find/
β β βββ mine_diamond_from_scratch/
β β βββ mining_and_collecting/
β β βββ motion/
β β βββ tool_use/
β β βββ trapping/
β β
β βββ auto_eval/
β βββ prompt/ # Video evaluation prompts
β βββ criteria_files/ # Task-specific evaluation criteria
β
βββ output/ # Evaluation results and recordings
βββ logs/ # Server logs
βββ pyproject.toml # Project dependencies
Minecraft provides a rich, open-world environment that closely mirrors real-world agent challenges:
- Complex State Space: Thousands of block types, items, and environmental states
- Emergent Complexity: Simple actions combine into complex behaviors
- Grounded Multimodal Input: Visual observation (RGB images) with spatial reasoning
- Diverse Skill Requirements: Navigation, crafting, combat, exploration, construction
- Long-horizon Tasks: Require over 12000 steps with intermediate goals
Our benchmark includes 12 distinct categories, each testing different capabilities:
| Category | Skills Tested | Example Tasks |
|---|---|---|
| building | Spatial reasoning, multi-step execution | Build house, tower, maze |
| combat | Real-time decision making, positioning | Combat zombies, skeletons, enderman |
| crafting | Sequential planning, recipe knowledge | Craft enchanting table, furnace, ladder |
| decoration | Creative placement, aesthetic judgment | Decorate wall/ground, lay carpet |
| explore | Pathfinding, environment interaction | Explore chest, boat, climb |
| find | Visual search, navigation | Find diamond, village, bedrock |
| mining_and_collecting | Efficient resource gathering | Collect wood/wool/dirt, mine ores |
| motion | Basic movement and interaction | Drop item, look at sky, stacking |
| tool_use | Tool understanding, context-aware actions | Use bow, shield, brew potion |
| trapping | Strategic planning, entity manipulation | Trap mobs |
| Category | Skills Tested | Example Tasks |
|---|---|---|
| ender_dragon | Advanced combat, long-term planning, resource pipeline | Kill ender dragon |
| mine_diamond_from_scratch | Complete resource gathering β crafting β mining pipeline | Mine diamond from scratch |
These tasks require extensive multi-stage planning: gathering initial resources β crafting tools β exploring β achieving final goal.
The evaluation system uses different scoring approaches based on task configuration:
For tasks with reward configurations (most standard tasks):
-
Primary Score: Simulation reward tracking during execution
- MineStudio simulator monitors task events (e.g., craft_item, mine_block)
- Rewards are accumulated based on achieved milestones
sim_score = total_rewards_achieved
-
Video Enhancement:
- Video evaluation is performed only when
sim_score < max_score - If sim_score already equals max_score (perfect completion), video evaluation is skipped
- GPT-4 Vision analyzes the gameplay video and scores "Task Progress" (0-10)
final_score = (sim_score + task_progress_score) / 2- This helps capture partial progress not reflected in discrete reward events
- Video evaluation is performed only when
Example:
# craft_furnace.yaml
reward_cfg:
- event: craft_item
objects:
- furnace
reward: 10.0
max_reward_times: 1For complex, multi-stage, long horizon tasks (kill_ender_dragon, mine_diamond_from_scratch):
- Milestone Tracking: Task broken into sequential milestones
- Continuous Scoring:
continuous_score = completed_milestones + current_progresscompleted_milestones: Number of fully achieved milestonescurrent_progress: Progress on current incomplete milestone (0-1), evaluated by GPT-4 Vision
- Final Score:
(continuous_score / total_milestones) * 100
This provides fine-grained progress measurement for tasks requiring 12,000+ steps.
For tasks without reward_cfg or milestone_reward_cfg:
- Video-Only Evaluation: GPT-4 Vision analyzes recorded gameplay
- Score: Based solely on "Task Progress" dimension (0-10)
- Use case: Tasks where success is subjective or difficult to define programmatically
When video evaluation is used, GPT-4 Vision assesses gameplay across 6 dimensions:
| Criterion | Description | Score Range |
|---|---|---|
| Task Progress | Goal achievement level | 0-10 |
| Action Control | Precision and appropriateness of actions | 0-10 |
| Error Recognition | Detection and correction of mistakes | 0-10 |
| Creative Attempts | Novel problem-solving approaches | 0-10 |
| Task Efficiency | Speed and resource optimization | 0-10 |
| Material Selection | Correctness of tool/item usage | 0-10 |
Note: Video evaluation requires OpenAI API key and is used strategically:
- For long-term tasks: evaluating progress on incomplete milestones
- For standard tasks: enhancing simulation scores when partial completion is detected
- For non-reward tasks: primary scoring method
| Task Type | Primary Metric | Video Evaluation Role | Max Score |
|---|---|---|---|
Standard with reward_cfg |
Simulation rewards | Enhancement when score < max | 10 |
Long-term with milestone_reward_cfg |
Milestone completion + progress | Current milestone progress | 100 |
| No reward config | Video "Task Progress" | Primary scoring method | 10 |
Results are saved in output/{timestamp}/result.json:
{
"participants": {
"agent": "019bc2e2-b44b-71f3-9fa5-e89901920e31"
},
"results": [
{
"task_category": "crafting",
"num_tasks": 10,
"total_max_score": 100.0,
"total_score": 21.5,
"avg_action_control": 3.4,
"avg_error_recognition_and_correction": 0.9,
"avg_creative_attempts": 0.2,
"avg_task_completion_efficiency": 1.9,
"avg_material_selection_and_usage": 4.3,
"task_metrics": {
"craft_oak_planks": {
"max_score": 10.0,
"sim_score": 0.0,
"score": 4.0,
"action_control": 9.0,
"error_recognition_and_correction": 8.0,
"creative_attempts": 2.0,
"task_completion_efficiency": 8.0,
"material_selection_and_usage": 9.0
},
"craft_the_crafting_table": {
"max_score": 10.0,
"sim_score": 0.0,
"score": 1.5,
"action_control": 8.0,
"error_recognition_and_correction": 0.0,
"creative_attempts": 0.0,
"task_completion_efficiency": 0.0,
"material_selection_and_usage": 4.0
},
"craft_diorite": {
"max_score": 10.0,
"sim_score": 0.0,
"score": 2.0,
"action_control": 2.0,
"error_recognition_and_correction": 0.0,
"creative_attempts": 0.0,
"task_completion_efficiency": 1.0,
"material_selection_and_usage": 3.0
},
...
}
}
]
}Your Purple Agent must implement the following A2A message protocol. For detailed action space documentation, refer to MineStudio Action Space.
Request (InitPayload):
{
"type": "init",
"prompt": "You are an AI agent that can play Minecraft...",
"text": "craft furnace from cobblestone"
}- prompt: Basic instruction or context for the agent's role and behavior. Example: "You are an AI agent that can play Minecraft...". This defines the agent's overall style, goals, and constraints, action space, and serves as a decision-making guideline during action inference.
- text: The specific task or command to be performed. Example: "craft furnace from cobblestone". This clearly specifies the goal to be achieved and is a key input for determining the agent's purpose during action inference.
Response (AckPayload):
{
"type": "ack",
"success": true,
"message": "Agent initialized and ready"
}Request (ObservationPayload):
{
"type": "obs",
"step": 42,
"obs": "<base64_encoded_128x128_RGB_image>"
}- obs: The current observation of the environment, typically a base64-encoded RGB image of Minecraft game screen. This allows the agent to perceive the environment and infer the most appropriate action for the current situation.
Response (ActionPayload):
The agent supports three action formats:
{
"type": "action",
"action_type": "agent",
"buttons": [123],
"camera": [60]
}buttons: Single integer (0-8191) encoding all button states as a bitmaskcamera: Single integer (0-120) for discretized camera movement
{
"type": "action",
"action_type": "agent",
"buttons": [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"camera": [0.0, 90.0]
}buttons: 20-element binary array for individual button statescamera: [yaw, pitch] in degrees
{
"type": "action",
"action_type": "env",
"action": {
"forward": 1,
"back": 0,
"left": 0,
"right": 0,
"jump": 0,
"sneak": 0,
"sprint": 0,
"attack": 0,
"use": 0,
"drop": 0,
"inventory": 0,
"hotbar.1": 0,
"hotbar.2": 0,
"hotbar.3": 0,
"hotbar.4": 0,
"hotbar.5": 0,
"hotbar.6": 0,
"hotbar.7": 0,
"hotbar.8": 0,
"hotbar.9": 0,
"camera": [0.0, 0.0]
}
}For complete action space specification, see the MineStudio documentation.
Button Actions:
- Movement:
forward,back,left,right,jump,sneak,sprint - Interaction:
attack,use,drop,inventory - Hotbar:
hotbar.1throughhotbar.9
Camera Control:
- Format:
[yaw, pitch]or single discretized value - Yaw range: -180Β° to 180Β° (horizontal rotation)
- Pitch range: -90Β° to 90Β° (vertical rotation)
Note: All three formats are automatically parsed and converted by the Green Agent. Choose the format that best suits your agent's architecture.
- Create YAML file in appropriate category folder
- Define init commands, task text, and reward config
- Test with
run_task.py
Main files to modify:
src/agent.py: Core evaluation loopsrc/util.py: Task loading and video processingMCU_benchmark/auto_eval/: Video evaluation prompts
MIT License
For questions or issues, please create an issue on GitHub or contact the maintainers.