diff --git a/SKILL.md b/SKILL.md new file mode 100644 index 0000000000..007be33e52 --- /dev/null +++ b/SKILL.md @@ -0,0 +1,484 @@ +--- +name: openevolve +description: > + Use OpenEvolve to evolve and optimize code using LLMs. Triggers: user wants to + automatically improve an algorithm, optimize a function, discover better solutions + through evolutionary search, or run an AlphaEvolve-style coding agent. Covers the + full workflow: scaffolding experiments, writing evaluators, configuring evolution, + running jobs, inspecting checkpoints, and retrieving the best evolved programs. +--- + +# OpenEvolve Agent Skill + +OpenEvolve is an evolutionary coding agent that uses LLMs to iteratively improve code. +Given an initial program and an evaluation function, it maintains diverse populations +of program variants (MAP-Elites) and evolves them over generations to discover better +algorithms, optimizations, or solutions. + +## When to use this skill + +- User asks to "evolve", "optimize", or "improve" code automatically +- User wants to discover novel algorithms for a problem +- User mentions AlphaEvolve, evolutionary coding, or MAP-Elites +- User wants to benchmark multiple approaches to a problem and find the best one +- User asks to optimize GPU kernels, sorting algorithms, mathematical functions, etc. + +## Prerequisites + +```bash +pip install openevolve +export OPENAI_API_KEY="your-api-key" # works for OpenAI, Gemini, or any compatible provider +``` + +Verify: `openevolve-run --help` + +--- + +## Core Concepts + +Every OpenEvolve experiment requires **2 files** (with an optional third): + +| File | Required | Purpose | +|------|----------|---------| +| `initial_program.py` | Yes | Starting code with `EVOLVE-BLOCK` markers around the section to evolve | +| `evaluator.py` | Yes | Defines `evaluate(program_path) -> dict` that scores each variant | +| `config.yaml` | No | LLM provider, iterations, population size, system message (sensible defaults are used when omitted) | + +--- + +## Workflow + +### Step 1: Scaffold the experiment + +Create a project directory with the required files. + +```bash +mkdir -p my_experiment +``` + +#### initial_program.py + +Wrap the code to evolve in `EVOLVE-BLOCK` markers. These markers serve as guidance for the LLM, signaling which sections should be modified. Note: the markers are not enforced by the tool — the diff engine can technically change any part of the file — but they strongly steer the LLM's attention. + +```python +import math + +# Helper functions (not evolved) +def distance(a, b): + return math.sqrt((a[0]-b[0])**2 + (a[1]-b[1])**2) + +# EVOLVE-BLOCK-START +def solve(data): + """This function will be evolved by OpenEvolve.""" + # Start with a simple/naive implementation + result = sorted(data) + return result +# EVOLVE-BLOCK-END + +if __name__ == "__main__": + print(solve([3, 1, 4, 1, 5])) +``` + +**Guidelines for EVOLVE-BLOCK markers:** +- Both `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` are required as a pair +- Multiple blocks are supported for evolving different sections +- If omitted, the LLM may modify any part of the file (less control) +- These markers are conventions that guide the LLM prompt — they are not enforced at the tool level +- Keep the code between markers self-contained — imports and helpers go outside + +#### evaluator.py + +Must define `evaluate(program_path)` returning a dict of numeric metrics. +Fitness is taken from the `combined_score` key when present; otherwise it is computed as the average of all numeric metric values excluding any MAP-Elites feature dimensions (non-numeric and boolean values are ignored). + +```python +import importlib.util +import time + +def evaluate(program_path: str) -> dict: + """Score an evolved program. Return dict with numeric metrics.""" + # Load the evolved module + spec = importlib.util.spec_from_file_location("evolved", program_path) + module = importlib.util.module_from_spec(spec) + try: + spec.loader.exec_module(module) + except Exception: + # On load failure, return a numeric-only metrics dict. + # In a real evaluator, log the exception or store it in artifacts, not metrics. + return {"combined_score": 0.0} + + # Define test cases + test_cases = [ + ([3, 1, 2], [1, 2, 3]), + ([5, 2, 8, 1], [1, 2, 5, 8]), + ([1], [1]), + ] + + correct = 0 + start = time.time() + for inp, expected in test_cases: + try: + result = module.solve(inp.copy()) + if result == expected: + correct += 1 + except Exception: + pass + elapsed = time.time() - start + + accuracy = correct / len(test_cases) + return { + "combined_score": accuracy, + "accuracy": accuracy, + "runtime": elapsed, + } +``` + +**Evaluator best practices:** +- Always wrap `exec_module` in try/except — evolved code may crash +- Copy mutable inputs (lists, dicts) before passing to the evolved function +- Return `combined_score: 0.0` on any failure so evolution can continue +- Add multiple metrics (accuracy, runtime, memory) for richer MAP-Elites grids +- For custom MAP-Elites dimensions, return raw continuous values — OpenEvolve handles binning + +**Advanced: using EvaluationResult for artifact feedback** + +```python +from openevolve.evaluation_result import EvaluationResult + +def evaluate(program_path): + # ... run tests ... + return EvaluationResult( + metrics={"combined_score": 0.85, "accuracy": 0.9, "runtime": 0.3}, + artifacts={ + "stderr": captured_stderr, + "test_failures": failure_details, + } + ) +``` + +Artifacts are fed back to the LLM in the next generation's prompt, creating a feedback loop. + +#### config.yaml + +```yaml +max_iterations: 200 +random_seed: 42 + +llm: + api_base: "https://generativelanguage.googleapis.com/v1beta/openai/" # Gemini + primary_model: "gemini-2.5-flash" + temperature: 0.7 + +prompt: + system_message: | + You are an expert programmer. Your task is to improve the sorting algorithm + for better performance and correctness. Focus on algorithmic improvements. + num_top_programs: 3 + num_diverse_programs: 2 + include_artifacts: true + +database: + population_size: 500 + num_islands: 5 + feature_dimensions: ["complexity", "diversity"] + +evaluator: + enable_artifacts: true + cascade_evaluation: true +``` + +**Key config parameters:** + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `max_iterations` | 10000 | Total evolution iterations | +| `random_seed` | 42 | For reproducibility | +| `llm.api_base` | OpenAI URL | LLM API endpoint | +| `llm.primary_model` | — | Model name (required) | +| `llm.temperature` | 0.7 | Higher = more exploration | +| `prompt.system_message` | generic | Domain-specific guidance for the LLM | +| `prompt.num_top_programs` | 3 | Best programs shown as inspiration | +| `prompt.num_diverse_programs` | 2 | Diverse programs for exploration | +| `database.population_size` | 1000 | MAP-Elites grid capacity | +| `database.num_islands` | 5 | Parallel evolving populations | +| `database.migration_interval` | 50 | Generations between island migration | +| `evaluator.cascade_evaluation` | true | Multi-stage filtering of bad programs | +| `evaluator.enable_artifacts` | true | Capture and store evaluation artifacts | +| `prompt.include_artifacts` | true | Include artifacts in LLM prompts (feedback loop) | + +**LLM provider examples:** + +```yaml +# OpenAI +llm: + api_base: "https://api.openai.com/v1" + primary_model: "gpt-4o" + +# Google Gemini +llm: + api_base: "https://generativelanguage.googleapis.com/v1beta/openai/" + primary_model: "gemini-2.5-pro" + +# Local (Ollama) +llm: + api_base: "http://localhost:11434/v1" + primary_model: "codellama:7b" +``` + +### Step 2: Validate the setup + +Before running, verify files are correct: + +```bash +# Check initial program has EVOLVE-BLOCK markers +grep -n "EVOLVE-BLOCK" my_experiment/initial_program.py + +# Check evaluator has evaluate() function +grep -n "def evaluate" my_experiment/evaluator.py + +# Dry-run the evaluator on the initial program +python -c " +import sys; sys.path.insert(0, 'my_experiment') +from evaluator import evaluate +print(evaluate('my_experiment/initial_program.py')) +" +``` + +The evaluator should return a dict with numeric scores. If it crashes or returns 0, fix it before running evolution. + +### Step 3: Run evolution + +**CLI (recommended):** + +```bash +openevolve-run my_experiment/initial_program.py \ + my_experiment/evaluator.py \ + --config my_experiment/config.yaml \ + --iterations 200 \ + --output my_experiment/output +``` + +**All CLI flags:** + +``` +openevolve-run [options] + +Required: + initial_program Path to initial program file + evaluation_file Path to evaluator file + +Options: + --config, -c CONFIG Config YAML file + --output, -o DIR Output directory (default: openevolve_output) + --iterations, -i N Max iterations (overrides config) + --target-score, -t SCORE Stop early at this score + --log-level, -l LEVEL DEBUG|INFO|WARNING|ERROR + --checkpoint PATH Resume from checkpoint directory + --api-base URL Override LLM API base URL + --primary-model NAME Override primary LLM model + --secondary-model NAME Override secondary LLM model +``` + +**Python API (for programmatic use):** + +```python +from openevolve import run_evolution, evolve_function + +# File-based +result = run_evolution( + initial_program="my_experiment/initial_program.py", + evaluator="my_experiment/evaluator.py", + config="my_experiment/config.yaml", + iterations=200, +) +print(f"Best score: {result.best_score}") +print(result.best_code) + +# Inline — evolve a function directly +def bubble_sort(arr): + for i in range(len(arr)): + for j in range(len(arr)-1): + if arr[j] > arr[j+1]: + arr[j], arr[j+1] = arr[j+1], arr[j] + return arr + +result = evolve_function( + bubble_sort, + test_cases=[([3,1,2], [1,2,3]), ([5,2,8], [2,5,8])], + iterations=50, +) +``` + +### Step 4: Inspect results + +**Output directory structure:** + +``` +my_experiment/output/ +├── checkpoints/ +│ ├── checkpoint_100/ +│ │ ├── metadata.json # database state and iteration info +│ │ ├── best_program.py # best evolved code (extension matches language) +│ │ ├── best_program_info.json # best program metadata (id, metrics, generation) +│ │ ├── programs/ # all population programs as JSON +│ │ │ ├── .json +│ │ │ └── ... +│ │ └── artifacts/ # evaluation artifacts (if any) +│ └── checkpoint_200/ +└── evolution_trace.jsonl # per-iteration log (if enabled) +``` + +**Get the best program from the latest checkpoint:** + +```bash +# Find the latest checkpoint +ls -d my_experiment/output/checkpoints/checkpoint_* | sort -t_ -k2 -n | tail -1 + +# View the best evolved code directly +cat my_experiment/output/checkpoints/checkpoint_200/best_program.py + +# Read best program metadata (id, metrics, generation, etc.) +cat my_experiment/output/checkpoints/checkpoint_200/best_program_info.json | python -m json.tool +``` + +**Check score progression (if evolution_trace is enabled):** + +```bash +# Extract scores from JSONL trace +python -c " +import json +with open('my_experiment/output/evolution_trace.jsonl') as f: + for line in f: + entry = json.loads(line) + if 'metrics' in entry: + print(f\"Iter {entry.get('iteration')}: {entry['metrics'].get('combined_score', 'N/A')}\") +" +``` + +### Step 5: Resume from checkpoint + +```bash +openevolve-run my_experiment/initial_program.py \ + my_experiment/evaluator.py \ + --config my_experiment/config.yaml \ + --checkpoint my_experiment/output/checkpoints/checkpoint_200 \ + --iterations 100 +``` + +This loads the MAP-Elites population from the checkpoint and runs 100 more iterations. + +### Step 6: Visualize (optional) + +```bash +pip install flask +python scripts/visualizer.py --path my_experiment/output/checkpoints/checkpoint_200/ +``` + +Opens a web UI with evolution tree, score progression, code diffs, and MAP-Elites grid. + +--- + +## Writing Effective System Messages + +The system message in `config.yaml` is the single most important lever for evolution quality. + +**Structure:** Role → Task → Domain knowledge → Allowed changes → Constraints + +**Example (GPU kernel optimization):** + +```yaml +prompt: + system_message: | + You are an expert Metal GPU programmer for Apple Silicon. + + GOAL: Optimize the attention kernel for 5-15% speedup. + + OPTIMIZATION OPPORTUNITIES: + 1. Memory coalescing and vectorized SIMD loading + 2. Fuse max-finding with score computation + 3. Pre-compute frequently used indices + + MUST NOT CHANGE: + - Kernel function signature + - Algorithm correctness + - Template parameter types + + ALLOWED: + - Memory access patterns + - Computation order + - Apple Silicon-specific intrinsics +``` + +**Iterative refinement process:** +1. Start with a basic message, run 20-50 iterations +2. Check where evolution gets stuck (read artifacts/errors) +3. Add specific guidance for observed failure modes +4. Repeat — the system message should evolve alongside the code + +--- + +## Common Patterns + +**Multi-language evolution (Rust, R, etc.):** Set `language` and `file_suffix` in config: + +```yaml +language: rust +file_suffix: ".rs" +``` + +The evaluator can compile and run any language — it just receives a file path. + +**Prompt evolution (evolve prompts, not code):** Use the initial program as a prompt template +and the evaluator to test LLM output quality. See `examples/llm_prompt_optimization/`. + +**Multi-objective optimization:** Return multiple metrics from the evaluator and add them +as `feature_dimensions` in the database config. OpenEvolve maintains Pareto-optimal solutions. + +**Cost management:** +- Start with `gemini-2.5-flash` (~$0.01-0.05/iter) for exploration +- Switch to `gemini-2.5-pro` or `gpt-4o` for refinement +- Use `cascade_evaluation: true` to filter bad programs cheaply +- Start with 100-200 iterations, increase only if scores are still improving + +--- + +## Troubleshooting + +| Problem | Solution | +|---------|----------| +| `OPENAI_API_KEY not set` | `export OPENAI_API_KEY="your-key"` (used for all providers including Gemini) | +| Evaluator always returns 0 | Test evaluator manually: `python -c "from evaluator import evaluate; print(evaluate('initial_program.py'))"` | +| Evolution stuck at same score | Increase `temperature`, add more `num_diverse_programs`, improve system message | +| Out of memory | Reduce `population_size`, enable `cascade_evaluation` | +| LLM rate limits | Add `retry_delay: 10` in llm config, or use OptiLLM proxy | +| Bad evolved code | Enable `evaluator.enable_artifacts: true` and `prompt.include_artifacts: true` so errors feed back to the LLM | + +--- + +## Built-in Examples + +Clone the repo for ready-to-run examples: + +```bash +git clone https://github.com/algorithmicsuperintelligence/openevolve.git +cd openevolve +``` + +| Example | Directory | Description | +|---------|-----------|-------------| +| Function Minimization | `examples/function_minimization/` | Random search → simulated annealing | +| Circle Packing | `examples/circle_packing/` | State-of-the-art n=26 packing | +| GPU Kernels | `examples/mlx_metal_kernel_opt/` | 2.8x Metal kernel speedup | +| Sorting | `examples/rust_adaptive_sort/` | Adaptive sort in Rust | +| Prompt Optimization | `examples/llm_prompt_optimization/` | +23% accuracy on HotpotQA | +| Symbolic Regression | `examples/symbolic_regression/` | Automated equation discovery | +| Signal Processing | `examples/signal_processing/` | Filter design | + +Run any example: + +```bash +openevolve-run examples/function_minimization/initial_program.py \ + examples/function_minimization/evaluator.py \ + --config examples/function_minimization/config.yaml \ + --iterations 50 +```