Turn your general AI agents into state-of-the-art specialists with a benchmark and EvoSkill, a toolkit for automatically creating and improving AI skills, compatible with Claude Code, Codex CLI, OpenCode, OpenHands, Goose, and more.
EvoSkill significantly extends the feedback-driven idea of GEPA from single-file optimization to complete agent evolution. Instead of only revising one prompt in place like GEPA, EvoSkill proposes multiple skill and prompt mutations jointly, evaluates new variants on held-out data, and has each iteration produce an entirely new agent program.
Install in seconds, then run evoskill init and evoskill run to supercharge any coding agent with AI-created skills and prompts automatically. Depending on the agent, you are free to use any model provider of your choice (OpenRouter, Anthropic, OpenAI, Fireworks, and more) and any model you want (Claude, GLM, Minimax, Kimi, GPT, Gemini, Qwen, and others).
Also join us on Discord to discuss your experience, share suggestions, or show off your work!
| Agent | Support | Notes |
|---|---|---|
| Claude Code | ✅ | |
| OpenCode | ✅ | CLI v1.4.0+ required (structured output support) |
| OpenHands | ✅ | No native structured output; uses fallback JSON extraction |
| Goose | ✅ | CLI v1.25.0+ required (skill discovery via summon extension) |
| Codex CLI | ✅ | Skill discovery via .agents/skills/ symlink |
| Capability | Status | Explanation |
|---|---|---|
| Evolution with a benchmark | ✅ | Skills can be effectively improved against your own or academic benchmarks. |
| Cross-agent transferability | ✅ | Skills are packaged as reusable folders with instructions, metadata, and helper scripts, compatible with many coding agents. |
| Cross-model transferability | ✅ | Demonstrated in EvoSkills, skills evolved with a fixed LLM can transfer their performance increase to other LLMs. |
| Cross-task transferability | ✅ | Generated skills can be generic enough to transfer across tasks, for instance a SealQA skill improving BrowseComp performance (as shown in EvoSkill). |
| Evolution without a benchmark | 🛠️ | An open research direction where benchmarks are generated on the fly (ex. Hermes-Agent self-evolution). |
| Continuous evolution | 🛠️ | Integrating the ability to improve skills from regular usage. |
- Installation
- Quickstart
- CLI Reference
- Configuration Reference
- How It Works
- Git Branches
- When the Loop Gets Stuck
- Python API
- Citation
- License
Requirements:
- Python 3.12+
uv(recommended) orpip
# Using uv (recommended)
uv sync
# Or using pip
pip install -e .Agent CLI (install whichever harness you plan to use):
brew install --cask claude-code # Claude Code
brew install opencode # OpenCode (v1.4.0+)
brew install --cask codex # Codex CLI
brew install block-goose-cli # Goose (v1.25.0+)Common auth setup:
# Anthropic (Claude Code harness)
export ANTHROPIC_API_KEY=your-key-here
# OpenAI (Codex harness)
export OPENAI_API_KEY=your-key-here
# OpenRouter (OpenCode / Goose / OpenHands harnesses)
export OPENROUTER_API_KEY=your-key-hereOpenRouter-backed evolution runs also accept LLM_API_KEY, but OPENROUTER_API_KEY is the preferred env var.
Run evoskill init inside any git repository:
$ evoskill init
EvoSkill — Project Setup
Which harness? › claude
Evolution mode? › skill_only — agent learns new skills (recommended)
Dataset path? › /absolute/path/to/questions.csv
Question column name? › question
Ground truth column name? › answer
Category column name? (leave blank if none) ›
Additional folders the agent can interact with? › /absolute/path/to/data_dirThis creates .evoskill/config.toml and .evoskill/task.md.
- Dataset path — absolute path to your CSV file containing questions and ground-truth answers.
- Data dirs — absolute paths to any additional directories the agent needs access to during runs (e.g. reference documents, databases). Comma-separated if multiple.
Edit .evoskill/task.md to describe what the agent should do:
# Task
Answer questions about quarterly financial reports.
Return only the numeric answer with units.
## Examples
- "What was revenue in Q3?" → "$4.2B"
---
# Constraints
- Always include units in the answer
- Do not explain your reasoning, just return the answerevoskill runEvoSkill will run the evolutionary loop and print a live progress table:
Iter Accuracy Δ Skills Frontier Status
1 42.0% — 0 [1] baseline
2 51.3% +9.3% 1 [1, 2] ★ new best
3 49.7% -1.6% 1 [1, 2] discarded
...evoskill eval # score the best program on the validation set
evoskill skills # list all discovered skills
evoskill diff # see what changed vs baseline
evoskill logs # view past run historyAfter the loop finishes, the best program lives on a git branch:
git branch | grep program/ # list all program branches
git checkout program/iter-skill-3 # switch to the best oneFrom there you can inspect what the loop discovered:
cat .claude/program.yaml # system prompt, tools, score
ls .claude/skills/ # all learned skillsCopy .claude/program.yaml and .claude/skills/ into your deployment to use the evolved agent configuration.
| Command | Description |
|---|---|
evoskill init |
Initialize a new project (creates .evoskill/) |
evoskill run |
Run the self-improvement loop |
evoskill eval |
Evaluate the best program on the validation set |
evoskill skills |
List all skills discovered so far |
evoskill diff |
Diff baseline vs best, or between two iterations |
evoskill logs |
Show recent run history |
evoskill reset |
Delete all program branches and start fresh |
evoskill run [--continue] [--verbose] [--quiet]| Flag | Description |
|---|---|
--continue |
Resume from the existing frontier instead of starting fresh. Preserves all program/* branches, frontier/* tags, feedback history, and the sampling checkpoint so the loop picks up exactly where it left off. |
--verbose |
Show per-sample pass/fail results |
--quiet |
Show the progress table only, suppress proposer output |
evoskill diff # baseline → current best
evoskill diff 3 7 # iteration 3 vs iteration 7The diff is scoped to the .claude/ directory — it shows changes to skills and the system prompt, not your source code.
evoskill logs # last 5 runs (default)
evoskill logs --last 10 # last 10 runsevoskill reset # prompts for confirmationDeletes all program/* branches, frontier/* tags, the loop checkpoint, and feedback history. Your source code, config.toml, task.md, and any skills in .claude/skills/ are left untouched.
evoskill init creates .evoskill/config.toml. All fields are optional — defaults are shown below.
[harness]
name = "claude" # "claude", "opencode", "codex", "goose", or "openhands"
model = "sonnet" # Claude alias, Codex model name, or provider/model for OpenCode/Goose/OpenHands
data_dirs = ["/absolute/path/to/data_dir"] # extra directories the agent can read
[evolution]
mode = "skill_only" # "skill_only" or "prompt_only"
iterations = 20
frontier_size = 3
concurrency = 4
no_improvement_limit = 5
[dataset]
path = "/absolute/path/to/questions.csv" # absolute path to the dataset CSV
question_column = "question"
ground_truth_column = "ground_truth"
category_column = "" # optional, for stratified sampling
train_ratio = 0.18
val_ratio = 0.12
[scorer]
type = "multi_tolerance" # see scorer types belowCommon evolution model setups:
Anthropic:
[harness]
name = "claude"
model = "claude-sonnet-4-6"OpenAI:
[harness]
name = "codex"
model = "gpt-5"OpenRouter:
[harness]
name = "opencode"
model = "openrouter/openai/gpt-5-mini"Notes:
claudeis Anthropic-only.codexuses bare OpenAI model names such asgpt-5,o3, orcodex-mini-latest.opencode,goose, andopenhandsare multi-provider harnesses and can also use Claude and OpenAI models.opencode,goose, andopenhandsacceptprovider/modelstrings such asanthropic/claude-sonnet-4-6,openai/gpt-5, oropenrouter/openai/gpt-5-mini.
| Type | Description |
|---|---|
multi_tolerance |
Flexible string matching: exact, numeric tolerance, list overlap (default) |
exact |
Case-insensitive exact string match |
llm |
LLM-as-judge grading with a custom rubric |
script |
Shell script scorer — receives {predicted} and {expected} as variables |
LLM scorer options:
[scorer]
type = "llm"
rubric = "Award 1.0 if the answer is numerically correct within 5%, 0.0 otherwise."
model = "claude-sonnet-4-6" # defaults to claude-sonnet-4-6
provider = "anthropic" # "anthropic", "openai", "google", or "openrouter"For OpenRouter-backed scoring, set provider = "openrouter" and use an OpenRouter model ID such as openai/gpt-5-mini or google/gemini-2.5-flash. Authentication uses OPENROUTER_API_KEY and falls back to LLM_API_KEY if needed.
Script scorer options:
[scorer]
type = "script"
command = "python score.py --predicted {predicted} --expected {expected}"The self-improvement loop follows five stages:
- Base Agent — Attempts benchmark questions using the current best program (system prompt + skills).
- Proposer — Analyzes failure cases and proposes targeted skill or prompt changes to address them.
- Generator — Creates the proposed changes: writes new skill files or rewrites the system prompt.
- Evaluator — Scores the new program variant on a held-out validation set to measure improvement.
- Frontier — Tracks the top-N performing programs as git branches; the best survive to the next iteration.
This cycle repeats for a configurable number of iterations, automatically converging on stronger agent configurations.
EvoSkill uses your repo's git history to version every program it creates. During a run it automatically creates and switches between branches — you don't need to do anything. After a run your branch layout will look like:
main ← your code, untouched
program/base ← initial baseline agent
program/iter-skill-1 ← after iteration 1
program/iter-skill-2 ← after iteration 2
...
Frontier members are marked with frontier/* tags. EvoSkill only ever writes to branches prefixed program/, so there is no risk of it touching your working branch.
If accuracy stops improving, try the following:
- Check the feedback log —
.claude/feedback_history.mdrecords what the proposer tried each iteration and why it succeeded or failed. - Resume instead of restarting —
evoskill run --continuepicks up from the last frontier rather than discarding progress. - Reset and start fresh —
evoskill resetclears all branches and lets you start over with a revisedtask.md.
For programmatic usage, EvoSkill exposes a high-level Python API.
from src.api import EvoSkill
evo = EvoSkill(
task="sealqa",
model="sonnet",
mode="skill_only",
max_iterations=20,
frontier_size=3,
concurrency=4,
train_ratio=0.18,
val_ratio=0.12,
continue_mode=False,
)
result = await evo.run()
# Synchronous usage
result = EvoSkill(task="base").run_sync()from src.api import EvalRunner
summary = await EvalRunner(
task="sealqa",
model="sonnet",
max_concurrent=8,
).run()If you use EvoSkill in your research, please cite the original paper:
@misc{alzubi2026evoskillautomatedskilldiscovery,
title={EvoSkill: Automated Skill Discovery for Multi-Agent Systems},
author={Salaheddin Alzubi and Noah Provenzano and Jaydon Bingham and Weiyuan Chen and Tu Vu},
year={2026},
eprint={2603.02766},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.02766},
}This project is licensed under the Apache 2.0 License - see the LICENSE file for details.


