tracking: Anthropic skill-creator eval framework alignment

## Summary

Make AgentV compatible with the skill evaluation lifecycle established by Anthropic's skill-creator framework. The core approach is to combine AgentV's currently fragmented eval skills (eval-orchestrator + optimizer) into a single lifecycle skill that matches skill-creator's unified pattern, while enhancing eval agents with skill-creator's best prompt engineering techniques.

## Research Basis

- [Skill Lifecycle Alignment Memo](https://github.com/agentevals/agentevals-research/blob/main/research/agentv/skill-lifecycle-alignment.md)
- [Anthropic Skill Creator Findings](https://github.com/agentevals/agentevals-research/blob/main/research/findings/skill-creator/README.md)
- [Tessl + HBOon Findings](https://github.com/agentevals/agentevals-research/blob/main/research/findings/tessl-skill-improvement/README.md)
- [agentevals-research PR #33](https://github.com/agentevals/agentevals-research/pull/33)


## Positioning

AgentV's primary use case is agent/workspace evaluation (EVAL.yaml with repos, code judges, multi-provider comparison). Skill-creator compatibility is the secondary migration path: users start evaluating skills with skill-creator's simple loop (`evals.json` + `claude -p`). When they need workspace isolation, code judges, multi-provider comparison, tool trajectory scoring, or multi-turn evaluation, they migrate to AgentV — same `evals.json` input, same artifact output, zero rewrite.

## Interoperability Model: AgentV ↔ Skill-Creator

The two systems are **complementary, not competing**. The issues in this tracker must preserve this interop story:

### Shared formats (the integration layer)

| Format | Skill-Creator | AgentV | Direction |
|--------|--------------|--------|-----------|
| `evals.json` | Writes (test cases for a skill) | Reads + runs (via `agentv eval run evals.json`) | skill-creator → AgentV |
| `grading.json` | Reads (eval-viewer, analyzer) | Writes (as companion artifact, #565) | AgentV → skill-creator |
| `timing.json` | Reads (eval-viewer) | Writes (#565) | AgentV → skill-creator |
| `benchmark.json` | Reads (analyzer, comparison) | Writes (#565) | AgentV → skill-creator |
| `feedback.json` | Writes (eval-viewer review) | Reads (review checkpoint, #568) | Bidirectional |

### Skill-creator → AgentV (AgentV as evaluation engine)

A user creates `evals.json` with skill-creator, then runs them through AgentV instead of `claude -p`:
- AgentV adds: workspace isolation, code judges, tool trajectory scoring, multi-provider comparison, multi-turn evaluation
- AgentV outputs grading/benchmark artifacts that skill-creator's `eval-viewer/generate_review.py` can read
- This is the **upgrade path**: simple skill eval → full environment eval

### AgentV → Skill-creator (skill-creator as trigger optimizer)

When AgentV eventually needs trigger-quality evaluation (currently deferred), skill-creator's trigger-eval tooling (`run_loop.py`, `improve_description.py`, train/test splits) could provide it. AgentV would not need to rebuild this from scratch.

### What each system owns

| Concern | Owner | Why |
|---------|-------|-----|
| Skill authoring (SKILL.md creation) | Skill-creator | Claude-specific, not AgentV's scope |
| Trigger optimization | Skill-creator | Requires Claude-specific mechanisms (`claude -p`) |
| .skill packaging | Skill-creator | Distribution format, not evaluation |
| Execution-quality evaluation | AgentV | Richer environment: workspaces, multi-provider, code judges |
| Artifact format definition | Shared | AgentV produces supersets; skill-creator defines the baseline schema |
| Eval-viewer / review UI | Skill-creator (for now) | AgentV's #562/#563 dashboards will eventually supersede |

### Implementation constraint

Artifact schemas (#565) MUST be supersets of skill-creator's schemas — AgentV adds fields but never removes or renames skill-creator fields. This ensures skill-creator tooling can always read AgentV output.

## Sub-Issues

### Orchestration

| # | Title | Boundary | Priority |
|---|-------|----------|----------|
| #573 | feat: unified skill-eval lifecycle skill (combine eval-orchestrator + optimizer) | external-first | Must |
| #572 | fix: disambiguate eval skill triggers from skill-creator | external-first | Must |

### Agent/Capability Enhancements (feed into #573's phases)

| # | Title | Phase in #573 | Priority |
|---|-------|--------------|----------|
| #570 | feat: eval-judge with claims extraction, eval critique, evidence format | Phase 3: Grade | Must |
| #571 | feat: blind A/B comparison with dynamic rubrics | Phase 4: Compare | Should |
| #567 | feat: eval analyzer (deterministic-upgrade suggestions) | Phase 5: Analyze | Should |
| #568 | feat: human review checkpoint + feedback artifact | Phase 6: Review | Should |

### Output & Documentation

| # | Title | Boundary | Priority |
|---|-------|----------|----------|
| #565 | feat: skill-eval companion artifacts (grading, timing, benchmark) | external-first | Must |
| #564 | docs: canonical skill-improvement workflow guide | docs-examples | Must |
| #566 | docs: separate execution quality from trigger quality | docs-examples | Must |

### Pre-existing related issues

| # | Title | Relationship |
|---|-------|-------------|
| #562 | feat: self-contained HTML dashboard | Review surface for Phase 6; consumer of #565 artifacts |
| #563 | feat: self-hosted dashboard with history repo | Historical trends; consumer of #565 artifacts |
| #335 | feat: iteration tracking, cross-run regression | Complementary to #565 |

## Architecture Boundary Summary

- **All implementation is external-first or docs-only.** No core runtime changes. Per CLAUDE.md's Lightweight Core principle, the JSONL output already contains all data needed for companion artifacts; the eval-judge and comparison enhancements are agent prompt changes; the lifecycle skill is orchestration-level work in `plugins/agentv-dev/`.
- **No Python scripts.** Skill-creator uses Python scripts for trigger-eval tooling. AgentV is TypeScript/Bun. Per AI-First Design, evaluation patterns belong in agents and skills, not scripts.
- **History storage uses existing config.** `.agentv/config.yaml` already supports configurable `history.repo`.
- **One lifecycle skill, not four.** Combining eval-orchestrator + optimizer into one skill matches skill-creator's unified approach. `agentv-eval-builder` (test case creation) and `agentv-trace-analyst` (ad-hoc analysis) stay separate.

## Skill-Creator Prompt Engineering Adoption

### Adopt from skill-creator

| Pattern | Source | Target |
|---------|--------|--------|
| Claims extraction and verification | grader.md | #570 → Phase 3 |
| Eval self-critique | grader.md | #570 → Phase 3 |
| Surface vs substance guards | grader.md | #570 → Phase 3 |
| Per-assertion structured evidence | grader.md | #570 → Phase 3 |
| User notes integration | grader.md | #570 → Phase 3 |
| Blind A/B comparison | comparator.md | #571 → Phase 4 |
| Dynamic rubric generation | comparator.md | #571 → Phase 4 |
| Post-comparison analysis | analyzer.md | #571 → Phase 4 |
| Benchmark pattern analysis | analyzer.md | #567 → Phase 5 |
| Deterministic-upgrade suggestions | analyzer.md | #567 → Phase 5 |

### Keep from AgentV

| Pattern | Location | Why |
|---------|----------|-----|
| SIMBA (self-introspective failure analysis) | optimizer-reflector | More structured than analyzer |
| GEPA (trace reflection) | optimizer-reflector | More formal diagnosis categories |
| Integrity checks (task ∉ evaluator configs) | optimizer-discovery | Unique to AgentV |
| Stagnation detection | optimizer-reflector | Unique to AgentV |
| Failure triage | optimizer-discovery | More granular |

## Dependency Graph

```
#572 (trigger disambiguation) ─────────────────────────┐
#566 (exec vs trigger docs) ────────────────────────────┤
#564 (workflow guide) ──────────────────────────────────┤
#565 (artifact format) ─────────────────────────────────┤
                                                        │
#570 (eval-judge enhancement) ──┐                       │
#571 (blind comparison) ────────┤                       ├─ all complete = tracking done
#567 (analyzer) ────────────────┼─► #573 (combined skill)
#568 (review checkpoint) ───────┘                       │
                                                        │
#573 (unified lifecycle skill) ─────────────────────────┘
```

- Wave 1 (#572, #566, #564, #565, #570) — all independent
- Wave 2 (#571, #567, #568) — benefit from Wave 1 but can start in parallel
- Wave 3 (#573) — orchestrates all agent enhancements into the unified SKILL.md. Can start in parallel with agent work but finalizes after agents are enhanced.

## Parallel Waves

### Wave 1 (independent — run in parallel)
- #572 — Disambiguate eval skill triggers (frontmatter-only)
- #566 — Execution vs trigger quality docs (docs-only)
- #564 — Canonical skill-improvement workflow guide (docs-only)
- #565 — Skill-eval companion artifacts (output format)
- #570 — Eval-judge enhancement with skill-creator grading patterns (agent prompt)

### Wave 2 (benefits from Wave 1)
- #571 — Blind A/B comparison with dynamic rubrics (new agents)
- #567 — Eval analyzer (enhanced optimizer-reflector)
- #568 — Human review checkpoint + feedback artifact (docs + schema)

### Wave 3 (orchestration)
- #573 — Unified lifecycle skill SKILL.md (combines eval-orchestrator + optimizer, references all enhanced agents)

## Merge Order

1. #572 (frontmatter-only)
2. #566 (docs-only)
3. #564 (docs-only)
4. #568 (docs + schema)
5. #570 (agent prompt — eval-judge)
6. #565 (output format)
7. #571 (new agents — comparator + analyzer)
8. #567 (enhanced optimizer-reflector)
9. #573 (unified SKILL.md — must merge after all agents are in place)

## Subagent Operating Contract

- Read AgentV's CLAUDE.md and contribution guidelines before starting
- One issue per PR
- Follow existing agent structure in `plugins/agentv-dev/agents/`
- Follow existing skill structure in `plugins/agentv-dev/skills/`
- Follow existing `OutputWriter` pattern for #565 (see `apps/cli/src/commands/eval/jsonl-writer.ts`)
- Follow existing Astro docs patterns for #564 and #566
- For agent changes (#570), read the actual skill-creator source prompt for reference (linked in each issue)
- Tests included for output format changes (#565)
- Docs changes should build without errors (`bun run docs:build`)
- Docs issues (#564, #566) should also update the `agentv-eval-builder` skill reference card
- Artifact storage should reference existing `.agentv/config.yaml` history configuration

## End-to-End Validation

After all issues are implemented, the following end-to-end scenario must work:

### E2E Test: Full Skill Evaluation Lifecycle

**Setup:**
1. A workspace with an existing skill (SKILL.md) and eval file (evals.json or EVAL.yaml)
2. Both `agentv-dev` plugin and `anthropics/skills` (skill-creator) loaded in the same session
3. AgentV CLI installed and configured with at least one target

**Scenario:**
```
User: "Evaluate and improve my skill against evals/skill-quality.yaml"
```

**Expected behavior:**
1. Claude triggers the **AgentV unified lifecycle skill** (not skill-creator) based on disambiguated triggers (#572)
2. **Phase 1 (Discovery)**: optimizer-discovery analyzes the eval, challenges assumptions, triages failures
3. **Phase 2 (Run)**: Runs baseline and candidate evaluations using `agentv eval run`
4. **Phase 3 (Grade)**: Enhanced eval-judge grades with per-assertion evidence, extracts claims, critiques eval quality, guards against surface compliance (#570). Outputs `grading/<test-id>.json` (#565)
5. **Phase 4 (Compare)**: Blind comparator evaluates A/B without labels, generates task-specific rubrics (#571). Comparison analyzer unblinds and explains why winner won
6. **Phase 5 (Analyze)**: Enhanced reflector identifies deterministic upgrade opportunities, flags weak assertions, detects flaky patterns (#567). Outputs `benchmark.json` (#565)
7. **Phase 6 (Review)**: Presents results to human. Human writes feedback.json (#568)
8. **Phase 7 (Optimize)**: Curator applies surgical edits, polish generalizes into principles
9. **Phase 8 (Re-run)**: Loops back to Phase 2 with modified skill. Compares against previous iteration
10. Skill stabilizes. User is satisfied. `timing.json` written (#565)

**Verification checklist:**
- [ ] Correct skill triggered (AgentV, not skill-creator)
- [ ] All 8 phases execute in order
- [ ] grading.json matches skill-creator schema (expectations[].text/passed/evidence)
- [ ] timing.json matches skill-creator schema
- [ ] benchmark.json matches skill-creator schema (run_summary with mean/stddev)
- [ ] Blind comparison produces winner/reasoning/rubric without revealing labels
- [ ] Analyzer suggests at least one deterministic assertion upgrade
- [ ] Human review checkpoint pauses for feedback
- [ ] Iteration loop improves pass rate
- [ ] Feedback.json persists across iterations
- [ ] Execution quality and trigger quality are NOT conflated

### E2E Test: Artifact Compatibility

**Scenario:** Generate AgentV artifacts and verify skill-creator's tooling can read them

- [ ] `grading/<test-id>.json` can be parsed by skill-creator's `eval-viewer/generate_review.py`
- [ ] `benchmark.json` structure matches skill-creator's `run_summary` format
- [ ] `timing.json` structure matches skill-creator's timing format

### E2E Test: Workspace-Based Agent Evaluation

**Scenario:** Evaluate an agent that modifies files in a cloned repository

**Setup:**
1. An EVAL.yaml with workspace config (repo URL, setup script, teardown)
2. Test cases that require the agent to read, modify, and create files
3. Evaluators: code-judge (Python script checking file output), tool-trajectory, llm-judge

**Expected behavior:**
1. Workspace cloned and setup script executed
2. Agent runs in isolated workspace
3. Code judge evaluates workspace file changes (not just text output)
4. Tool trajectory verifies correct tool usage (Read → analyze → Write)
5. Workspace teardown executed
6. grading.json includes `workspace_changes` and `evaluators[]` with code-judge details
7. benchmark.json includes per-evaluator breakdown (code-judge, tool-trajectory, llm-judge)

**Verification:**
- [ ] Workspace isolation works (agent runs in cloned repo, not user workspace)
- [ ] Code judge results appear in grading.json with structured details
- [ ] Tool trajectory scores appear in grading.json
- [ ] Workspace file changes tracked in grading.json
- [ ] benchmark.json has per-evaluator summary (not just overall pass rate)

### E2E Test: Multi-Provider Comparison

**Scenario:** Compare 3 providers (Claude, GPT, Gemini) on the same eval

**Expected behavior:**
1. Same eval runs against all 3 targets
2. Blind comparison randomizes labels (A, B, C) for all pairs or round-robin
3. benchmark.json has per-target breakdown with N-way statistics
4. Dynamic rubrics generated per task, applied consistently across all providers

**Verification:**
- [ ] N-way blind comparison works (not just binary A/B)
- [ ] benchmark.json has entries for all 3 targets
- [ ] Comparison results include per-pair winner + overall ranking

### E2E Test: Trigger Disambiguation

**Scenario:** Both skill-creator and agentv-dev loaded. Test these prompts:

| Prompt | Should trigger |
|--------|---------------|
| "Run evals on my skill using AgentV" | AgentV lifecycle skill |
| "Create a new skill from scratch" | skill-creator |
| "Optimize my skill's trigger description" | skill-creator |
| "Evaluate my agent against evals/quality.yaml" | AgentV lifecycle skill |
| "Benchmark skill performance with variance analysis" | Depends on context — should not be ambiguous |

## Completion Criteria

All of the following are true:
- [ ] A unified lifecycle skill handles run → grade → compare → analyze → review → optimize → re-run
- [ ] eval-orchestrator is deprecated/merged into the lifecycle skill
- [ ] eval-judge extracts claims, critiques evals inline, guards against surface compliance
- [ ] Blind A/B comparison with dynamic rubrics is available
- [ ] Standalone analyzer suggests deterministic assertion upgrades
- [ ] Human review checkpoint with feedback artifact is documented and usable
- [ ] Skill-creator-compatible companion artifacts produced at relevant phases
- [ ] User-facing workflow guide documents the lifecycle
- [ ] Execution and trigger quality are documented as distinct concerns
- [ ] AgentV skills trigger correctly when skill-creator is also loaded
- [ ] E2E validation scenarios pass

## Future Work (not in scope)
- Skill-trigger evaluation now tracked in #593 (promoted from future work)

- #593 feat: skill-trigger evaluation + claude-cli provider — **now tracked**
- Trigger-description optimizer with repeated trials and held-out selection
- Skill marketplace or discovery system
- Packaging/distribution tooling for `.skill` bundles
- Multi-run variance detection (requires history repo — relates to #335/#563)
- Python scripts or non-TypeScript tooling (per architecture principles)

Format	Skill-Creator	AgentV	Direction
`evals.json`	Writes (test cases for a skill)	Reads + runs (via `agentv eval run evals.json`)	skill-creator → AgentV
`grading.json`	Reads (eval-viewer, analyzer)	Writes (as companion artifact, #565)	AgentV → skill-creator
`timing.json`	Reads (eval-viewer)	Writes (#565)	AgentV → skill-creator
`benchmark.json`	Reads (analyzer, comparison)	Writes (#565)	AgentV → skill-creator
`feedback.json`	Writes (eval-viewer review)	Reads (review checkpoint, #568)	Bidirectional

Concern	Owner	Why
Skill authoring (SKILL.md creation)	Skill-creator	Claude-specific, not AgentV's scope
Trigger optimization	Skill-creator	Requires Claude-specific mechanisms (`claude -p`)
.skill packaging	Skill-creator	Distribution format, not evaluation
Execution-quality evaluation	AgentV	Richer environment: workspaces, multi-provider, code judges
Artifact format definition	Shared	AgentV produces supersets; skill-creator defines the baseline schema
Eval-viewer / review UI	Skill-creator (for now)	AgentV's #562/#563 dashboards will eventually supersede

#	Title	Boundary	Priority
#573	feat: unified skill-eval lifecycle skill (combine eval-orchestrator + optimizer)	external-first	Must
#572	fix: disambiguate eval skill triggers from skill-creator	external-first	Must

#	Title	Phase in #573	Priority
#570	feat: eval-judge with claims extraction, eval critique, evidence format	Phase 3: Grade	Must
#571	feat: blind A/B comparison with dynamic rubrics	Phase 4: Compare	Should
#567	feat: eval analyzer (deterministic-upgrade suggestions)	Phase 5: Analyze	Should
#568	feat: human review checkpoint + feedback artifact	Phase 6: Review	Should

#	Title	Boundary	Priority
#565	feat: skill-eval companion artifacts (grading, timing, benchmark)	external-first	Must
#564	docs: canonical skill-improvement workflow guide	docs-examples	Must
#566	docs: separate execution quality from trigger quality	docs-examples	Must

#	Title	Relationship
#562	feat: self-contained HTML dashboard	Review surface for Phase 6; consumer of #565 artifacts
#563	feat: self-hosted dashboard with history repo	Historical trends; consumer of #565 artifacts
#335	feat: iteration tracking, cross-run regression	Complementary to #565

Pattern	Source	Target
Claims extraction and verification	grader.md	#570 → Phase 3
Eval self-critique	grader.md	#570 → Phase 3
Surface vs substance guards	grader.md	#570 → Phase 3
Per-assertion structured evidence	grader.md	#570 → Phase 3
User notes integration	grader.md	#570 → Phase 3
Blind A/B comparison	comparator.md	#571 → Phase 4
Dynamic rubric generation	comparator.md	#571 → Phase 4
Post-comparison analysis	analyzer.md	#571 → Phase 4
Benchmark pattern analysis	analyzer.md	#567 → Phase 5
Deterministic-upgrade suggestions	analyzer.md	#567 → Phase 5

Pattern	Location	Why
SIMBA (self-introspective failure analysis)	optimizer-reflector	More structured than analyzer
GEPA (trace reflection)	optimizer-reflector	More formal diagnosis categories
Integrity checks (task ∉ evaluator configs)	optimizer-discovery	Unique to AgentV
Stagnation detection	optimizer-reflector	Unique to AgentV
Failure triage	optimizer-discovery	More granular

Prompt	Should trigger
"Run evals on my skill using AgentV"	AgentV lifecycle skill
"Create a new skill from scratch"	skill-creator
"Optimize my skill's trigger description"	skill-creator
"Evaluate my agent against evals/quality.yaml"	AgentV lifecycle skill
"Benchmark skill performance with variance analysis"	Depends on context — should not be ambiguous

tracking: Anthropic skill-creator eval framework alignment #569

Description

Summary

Research Basis

Positioning

Interoperability Model: AgentV ↔ Skill-Creator

Shared formats (the integration layer)

Skill-creator → AgentV (AgentV as evaluation engine)

AgentV → Skill-creator (skill-creator as trigger optimizer)

What each system owns

Implementation constraint

Sub-Issues

Orchestration

Agent/Capability Enhancements (feed into #573's phases)

Output & Documentation

Pre-existing related issues

Architecture Boundary Summary

Skill-Creator Prompt Engineering Adoption

Adopt from skill-creator

Keep from AgentV

Dependency Graph

Parallel Waves

Wave 1 (independent — run in parallel)

Wave 2 (benefits from Wave 1)

Wave 3 (orchestration)

Merge Order

Subagent Operating Contract

End-to-End Validation

E2E Test: Full Skill Evaluation Lifecycle

E2E Test: Artifact Compatibility

E2E Test: Workspace-Based Agent Evaluation

E2E Test: Multi-Provider Comparison

E2E Test: Trigger Disambiguation

Completion Criteria

Future Work (not in scope)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions