Skip to content

tracking: Anthropic skill-creator eval framework alignment #569

@christso

Description

@christso

Summary

Make AgentV compatible with the skill evaluation lifecycle established by Anthropic's skill-creator framework. The core approach is to combine AgentV's currently fragmented eval skills (eval-orchestrator + optimizer) into a single lifecycle skill that matches skill-creator's unified pattern, while enhancing eval agents with skill-creator's best prompt engineering techniques.

Research Basis

Positioning

AgentV's primary use case is agent/workspace evaluation (EVAL.yaml with repos, code judges, multi-provider comparison). Skill-creator compatibility is the secondary migration path: users start evaluating skills with skill-creator's simple loop (evals.json + claude -p). When they need workspace isolation, code judges, multi-provider comparison, tool trajectory scoring, or multi-turn evaluation, they migrate to AgentV — same evals.json input, same artifact output, zero rewrite.

Interoperability Model: AgentV ↔ Skill-Creator

The two systems are complementary, not competing. The issues in this tracker must preserve this interop story:

Shared formats (the integration layer)

Format Skill-Creator AgentV Direction
evals.json Writes (test cases for a skill) Reads + runs (via agentv eval run evals.json) skill-creator → AgentV
grading.json Reads (eval-viewer, analyzer) Writes (as companion artifact, #565) AgentV → skill-creator
timing.json Reads (eval-viewer) Writes (#565) AgentV → skill-creator
benchmark.json Reads (analyzer, comparison) Writes (#565) AgentV → skill-creator
feedback.json Writes (eval-viewer review) Reads (review checkpoint, #568) Bidirectional

Skill-creator → AgentV (AgentV as evaluation engine)

A user creates evals.json with skill-creator, then runs them through AgentV instead of claude -p:

  • AgentV adds: workspace isolation, code judges, tool trajectory scoring, multi-provider comparison, multi-turn evaluation
  • AgentV outputs grading/benchmark artifacts that skill-creator's eval-viewer/generate_review.py can read
  • This is the upgrade path: simple skill eval → full environment eval

AgentV → Skill-creator (skill-creator as trigger optimizer)

When AgentV eventually needs trigger-quality evaluation (currently deferred), skill-creator's trigger-eval tooling (run_loop.py, improve_description.py, train/test splits) could provide it. AgentV would not need to rebuild this from scratch.

What each system owns

Concern Owner Why
Skill authoring (SKILL.md creation) Skill-creator Claude-specific, not AgentV's scope
Trigger optimization Skill-creator Requires Claude-specific mechanisms (claude -p)
.skill packaging Skill-creator Distribution format, not evaluation
Execution-quality evaluation AgentV Richer environment: workspaces, multi-provider, code judges
Artifact format definition Shared AgentV produces supersets; skill-creator defines the baseline schema
Eval-viewer / review UI Skill-creator (for now) AgentV's #562/#563 dashboards will eventually supersede

Implementation constraint

Artifact schemas (#565) MUST be supersets of skill-creator's schemas — AgentV adds fields but never removes or renames skill-creator fields. This ensures skill-creator tooling can always read AgentV output.

Sub-Issues

Orchestration

# Title Boundary Priority
#573 feat: unified skill-eval lifecycle skill (combine eval-orchestrator + optimizer) external-first Must
#572 fix: disambiguate eval skill triggers from skill-creator external-first Must

Agent/Capability Enhancements (feed into #573's phases)

# Title Phase in #573 Priority
#570 feat: eval-judge with claims extraction, eval critique, evidence format Phase 3: Grade Must
#571 feat: blind A/B comparison with dynamic rubrics Phase 4: Compare Should
#567 feat: eval analyzer (deterministic-upgrade suggestions) Phase 5: Analyze Should
#568 feat: human review checkpoint + feedback artifact Phase 6: Review Should

Output & Documentation

# Title Boundary Priority
#565 feat: skill-eval companion artifacts (grading, timing, benchmark) external-first Must
#564 docs: canonical skill-improvement workflow guide docs-examples Must
#566 docs: separate execution quality from trigger quality docs-examples Must

Pre-existing related issues

# Title Relationship
#562 feat: self-contained HTML dashboard Review surface for Phase 6; consumer of #565 artifacts
#563 feat: self-hosted dashboard with history repo Historical trends; consumer of #565 artifacts
#335 feat: iteration tracking, cross-run regression Complementary to #565

Architecture Boundary Summary

  • All implementation is external-first or docs-only. No core runtime changes. Per CLAUDE.md's Lightweight Core principle, the JSONL output already contains all data needed for companion artifacts; the eval-judge and comparison enhancements are agent prompt changes; the lifecycle skill is orchestration-level work in plugins/agentv-dev/.
  • No Python scripts. Skill-creator uses Python scripts for trigger-eval tooling. AgentV is TypeScript/Bun. Per AI-First Design, evaluation patterns belong in agents and skills, not scripts.
  • History storage uses existing config. .agentv/config.yaml already supports configurable history.repo.
  • One lifecycle skill, not four. Combining eval-orchestrator + optimizer into one skill matches skill-creator's unified approach. agentv-eval-builder (test case creation) and agentv-trace-analyst (ad-hoc analysis) stay separate.

Skill-Creator Prompt Engineering Adoption

Adopt from skill-creator

Pattern Source Target
Claims extraction and verification grader.md #570 → Phase 3
Eval self-critique grader.md #570 → Phase 3
Surface vs substance guards grader.md #570 → Phase 3
Per-assertion structured evidence grader.md #570 → Phase 3
User notes integration grader.md #570 → Phase 3
Blind A/B comparison comparator.md #571 → Phase 4
Dynamic rubric generation comparator.md #571 → Phase 4
Post-comparison analysis analyzer.md #571 → Phase 4
Benchmark pattern analysis analyzer.md #567 → Phase 5
Deterministic-upgrade suggestions analyzer.md #567 → Phase 5

Keep from AgentV

Pattern Location Why
SIMBA (self-introspective failure analysis) optimizer-reflector More structured than analyzer
GEPA (trace reflection) optimizer-reflector More formal diagnosis categories
Integrity checks (task ∉ evaluator configs) optimizer-discovery Unique to AgentV
Stagnation detection optimizer-reflector Unique to AgentV
Failure triage optimizer-discovery More granular

Dependency Graph

#572 (trigger disambiguation) ─────────────────────────┐
#566 (exec vs trigger docs) ────────────────────────────┤
#564 (workflow guide) ──────────────────────────────────┤
#565 (artifact format) ─────────────────────────────────┤
                                                        │
#570 (eval-judge enhancement) ──┐                       │
#571 (blind comparison) ────────┤                       ├─ all complete = tracking done
#567 (analyzer) ────────────────┼─► #573 (combined skill)
#568 (review checkpoint) ───────┘                       │
                                                        │
#573 (unified lifecycle skill) ─────────────────────────┘

Parallel Waves

Wave 1 (independent — run in parallel)

Wave 2 (benefits from Wave 1)

Wave 3 (orchestration)

Merge Order

  1. fix: disambiguate agentv eval skill triggers from skill-creator #572 (frontmatter-only)
  2. docs: separate execution quality from trigger quality in eval guidance #566 (docs-only)
  3. docs: canonical skill-improvement workflow guide #564 (docs-only)
  4. feat: human review checkpoint and feedback artifact for skill iteration #568 (docs + schema)
  5. feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570 (agent prompt — eval-judge)
  6. feat: skill-eval companion artifacts (grading, timing, benchmark) #565 (output format)
  7. feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571 (new agents — comparator + analyzer)
  8. feat: eval analyzer pass for weak assertions and flaky scenarios #567 (enhanced optimizer-reflector)
  9. feat: unified agent-evaluation lifecycle skill (combine eval-orchestrator + optimizer) #573 (unified SKILL.md — must merge after all agents are in place)

Subagent Operating Contract

End-to-End Validation

After all issues are implemented, the following end-to-end scenario must work:

E2E Test: Full Skill Evaluation Lifecycle

Setup:

  1. A workspace with an existing skill (SKILL.md) and eval file (evals.json or EVAL.yaml)
  2. Both agentv-dev plugin and anthropics/skills (skill-creator) loaded in the same session
  3. AgentV CLI installed and configured with at least one target

Scenario:

User: "Evaluate and improve my skill against evals/skill-quality.yaml"

Expected behavior:

  1. Claude triggers the AgentV unified lifecycle skill (not skill-creator) based on disambiguated triggers (fix: disambiguate agentv eval skill triggers from skill-creator #572)
  2. Phase 1 (Discovery): optimizer-discovery analyzes the eval, challenges assumptions, triages failures
  3. Phase 2 (Run): Runs baseline and candidate evaluations using agentv eval run
  4. Phase 3 (Grade): Enhanced eval-judge grades with per-assertion evidence, extracts claims, critiques eval quality, guards against surface compliance (feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570). Outputs grading/<test-id>.json (feat: skill-eval companion artifacts (grading, timing, benchmark) #565)
  5. Phase 4 (Compare): Blind comparator evaluates A/B without labels, generates task-specific rubrics (feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571). Comparison analyzer unblinds and explains why winner won
  6. Phase 5 (Analyze): Enhanced reflector identifies deterministic upgrade opportunities, flags weak assertions, detects flaky patterns (feat: eval analyzer pass for weak assertions and flaky scenarios #567). Outputs benchmark.json (feat: skill-eval companion artifacts (grading, timing, benchmark) #565)
  7. Phase 6 (Review): Presents results to human. Human writes feedback.json (feat: human review checkpoint and feedback artifact for skill iteration #568)
  8. Phase 7 (Optimize): Curator applies surgical edits, polish generalizes into principles
  9. Phase 8 (Re-run): Loops back to Phase 2 with modified skill. Compares against previous iteration
  10. Skill stabilizes. User is satisfied. timing.json written (feat: skill-eval companion artifacts (grading, timing, benchmark) #565)

Verification checklist:

  • Correct skill triggered (AgentV, not skill-creator)
  • All 8 phases execute in order
  • grading.json matches skill-creator schema (expectations[].text/passed/evidence)
  • timing.json matches skill-creator schema
  • benchmark.json matches skill-creator schema (run_summary with mean/stddev)
  • Blind comparison produces winner/reasoning/rubric without revealing labels
  • Analyzer suggests at least one deterministic assertion upgrade
  • Human review checkpoint pauses for feedback
  • Iteration loop improves pass rate
  • Feedback.json persists across iterations
  • Execution quality and trigger quality are NOT conflated

E2E Test: Artifact Compatibility

Scenario: Generate AgentV artifacts and verify skill-creator's tooling can read them

  • grading/<test-id>.json can be parsed by skill-creator's eval-viewer/generate_review.py
  • benchmark.json structure matches skill-creator's run_summary format
  • timing.json structure matches skill-creator's timing format

E2E Test: Workspace-Based Agent Evaluation

Scenario: Evaluate an agent that modifies files in a cloned repository

Setup:

  1. An EVAL.yaml with workspace config (repo URL, setup script, teardown)
  2. Test cases that require the agent to read, modify, and create files
  3. Evaluators: code-judge (Python script checking file output), tool-trajectory, llm-judge

Expected behavior:

  1. Workspace cloned and setup script executed
  2. Agent runs in isolated workspace
  3. Code judge evaluates workspace file changes (not just text output)
  4. Tool trajectory verifies correct tool usage (Read → analyze → Write)
  5. Workspace teardown executed
  6. grading.json includes workspace_changes and evaluators[] with code-judge details
  7. benchmark.json includes per-evaluator breakdown (code-judge, tool-trajectory, llm-judge)

Verification:

  • Workspace isolation works (agent runs in cloned repo, not user workspace)
  • Code judge results appear in grading.json with structured details
  • Tool trajectory scores appear in grading.json
  • Workspace file changes tracked in grading.json
  • benchmark.json has per-evaluator summary (not just overall pass rate)

E2E Test: Multi-Provider Comparison

Scenario: Compare 3 providers (Claude, GPT, Gemini) on the same eval

Expected behavior:

  1. Same eval runs against all 3 targets
  2. Blind comparison randomizes labels (A, B, C) for all pairs or round-robin
  3. benchmark.json has per-target breakdown with N-way statistics
  4. Dynamic rubrics generated per task, applied consistently across all providers

Verification:

  • N-way blind comparison works (not just binary A/B)
  • benchmark.json has entries for all 3 targets
  • Comparison results include per-pair winner + overall ranking

E2E Test: Trigger Disambiguation

Scenario: Both skill-creator and agentv-dev loaded. Test these prompts:

Prompt Should trigger
"Run evals on my skill using AgentV" AgentV lifecycle skill
"Create a new skill from scratch" skill-creator
"Optimize my skill's trigger description" skill-creator
"Evaluate my agent against evals/quality.yaml" AgentV lifecycle skill
"Benchmark skill performance with variance analysis" Depends on context — should not be ambiguous

Completion Criteria

All of the following are true:

  • A unified lifecycle skill handles run → grade → compare → analyze → review → optimize → re-run
  • eval-orchestrator is deprecated/merged into the lifecycle skill
  • eval-judge extracts claims, critiques evals inline, guards against surface compliance
  • Blind A/B comparison with dynamic rubrics is available
  • Standalone analyzer suggests deterministic assertion upgrades
  • Human review checkpoint with feedback artifact is documented and usable
  • Skill-creator-compatible companion artifacts produced at relevant phases
  • User-facing workflow guide documents the lifecycle
  • Execution and trigger quality are documented as distinct concerns
  • AgentV skills trigger correctly when skill-creator is also loaded
  • E2E validation scenarios pass

Future Work (not in scope)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions