-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Make AgentV compatible with the skill evaluation lifecycle established by Anthropic's skill-creator framework. The core approach is to combine AgentV's currently fragmented eval skills (eval-orchestrator + optimizer) into a single lifecycle skill that matches skill-creator's unified pattern, while enhancing eval agents with skill-creator's best prompt engineering techniques.
Research Basis
- Skill Lifecycle Alignment Memo
- Anthropic Skill Creator Findings
- Tessl + HBOon Findings
- agentevals-research PR #33
Positioning
AgentV's primary use case is agent/workspace evaluation (EVAL.yaml with repos, code judges, multi-provider comparison). Skill-creator compatibility is the secondary migration path: users start evaluating skills with skill-creator's simple loop (evals.json + claude -p). When they need workspace isolation, code judges, multi-provider comparison, tool trajectory scoring, or multi-turn evaluation, they migrate to AgentV — same evals.json input, same artifact output, zero rewrite.
Interoperability Model: AgentV ↔ Skill-Creator
The two systems are complementary, not competing. The issues in this tracker must preserve this interop story:
Shared formats (the integration layer)
| Format | Skill-Creator | AgentV | Direction |
|---|---|---|---|
evals.json |
Writes (test cases for a skill) | Reads + runs (via agentv eval run evals.json) |
skill-creator → AgentV |
grading.json |
Reads (eval-viewer, analyzer) | Writes (as companion artifact, #565) | AgentV → skill-creator |
timing.json |
Reads (eval-viewer) | Writes (#565) | AgentV → skill-creator |
benchmark.json |
Reads (analyzer, comparison) | Writes (#565) | AgentV → skill-creator |
feedback.json |
Writes (eval-viewer review) | Reads (review checkpoint, #568) | Bidirectional |
Skill-creator → AgentV (AgentV as evaluation engine)
A user creates evals.json with skill-creator, then runs them through AgentV instead of claude -p:
- AgentV adds: workspace isolation, code judges, tool trajectory scoring, multi-provider comparison, multi-turn evaluation
- AgentV outputs grading/benchmark artifacts that skill-creator's
eval-viewer/generate_review.pycan read - This is the upgrade path: simple skill eval → full environment eval
AgentV → Skill-creator (skill-creator as trigger optimizer)
When AgentV eventually needs trigger-quality evaluation (currently deferred), skill-creator's trigger-eval tooling (run_loop.py, improve_description.py, train/test splits) could provide it. AgentV would not need to rebuild this from scratch.
What each system owns
| Concern | Owner | Why |
|---|---|---|
| Skill authoring (SKILL.md creation) | Skill-creator | Claude-specific, not AgentV's scope |
| Trigger optimization | Skill-creator | Requires Claude-specific mechanisms (claude -p) |
| .skill packaging | Skill-creator | Distribution format, not evaluation |
| Execution-quality evaluation | AgentV | Richer environment: workspaces, multi-provider, code judges |
| Artifact format definition | Shared | AgentV produces supersets; skill-creator defines the baseline schema |
| Eval-viewer / review UI | Skill-creator (for now) | AgentV's #562/#563 dashboards will eventually supersede |
Implementation constraint
Artifact schemas (#565) MUST be supersets of skill-creator's schemas — AgentV adds fields but never removes or renames skill-creator fields. This ensures skill-creator tooling can always read AgentV output.
Sub-Issues
Orchestration
| # | Title | Boundary | Priority |
|---|---|---|---|
| #573 | feat: unified skill-eval lifecycle skill (combine eval-orchestrator + optimizer) | external-first | Must |
| #572 | fix: disambiguate eval skill triggers from skill-creator | external-first | Must |
Agent/Capability Enhancements (feed into #573's phases)
| # | Title | Phase in #573 | Priority |
|---|---|---|---|
| #570 | feat: eval-judge with claims extraction, eval critique, evidence format | Phase 3: Grade | Must |
| #571 | feat: blind A/B comparison with dynamic rubrics | Phase 4: Compare | Should |
| #567 | feat: eval analyzer (deterministic-upgrade suggestions) | Phase 5: Analyze | Should |
| #568 | feat: human review checkpoint + feedback artifact | Phase 6: Review | Should |
Output & Documentation
| # | Title | Boundary | Priority |
|---|---|---|---|
| #565 | feat: skill-eval companion artifacts (grading, timing, benchmark) | external-first | Must |
| #564 | docs: canonical skill-improvement workflow guide | docs-examples | Must |
| #566 | docs: separate execution quality from trigger quality | docs-examples | Must |
Pre-existing related issues
| # | Title | Relationship |
|---|---|---|
| #562 | feat: self-contained HTML dashboard | Review surface for Phase 6; consumer of #565 artifacts |
| #563 | feat: self-hosted dashboard with history repo | Historical trends; consumer of #565 artifacts |
| #335 | feat: iteration tracking, cross-run regression | Complementary to #565 |
Architecture Boundary Summary
- All implementation is external-first or docs-only. No core runtime changes. Per CLAUDE.md's Lightweight Core principle, the JSONL output already contains all data needed for companion artifacts; the eval-judge and comparison enhancements are agent prompt changes; the lifecycle skill is orchestration-level work in
plugins/agentv-dev/. - No Python scripts. Skill-creator uses Python scripts for trigger-eval tooling. AgentV is TypeScript/Bun. Per AI-First Design, evaluation patterns belong in agents and skills, not scripts.
- History storage uses existing config.
.agentv/config.yamlalready supports configurablehistory.repo. - One lifecycle skill, not four. Combining eval-orchestrator + optimizer into one skill matches skill-creator's unified approach.
agentv-eval-builder(test case creation) andagentv-trace-analyst(ad-hoc analysis) stay separate.
Skill-Creator Prompt Engineering Adoption
Adopt from skill-creator
| Pattern | Source | Target |
|---|---|---|
| Claims extraction and verification | grader.md | #570 → Phase 3 |
| Eval self-critique | grader.md | #570 → Phase 3 |
| Surface vs substance guards | grader.md | #570 → Phase 3 |
| Per-assertion structured evidence | grader.md | #570 → Phase 3 |
| User notes integration | grader.md | #570 → Phase 3 |
| Blind A/B comparison | comparator.md | #571 → Phase 4 |
| Dynamic rubric generation | comparator.md | #571 → Phase 4 |
| Post-comparison analysis | analyzer.md | #571 → Phase 4 |
| Benchmark pattern analysis | analyzer.md | #567 → Phase 5 |
| Deterministic-upgrade suggestions | analyzer.md | #567 → Phase 5 |
Keep from AgentV
| Pattern | Location | Why |
|---|---|---|
| SIMBA (self-introspective failure analysis) | optimizer-reflector | More structured than analyzer |
| GEPA (trace reflection) | optimizer-reflector | More formal diagnosis categories |
| Integrity checks (task ∉ evaluator configs) | optimizer-discovery | Unique to AgentV |
| Stagnation detection | optimizer-reflector | Unique to AgentV |
| Failure triage | optimizer-discovery | More granular |
Dependency Graph
#572 (trigger disambiguation) ─────────────────────────┐
#566 (exec vs trigger docs) ────────────────────────────┤
#564 (workflow guide) ──────────────────────────────────┤
#565 (artifact format) ─────────────────────────────────┤
│
#570 (eval-judge enhancement) ──┐ │
#571 (blind comparison) ────────┤ ├─ all complete = tracking done
#567 (analyzer) ────────────────┼─► #573 (combined skill)
#568 (review checkpoint) ───────┘ │
│
#573 (unified lifecycle skill) ─────────────────────────┘
- Wave 1 (fix: disambiguate agentv eval skill triggers from skill-creator #572, docs: separate execution quality from trigger quality in eval guidance #566, docs: canonical skill-improvement workflow guide #564, feat: skill-eval companion artifacts (grading, timing, benchmark) #565, feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570) — all independent
- Wave 2 (feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571, feat: eval analyzer pass for weak assertions and flaky scenarios #567, feat: human review checkpoint and feedback artifact for skill iteration #568) — benefit from Wave 1 but can start in parallel
- Wave 3 (feat: unified agent-evaluation lifecycle skill (combine eval-orchestrator + optimizer) #573) — orchestrates all agent enhancements into the unified SKILL.md. Can start in parallel with agent work but finalizes after agents are enhanced.
Parallel Waves
Wave 1 (independent — run in parallel)
- fix: disambiguate agentv eval skill triggers from skill-creator #572 — Disambiguate eval skill triggers (frontmatter-only)
- docs: separate execution quality from trigger quality in eval guidance #566 — Execution vs trigger quality docs (docs-only)
- docs: canonical skill-improvement workflow guide #564 — Canonical skill-improvement workflow guide (docs-only)
- feat: skill-eval companion artifacts (grading, timing, benchmark) #565 — Skill-eval companion artifacts (output format)
- feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570 — Eval-judge enhancement with skill-creator grading patterns (agent prompt)
Wave 2 (benefits from Wave 1)
- feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571 — Blind A/B comparison with dynamic rubrics (new agents)
- feat: eval analyzer pass for weak assertions and flaky scenarios #567 — Eval analyzer (enhanced optimizer-reflector)
- feat: human review checkpoint and feedback artifact for skill iteration #568 — Human review checkpoint + feedback artifact (docs + schema)
Wave 3 (orchestration)
- feat: unified agent-evaluation lifecycle skill (combine eval-orchestrator + optimizer) #573 — Unified lifecycle skill SKILL.md (combines eval-orchestrator + optimizer, references all enhanced agents)
Merge Order
- fix: disambiguate agentv eval skill triggers from skill-creator #572 (frontmatter-only)
- docs: separate execution quality from trigger quality in eval guidance #566 (docs-only)
- docs: canonical skill-improvement workflow guide #564 (docs-only)
- feat: human review checkpoint and feedback artifact for skill iteration #568 (docs + schema)
- feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570 (agent prompt — eval-judge)
- feat: skill-eval companion artifacts (grading, timing, benchmark) #565 (output format)
- feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571 (new agents — comparator + analyzer)
- feat: eval analyzer pass for weak assertions and flaky scenarios #567 (enhanced optimizer-reflector)
- feat: unified agent-evaluation lifecycle skill (combine eval-orchestrator + optimizer) #573 (unified SKILL.md — must merge after all agents are in place)
Subagent Operating Contract
- Read AgentV's CLAUDE.md and contribution guidelines before starting
- One issue per PR
- Follow existing agent structure in
plugins/agentv-dev/agents/ - Follow existing skill structure in
plugins/agentv-dev/skills/ - Follow existing
OutputWriterpattern for feat: skill-eval companion artifacts (grading, timing, benchmark) #565 (seeapps/cli/src/commands/eval/jsonl-writer.ts) - Follow existing Astro docs patterns for docs: canonical skill-improvement workflow guide #564 and docs: separate execution quality from trigger quality in eval guidance #566
- For agent changes (feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570), read the actual skill-creator source prompt for reference (linked in each issue)
- Tests included for output format changes (feat: skill-eval companion artifacts (grading, timing, benchmark) #565)
- Docs changes should build without errors (
bun run docs:build) - Docs issues (docs: canonical skill-improvement workflow guide #564, docs: separate execution quality from trigger quality in eval guidance #566) should also update the
agentv-eval-builderskill reference card - Artifact storage should reference existing
.agentv/config.yamlhistory configuration
End-to-End Validation
After all issues are implemented, the following end-to-end scenario must work:
E2E Test: Full Skill Evaluation Lifecycle
Setup:
- A workspace with an existing skill (SKILL.md) and eval file (evals.json or EVAL.yaml)
- Both
agentv-devplugin andanthropics/skills(skill-creator) loaded in the same session - AgentV CLI installed and configured with at least one target
Scenario:
User: "Evaluate and improve my skill against evals/skill-quality.yaml"
Expected behavior:
- Claude triggers the AgentV unified lifecycle skill (not skill-creator) based on disambiguated triggers (fix: disambiguate agentv eval skill triggers from skill-creator #572)
- Phase 1 (Discovery): optimizer-discovery analyzes the eval, challenges assumptions, triages failures
- Phase 2 (Run): Runs baseline and candidate evaluations using
agentv eval run - Phase 3 (Grade): Enhanced eval-judge grades with per-assertion evidence, extracts claims, critiques eval quality, guards against surface compliance (feat: adopt skill-creator grading patterns in eval-judge (claims extraction, eval critique, evidence format) #570). Outputs
grading/<test-id>.json(feat: skill-eval companion artifacts (grading, timing, benchmark) #565) - Phase 4 (Compare): Blind comparator evaluates A/B without labels, generates task-specific rubrics (feat: blind A/B comparison with dynamic rubrics and post-comparison analysis #571). Comparison analyzer unblinds and explains why winner won
- Phase 5 (Analyze): Enhanced reflector identifies deterministic upgrade opportunities, flags weak assertions, detects flaky patterns (feat: eval analyzer pass for weak assertions and flaky scenarios #567). Outputs
benchmark.json(feat: skill-eval companion artifacts (grading, timing, benchmark) #565) - Phase 6 (Review): Presents results to human. Human writes feedback.json (feat: human review checkpoint and feedback artifact for skill iteration #568)
- Phase 7 (Optimize): Curator applies surgical edits, polish generalizes into principles
- Phase 8 (Re-run): Loops back to Phase 2 with modified skill. Compares against previous iteration
- Skill stabilizes. User is satisfied.
timing.jsonwritten (feat: skill-eval companion artifacts (grading, timing, benchmark) #565)
Verification checklist:
- Correct skill triggered (AgentV, not skill-creator)
- All 8 phases execute in order
- grading.json matches skill-creator schema (expectations[].text/passed/evidence)
- timing.json matches skill-creator schema
- benchmark.json matches skill-creator schema (run_summary with mean/stddev)
- Blind comparison produces winner/reasoning/rubric without revealing labels
- Analyzer suggests at least one deterministic assertion upgrade
- Human review checkpoint pauses for feedback
- Iteration loop improves pass rate
- Feedback.json persists across iterations
- Execution quality and trigger quality are NOT conflated
E2E Test: Artifact Compatibility
Scenario: Generate AgentV artifacts and verify skill-creator's tooling can read them
-
grading/<test-id>.jsoncan be parsed by skill-creator'seval-viewer/generate_review.py -
benchmark.jsonstructure matches skill-creator'srun_summaryformat -
timing.jsonstructure matches skill-creator's timing format
E2E Test: Workspace-Based Agent Evaluation
Scenario: Evaluate an agent that modifies files in a cloned repository
Setup:
- An EVAL.yaml with workspace config (repo URL, setup script, teardown)
- Test cases that require the agent to read, modify, and create files
- Evaluators: code-judge (Python script checking file output), tool-trajectory, llm-judge
Expected behavior:
- Workspace cloned and setup script executed
- Agent runs in isolated workspace
- Code judge evaluates workspace file changes (not just text output)
- Tool trajectory verifies correct tool usage (Read → analyze → Write)
- Workspace teardown executed
- grading.json includes
workspace_changesandevaluators[]with code-judge details - benchmark.json includes per-evaluator breakdown (code-judge, tool-trajectory, llm-judge)
Verification:
- Workspace isolation works (agent runs in cloned repo, not user workspace)
- Code judge results appear in grading.json with structured details
- Tool trajectory scores appear in grading.json
- Workspace file changes tracked in grading.json
- benchmark.json has per-evaluator summary (not just overall pass rate)
E2E Test: Multi-Provider Comparison
Scenario: Compare 3 providers (Claude, GPT, Gemini) on the same eval
Expected behavior:
- Same eval runs against all 3 targets
- Blind comparison randomizes labels (A, B, C) for all pairs or round-robin
- benchmark.json has per-target breakdown with N-way statistics
- Dynamic rubrics generated per task, applied consistently across all providers
Verification:
- N-way blind comparison works (not just binary A/B)
- benchmark.json has entries for all 3 targets
- Comparison results include per-pair winner + overall ranking
E2E Test: Trigger Disambiguation
Scenario: Both skill-creator and agentv-dev loaded. Test these prompts:
| Prompt | Should trigger |
|---|---|
| "Run evals on my skill using AgentV" | AgentV lifecycle skill |
| "Create a new skill from scratch" | skill-creator |
| "Optimize my skill's trigger description" | skill-creator |
| "Evaluate my agent against evals/quality.yaml" | AgentV lifecycle skill |
| "Benchmark skill performance with variance analysis" | Depends on context — should not be ambiguous |
Completion Criteria
All of the following are true:
- A unified lifecycle skill handles run → grade → compare → analyze → review → optimize → re-run
- eval-orchestrator is deprecated/merged into the lifecycle skill
- eval-judge extracts claims, critiques evals inline, guards against surface compliance
- Blind A/B comparison with dynamic rubrics is available
- Standalone analyzer suggests deterministic assertion upgrades
- Human review checkpoint with feedback artifact is documented and usable
- Skill-creator-compatible companion artifacts produced at relevant phases
- User-facing workflow guide documents the lifecycle
- Execution and trigger quality are documented as distinct concerns
- AgentV skills trigger correctly when skill-creator is also loaded
- E2E validation scenarios pass
Future Work (not in scope)
-
Skill-trigger evaluation now tracked in feat: skill-trigger evaluation + claude-cli provider (subprocess-based) #593 (promoted from future work)
-
feat: skill-trigger evaluation + claude-cli provider (subprocess-based) #593 feat: skill-trigger evaluation + claude-cli provider — now tracked
-
Trigger-description optimizer with repeated trials and held-out selection
-
Skill marketplace or discovery system
-
Packaging/distribution tooling for
.skillbundles -
Multi-run variance detection (requires history repo — relates to feat(eval): iteration tracking, termination taxonomy, and cross-run regression detection #335/feat: self-hosted dashboard — historical trends, dataset management, YAML editor #563)
-
Python scripts or non-TypeScript tooling (per architecture principles)