-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
Objective
Rename all user-facing "judge" terminology to "grader" across the agentv codebase, adopting the three-layer evaluation taxonomy:
| Layer | Term | Role | Example |
|---|---|---|---|
| Config | Assertion | What to check (YAML declaration) | type: llm-grader, type: contains |
| Engine | Evaluator | How to dispatch (runtime interface) | CodeEvaluator, LlmEvaluator |
| Scoring | Grader | Who scores (script or LLM) | .agentv/graders/format-check.ts |
Why
- Pipeline alignment — AgentV transpiles EVAL.yaml to agentskills evals.json. Both agentskills and skill-creator use "grader." The downstream consumer already chose the term.
- No framework uses "judge" as primary term — across 14 eval frameworks studied: Metric (4), Scorer (4), Grader (3), Evaluator (2), Eval (1). "Judge" is always secondary or informal.
- Internal inconsistency — agentv's own codebase uses
Evaluatoras the canonical interface (all classes are*Evaluator), but user-facing config sayscode-judge/llm-judge. - Semantic mismatch — the "LLM-as-a-Judge" paper scoped "judge" to LLMs specifically, making
code-judgesemantically wrong for a TypeScript script.
Scope
User-facing renames
| Before | After |
|---|---|
type: code-judge |
type: code-grader |
type: llm-judge |
type: llm-grader |
.agentv/judges/ |
.agentv/graders/ |
judge_target (targets.yaml) |
grader_target |
--judge-target (CLI flag) |
--grader-target |
defineCodeJudge() (SDK) |
defineCodeGrader() |
Internal renames
| Before | After |
|---|---|
LlmJudgeEvaluator |
LlmGraderEvaluator → LlmEvaluator |
LlmJudgeEvaluatorConfig |
LlmGraderEvaluatorConfig → LlmEvaluatorConfig |
judge-discovery.ts |
grader-discovery.ts |
llm-judge.ts |
llm-grader.ts |
discoverJudges() |
discoverGraders() |
Backward compatibility
Accept old names with deprecation warnings (same pattern as assert: → assertions: in #604):
type: code-judge→ accepted, warns "usecode-grader"type: llm-judge→ accepted, warns "usellm-grader".agentv/judges/→ still discovered alongside.agentv/graders/, warns oncejudge_target→ accepted in targets.yaml, normalized tograder_targetdefineCodeJudge()→ re-exported as deprecated alias
Unchanged
assertions:YAML key (already correct)agentv eval assertCLI command (config-layer verb, not scoring-layer)Evaluatorinterface andEvaluatorRegistry(already correct)- Deterministic assertion types (
contains,regex,equals— not graders)
Implementation plan
16 tasks in dependency order: types → evaluators → schemas → discovery → registry → orchestrator → targets → loaders → SDK → CLI → tests → examples → docs → deprecation → validation.
Full plan: agentevals-research/docs/plans/2026-03-15-eval-taxonomy-plan.md
Acceptance signals
- All tests pass with new
graderterminology - Old
judgenames accepted with deprecation warnings agentv eval assertworks with.agentv/graders/directory- CLI
--grader-targetreplaces--judge-target - Docs updated across all MDX pages
- Examples updated across all EVAL.yaml files
Non-goals
- Renaming
EvaluatorResultor other internal result types - Changing the
agentv eval assertCLI command name - Renaming deterministic assertion types
- Cross-repo changes (agentskills already uses "grader")
Design latitude
- Exact deprecation warning wording is flexible
- File rename order can be adjusted if it helps avoid intermediate breakage
- Whether
.agentv/assertions/merges into.agentv/graders/can be deferred
Research
Based on cross-framework taxonomy research of 14 eval frameworks:
| Framework | Primary Term |
|---|---|
| Promptfoo | Grader |
| agentskills | Grader |
| skill-creator | Grader |
| DeepEval | Metric |
| RAGAS | Metric |
| TruLens | Metric |
| lm-eval-harness | Metric |
| Braintrust | Scorer |
| Mastra | Scorer |
| inspect-ai | Scorer |
| convex-evals | Scorer |
| LangWatch | Evaluator |
| Arize Phoenix | Evaluator |
| OpenAI Evals | Eval |
Related
- refactor: rename assert: to assertions: in EVAL.yaml schema (#603) #604 — rename
assert:→assertions:(same deprecation pattern) - tracking: Anthropic skill-creator eval framework alignment #569 — skill-creator eval framework alignment (tracking)
- feat: unify llm-judge and agent-judge, add agentv provider #617 — unify llm-judge and agent-judge
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels