Skip to content

Commit 57cda0c

Browse files
committed
feat: structured data evaluators, execution metrics, and eval baselines
1 parent f5150c8 commit 57cda0c

97 files changed

Lines changed: 6103 additions & 369 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
"@agentv/core": minor
3+
"agentv": minor
4+
---
5+
6+
Add `field_accuracy`, `latency`, and `cost` evaluators
7+
8+
- `field_accuracy`: Compare structured data fields with exact, numeric_tolerance, or date matching
9+
- `latency`: Check execution duration against threshold (uses traceSummary.durationMs)
10+
- `cost`: Check execution cost against budget (uses traceSummary.costUsd)
11+
12+
See `examples/features/document-extraction/README.md` for usage examples.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
"@agentv/core": minor
3+
"agentv": minor
4+
---
5+
6+
Add structured data and execution-metrics evaluators, normalize code-judge payloads, and ship refreshed eval examples with CI baselines.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
"@agentv/core": minor
3+
"agentv": minor
4+
---
5+
6+
Add `token_usage` evaluator to gate on provider-reported token budgets.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
"@agentv/core": patch
3+
---
4+
5+
Fix composite evaluators to pass through trace and output message context so trace-dependent evaluators (e.g. latency/cost/tool_trajectory) work when nested.

.claude/skills/agentv-eval-builder/SKILL.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
1414
- Rubrics: `references/rubric-evaluator.md` - Structured criteria-based evaluation
1515
- Composite Evaluators: `references/composite-evaluator.md` - Combine multiple evaluators
1616
- Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
17+
- Structured Data + Metrics: `references/structured-data-evaluators.md` - `field_accuracy`, `latency`, `cost`
1718
- Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
1819
- Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
1920
- Compare: `references/compare-command.md` - Compare evaluation results between runs

.claude/skills/agentv-eval-builder/references/eval-schema.json

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,6 @@
44
"description": "Schema for YAML evaluation files with conversation flows, multiple evaluators, and execution configuration",
55
"type": "object",
66
"properties": {
7-
"$schema": {
8-
"type": "string",
9-
"description": "Schema identifier",
10-
"enum": ["agentv-eval-v2"]
11-
},
127
"description": {
138
"type": "string",
149
"description": "Description of what this eval suite covers"
@@ -37,7 +32,16 @@
3732
},
3833
"type": {
3934
"type": "string",
40-
"enum": ["code", "llm_judge"],
35+
"enum": [
36+
"code",
37+
"llm_judge",
38+
"composite",
39+
"tool_trajectory",
40+
"field_accuracy",
41+
"latency",
42+
"cost",
43+
"token_usage"
44+
],
4145
"description": "Evaluator type: 'code' for scripts/regex/keywords, 'llm_judge' for LLM-based evaluation"
4246
},
4347
"script": {
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Structured Data + Metrics Evaluators
2+
3+
This reference covers the built-in evaluators used for grading structured outputs and gating on execution metrics:
4+
5+
- `field_accuracy`
6+
- `latency`
7+
- `cost`
8+
- `token_usage`
9+
10+
## Ground Truth (`expected_messages`)
11+
12+
Put the expected structured output in the evalcase `expected_messages` (typically as the last `assistant` message with `content` as an object). Evaluators read expected values from there.
13+
14+
```yaml
15+
evalcases:
16+
- id: invoice-001
17+
expected_messages:
18+
- role: assistant
19+
content:
20+
invoice_number: "INV-2025-001234"
21+
net_total: 1889
22+
```
23+
24+
## `field_accuracy`
25+
26+
Use `field_accuracy` to compare fields in the candidate JSON against the ground-truth object in `expected_messages`.
27+
28+
```yaml
29+
execution:
30+
evaluators:
31+
- name: invoice_fields
32+
type: field_accuracy
33+
aggregation: weighted_average
34+
fields:
35+
- path: invoice_number
36+
match: exact
37+
required: true
38+
weight: 2.0
39+
- path: invoice_date
40+
match: date
41+
formats: ["DD-MMM-YYYY", "YYYY-MM-DD"]
42+
- path: net_total
43+
match: numeric_tolerance
44+
tolerance: 1.0
45+
```
46+
47+
### Match types
48+
49+
- `exact`: strict equality
50+
- `date`: compares dates after parsing; optionally provide `formats`
51+
- `numeric_tolerance`: numeric compare within `tolerance` (set `relative: true` for relative tolerance)
52+
53+
For fuzzy string matching, use a `code_judge` evaluator (e.g. Levenshtein) instead of adding a fuzzy mode to `field_accuracy`.
54+
55+
### Aggregation
56+
57+
- `weighted_average` (default): weighted mean of field scores
58+
- `all_or_nothing`: score 1.0 only if all graded fields pass
59+
60+
## `latency` and `cost`
61+
62+
These evaluators gate on execution metrics reported by the provider (via `traceSummary`).
63+
64+
```yaml
65+
execution:
66+
evaluators:
67+
- name: performance
68+
type: latency
69+
threshold: 2000
70+
- name: budget
71+
type: cost
72+
budget: 0.10
73+
```
74+
75+
## `token_usage`
76+
77+
Gate on provider-reported token usage (useful when cost is unavailable or model pricing differs).
78+
79+
```yaml
80+
execution:
81+
evaluators:
82+
- name: token-budget
83+
type: token_usage
84+
max_total: 10000
85+
# or:
86+
# max_input: 8000
87+
# max_output: 2000
88+
```
89+
90+
## Common pattern: combine correctness + gates
91+
92+
Use a `composite` evaluator if you want a single “release gate” score/verdict from multiple checks:
93+
94+
```yaml
95+
execution:
96+
evaluators:
97+
- name: release_gate
98+
type: composite
99+
evaluators:
100+
- name: correctness
101+
type: field_accuracy
102+
fields:
103+
- path: invoice_number
104+
match: exact
105+
- name: latency
106+
type: latency
107+
threshold: 2000
108+
- name: cost
109+
type: cost
110+
budget: 0.10
111+
- name: tokens
112+
type: token_usage
113+
max_total: 10000
114+
aggregator:
115+
type: weighted_average
116+
weights:
117+
correctness: 0.8
118+
latency: 0.1
119+
cost: 0.05
120+
tokens: 0.05
121+
```

apps/cli/src/commands/compare/index.ts

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,10 +52,7 @@ export function loadJsonlResults(filePath: string): EvalResult[] {
5252
.filter((line) => line.trim());
5353

5454
return lines.map((line) => {
55-
const record = JSON.parse(line) as {
56-
eval_id?: string;
57-
score?: number;
58-
};
55+
const record = JSON.parse(line) as { eval_id?: string; score?: number };
5956
if (typeof record.eval_id !== 'string') {
6057
throw new Error(`Missing eval_id in result: ${line}`);
6158
}

apps/cli/src/templates/.claude/skills/agentv-eval-builder/SKILL.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
1414
- Rubrics: `references/rubric-evaluator.md` - Structured criteria-based evaluation
1515
- Composite Evaluators: `references/composite-evaluator.md` - Combine multiple evaluators
1616
- Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
17+
- Structured Data + Metrics: `references/structured-data-evaluators.md` - `field_accuracy`, `latency`, `cost`
1718
- Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
1819
- Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
1920
- Compare: `references/compare-command.md` - Compare evaluation results between runs

apps/cli/src/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,6 @@
44
"description": "Schema for YAML evaluation files with conversation flows, multiple evaluators, and execution configuration",
55
"type": "object",
66
"properties": {
7-
"$schema": {
8-
"type": "string",
9-
"description": "Schema identifier",
10-
"enum": ["agentv-eval-v2"]
11-
},
127
"description": {
138
"type": "string",
149
"description": "Description of what this eval suite covers"
@@ -37,7 +32,16 @@
3732
},
3833
"type": {
3934
"type": "string",
40-
"enum": ["code", "llm_judge"],
35+
"enum": [
36+
"code",
37+
"llm_judge",
38+
"composite",
39+
"tool_trajectory",
40+
"field_accuracy",
41+
"latency",
42+
"cost",
43+
"token_usage"
44+
],
4145
"description": "Evaluator type: 'code' for scripts/regex/keywords, 'llm_judge' for LLM-based evaluation"
4246
},
4347
"script": {

0 commit comments

Comments
 (0)