-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Self-hosted local dashboard (agentv serve) for historical trend analysis, progressive live visualization, and dataset browsing. Complements the static HTML report (#562) by adding persistence, aggregation across time, and a live view that updates as results stream in.
Schema alignment: Internal snapshot structure aligns with Anthropic's skill-creator conventions (grading.json, timing.json, benchmark.json, history.json) while storing results in a separate history repo instead of inline with the project.
Problem
The static HTML report (#562) covers single-run visualization with meta-refresh. But teams need to:
- Spot trends — Is our agent getting better or worse over time? Which evaluators are regressing?
- Look up historical data — What was the pass rate last week? When did this test start failing?
- Aggregate across runs — Compare Tuesday's run against Friday's across multiple targets
- Watch results progressively — See the dashboard populate in real-time as each test completes, not just on page reload
Prerequisite: Result snapshot structure
Storage location: separate history repo
Results stored in a separate git repository — not the project repo (avoids git history bloat) and not inline like skill-creator (eval results are larger and more frequent than skill iteration artifacts).
Configured in .agentv/config.yaml (extends the existing config schema):
$schema: agentv-config-v2
guideline_patterns:
- "**/*.instructions.md"
execution:
verbose: false
trace_file: .agentv/results/trace-{timestamp}.jsonl
# History repo for eval result snapshots
history:
repo: ../myproject-eval-results # Path to the history repo (required for dashboard)
auto_save: true # Automatically snapshot every run (default: true)
retention:
max_runs: 500 # Max snapshots to keep (optional, default: unlimited)The history.repo path is relative to the project root. The history repo is a plain git repo — users create it themselves (git init ../myproject-eval-results).
Proposed snapshot layout (aligned with skill-creator)
myproject-eval-results/ # Separate git repo
history.json # Campaign/iteration tracking (skill-creator aligned)
index.json # Fast lookup index for dashboard
runs/
2026-03-14T10-32-00_claude/ # Timestamped run folder
meta.json # Run metadata
results.jsonl # Eval results (existing format)
trace.jsonl # Execution trace: LLM calls, tool calls, latency
otel.json # OpenTelemetry spans (optional)
eval-snapshot.yaml # Frozen copy of EVAL.yaml used for this run
grading/ # Per-test grading detail (skill-creator aligned)
test-feature-alpha.json # grading.json per test case
test-retrieval-basic.json
timing.json # Wall clock timing (skill-creator aligned)
benchmark.json # Aggregate stats (skill-creator aligned)
2026-03-14T11-45-00_gpt4o/
meta.json
results.jsonl
trace.jsonl
...
Schema alignment with skill-creator
AgentV adopts skill-creator's JSON schemas where they overlap, extending them for eval-specific needs:
history.json — Campaign/iteration tracking
Adapted from skill-creator's history.json. Tracks related runs as iterations in a campaign (e.g., "improving retrieval quality").
{
"campaigns": [
{
"name": "optimize-retrieval",
"started_at": "2026-03-14T10:30:00Z",
"current_best": "2026-03-15T09-00-00_claude",
"iterations": [
{
"run_id": "2026-03-14T10-32-00_claude",
"parent": null,
"pass_rate": 0.65,
"result": "baseline",
"is_current_best": false
},
{
"run_id": "2026-03-15T09-00-00_claude",
"parent": "2026-03-14T10-32-00_claude",
"pass_rate": 0.85,
"result": "won",
"is_current_best": true
}
]
}
]
}Skill-creator alignment: Same iterations[] structure with parent, pass_rate, is_current_best. Extended with campaigns[] wrapper since AgentV tracks multiple improvement efforts, not just one skill.
timing.json — Per-run timing
Matches skill-creator's schema:
{
"total_tokens": 84852,
"duration_ms": 45200,
"total_duration_seconds": 45.2,
"token_usage": {
"input": 52000,
"output": 18000
}
}grading/<test-id>.json — Per-test grading detail
Matches skill-creator's grading.json schema:
{
"expectations": [
{
"text": "Response correctly identifies the root cause",
"passed": true,
"evidence": "Output states 'The root cause is the missing null check in line 42'"
},
{
"text": "Response suggests a fix with code example",
"passed": false,
"evidence": "Output identifies the issue but provides no code fix"
}
],
"summary": {
"passed": 1,
"failed": 1,
"total": 2,
"pass_rate": 0.50
},
"execution_metrics": {
"tool_calls": { "Read": 3, "Bash": 2, "Write": 1 },
"total_tool_calls": 6,
"errors_encountered": 0
}
}Skill-creator alignment: Same expectations[].text/passed/evidence fields, same summary structure, same execution_metrics shape. This means skill-creator's viewer and aggregation scripts can read AgentV's grading output directly.
benchmark.json — Run aggregate
Matches skill-creator's schema:
{
"metadata": {
"eval_file": "evals/eval.yaml",
"timestamp": "2026-03-14T10:32:00Z",
"targets": ["claude-3-5-sonnet"],
"tests_run": ["test-feature-alpha", "test-retrieval-basic"]
},
"run_summary": {
"claude-3-5-sonnet": {
"pass_rate": { "mean": 0.85, "stddev": 0.05 },
"time_seconds": { "mean": 45.0, "stddev": 12.0 },
"tokens": { "mean": 3800, "stddev": 400 }
}
},
"notes": []
}Skill-creator alignment: Same run_summary structure with mean/stddev per metric. Uses target names instead of with_skill/without_skill since AgentV compares across targets, not skill presence.
meta.json — Run metadata
AgentV-specific (no skill-creator equivalent):
{
"runId": "2026-03-14T10-32-00_claude",
"timestamp": "2026-03-14T10:32:00Z",
"status": "completed",
"evalFile": "evals/eval.yaml",
"targets": ["claude-3-5-sonnet"],
"campaign": "optimize-retrieval",
"git": {
"branch": "main",
"commit": "abc1234",
"dirty": false
},
"artifacts": {
"results": "results.jsonl",
"trace": "trace.jsonl",
"otel": "otel.json",
"eval_snapshot": "eval-snapshot.yaml",
"timing": "timing.json",
"benchmark": "benchmark.json"
},
"tags": ["nightly", "v2.1"]
}eval-snapshot.yaml — Frozen eval config
A copy of the EVAL.yaml that was used for this run. Enables the dashboard to show exactly what was evaluated, even if the eval config has since changed. No skill-creator equivalent, but follows the same principle as skill-creator's skill-snapshot/ directory.
index.json — Fast lookup
Rebuilt on startup or incrementally updated. Contains one entry per run with summary stats from meta.json + benchmark.json. Enables the dashboard to render runs list and trend charts without parsing every JSONL.
Workflow
# One-time setup
git init ../myproject-eval-results
# Configure in .agentv/config.yaml
# history:
# repo: ../myproject-eval-results
# Every eval run auto-saves all artifacts to the history repo
agentv eval run -f eval.yaml
# → writes results.jsonl, trace.jsonl, timing.json, grading/, benchmark.json, meta.json
# Group runs into a campaign for iteration tracking
agentv eval run -f eval.yaml --campaign optimize-retrieval
# View dashboard
agentv serve # Reads history.repo from config.yaml
# Manage history
agentv results reindex # Rebuild index.json
agentv results prune --older-than 30d # Cleanup old runs
# Share with team
cd ../myproject-eval-results && git add . && git commit -m "eval run 2026-03-14" && git pushProgressive visualization
The core differentiator over the static HTML report. The dashboard should feel alive during a run.
How it works
agentv eval runwrites results and traces as each test completes — also writes to history repo if configuredagentv servewatches the active JSONL files vianode:fs.watch- Server pushes new results to the browser via Server-Sent Events (SSE)
- Dashboard updates incrementally — no full page reload
What "progressive" looks like
| Phase | Dashboard state |
|---|---|
| Run starts (0 results) | Stat cards show 0/N, empty table, "Running..." indicator |
| First result arrives | First row appears in table, stat cards update, first data point on charts |
| Results streaming in | Table grows, pass rate bar animates, score distribution builds up |
| Run completes | "Running..." → "Complete", final stats, timing.json + benchmark.json written |
Progressive elements per view
- Stat cards: Counters animate as results arrive (passed: 0 → 1 → 2...)
- Test cases table: New rows append at top with subtle highlight animation
- Score distribution histogram: Bars grow as more data points arrive
- Pass rate bar: Width animates from 0% toward current rate
- Category breakdown: Categories appear as their first test completes
- Trace tree: Nodes appear as their corresponding tests finish
Dashboard features
agentv serve
agentv serve # Reads history.repo from .agentv/config.yaml
agentv serve --port 8080 # Custom port (default: 4747)
agentv serve --results-dir ./custom-results # Override configMinimal node:http — zero external dependencies.
View: Historical Trends
- Line chart: Pass rate over time (one line per target)
- Score trend: Average evaluator score over time
- Regression detection: Highlight runs where pass rate dropped significantly
- Campaign view: Group runs by campaign, show iteration progression (from history.json)
- Filter by: target, evaluator, tag, date range, git branch, campaign
View: Run History
- Table of all historical runs: timestamp, target, campaign, pass rate bar, passed/failed/error, duration, tags, git commit
- Sort by any column
- Click row → detail views (same as feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562: test cases, traces, comparison)
- Multi-select runs for side-by-side comparison
View: Live Run (progressive)
- Same layout as the run detail view from feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562
- Updates in real-time via SSE as results stream in
- Visual indicator: pulsing dot or progress bar showing N/total completed
- Auto-scrolls to show newest results (with option to pin scroll position)
View: Trace Explorer
Powered by trace.jsonl from the snapshot:
- Collapsible tree view of test execution → LLM calls → tool calls
- Per-node details: prompts, responses, token counts, latency
- Score overlay on trace nodes
- Color coding: green ≥90%, yellow ≥50%, red <50%
- Filter traces by test ID, target, or score range
View: Dataset Browser
- Read-only view of test cases from EVAL.yaml and referenced dataset files
- Table with: test ID, input preview, expected output preview, evaluators, tags
- Expand row to see full input/output
- Filter by tag, evaluator
- Import: upload CSV/JSON/JSONL (writes to dataset file on disk)
- No inline editing — users edit YAML directly or via AI assistant
View: Cross-Run Comparison
- Select any 2+ historical runs (or iterations within a campaign)
- Regression table: tests that flipped pass→fail (red) or fail→pass (green)
- Score delta with ↑/↓ arrows
- Summary: improvements / regressions / unchanged
- Campaign iteration comparison: show progression across iterations
Architecture
Server
agentv serve
├─ GET / → SPA (bundled HTML/CSS/JS)
├─ GET /api/runs → index.json (run list)
├─ GET /api/runs/:runId → meta.json + benchmark.json
├─ GET /api/runs/:runId/results → Streamed results.jsonl parsing
├─ GET /api/runs/:runId/traces → Streamed trace.jsonl parsing
├─ GET /api/runs/:runId/grading/:test → grading/<test>.json
├─ GET /api/campaigns → history.json (campaign list)
├─ GET /api/trends → Aggregated time-series data
├─ GET /api/live → SSE stream (new results + traces)
├─ GET /api/datasets → Test cases from EVAL.yaml / dataset files
└─ POST /api/datasets/import → Import CSV/JSON/JSONL as test cases
node:http+node:fs+node:fs.watch— zero dependencies- SSE endpoint watches active JSONL files and pushes
data: {result JSON}\n\non each new line - SPA bundled into the npm package (no separate install)
- Reads
history.repofrom.agentv/config.yamlto locate results
Frontend
- Vanilla JS or minimal vendored lib (must bundle into npm package)
- Reuses visualization components from feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562
- SSE client:
new EventSource('/api/live')— native browser API, no library needed
Acceptance signals
-
history.repoconfig in.agentv/config.yamlpoints to a separate results repo -
agentv eval runauto-saves all artifacts to history repo when configured - Snapshot includes: results.jsonl, trace.jsonl, otel.json, timing.json, grading/*.json, benchmark.json, eval-snapshot.yaml, meta.json
-
grading.jsonschema matches skill-creator'sexpectations[].text/passed/evidenceformat -
timing.jsonschema matches skill-creator's format -
benchmark.jsonschema matches skill-creator'srun_summarywithmean/stddevformat -
history.jsontracks campaigns with iteration progression (aligned with skill-creator'shistory.json) -
agentv servestarts a local HTTP server with zero additional dependencies - Progressive visualization: dashboard updates in real-time via SSE
- Historical trend charts with campaign grouping
- Trace explorer with collapsible tree from trace.jsonl
- Cross-run and cross-campaign comparison
- Dataset browser with filtering and import
-
index.jsonenables fast aggregation without parsing all JSONL files -
agentv results reindexrebuilds index - Works with existing JSONL results (backwards compatible)
Non-goals
- Cloud hosting or multi-user auth
- YAML editing UI (users edit YAML directly or via AI assistant)
- Eval config form builder
- Real-time collaboration
- Database backend (SQLite, Convex, etc.) — filesystem only
- Replacing the CLI workflow — dashboard is complementary
- Blind A/B comparison (skill-creator's comparator.md — future enhancement)
Related
- feat: self-contained HTML dashboard with meta-refresh (runs, benchmarks, traces) #562 — Static HTML report (single-run, self-contained, meta-refresh)
- Anthropic skill-creator — Schema alignment source (grading.json, timing.json, benchmark.json, history.json)
- DeepEval Confident AI — cloud dashboard with trend tracking
- Convex Evals visualizer — React + Convex backend dashboard
- pi-autoresearch — session persistence patterns