feat: self-hosted dashboard — historical trends, dataset management, YAML editor

## Summary

Self-hosted local dashboard (`agentv serve`) for historical trend analysis, progressive live visualization, and dataset browsing. Complements the static HTML report (#562) by adding persistence, aggregation across time, and a live view that updates as results stream in.

**Schema alignment**: Internal snapshot structure aligns with [Anthropic's skill-creator](https://github.com/anthropics/skills/tree/main/skills/skill-creator) conventions (`grading.json`, `timing.json`, `benchmark.json`, `history.json`) while storing results in a separate history repo instead of inline with the project.

## Problem

The static HTML report (#562) covers single-run visualization with meta-refresh. But teams need to:

1. **Spot trends** — Is our agent getting better or worse over time? Which evaluators are regressing?
2. **Look up historical data** — What was the pass rate last week? When did this test start failing?
3. **Aggregate across runs** — Compare Tuesday's run against Friday's across multiple targets
4. **Watch results progressively** — See the dashboard populate in real-time as each test completes, not just on page reload

## Prerequisite: Result snapshot structure

### Storage location: separate history repo

Results stored in a **separate git repository** — not the project repo (avoids git history bloat) and not inline like skill-creator (eval results are larger and more frequent than skill iteration artifacts).

**Configured in `.agentv/config.yaml`** (extends the existing config schema):

```yaml
$schema: agentv-config-v2

guideline_patterns:
  - "**/*.instructions.md"

execution:
  verbose: false
  trace_file: .agentv/results/trace-{timestamp}.jsonl

# History repo for eval result snapshots
history:
  repo: ../myproject-eval-results    # Path to the history repo (required for dashboard)
  auto_save: true                    # Automatically snapshot every run (default: true)
  retention:
    max_runs: 500                    # Max snapshots to keep (optional, default: unlimited)
```

The `history.repo` path is relative to the project root. The history repo is a plain git repo — users create it themselves (`git init ../myproject-eval-results`).

### Proposed snapshot layout (aligned with skill-creator)

```
myproject-eval-results/                     # Separate git repo
  history.json                              # Campaign/iteration tracking (skill-creator aligned)
  index.json                                # Fast lookup index for dashboard

  runs/
    2026-03-14T10-32-00_claude/             # Timestamped run folder
      meta.json                             # Run metadata
      results.jsonl                         # Eval results (existing format)
      trace.jsonl                           # Execution trace: LLM calls, tool calls, latency
      otel.json                             # OpenTelemetry spans (optional)
      eval-snapshot.yaml                    # Frozen copy of EVAL.yaml used for this run
      grading/                              # Per-test grading detail (skill-creator aligned)
        test-feature-alpha.json             # grading.json per test case
        test-retrieval-basic.json
      timing.json                           # Wall clock timing (skill-creator aligned)
      benchmark.json                        # Aggregate stats (skill-creator aligned)

    2026-03-14T11-45-00_gpt4o/
      meta.json
      results.jsonl
      trace.jsonl
      ...
```

### Schema alignment with skill-creator

AgentV adopts skill-creator's JSON schemas where they overlap, extending them for eval-specific needs:

#### `history.json` — Campaign/iteration tracking

Adapted from skill-creator's `history.json`. Tracks related runs as iterations in a campaign (e.g., "improving retrieval quality").

```json
{
  "campaigns": [
    {
      "name": "optimize-retrieval",
      "started_at": "2026-03-14T10:30:00Z",
      "current_best": "2026-03-15T09-00-00_claude",
      "iterations": [
        {
          "run_id": "2026-03-14T10-32-00_claude",
          "parent": null,
          "pass_rate": 0.65,
          "result": "baseline",
          "is_current_best": false
        },
        {
          "run_id": "2026-03-15T09-00-00_claude",
          "parent": "2026-03-14T10-32-00_claude",
          "pass_rate": 0.85,
          "result": "won",
          "is_current_best": true
        }
      ]
    }
  ]
}
```

**Skill-creator alignment**: Same `iterations[]` structure with `parent`, `pass_rate`, `is_current_best`. Extended with `campaigns[]` wrapper since AgentV tracks multiple improvement efforts, not just one skill.

#### `timing.json` — Per-run timing

Matches skill-creator's schema:

```json
{
  "total_tokens": 84852,
  "duration_ms": 45200,
  "total_duration_seconds": 45.2,
  "token_usage": {
    "input": 52000,
    "output": 18000
  }
}
```

#### `grading/<test-id>.json` — Per-test grading detail

Matches skill-creator's `grading.json` schema:

```json
{
  "expectations": [
    {
      "text": "Response correctly identifies the root cause",
      "passed": true,
      "evidence": "Output states 'The root cause is the missing null check in line 42'"
    },
    {
      "text": "Response suggests a fix with code example",
      "passed": false,
      "evidence": "Output identifies the issue but provides no code fix"
    }
  ],
  "summary": {
    "passed": 1,
    "failed": 1,
    "total": 2,
    "pass_rate": 0.50
  },
  "execution_metrics": {
    "tool_calls": { "Read": 3, "Bash": 2, "Write": 1 },
    "total_tool_calls": 6,
    "errors_encountered": 0
  }
}
```

**Skill-creator alignment**: Same `expectations[].text/passed/evidence` fields, same `summary` structure, same `execution_metrics` shape. This means skill-creator's viewer and aggregation scripts can read AgentV's grading output directly.

#### `benchmark.json` — Run aggregate

Matches skill-creator's schema:

```json
{
  "metadata": {
    "eval_file": "evals/eval.yaml",
    "timestamp": "2026-03-14T10:32:00Z",
    "targets": ["claude-3-5-sonnet"],
    "tests_run": ["test-feature-alpha", "test-retrieval-basic"]
  },
  "run_summary": {
    "claude-3-5-sonnet": {
      "pass_rate": { "mean": 0.85, "stddev": 0.05 },
      "time_seconds": { "mean": 45.0, "stddev": 12.0 },
      "tokens": { "mean": 3800, "stddev": 400 }
    }
  },
  "notes": []
}
```

**Skill-creator alignment**: Same `run_summary` structure with `mean`/`stddev` per metric. Uses target names instead of `with_skill`/`without_skill` since AgentV compares across targets, not skill presence.

#### `meta.json` — Run metadata

AgentV-specific (no skill-creator equivalent):

```json
{
  "runId": "2026-03-14T10-32-00_claude",
  "timestamp": "2026-03-14T10:32:00Z",
  "status": "completed",
  "evalFile": "evals/eval.yaml",
  "targets": ["claude-3-5-sonnet"],
  "campaign": "optimize-retrieval",
  "git": {
    "branch": "main",
    "commit": "abc1234",
    "dirty": false
  },
  "artifacts": {
    "results": "results.jsonl",
    "trace": "trace.jsonl",
    "otel": "otel.json",
    "eval_snapshot": "eval-snapshot.yaml",
    "timing": "timing.json",
    "benchmark": "benchmark.json"
  },
  "tags": ["nightly", "v2.1"]
}
```

#### `eval-snapshot.yaml` — Frozen eval config

A copy of the EVAL.yaml that was used for this run. Enables the dashboard to show exactly what was evaluated, even if the eval config has since changed. No skill-creator equivalent, but follows the same principle as skill-creator's `skill-snapshot/` directory.

### `index.json` — Fast lookup

Rebuilt on startup or incrementally updated. Contains one entry per run with summary stats from `meta.json` + `benchmark.json`. Enables the dashboard to render runs list and trend charts without parsing every JSONL.

### Workflow

```bash
# One-time setup
git init ../myproject-eval-results

# Configure in .agentv/config.yaml
# history:
#   repo: ../myproject-eval-results

# Every eval run auto-saves all artifacts to the history repo
agentv eval run -f eval.yaml
# → writes results.jsonl, trace.jsonl, timing.json, grading/, benchmark.json, meta.json

# Group runs into a campaign for iteration tracking
agentv eval run -f eval.yaml --campaign optimize-retrieval

# View dashboard
agentv serve    # Reads history.repo from config.yaml

# Manage history
agentv results reindex                      # Rebuild index.json
agentv results prune --older-than 30d       # Cleanup old runs

# Share with team
cd ../myproject-eval-results && git add . && git commit -m "eval run 2026-03-14" && git push
```

---

## Progressive visualization

The core differentiator over the static HTML report. The dashboard should feel alive during a run.

### How it works

1. `agentv eval run` writes results and traces as each test completes — also writes to history repo if configured
2. `agentv serve` watches the active JSONL files via `node:fs.watch`
3. Server pushes new results to the browser via **Server-Sent Events (SSE)**
4. Dashboard updates incrementally — no full page reload

### What "progressive" looks like

| Phase | Dashboard state |
|-------|----------------|
| Run starts (0 results) | Stat cards show 0/N, empty table, "Running..." indicator |
| First result arrives | First row appears in table, stat cards update, first data point on charts |
| Results streaming in | Table grows, pass rate bar animates, score distribution builds up |
| Run completes | "Running..." → "Complete", final stats, timing.json + benchmark.json written |

### Progressive elements per view

- **Stat cards**: Counters animate as results arrive (passed: 0 → 1 → 2...)
- **Test cases table**: New rows append at top with subtle highlight animation
- **Score distribution histogram**: Bars grow as more data points arrive
- **Pass rate bar**: Width animates from 0% toward current rate
- **Category breakdown**: Categories appear as their first test completes
- **Trace tree**: Nodes appear as their corresponding tests finish

---

## Dashboard features

### `agentv serve`

```bash
agentv serve                    # Reads history.repo from .agentv/config.yaml
agentv serve --port 8080        # Custom port (default: 4747)
agentv serve --results-dir ./custom-results  # Override config
```

Minimal `node:http` — zero external dependencies.

### View: Historical Trends

- **Line chart**: Pass rate over time (one line per target)
- **Score trend**: Average evaluator score over time
- **Regression detection**: Highlight runs where pass rate dropped significantly
- **Campaign view**: Group runs by campaign, show iteration progression (from history.json)
- **Filter by**: target, evaluator, tag, date range, git branch, campaign

### View: Run History

- Table of all historical runs: timestamp, target, campaign, pass rate bar, passed/failed/error, duration, tags, git commit
- Sort by any column
- Click row → detail views (same as #562: test cases, traces, comparison)
- Multi-select runs for side-by-side comparison

### View: Live Run (progressive)

- Same layout as the run detail view from #562
- Updates in real-time via SSE as results stream in
- Visual indicator: pulsing dot or progress bar showing N/total completed
- Auto-scrolls to show newest results (with option to pin scroll position)

### View: Trace Explorer

Powered by `trace.jsonl` from the snapshot:

- Collapsible tree view of test execution → LLM calls → tool calls
- Per-node details: prompts, responses, token counts, latency
- Score overlay on trace nodes
- Color coding: green ≥90%, yellow ≥50%, red <50%
- Filter traces by test ID, target, or score range

### View: Dataset Browser

- **Read-only view** of test cases from EVAL.yaml and referenced dataset files
- Table with: test ID, input preview, expected output preview, evaluators, tags
- Expand row to see full input/output
- Filter by tag, evaluator
- Import: upload CSV/JSON/JSONL (writes to dataset file on disk)
- No inline editing — users edit YAML directly or via AI assistant

### View: Cross-Run Comparison

- Select any 2+ historical runs (or iterations within a campaign)
- Regression table: tests that flipped pass→fail (red) or fail→pass (green)
- Score delta with ↑/↓ arrows
- Summary: improvements / regressions / unchanged
- Campaign iteration comparison: show progression across iterations

---

## Architecture

### Server

```
agentv serve
  ├─ GET /                              → SPA (bundled HTML/CSS/JS)
  ├─ GET /api/runs                      → index.json (run list)
  ├─ GET /api/runs/:runId               → meta.json + benchmark.json
  ├─ GET /api/runs/:runId/results       → Streamed results.jsonl parsing
  ├─ GET /api/runs/:runId/traces        → Streamed trace.jsonl parsing
  ├─ GET /api/runs/:runId/grading/:test → grading/<test>.json
  ├─ GET /api/campaigns                 → history.json (campaign list)
  ├─ GET /api/trends                    → Aggregated time-series data
  ├─ GET /api/live                      → SSE stream (new results + traces)
  ├─ GET /api/datasets                  → Test cases from EVAL.yaml / dataset files
  └─ POST /api/datasets/import          → Import CSV/JSON/JSONL as test cases
```

- `node:http` + `node:fs` + `node:fs.watch` — zero dependencies
- SSE endpoint watches active JSONL files and pushes `data: {result JSON}\n\n` on each new line
- SPA bundled into the npm package (no separate install)
- Reads `history.repo` from `.agentv/config.yaml` to locate results

### Frontend

- Vanilla JS or minimal vendored lib (must bundle into npm package)
- Reuses visualization components from #562
- SSE client: `new EventSource('/api/live')` — native browser API, no library needed

---

## Acceptance signals

- [ ] `history.repo` config in `.agentv/config.yaml` points to a separate results repo
- [ ] `agentv eval run` auto-saves all artifacts to history repo when configured
- [ ] Snapshot includes: results.jsonl, trace.jsonl, otel.json, timing.json, grading/*.json, benchmark.json, eval-snapshot.yaml, meta.json
- [ ] `grading.json` schema matches skill-creator's `expectations[].text/passed/evidence` format
- [ ] `timing.json` schema matches skill-creator's format
- [ ] `benchmark.json` schema matches skill-creator's `run_summary` with `mean`/`stddev` format
- [ ] `history.json` tracks campaigns with iteration progression (aligned with skill-creator's `history.json`)
- [ ] `agentv serve` starts a local HTTP server with zero additional dependencies
- [ ] Progressive visualization: dashboard updates in real-time via SSE
- [ ] Historical trend charts with campaign grouping
- [ ] Trace explorer with collapsible tree from trace.jsonl
- [ ] Cross-run and cross-campaign comparison
- [ ] Dataset browser with filtering and import
- [ ] `index.json` enables fast aggregation without parsing all JSONL files
- [ ] `agentv results reindex` rebuilds index
- [ ] Works with existing JSONL results (backwards compatible)

## Non-goals

- Cloud hosting or multi-user auth
- YAML editing UI (users edit YAML directly or via AI assistant)
- Eval config form builder
- Real-time collaboration
- Database backend (SQLite, Convex, etc.) — filesystem only
- Replacing the CLI workflow — dashboard is complementary
- Blind A/B comparison (skill-creator's comparator.md — future enhancement)

## Related

- #562 — Static HTML report (single-run, self-contained, meta-refresh)
- [Anthropic skill-creator](https://github.com/anthropics/skills/tree/main/skills/skill-creator) — Schema alignment source (grading.json, timing.json, benchmark.json, history.json)
- DeepEval Confident AI — cloud dashboard with trend tracking
- Convex Evals visualizer — React + Convex backend dashboard
- pi-autoresearch — session persistence patterns


Phase	Dashboard state
Run starts (0 results)	Stat cards show 0/N, empty table, "Running..." indicator
First result arrives	First row appears in table, stat cards update, first data point on charts
Results streaming in	Table grows, pass rate bar animates, score distribution builds up
Run completes	"Running..." → "Complete", final stats, timing.json + benchmark.json written

feat: self-hosted dashboard — historical trends, dataset management, YAML editor #563

Description

Summary

Problem

Prerequisite: Result snapshot structure

Storage location: separate history repo

Proposed snapshot layout (aligned with skill-creator)

Schema alignment with skill-creator

history.json — Campaign/iteration tracking

timing.json — Per-run timing

grading/<test-id>.json — Per-test grading detail

benchmark.json — Run aggregate

meta.json — Run metadata

eval-snapshot.yaml — Frozen eval config

index.json — Fast lookup

Workflow

Progressive visualization

How it works

What "progressive" looks like

Progressive elements per view

Dashboard features

agentv serve

View: Historical Trends

View: Run History

View: Live Run (progressive)

View: Trace Explorer

View: Dataset Browser

View: Cross-Run Comparison

Architecture

Server

Frontend

Acceptance signals

Non-goals

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`history.json` — Campaign/iteration tracking

`timing.json` — Per-run timing

`grading/<test-id>.json` — Per-test grading detail

`benchmark.json` — Run aggregate

`meta.json` — Run metadata

`eval-snapshot.yaml` — Frozen eval config

`index.json` — Fast lookup

`agentv serve`