Skip to content

feat: self-hosted dashboard — historical trends, dataset management, YAML editor #563

@christso

Description

@christso

Summary

Self-hosted local dashboard (agentv serve) for historical trend analysis, progressive live visualization, and dataset browsing. Complements the static HTML report (#562) by adding persistence, aggregation across time, and a live view that updates as results stream in.

Schema alignment: Internal snapshot structure aligns with Anthropic's skill-creator conventions (grading.json, timing.json, benchmark.json, history.json) while storing results in a separate history repo instead of inline with the project.

Problem

The static HTML report (#562) covers single-run visualization with meta-refresh. But teams need to:

  1. Spot trends — Is our agent getting better or worse over time? Which evaluators are regressing?
  2. Look up historical data — What was the pass rate last week? When did this test start failing?
  3. Aggregate across runs — Compare Tuesday's run against Friday's across multiple targets
  4. Watch results progressively — See the dashboard populate in real-time as each test completes, not just on page reload

Prerequisite: Result snapshot structure

Storage location: separate history repo

Results stored in a separate git repository — not the project repo (avoids git history bloat) and not inline like skill-creator (eval results are larger and more frequent than skill iteration artifacts).

Configured in .agentv/config.yaml (extends the existing config schema):

$schema: agentv-config-v2

guideline_patterns:
  - "**/*.instructions.md"

execution:
  verbose: false
  trace_file: .agentv/results/trace-{timestamp}.jsonl

# History repo for eval result snapshots
history:
  repo: ../myproject-eval-results    # Path to the history repo (required for dashboard)
  auto_save: true                    # Automatically snapshot every run (default: true)
  retention:
    max_runs: 500                    # Max snapshots to keep (optional, default: unlimited)

The history.repo path is relative to the project root. The history repo is a plain git repo — users create it themselves (git init ../myproject-eval-results).

Proposed snapshot layout (aligned with skill-creator)

myproject-eval-results/                     # Separate git repo
  history.json                              # Campaign/iteration tracking (skill-creator aligned)
  index.json                                # Fast lookup index for dashboard

  runs/
    2026-03-14T10-32-00_claude/             # Timestamped run folder
      meta.json                             # Run metadata
      results.jsonl                         # Eval results (existing format)
      trace.jsonl                           # Execution trace: LLM calls, tool calls, latency
      otel.json                             # OpenTelemetry spans (optional)
      eval-snapshot.yaml                    # Frozen copy of EVAL.yaml used for this run
      grading/                              # Per-test grading detail (skill-creator aligned)
        test-feature-alpha.json             # grading.json per test case
        test-retrieval-basic.json
      timing.json                           # Wall clock timing (skill-creator aligned)
      benchmark.json                        # Aggregate stats (skill-creator aligned)

    2026-03-14T11-45-00_gpt4o/
      meta.json
      results.jsonl
      trace.jsonl
      ...

Schema alignment with skill-creator

AgentV adopts skill-creator's JSON schemas where they overlap, extending them for eval-specific needs:

history.json — Campaign/iteration tracking

Adapted from skill-creator's history.json. Tracks related runs as iterations in a campaign (e.g., "improving retrieval quality").

{
  "campaigns": [
    {
      "name": "optimize-retrieval",
      "started_at": "2026-03-14T10:30:00Z",
      "current_best": "2026-03-15T09-00-00_claude",
      "iterations": [
        {
          "run_id": "2026-03-14T10-32-00_claude",
          "parent": null,
          "pass_rate": 0.65,
          "result": "baseline",
          "is_current_best": false
        },
        {
          "run_id": "2026-03-15T09-00-00_claude",
          "parent": "2026-03-14T10-32-00_claude",
          "pass_rate": 0.85,
          "result": "won",
          "is_current_best": true
        }
      ]
    }
  ]
}

Skill-creator alignment: Same iterations[] structure with parent, pass_rate, is_current_best. Extended with campaigns[] wrapper since AgentV tracks multiple improvement efforts, not just one skill.

timing.json — Per-run timing

Matches skill-creator's schema:

{
  "total_tokens": 84852,
  "duration_ms": 45200,
  "total_duration_seconds": 45.2,
  "token_usage": {
    "input": 52000,
    "output": 18000
  }
}

grading/<test-id>.json — Per-test grading detail

Matches skill-creator's grading.json schema:

{
  "expectations": [
    {
      "text": "Response correctly identifies the root cause",
      "passed": true,
      "evidence": "Output states 'The root cause is the missing null check in line 42'"
    },
    {
      "text": "Response suggests a fix with code example",
      "passed": false,
      "evidence": "Output identifies the issue but provides no code fix"
    }
  ],
  "summary": {
    "passed": 1,
    "failed": 1,
    "total": 2,
    "pass_rate": 0.50
  },
  "execution_metrics": {
    "tool_calls": { "Read": 3, "Bash": 2, "Write": 1 },
    "total_tool_calls": 6,
    "errors_encountered": 0
  }
}

Skill-creator alignment: Same expectations[].text/passed/evidence fields, same summary structure, same execution_metrics shape. This means skill-creator's viewer and aggregation scripts can read AgentV's grading output directly.

benchmark.json — Run aggregate

Matches skill-creator's schema:

{
  "metadata": {
    "eval_file": "evals/eval.yaml",
    "timestamp": "2026-03-14T10:32:00Z",
    "targets": ["claude-3-5-sonnet"],
    "tests_run": ["test-feature-alpha", "test-retrieval-basic"]
  },
  "run_summary": {
    "claude-3-5-sonnet": {
      "pass_rate": { "mean": 0.85, "stddev": 0.05 },
      "time_seconds": { "mean": 45.0, "stddev": 12.0 },
      "tokens": { "mean": 3800, "stddev": 400 }
    }
  },
  "notes": []
}

Skill-creator alignment: Same run_summary structure with mean/stddev per metric. Uses target names instead of with_skill/without_skill since AgentV compares across targets, not skill presence.

meta.json — Run metadata

AgentV-specific (no skill-creator equivalent):

{
  "runId": "2026-03-14T10-32-00_claude",
  "timestamp": "2026-03-14T10:32:00Z",
  "status": "completed",
  "evalFile": "evals/eval.yaml",
  "targets": ["claude-3-5-sonnet"],
  "campaign": "optimize-retrieval",
  "git": {
    "branch": "main",
    "commit": "abc1234",
    "dirty": false
  },
  "artifacts": {
    "results": "results.jsonl",
    "trace": "trace.jsonl",
    "otel": "otel.json",
    "eval_snapshot": "eval-snapshot.yaml",
    "timing": "timing.json",
    "benchmark": "benchmark.json"
  },
  "tags": ["nightly", "v2.1"]
}

eval-snapshot.yaml — Frozen eval config

A copy of the EVAL.yaml that was used for this run. Enables the dashboard to show exactly what was evaluated, even if the eval config has since changed. No skill-creator equivalent, but follows the same principle as skill-creator's skill-snapshot/ directory.

index.json — Fast lookup

Rebuilt on startup or incrementally updated. Contains one entry per run with summary stats from meta.json + benchmark.json. Enables the dashboard to render runs list and trend charts without parsing every JSONL.

Workflow

# One-time setup
git init ../myproject-eval-results

# Configure in .agentv/config.yaml
# history:
#   repo: ../myproject-eval-results

# Every eval run auto-saves all artifacts to the history repo
agentv eval run -f eval.yaml
# → writes results.jsonl, trace.jsonl, timing.json, grading/, benchmark.json, meta.json

# Group runs into a campaign for iteration tracking
agentv eval run -f eval.yaml --campaign optimize-retrieval

# View dashboard
agentv serve    # Reads history.repo from config.yaml

# Manage history
agentv results reindex                      # Rebuild index.json
agentv results prune --older-than 30d       # Cleanup old runs

# Share with team
cd ../myproject-eval-results && git add . && git commit -m "eval run 2026-03-14" && git push

Progressive visualization

The core differentiator over the static HTML report. The dashboard should feel alive during a run.

How it works

  1. agentv eval run writes results and traces as each test completes — also writes to history repo if configured
  2. agentv serve watches the active JSONL files via node:fs.watch
  3. Server pushes new results to the browser via Server-Sent Events (SSE)
  4. Dashboard updates incrementally — no full page reload

What "progressive" looks like

Phase Dashboard state
Run starts (0 results) Stat cards show 0/N, empty table, "Running..." indicator
First result arrives First row appears in table, stat cards update, first data point on charts
Results streaming in Table grows, pass rate bar animates, score distribution builds up
Run completes "Running..." → "Complete", final stats, timing.json + benchmark.json written

Progressive elements per view

  • Stat cards: Counters animate as results arrive (passed: 0 → 1 → 2...)
  • Test cases table: New rows append at top with subtle highlight animation
  • Score distribution histogram: Bars grow as more data points arrive
  • Pass rate bar: Width animates from 0% toward current rate
  • Category breakdown: Categories appear as their first test completes
  • Trace tree: Nodes appear as their corresponding tests finish

Dashboard features

agentv serve

agentv serve                    # Reads history.repo from .agentv/config.yaml
agentv serve --port 8080        # Custom port (default: 4747)
agentv serve --results-dir ./custom-results  # Override config

Minimal node:http — zero external dependencies.

View: Historical Trends

  • Line chart: Pass rate over time (one line per target)
  • Score trend: Average evaluator score over time
  • Regression detection: Highlight runs where pass rate dropped significantly
  • Campaign view: Group runs by campaign, show iteration progression (from history.json)
  • Filter by: target, evaluator, tag, date range, git branch, campaign

View: Run History

View: Live Run (progressive)

View: Trace Explorer

Powered by trace.jsonl from the snapshot:

  • Collapsible tree view of test execution → LLM calls → tool calls
  • Per-node details: prompts, responses, token counts, latency
  • Score overlay on trace nodes
  • Color coding: green ≥90%, yellow ≥50%, red <50%
  • Filter traces by test ID, target, or score range

View: Dataset Browser

  • Read-only view of test cases from EVAL.yaml and referenced dataset files
  • Table with: test ID, input preview, expected output preview, evaluators, tags
  • Expand row to see full input/output
  • Filter by tag, evaluator
  • Import: upload CSV/JSON/JSONL (writes to dataset file on disk)
  • No inline editing — users edit YAML directly or via AI assistant

View: Cross-Run Comparison

  • Select any 2+ historical runs (or iterations within a campaign)
  • Regression table: tests that flipped pass→fail (red) or fail→pass (green)
  • Score delta with ↑/↓ arrows
  • Summary: improvements / regressions / unchanged
  • Campaign iteration comparison: show progression across iterations

Architecture

Server

agentv serve
  ├─ GET /                              → SPA (bundled HTML/CSS/JS)
  ├─ GET /api/runs                      → index.json (run list)
  ├─ GET /api/runs/:runId               → meta.json + benchmark.json
  ├─ GET /api/runs/:runId/results       → Streamed results.jsonl parsing
  ├─ GET /api/runs/:runId/traces        → Streamed trace.jsonl parsing
  ├─ GET /api/runs/:runId/grading/:test → grading/<test>.json
  ├─ GET /api/campaigns                 → history.json (campaign list)
  ├─ GET /api/trends                    → Aggregated time-series data
  ├─ GET /api/live                      → SSE stream (new results + traces)
  ├─ GET /api/datasets                  → Test cases from EVAL.yaml / dataset files
  └─ POST /api/datasets/import          → Import CSV/JSON/JSONL as test cases
  • node:http + node:fs + node:fs.watch — zero dependencies
  • SSE endpoint watches active JSONL files and pushes data: {result JSON}\n\n on each new line
  • SPA bundled into the npm package (no separate install)
  • Reads history.repo from .agentv/config.yaml to locate results

Frontend


Acceptance signals

  • history.repo config in .agentv/config.yaml points to a separate results repo
  • agentv eval run auto-saves all artifacts to history repo when configured
  • Snapshot includes: results.jsonl, trace.jsonl, otel.json, timing.json, grading/*.json, benchmark.json, eval-snapshot.yaml, meta.json
  • grading.json schema matches skill-creator's expectations[].text/passed/evidence format
  • timing.json schema matches skill-creator's format
  • benchmark.json schema matches skill-creator's run_summary with mean/stddev format
  • history.json tracks campaigns with iteration progression (aligned with skill-creator's history.json)
  • agentv serve starts a local HTTP server with zero additional dependencies
  • Progressive visualization: dashboard updates in real-time via SSE
  • Historical trend charts with campaign grouping
  • Trace explorer with collapsible tree from trace.jsonl
  • Cross-run and cross-campaign comparison
  • Dataset browser with filtering and import
  • index.json enables fast aggregation without parsing all JSONL files
  • agentv results reindex rebuilds index
  • Works with existing JSONL results (backwards compatible)

Non-goals

  • Cloud hosting or multi-user auth
  • YAML editing UI (users edit YAML directly or via AI assistant)
  • Eval config form builder
  • Real-time collaboration
  • Database backend (SQLite, Convex, etc.) — filesystem only
  • Replacing the CLI workflow — dashboard is complementary
  • Blind A/B comparison (skill-creator's comparator.md — future enhancement)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions