Skip to content

feat(studio): expose eval resumability — API + Resume action on run detail#1220

Open
christso wants to merge 2 commits intomainfrom
feat/1219-studio-resume
Open

feat(studio): expose eval resumability — API + Resume action on run detail#1220
christso wants to merge 2 commits intomainfrom
feat/1219-studio-resume

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented May 6, 2026

Closes #1219

Summary

Surfaces the existing CLI resume mechanics (--resume, --rerun-failed, --retry-errors, --output) in Studio so users can finish an interrupted or partially-errored run from the web UI instead of dropping to a terminal.

Changes

Server (apps/cli/src/commands/results/)

  • RunEvalRequest accepts resume / rerun_failed / retry_errors / output (snake_case wire format).
  • buildCliArgs translates these into --resume, --rerun-failed, --retry-errors <path>, --output <dir>.
  • New validateResumeOptions returns 400 with a usable message when the three modes are combined.
  • Read-only guard now also covers /api/benchmarks/:id/eval/run (was missing before).
  • handleRunDetail reads benchmark.json for the run and exposes run_dir (relative to cwd) and suite_filter (from metadata.eval_file) so the UI can target the same workspace. Local runs only — remote runs in the results-repo cache cannot be resumed in place.

UI (apps/studio/src/)

  • New ResumeRunActions component renders "↻ Resume run" + "Rerun failed cases" buttons on /runs/:runId and /benchmarks/:id/runs/:runId when at least one row has executionStatus === 'execution_error'.
  • Hidden in read-only mode; disabled with an explanatory tooltip when run_dir or suite_filter cannot be resolved.
  • After POSTing to /api/eval/run, navigates to /jobs/:runId for live progress.
  • Pure helpers (shouldShowResumeActions, buildResumeRequestBody) are unit-tested without rendering React.

Tests added

  • apps/cli/test/commands/results/serve.test.ts — request/preview shaping for resume / rerun_failed / retry_errors, mutual-exclusivity 400s, read-only 403 (unscoped + benchmark-scoped), run_dir + suite_filter exposure on /api/runs/:filename. 53 tests in this file.
  • apps/studio/src/components/resume-run-helpers.test.ts — visibility logic and request-body shape for both modes (incl. read-only hides, missing target omitted). 7 tests.

Test plan

  • bun run test — 2337 tests pass
  • bun run typecheck — clean
  • bun run lint — clean
  • Manual red/green UAT (synthetic fixture)
  • Live e2e against Azure OpenAI (real provider, real execution_error, click-through)

Red / Green UAT — synthetic fixture

Hand-crafted run workspace with one execution_status: execution_error row + a benchmark.json whose metadata.eval_file points at a known eval YAML. Confirms wire contract + UI surface across main, this branch, and --read-only.

Red — main

GET /api/runs/:filename does not include run_dir / suite_filter. Run detail page exposes only "▶ Re-run with Filters":

- heading "gpt-4o" [level=1]
- button "▶ Re-run with Filters"

Green — this branch

GET /api/runs/:filename returns:

"run_dir":".agentv/results/runs/2026-05-06T00-00-00-000Z",
"suite_filter":"examples/features/basic/evals/dataset.eval.yaml"

UI renders the new actions:

- button "↻ Resume run"
- button "Rerun failed cases"
- button "▶ Re-run with Filters"

/api/eval/preview produces the expected CLI invocations and validation rejects mode combos with 400. Read-only mode hides both buttons and the API still returns 403.

Live e2e UAT — real Azure OpenAI run

Built a 2-test eval, ran it with --budget-usd 0.000001 --workers 1 to deliberately trigger one execution_error (budget_exceeded on the second test). Then opened the run in Studio and clicked Rerun failed cases.

Eval definition (tiny.eval.yaml):

tests:
  - id: cheap-greet
    criteria: Assistant says hello.
    input: "Say hello in one short sentence."
    expected_output: "Hello!"
  - id: also-cheap-greet
    criteria: Assistant says goodbye.
    input: "Say goodbye in one short sentence."
    expected_output: "Goodbye!"

Initial state (index.jsonl, before button click):

{"test_id":"cheap-greet",      "execution_status":"ok",              "score":1, "timestamp":"05:36:23.544Z"}
{"test_id":"also-cheap-greet", "execution_status":"execution_error", "score":0, "timestamp":"05:36:23.553Z"}

UI snapshot at /runs/2026-05-06T05-36-19-075Z — both new buttons visible, header shows 50% pass rate:

- heading "azure" [level=1]
- button "↻ Resume run"
- button "Rerun failed cases"
- button "▶ Re-run with Filters"
- cell "✓"   - cell "cheap-greet"      - cell "100%"  (1.5s)
- cell "!"   - cell "also-cheap-greet" - cell "ERR"   (0.0s)

Click "Rerun failed cases" → browser navigated to /jobs/studio-20260506-074017-xatb. The Studio job tracker showed status running then finished with exit code 0. Spawned CLI command (returned in the launch response):

agentv eval /tmp/agentv-e2e-oVjxrS/tiny.eval.yaml --target azure --output .agentv/results/runs/default/2026-05-06T05-36-19-075Z --rerun-failed

Final state (index.jsonl, after rerun finished):

{"test_id":"cheap-greet",      "execution_status":"ok", "score":1, "timestamp":"05:36:23.544Z"}  ← unchanged
{"test_id":"also-cheap-greet", "execution_status":"execution_error", "score":0, "timestamp":"05:36:23.553Z"}  ← original error row preserved
{"test_id":"also-cheap-greet", "execution_status":"ok", "score":1, "timestamp":"05:40:22.003Z"}  ← re-run, now passing

Acceptance: previously-passing test was skipped (timestamp on cheap-greet unchanged); errored test was re-run and now passes; pass rate updated from 50% → 67% in the Studio header.

Pre-existing bug discovered

The live e2e surfaced an unrelated bug in resolveCliPath (off-by-one in the currentDir fallback path) which prevents Studio from spawning the CLI when run from source against a foreign cwd. Filed as #1221 and worked around in this UAT with an agentv PATH shim. Not in scope for this PR — the global-install path used by end users is unaffected.

🤖 Generated with Claude Code

…etail

Surfaces the existing CLI resume mechanics (--resume, --rerun-failed,
--retry-errors, --output) in Studio so users can finish an interrupted
run from the web UI instead of dropping to a terminal.

Server:
- RunEvalRequest accepts resume / rerun_failed / retry_errors / output.
- buildCliArgs translates them to the corresponding CLI flags.
- Mutual-exclusivity validation returns 400 with a usable error.
- Read-only guard now also covers /api/benchmarks/:id/eval/run.
- handleRunDetail returns run_dir + suite_filter (from benchmark.json's
  metadata.eval_file) for local runs so the UI can target the same
  workspace.

UI:
- New ResumeRunActions component renders "Resume run" + "Rerun failed
  cases" buttons on /runs/:runId (and the benchmark-scoped variant)
  when at least one row has executionStatus === 'execution_error'.
- Hidden in read-only mode; disabled with an explanatory tooltip when
  run_dir or suite_filter cannot be resolved (e.g. remote runs).
- After launch, navigates to /jobs/:runId for live progress.
- Pure helpers (shouldShowResumeActions, buildResumeRequestBody)
  are unit-tested without rendering React.

Closes #1219

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 6, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: e686bef
Status: ✅  Deploy successful!
Preview URL: https://763dfa33.agentv.pages.dev
Branch Preview URL: https://feat-1219-studio-resume.agentv.pages.dev

View logs

- validateResumeOptions: trim retry_errors before counting it as a mode
  (matches buildCliArgs trim, so whitespace-only strings can no longer
  pass validation but emit no flag)
- deriveResumeMeta: explicitly handle '' from path.relative (runDir ===
  cwd) by falling through to the absolute path; previous truthiness
  check would have done the same but was less obvious

Both nits surfaced in code review of #1220.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@christso
Copy link
Copy Markdown
Collaborator Author

christso commented May 6, 2026

UAT screenshots

⚠️ Hosted on litterbox.catbox.moe — links expire in 72 hours (gist refused the binaries; imgur/0x0.st require API keys or are down). I can re-host on a permanent side branch if needed.

1. Red — main branch (no Resume button)

Run detail page on main against the fixture: only "▶ Re-run with Filters" is shown, even with one row in execution_error state.

main branch — no resume button

2. Green — feature branch (synthetic fixture)

Both "↻ Resume run" and "Rerun failed cases" buttons render alongside the existing "▶ Re-run with Filters".

feature branch — both buttons visible

3. Read-only mode

agentv studio --read-only: both buttons are hidden client-side (server still 403s the launch endpoint as defence-in-depth).

read-only — buttons hidden

4. Live e2e — /jobs/<id> after clicking Rerun

After triggering "Rerun failed cases" on a real run with one Azure-OpenAI execution_error (forced via --budget-usd 0.000001), the browser navigated to the jobs page. Status is Finished, exit code 0, and the spawned command is the expected agentv eval … --output … --rerun-failed.

/jobs/<id> page after click

@christso
Copy link
Copy Markdown
Collaborator Author

christso commented May 6, 2026

UAT screenshots (re-hosted)

The earlier litterbox.catbox.moe links rendered as broken images — GitHub's camo proxy refuses that host. Re-hosted on the throwaway screenshots/pr-1220 branch (never merged into main).

1. Red — main branch (no Resume button)

Run detail page on main against the synthetic fixture: only "▶ Re-run with Filters" is shown, even with one row in execution_error.

main branch — no resume button

2. Green — feature branch (synthetic fixture)

Both "↻ Resume run" and "Rerun failed cases" buttons render alongside the existing "▶ Re-run with Filters".

feature branch — both buttons visible

3. Read-only mode

agentv studio --read-only: both buttons are hidden client-side (server still 403s the launch endpoint as defence-in-depth).

read-only — buttons hidden

4. Live e2e — /jobs/<id> after clicking Rerun

After triggering "Rerun failed cases" on a real Azure-OpenAI run with one execution_error (forced via --budget-usd 0.000001), the browser navigated to the jobs page. Status Finished, exit code 0, spawned command is the expected agentv eval … --output … --rerun-failed.

/jobs/<id> page after click

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(studio): expose eval resumability — API + Resume action on run detail

1 participant