Decision
Do not build agentv watch.
The useful part of the original request is not a new live-observability runner. The useful primitive is: materialize the same AgentV eval workspace and task prompt so a human or external agent can attempt the case manually, then let AgentV grade the resulting workspace and optional trace later.
Live tracing should be configured in the harness or agent itself, for example through Opik/Phoenix/OTel env vars, target hooks, or provider-native config. AgentV can import or grade those traces after the fact through the normalized trajectory/import work. AgentV should not become a parallel live observability product surface.
Revised scope: agentv prepare
Add a CLI command that prepares one eval case for a manual or external-agent attempt:
agentv prepare evals/foo.eval.yaml --test-id case-1 --target codex --out /tmp/agentv-case-1
Expected output directory:
/tmp/agentv-case-1/
workspace/ # materialized repo/template/hooks applied
prompt.md # task prompt to give the human or external agent
agentv_prepare.json # snake_case manifest with eval/test/target/workspace metadata
agentv prepare should:
- Resolve the eval file, selected test case, and target.
- Materialize the workspace using existing AgentV primitives: template copy,
workspace.repos, Docker/static/pooled/temp behavior where applicable, and lifecycle hooks.
- Run setup only: workspace
before_all, target before_all, workspace before_each, target before_each.
- Generate a task prompt suitable for the human or external agent.
- Write an auditable manifest with repo pins, selected test, target, hook order, prompt path, workspace path, and created timestamp.
- Print the workspace path and next-step command hints.
agentv prepare must not:
- Launch the agent.
- Run graders.
- Claim an eval run completed.
- Emit Opik/Phoenix/OTel traces directly.
- Expose hidden tests, expected output, grader internals, or rubrics in
prompt.md unless explicitly allowed by eval config.
Follow-up: grade a prepared attempt
After a human or external agent works in the prepared workspace, AgentV should be able to grade the final state without re-running the target:
agentv grade evals/foo.eval.yaml --test-id case-1 --prepared /tmp/agentv-case-1
If trajectory matters and an external trace exists, grade with an imported trace:
agentv grade evals/foo.eval.yaml --test-id case-1 \
--prepared /tmp/agentv-case-1 \
--trace opik:<project>/<trace-id>
The trace path should route through AgentV's normalized trajectory import/projection work. Provider-native session logs, Opik traces, and OTLP files should be adapters into AgentV's trajectory model; graders should not consume vendor-specific trace shapes directly.
Product boundary
This replaces the old agentv watch proposal:
- Workspace preparation is an AgentV primitive.
- Live observability belongs in Opik/Phoenix/OTel or provider-native instrumentation.
- AgentV remains the source of truth for eval definitions, workspace setup, grading, result bundles, and CI gates.
- Opik/Phoenix traces are optional input artifacts for later grading, not a reason to add a live non-eval runner.
Implementation plan
- Add
agentv prepare for a single eval/test/target.
- Reuse existing workspace materialization and hook execution instead of duplicating setup logic.
- Define and write
agentv_prepare.json with snake_case wire keys.
- Generate
prompt.md from the eval input and safe user-facing task context.
- Add
agentv grade --prepared or an equivalent prepared-attempt grading command that reuses existing graders against the final workspace state without running the target.
- Wire optional trace input through normalized trajectory import work; start with local trace/session files if Opik direct fetch depends on existing export/import infrastructure.
- Document the workflow: prepare workspace → human/external agent works → grade prepared attempt → optionally import/grade traces.
Acceptance criteria
agentv prepare <eval> --test-id <id> --target <target> --out <dir> creates a prepared workspace with workspace/, prompt.md, and agentv_prepare.json.
- The prepared workspace matches the setup state an AgentV eval run would give the selected target before target execution.
- Setup uses existing workspace/repo/hook primitives and respects target hooks.
prompt.md contains the task input and safe execution instructions, but not hidden expected outputs or grader internals.
- The manifest uses
snake_case and is sufficient to audit and later grade the attempt.
- A prepared attempt can be graded without re-running the agent target.
- Optional trace/session input can enrich trajectory-based graders through the normalized trace import path.
- Docs explain how to configure Opik/Phoenix/OTel in the harness or target hooks instead of using
agentv watch.
Non-goals
- No
agentv watch command.
- No AgentV-hosted live observability product.
- No built-in Opik/Phoenix tracing runner inside
prepare.
- No hidden-test or oracle leakage into manual prompts.
- No new benchmark-family-specific workspace schema.
References
- Research wiki:
concepts/benchmark-provenance-workspace-patterns.md
- Research wiki:
concepts/production-llm-evaluation-observability-loop.md
- Research wiki:
concepts/harness-quality-evaluation.md
- Existing AgentV workspace docs:
apps/web/src/content/docs/docs/guides/workspace-architecture.mdx
Decision
Do not build
agentv watch.The useful part of the original request is not a new live-observability runner. The useful primitive is: materialize the same AgentV eval workspace and task prompt so a human or external agent can attempt the case manually, then let AgentV grade the resulting workspace and optional trace later.
Live tracing should be configured in the harness or agent itself, for example through Opik/Phoenix/OTel env vars, target hooks, or provider-native config. AgentV can import or grade those traces after the fact through the normalized trajectory/import work. AgentV should not become a parallel live observability product surface.
Revised scope:
agentv prepareAdd a CLI command that prepares one eval case for a manual or external-agent attempt:
Expected output directory:
agentv prepareshould:workspace.repos, Docker/static/pooled/temp behavior where applicable, and lifecycle hooks.before_all, targetbefore_all, workspacebefore_each, targetbefore_each.agentv preparemust not:prompt.mdunless explicitly allowed by eval config.Follow-up: grade a prepared attempt
After a human or external agent works in the prepared workspace, AgentV should be able to grade the final state without re-running the target:
If trajectory matters and an external trace exists, grade with an imported trace:
The trace path should route through AgentV's normalized trajectory import/projection work. Provider-native session logs, Opik traces, and OTLP files should be adapters into AgentV's trajectory model; graders should not consume vendor-specific trace shapes directly.
Product boundary
This replaces the old
agentv watchproposal:Implementation plan
agentv preparefor a single eval/test/target.agentv_prepare.jsonwith snake_case wire keys.prompt.mdfrom the eval input and safe user-facing task context.agentv grade --preparedor an equivalent prepared-attempt grading command that reuses existing graders against the final workspace state without running the target.Acceptance criteria
agentv prepare <eval> --test-id <id> --target <target> --out <dir>creates a prepared workspace withworkspace/,prompt.md, andagentv_prepare.json.prompt.mdcontains the task input and safe execution instructions, but not hidden expected outputs or grader internals.snake_caseand is sufficient to audit and later grade the attempt.agentv watch.Non-goals
agentv watchcommand.prepare.References
concepts/benchmark-provenance-workspace-patterns.mdconcepts/production-llm-evaluation-observability-loop.mdconcepts/harness-quality-evaluation.mdapps/web/src/content/docs/docs/guides/workspace-architecture.mdx