feat(cli): prepare AgentV workspaces for manual or external-agent attempts

## Decision

Do **not** build `agentv watch`.

The useful part of the original request is not a new live-observability runner. The useful primitive is: **materialize the same AgentV eval workspace and task prompt so a human or external agent can attempt the case manually, then let AgentV grade the resulting workspace and optional trace later.**

Live tracing should be configured in the harness or agent itself, for example through Opik/Phoenix/OTel env vars, target hooks, or provider-native config. AgentV can import or grade those traces after the fact through the normalized trajectory/import work. AgentV should not become a parallel live observability product surface.

## Revised scope: `agentv prepare`

Add a CLI command that prepares one eval case for a manual or external-agent attempt:

```bash
agentv prepare evals/foo.eval.yaml --test-id case-1 --target codex --out /tmp/agentv-case-1
```

Expected output directory:

```text
/tmp/agentv-case-1/
  workspace/              # materialized repo/template/hooks applied
  prompt.md               # task prompt to give the human or external agent
  agentv_prepare.json     # snake_case manifest with eval/test/target/workspace metadata
```

`agentv prepare` should:

- Resolve the eval file, selected test case, and target.
- Materialize the workspace using existing AgentV primitives: template copy, `workspace.repos`, Docker/static/pooled/temp behavior where applicable, and lifecycle hooks.
- Run setup only: workspace `before_all`, target `before_all`, workspace `before_each`, target `before_each`.
- Generate a task prompt suitable for the human or external agent.
- Write an auditable manifest with repo pins, selected test, target, hook order, prompt path, workspace path, and created timestamp.
- Print the workspace path and next-step command hints.

`agentv prepare` must not:

- Launch the agent.
- Run graders.
- Claim an eval run completed.
- Emit Opik/Phoenix/OTel traces directly.
- Expose hidden tests, expected output, grader internals, or rubrics in `prompt.md` unless explicitly allowed by eval config.

## Follow-up: grade a prepared attempt

After a human or external agent works in the prepared workspace, AgentV should be able to grade the final state without re-running the target:

```bash
agentv grade evals/foo.eval.yaml --test-id case-1 --prepared /tmp/agentv-case-1
```

If trajectory matters and an external trace exists, grade with an imported trace:

```bash
agentv grade evals/foo.eval.yaml --test-id case-1 \
  --prepared /tmp/agentv-case-1 \
  --trace opik:<project>/<trace-id>
```

The trace path should route through AgentV's normalized trajectory import/projection work. Provider-native session logs, Opik traces, and OTLP files should be adapters into AgentV's trajectory model; graders should not consume vendor-specific trace shapes directly.

## Product boundary

This replaces the old `agentv watch` proposal:

- Workspace preparation is an AgentV primitive.
- Live observability belongs in Opik/Phoenix/OTel or provider-native instrumentation.
- AgentV remains the source of truth for eval definitions, workspace setup, grading, result bundles, and CI gates.
- Opik/Phoenix traces are optional input artifacts for later grading, not a reason to add a live non-eval runner.

## Implementation plan

1. Add `agentv prepare` for a single eval/test/target.
2. Reuse existing workspace materialization and hook execution instead of duplicating setup logic.
3. Define and write `agentv_prepare.json` with snake_case wire keys.
4. Generate `prompt.md` from the eval input and safe user-facing task context.
5. Add `agentv grade --prepared` or an equivalent prepared-attempt grading command that reuses existing graders against the final workspace state without running the target.
6. Wire optional trace input through normalized trajectory import work; start with local trace/session files if Opik direct fetch depends on existing export/import infrastructure.
7. Document the workflow: prepare workspace → human/external agent works → grade prepared attempt → optionally import/grade traces.

## Acceptance criteria

- `agentv prepare <eval> --test-id <id> --target <target> --out <dir>` creates a prepared workspace with `workspace/`, `prompt.md`, and `agentv_prepare.json`.
- The prepared workspace matches the setup state an AgentV eval run would give the selected target before target execution.
- Setup uses existing workspace/repo/hook primitives and respects target hooks.
- `prompt.md` contains the task input and safe execution instructions, but not hidden expected outputs or grader internals.
- The manifest uses `snake_case` and is sufficient to audit and later grade the attempt.
- A prepared attempt can be graded without re-running the agent target.
- Optional trace/session input can enrich trajectory-based graders through the normalized trace import path.
- Docs explain how to configure Opik/Phoenix/OTel in the harness or target hooks instead of using `agentv watch`.

## Non-goals

- No `agentv watch` command.
- No AgentV-hosted live observability product.
- No built-in Opik/Phoenix tracing runner inside `prepare`.
- No hidden-test or oracle leakage into manual prompts.
- No new benchmark-family-specific workspace schema.

## References

- Research wiki: `concepts/benchmark-provenance-workspace-patterns.md`
- Research wiki: `concepts/production-llm-evaluation-observability-loop.md`
- Research wiki: `concepts/harness-quality-evaluation.md`
- Existing AgentV workspace docs: `apps/web/src/content/docs/docs/guides/workspace-architecture.mdx`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cli): prepare AgentV workspaces for manual or external-agent attempts #1263

Decision

Revised scope: `agentv prepare`

Follow-up: grade a prepared attempt

Product boundary

Implementation plan

Acceptance criteria

Non-goals

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat(cli): prepare AgentV workspaces for manual or external-agent attempts #1263

Description

Decision

Revised scope: agentv prepare

Follow-up: grade a prepared attempt

Product boundary

Implementation plan

Acceptance criteria

Non-goals

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Revised scope: `agentv prepare`