Concepts

Shared domain vocabulary for this project — entities, named processes, and status concepts with project-specific meaning. Seeded with core domain vocabulary, then accretes as ce-compound and ce-compound-refresh process learnings; direct edits are fine. Glossary only, not a spec or catch-all.

Providers and Targets

Provider — an adapter plugin that connects AgentV's evaluation engine to a specific AI system (e.g., copilot CLI, copilot SDK, Claude API, pi). Each provider implements the request/response contract: given a test case, invoke the AI system and return its output. Providers are selected per-target in eval YAML and can be extended via the provider registry.

Target — the eval YAML declaration that activates a specific provider for an evaluation run. A target names the provider, supplies configuration (model, API keys, timeouts, passthrough args), and scopes to a subset of test cases when needed. A single eval file can declare multiple targets to compare AI systems side by side.

Provider runtime boundary — the process boundary between AgentV's evaluation orchestrator and the agent runtime a provider invokes. CLI-backed providers place the agent runtime outside the orchestrator; in-process SDK providers share the orchestrator process and need either a targeted transport fix or subprocess-style isolation when runtime teardown can threaten run artifact finalization.

Evaluation Model

Eval — The frozen task and grading definition: prompts, datasets, input files, fixtures, assertions, and judge criteria. An eval defines what is being tested, not which agent, model, setup variant, or run policy executes it.

Experiment — A committed run variant that selects how evals are executed: target or target matrix, setup, scripts, eval filters, repeat counts, timeouts, workers, budgets, and related run knobs. Experiments make A/B setup differences explicit while pointing at stable eval tasks.

Run manifest — The root index.jsonl file in a run bundle. It is the dashboard and tooling loading contract for per-case result rows and artifact locations, including fields such as artifact_dir, task_dir, summary_path, and grading_path.

Artifact sidecar — A file beside or below a test-case artifact directory that provides evidence for a result, such as summary.json, grading.json, result.json, transcripts, logs, or outputs. Sidecars are evidence, not the primary discovery mechanism for a run.

Evaluation Reliability

Repeat run — A configured request to execute the same eval case and target more than once in the same timestamped run bundle. Repeat runs measure stochastic reliability, verifier stability, and drift; they are not the default CI path.

Attempt — One concrete execution inside a repeat run. Attempts keep their own score, status, timing, trace, transcript, logs, and artifacts so aggregate results never hide individual evidence.

Pass rate — Assertion or expectation pass rate inside a grading result: passed assertions or expectations divided by total assertions or expectations. AgentV does not use pass_rate for repeat-attempt success frequency.

Attempt success rate — Repeat-run reliability metric equal to successful counted attempts divided by counted attempts. This is distinct from pass_rate, which is reserved for assertion or expectation pass rate within a grading result.

Gate policy — The explicit rule that decides whether repeated attempts pass CI, such as all_attempts_successful, any_attempt_successful, attempt_success_rate_at_least, or mean_pass_rate_at_least. Without a repeat-run gate policy, AgentV preserves the normal single-run gate behavior and treats repeat statistics as report data.

Flaky eval outcome — A repeat-run aggregate whose attempts disagree, or whose failure classification points at verifier, infrastructure, or timeout instability rather than a stable model-quality failure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concepts

Providers and Targets

Evaluation Model

Evaluation Reliability

FilesExpand file tree

CONCEPTS.md

Latest commit

History

CONCEPTS.md

File metadata and controls

Concepts

Providers and Targets

Evaluation Model

Evaluation Reliability