[v0.5.0] benchmark runner — add reproducible CLI and artifact layout

## Context

Parent: #63. Methodology: `docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md`.

This is an agent-ready plumbing issue. The runner should make the benchmark reproducible from the shell without deciding the corpus or claiming results.

## Goal

Add a reproducible benchmark runner CLI that loads a corpus and competitor manifest, executes configured benchmark cells, and writes raw artifacts in a stable layout.

## Acceptance criteria

- [ ] A tool failure, timeout, or MCP protocol crash on an individual query is explicitly recorded as an error and scored as `0.0` for correctness; the denominator remains exactly 50 for every eligible tool/model cell.
- [ ] A benchmark CLI exists, for example `uv run python -m benchmarks run --corpus docs/benchmarks/corpus.yml --manifest docs/benchmarks/competitors.yml --out docs/benchmarks/results/<run-id>/`.
- [ ] The CLI writes raw artifacts: transcripts, token records, latency records, scoring placeholders, competitor manifest snapshot, environment metadata, repo commit SHA, and run summary.
- [ ] The runner supports a dry-run mode that validates corpus/manifest and emits planned benchmark cells without calling external providers.
- [ ] The runner can execute at least the no-MCP baseline adapter through a fake/mock provider in tests.
- [ ] Tests cover artifact paths, metadata capture, dry-run behavior, duplicate/missing corpus IDs, and failure recording for a competitor cell.
- [ ] No README benchmark claims are added.

## Scope boundaries

In scope:
- Benchmark runner package/module.
- CLI argument parsing.
- Artifact schema/layout.
- Dry-run and fake-provider tests.

Out of scope:
- Real OpenAI/Google API calls.
- Competitor-specific MCP adapters beyond interfaces/stubs.
- Correctness scoring automation.
- README result table.

## Forbidden-territory reminder

Do not modify MCP tool names, parameters, return shapes, `schema.sql`, `.github/workflows/`, `pyproject.toml` project metadata, `.planning/POSITIONING.md`, the README hero section, `LICENSE`, `SECURITY.md`, or existing tests by weakening/deleting assertions.

## Validation commands

```bash
uv run ruff check src/ tests/
uv run pyright src/
uv run pytest --tb=short -q
uv run python-docs-mcp-server doctor
```

Run any new benchmark-runner tests directly as well.

## PR template

Use `Refs #63`, not `Closes #63`.

The PR must include:
- Summary of artifact layout.
- Dry-run command output.
- Test output.
- Confirmation that no external provider calls were made.

## Recovery

If a stable artifact layout requires a methodology change, stop and comment on #63 before implementing.

## Effort estimate

4-6 hours.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.5.0] benchmark runner — add reproducible CLI and artifact layout #72

Context

Goal

Acceptance criteria

Scope boundaries

Forbidden-territory reminder

Validation commands

PR template

Recovery

Effort estimate

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[v0.5.0] benchmark runner — add reproducible CLI and artifact layout #72

Description

Context

Goal

Acceptance criteria

Scope boundaries

Forbidden-territory reminder

Validation commands

PR template

Recovery

Effort estimate

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions