You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an agent-ready plumbing issue. The runner should make the benchmark reproducible from the shell without deciding the corpus or claiming results.
Goal
Add a reproducible benchmark runner CLI that loads a corpus and competitor manifest, executes configured benchmark cells, and writes raw artifacts in a stable layout.
Acceptance criteria
A tool failure, timeout, or MCP protocol crash on an individual query is explicitly recorded as an error and scored as 0.0 for correctness; the denominator remains exactly 50 for every eligible tool/model cell.
A benchmark CLI exists, for example uv run python -m benchmarks run --corpus docs/benchmarks/corpus.yml --manifest docs/benchmarks/competitors.yml --out docs/benchmarks/results/<run-id>/.
The CLI writes raw artifacts: transcripts, token records, latency records, scoring placeholders, competitor manifest snapshot, environment metadata, repo commit SHA, and run summary.
The runner supports a dry-run mode that validates corpus/manifest and emits planned benchmark cells without calling external providers.
The runner can execute at least the no-MCP baseline adapter through a fake/mock provider in tests.
Tests cover artifact paths, metadata capture, dry-run behavior, duplicate/missing corpus IDs, and failure recording for a competitor cell.
Context
Parent: #63. Methodology:
docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md.This is an agent-ready plumbing issue. The runner should make the benchmark reproducible from the shell without deciding the corpus or claiming results.
Goal
Add a reproducible benchmark runner CLI that loads a corpus and competitor manifest, executes configured benchmark cells, and writes raw artifacts in a stable layout.
Acceptance criteria
0.0for correctness; the denominator remains exactly 50 for every eligible tool/model cell.uv run python -m benchmarks run --corpus docs/benchmarks/corpus.yml --manifest docs/benchmarks/competitors.yml --out docs/benchmarks/results/<run-id>/.Scope boundaries
In scope:
Out of scope:
Forbidden-territory reminder
Do not modify MCP tool names, parameters, return shapes,
schema.sql,.github/workflows/,pyproject.tomlproject metadata,.planning/POSITIONING.md, the README hero section,LICENSE,SECURITY.md, or existing tests by weakening/deleting assertions.Validation commands
Run any new benchmark-runner tests directly as well.
PR template
Use
Refs #63, notCloses #63.The PR must include:
Recovery
If a stable artifact layout requires a methodology change, stop and comment on #63 before implementing.
Effort estimate
4-6 hours.