Skip to content

[v0.5.0] benchmark runner — add reproducible CLI and artifact layout #72

@ayhammouda

Description

@ayhammouda

Context

Parent: #63. Methodology: docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md.

This is an agent-ready plumbing issue. The runner should make the benchmark reproducible from the shell without deciding the corpus or claiming results.

Goal

Add a reproducible benchmark runner CLI that loads a corpus and competitor manifest, executes configured benchmark cells, and writes raw artifacts in a stable layout.

Acceptance criteria

  • A tool failure, timeout, or MCP protocol crash on an individual query is explicitly recorded as an error and scored as 0.0 for correctness; the denominator remains exactly 50 for every eligible tool/model cell.
  • A benchmark CLI exists, for example uv run python -m benchmarks run --corpus docs/benchmarks/corpus.yml --manifest docs/benchmarks/competitors.yml --out docs/benchmarks/results/<run-id>/.
  • The CLI writes raw artifacts: transcripts, token records, latency records, scoring placeholders, competitor manifest snapshot, environment metadata, repo commit SHA, and run summary.
  • The runner supports a dry-run mode that validates corpus/manifest and emits planned benchmark cells without calling external providers.
  • The runner can execute at least the no-MCP baseline adapter through a fake/mock provider in tests.
  • Tests cover artifact paths, metadata capture, dry-run behavior, duplicate/missing corpus IDs, and failure recording for a competitor cell.
  • No README benchmark claims are added.

Scope boundaries

In scope:

  • Benchmark runner package/module.
  • CLI argument parsing.
  • Artifact schema/layout.
  • Dry-run and fake-provider tests.

Out of scope:

  • Real OpenAI/Google API calls.
  • Competitor-specific MCP adapters beyond interfaces/stubs.
  • Correctness scoring automation.
  • README result table.

Forbidden-territory reminder

Do not modify MCP tool names, parameters, return shapes, schema.sql, .github/workflows/, pyproject.toml project metadata, .planning/POSITIONING.md, the README hero section, LICENSE, SECURITY.md, or existing tests by weakening/deleting assertions.

Validation commands

uv run ruff check src/ tests/
uv run pyright src/
uv run pytest --tb=short -q
uv run python-docs-mcp-server doctor

Run any new benchmark-runner tests directly as well.

PR template

Use Refs #63, not Closes #63.

The PR must include:

  • Summary of artifact layout.
  • Dry-run command output.
  • Test output.
  • Confirmation that no external provider calls were made.

Recovery

If a stable artifact layout requires a methodology change, stop and comment on #63 before implementing.

Effort estimate

4-6 hours.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-readyIssue passed AGENT-EXECUTION-PIPELINE.md §10 pre-flight; scoped for an autonomous agentenhancementNew feature or requestpriority:P2Medium priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions