[v0.5.0] benchmark runner — add reproducible CLI and artifact layout by ayhammouda · Pull Request #75 · ayhammouda/python-docs-mcp-server

ayhammouda · 2026-06-08T21:52:39Z

Refs #63
Fixes #72

Summary

Added uv run python -m benchmarks run with --corpus, --manifest, --out, --run-id, and --dry-run.
Writes a stable raw artifact layout under the requested run directory: snapshots/, environment.json, planned-cells.json, run-summary.json, and, for executed cells, transcripts/, tokens/, latency/, scoring/, plus failures/ for failed cells.
Supports local fake/no-MCP baseline execution only; unsupported adapters are recorded as cell failures rather than contacted.
Records tool failures, timeouts, and MCP protocol crashes as explicit errors with correctness score: 0.0, keeps failed queries in the correctness denominator, and includes tool_model_key plus error categories for future [v0.5.0] benchmark reporting — generate raw report and README-safe summary #74 error/timeout reporting by tool+model pairing.
Added tests for artifact paths, metadata capture, dry-run behavior, duplicate/missing corpus IDs, and competitor-cell failure recording.

Dry-run output

uv run python -m benchmarks run --corpus "$tmpdir/corpus.yml" --manifest "$tmpdir/competitors.yml" --out "$tmpdir/results/run-1" --run-id run-1 --dry-run

{
  "dry_run": true,
  "external_provider_calls": false,
  "failed_cells": 0,
  "planned_cells": 1,
  "run_id": "run-1",
  "succeeded_cells": 0
}

Dry-run artifact files written:

environment.json
planned-cells.json
run-summary.json
snapshots/competitor-manifest.yml
snapshots/corpus.yml

Verification

uv run ruff check src/ tests/
# All checks passed!

uv run pyright src/
# 0 errors, 0 warnings, 0 informations

uv run pytest --tb=short -q
# 315 passed in 21.68s

uv run python-docs-mcp-server doctor
# All checks passed.

uv run pytest tests/benchmarks/test_runner.py -q
# 8 passed in 0.17s

Additional sanity checks for the new top-level benchmark module:

uv run ruff check benchmarks/ tests/benchmarks/test_runner.py
# All checks passed!

uv run pyright benchmarks/
# 0 errors, 0 warnings, 0 informations

External provider calls

No external provider calls were made. The runner records external_provider_calls: false, and issue #72 only implements local fake/no-MCP baseline plumbing.

Why this triggered supervisor review

The diff is over the autonomous-agent pipeline's 500-line review threshold because issue #72 adds the runner module and focused tests together. No forbidden files were modified, no runtime network access was added, and no public MCP tool surface changed.

coderabbitai · 2026-06-08T21:52:51Z

📝 Walkthrough

Walkthrough

A new test module tests/benchmarks/test_runner.py adds 206 lines of end-to-end coverage for the benchmark runner. It provides helpers to generate test corpus and manifest data, then validates artifact creation, metadata capture, input validation, dry-run behavior, failure recording, and CLI integration.

Changes

Benchmark runner test coverage

Layer / File(s)	Summary
Test helpers for corpus and manifest generation `tests/benchmarks/test_runner.py`	Three module-level helpers write YAML corpus and manifest files with default questions and competitors for test setup.
Artifact output and environment metadata validation `tests/benchmarks/test_runner.py`	One test verifies deterministic creation of output files including transcripts, tokens, latency, scoring, and run summaries. A second test asserts environment.json contains repo commit, non-unknown commit value, Python version prefix, and external_provider_calls metadata.
Corpus input validation `tests/benchmarks/test_runner.py`	Two tests verify that run_benchmark raises BenchmarkValidationError with specific messages when corpus IDs are duplicated or when required id fields are missing.
Dry-run, failure handling, and CLI integration `tests/benchmarks/test_runner.py`	A dry-run test asserts planning output is written while cell execution outputs are not produced. A failure test verifies error status and message are recorded in failure artifacts while external_provider_calls remains False. A CLI test invokes the benchmark runner via subprocess with --dry-run, asserts zero exit code, and validates JSON summary output.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A benchmark tester hops in with pride,
Eight tests now guard the runner's stride,
From corpus checks to CLI calls,
Each artifact path safely falls,
The dry run dances, failures are caught!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[v0.5.0] benchmark runner — add reproducible CLI and artifact layout' directly summarizes the main change: implementing a benchmark runner CLI with stable artifact layout.
Linked Issues check	✅ Passed	The PR fully addresses all acceptance criteria from `#72`: artifact paths, metadata capture, dry-run behavior, duplicate/missing corpus ID validation, and failure recording for competitor cells are all covered by the new test module.
Out of Scope Changes check	✅ Passed	All changes are scoped to the test module for the benchmark runner; no modifications to forbidden files, MCP tools, or existing tests are present.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch agent/72-benchmark-runner-cli

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbiteu · 2026-06-08T21:53:39Z

📝 Walkthrough

Walkthrough

This PR adds a comprehensive pytest-based test suite for the benchmark runner. The module exercises the run_benchmark function and python -m benchmarks run CLI, validating artifact output paths, environment metadata capture, dry-run mode, input validation, failure recording, and end-to-end CLI behavior without invoking external providers.

Changes

Benchmark Runner Test Suite

Layer / File(s)	Summary
Test utilities and fixture builders `tests/benchmarks/test_runner.py`	Imports for filesystem, serialization, and subprocess operations; `_write_yaml` persists YAML content; `_corpus` and `_manifest` helpers construct minimal YAML fixtures for benchmark questions and competitors.
Artifact output and environment metadata verification `tests/benchmarks/test_runner.py`	`test_runner_writes_stable_artifact_paths` validates deterministic output files and directories; `test_environment_metadata_captures_repo_commit_and_python` checks `environment.json` contains passed `run_id`, non-`unknown` `repo_commit`, Python version prefix, and `external_provider_calls=False`.
Dry-run execution mode `tests/benchmarks/test_runner.py`	`test_dry_run_writes_plan_without_cell_execution` asserts `dry_run=True` marks the run as dry, reports correct planned/succeeded/failed cell counts, writes `planned-cells.json`, and does not create `transcripts` directory.
Input validation error handling `tests/benchmarks/test_runner.py`	`test_duplicate_corpus_ids_are_rejected` and `test_missing_corpus_id_is_rejected` assert `run_benchmark` raises `BenchmarkValidationError` with specific messages for duplicate corpus IDs and missing required `id` fields.
Failure recording and transcripts `tests/benchmarks/test_runner.py`	`test_competitor_cell_failure_is_recorded` forces a competitor cell failure and verifies failure and transcript JSON output record expected status, error message content, and `external_provider_calls=False`.
CLI dry-run end-to-end integration `tests/benchmarks/test_runner.py`	`test_cli_dry_run_outputs_summary` invokes `python -m benchmarks run --dry-run` via subprocess, asserts zero exit code, validates JSON summary output to stdout with key fields, and checks `run-summary.json` is written.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A suite of tests, so fine and clean,
Benchmarks running, all scenes unseen,
Dry runs planned, metadata bright,
Failures caught in JSON's light,
The CLI hops—end-to-end in sight! 🏃✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The change set adds only benchmark tests and does not implement the CLI runner, artifact layout, or failure-scoring behavior required by `#72`.	Implement the benchmark runner/CLI and artifact-generation logic first, then keep these tests as coverage for the required behaviors.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title is concise and accurately summarizes the benchmark runner's reproducible CLI and artifact-layout work.
Out of Scope Changes check	✅ Passed	No clear out-of-scope changes are evident; the added benchmark tests align with the requested runner coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch agent/72-benchmark-runner-cli

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbiteu

🧹 Nitpick comments (1)

tests/benchmarks/test_runner.py (1)
146-176: ⚡ Quick win

Add assertions for failed-cell scoring artifacts and fixed denominator.

This test validates failure/transcript persistence, but it doesn’t assert the scoring contract from the runner objectives (failed eligible cells scored 0.0 with fixed denominator). Add assertions against the scoring artifact and/or summary fields to lock this behavior and prevent regressions.

As per coding guidelines, tests/**: "Focus review on meaningful assertions, regression coverage, fixture correctness, deterministic behavior, and avoiding network- or environment-dependent tests unless the test is explicitly marked as an integration smoke test."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmarks/test_runner.py` around lines 146 - 176, The test
test_competitor_cell_failure_is_recorded currently checks failure/transcript
persistence but lacks assertions that failed eligible cells are scored 0.0 with
the fixed denominator; update the test to read the scoring artifact (e.g., under
out_dir / "scores" / "failing-baseline" / "q001.json") and assert the recorded
score for that cell is 0.0 and that the denominator used for scoring is the
expected fixed value (e.g., 1), and also assert summary contains the
corresponding scoring fields (for example summary["scores"] or
summary["denominator"] and summary["failed_cells"]) to lock the scoring contract
produced by run_benchmark / BenchmarkConfig.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/benchmarks/test_runner.py`:
- Around line 146-176: The test test_competitor_cell_failure_is_recorded
currently checks failure/transcript persistence but lacks assertions that failed
eligible cells are scored 0.0 with the fixed denominator; update the test to
read the scoring artifact (e.g., under out_dir / "scores" / "failing-baseline" /
"q001.json") and assert the recorded score for that cell is 0.0 and that the
denominator used for scoring is the expected fixed value (e.g., 1), and also
assert summary contains the corresponding scoring fields (for example
summary["scores"] or summary["denominator"] and summary["failed_cells"]) to lock
the scoring contract produced by run_benchmark / BenchmarkConfig.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2b2430c5-9c12-4d44-9828-5bd0144c8fd5

📥 Commits

Reviewing files that changed from the base of the PR and between 80a093c and 14f615f.

⛔ Files ignored due to path filters (3)

benchmarks/__init__.py is excluded by none and included by none
benchmarks/__main__.py is excluded by none and included by none
benchmarks/runner.py is excluded by none and included by none

📒 Files selected for processing (1)

tests/benchmarks/test_runner.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/benchmarks/test_runner.py`:
- Line 92: Replace the environment-dependent assertion that
metadata["repo_commit"] != "unknown" with a deterministic presence/type check:
ensure the "repo_commit" key exists in metadata and its value is a non-empty
string (e.g., assert "repo_commit" in metadata and
isinstance(metadata["repo_commit"], str) and metadata["repo_commit"].strip() !=
""), rather than assuming git is available; this touches the test in
tests/benchmarks/test_runner.py and relates to the _repo_commit_sha() behavior
so you can still allow the value "unknown" in environments without git.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: bcb56153-286b-4a24-adca-1249fd4574a2

📥 Commits

Reviewing files that changed from the base of the PR and between 80a093c and 14f615f.

⛔ Files ignored due to path filters (3)

benchmarks/__init__.py is excluded by none and included by none
benchmarks/__main__.py is excluded by none and included by none
benchmarks/runner.py is excluded by none and included by none

📒 Files selected for processing (1)

tests/benchmarks/test_runner.py

coderabbitai · 2026-06-08T21:56:43Z

+    metadata = json.loads((out_dir / "environment.json").read_text(encoding="utf-8"))
+    assert metadata["run_id"] == "metadata"
+    assert metadata["repo_commit"]
+    assert metadata["repo_commit"] != "unknown"


⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoff

Remove or relax the environment-dependent assertion.

The assertion metadata["repo_commit"] != "unknown" will fail in environments without git (e.g., minimal CI containers, non-git test runs). Context snippet 4 shows that _repo_commit_sha() returns "unknown" when git is unavailable. This creates fragile, environment-dependent test behavior.

🛡️ Proposed fix to validate field presence without assuming git availability

assert metadata["run_id"] == "metadata" - assert metadata["repo_commit"] - assert metadata["repo_commit"] != "unknown" + assert "repo_commit" in metadata + assert isinstance(metadata["repo_commit"], str) assert metadata["python_version"].startswith(f"{sys.version_info.major}.")

As per coding guidelines: "Focus review on ... deterministic behavior, and avoiding network- or environment-dependent tests unless the test is explicitly marked as an integration smoke test."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/benchmarks/test_runner.py` at line 92, Replace the environment-dependent assertion that metadata["repo_commit"] != "unknown" with a deterministic presence/type check: ensure the "repo_commit" key exists in metadata and its value is a non-empty string (e.g., assert "repo_commit" in metadata and isinstance(metadata["repo_commit"], str) and metadata["repo_commit"].strip() != ""), rather than assuming git is available; this touches the test in tests/benchmarks/test_runner.py and relates to the _repo_commit_sha() behavior so you can still allow the value "unknown" in environments without git.

Source: Coding guidelines

agent: add benchmark runner CLI artifacts

14f615f

ayhammouda added the supervisor-review Vision supervisor decision required before further automation label Jun 8, 2026

coderabbiteu Bot reviewed Jun 8, 2026

View reviewed changes

agent: record benchmark cell failures as zero scores

e06914d

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.5.0] benchmark runner — add reproducible CLI and artifact layout#75

[v0.5.0] benchmark runner — add reproducible CLI and artifact layout#75
ayhammouda wants to merge 2 commits into
mainfrom
agent/72-benchmark-runner-cli

ayhammouda commented Jun 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited by coderabbiteu Bot

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbiteu Bot commented Jun 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbiteu Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ayhammouda commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dry-run output

Verification

External provider calls

Why this triggered supervisor review

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited by coderabbiteu Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbiteu Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbiteu Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ayhammouda commented Jun 8, 2026 •

edited

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited by coderabbiteu Bot

Loading

coderabbiteu Bot commented Jun 8, 2026 •

edited

Loading