Skip to content

[v0.5.0] benchmark runner — add reproducible CLI and artifact layout#75

Open
ayhammouda wants to merge 2 commits into
mainfrom
agent/72-benchmark-runner-cli
Open

[v0.5.0] benchmark runner — add reproducible CLI and artifact layout#75
ayhammouda wants to merge 2 commits into
mainfrom
agent/72-benchmark-runner-cli

Conversation

@ayhammouda

@ayhammouda ayhammouda commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Refs #63
Fixes #72

Summary

  • Added uv run python -m benchmarks run with --corpus, --manifest, --out, --run-id, and --dry-run.
  • Writes a stable raw artifact layout under the requested run directory: snapshots/, environment.json, planned-cells.json, run-summary.json, and, for executed cells, transcripts/, tokens/, latency/, scoring/, plus failures/ for failed cells.
  • Supports local fake/no-MCP baseline execution only; unsupported adapters are recorded as cell failures rather than contacted.
  • Records tool failures, timeouts, and MCP protocol crashes as explicit errors with correctness score: 0.0, keeps failed queries in the correctness denominator, and includes tool_model_key plus error categories for future [v0.5.0] benchmark reporting — generate raw report and README-safe summary #74 error/timeout reporting by tool+model pairing.
  • Added tests for artifact paths, metadata capture, dry-run behavior, duplicate/missing corpus IDs, and competitor-cell failure recording.

Dry-run output

uv run python -m benchmarks run --corpus "$tmpdir/corpus.yml" --manifest "$tmpdir/competitors.yml" --out "$tmpdir/results/run-1" --run-id run-1 --dry-run
{
  "dry_run": true,
  "external_provider_calls": false,
  "failed_cells": 0,
  "planned_cells": 1,
  "run_id": "run-1",
  "succeeded_cells": 0
}

Dry-run artifact files written:

environment.json
planned-cells.json
run-summary.json
snapshots/competitor-manifest.yml
snapshots/corpus.yml

Verification

uv run ruff check src/ tests/
# All checks passed!

uv run pyright src/
# 0 errors, 0 warnings, 0 informations

uv run pytest --tb=short -q
# 315 passed in 21.68s

uv run python-docs-mcp-server doctor
# All checks passed.

uv run pytest tests/benchmarks/test_runner.py -q
# 8 passed in 0.17s

Additional sanity checks for the new top-level benchmark module:

uv run ruff check benchmarks/ tests/benchmarks/test_runner.py
# All checks passed!

uv run pyright benchmarks/
# 0 errors, 0 warnings, 0 informations

External provider calls

No external provider calls were made. The runner records external_provider_calls: false, and issue #72 only implements local fake/no-MCP baseline plumbing.

Why this triggered supervisor review

The diff is over the autonomous-agent pipeline's 500-line review threshold because issue #72 adds the runner module and focused tests together. No forbidden files were modified, no runtime network access was added, and no public MCP tool surface changed.

@ayhammouda ayhammouda added the supervisor-review Vision supervisor decision required before further automation label Jun 8, 2026
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

A new test module tests/benchmarks/test_runner.py adds 206 lines of end-to-end coverage for the benchmark runner. It provides helpers to generate test corpus and manifest data, then validates artifact creation, metadata capture, input validation, dry-run behavior, failure recording, and CLI integration.

Changes

Benchmark runner test coverage

Layer / File(s) Summary
Test helpers for corpus and manifest generation
tests/benchmarks/test_runner.py
Three module-level helpers write YAML corpus and manifest files with default questions and competitors for test setup.
Artifact output and environment metadata validation
tests/benchmarks/test_runner.py
One test verifies deterministic creation of output files including transcripts, tokens, latency, scoring, and run summaries. A second test asserts environment.json contains repo commit, non-unknown commit value, Python version prefix, and external_provider_calls metadata.
Corpus input validation
tests/benchmarks/test_runner.py
Two tests verify that run_benchmark raises BenchmarkValidationError with specific messages when corpus IDs are duplicated or when required id fields are missing.
Dry-run, failure handling, and CLI integration
tests/benchmarks/test_runner.py
A dry-run test asserts planning output is written while cell execution outputs are not produced. A failure test verifies error status and message are recorded in failure artifacts while external_provider_calls remains False. A CLI test invokes the benchmark runner via subprocess with --dry-run, asserts zero exit code, and validates JSON summary output.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A benchmark tester hops in with pride,
Eight tests now guard the runner's stride,
From corpus checks to CLI calls,
Each artifact path safely falls,
The dry run dances, failures are caught!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[v0.5.0] benchmark runner — add reproducible CLI and artifact layout' directly summarizes the main change: implementing a benchmark runner CLI with stable artifact layout.
Linked Issues check ✅ Passed The PR fully addresses all acceptance criteria from #72: artifact paths, metadata capture, dry-run behavior, duplicate/missing corpus ID validation, and failure recording for competitor cells are all covered by the new test module.
Out of Scope Changes check ✅ Passed All changes are scoped to the test module for the benchmark runner; no modifications to forbidden files, MCP tools, or existing tests are present.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch agent/72-benchmark-runner-cli

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbiteu

coderabbiteu Bot commented Jun 8, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

This PR adds a comprehensive pytest-based test suite for the benchmark runner. The module exercises the run_benchmark function and python -m benchmarks run CLI, validating artifact output paths, environment metadata capture, dry-run mode, input validation, failure recording, and end-to-end CLI behavior without invoking external providers.

Changes

Benchmark Runner Test Suite

Layer / File(s) Summary
Test utilities and fixture builders
tests/benchmarks/test_runner.py
Imports for filesystem, serialization, and subprocess operations; _write_yaml persists YAML content; _corpus and _manifest helpers construct minimal YAML fixtures for benchmark questions and competitors.
Artifact output and environment metadata verification
tests/benchmarks/test_runner.py
test_runner_writes_stable_artifact_paths validates deterministic output files and directories; test_environment_metadata_captures_repo_commit_and_python checks environment.json contains passed run_id, non-unknown repo_commit, Python version prefix, and external_provider_calls=False.
Dry-run execution mode
tests/benchmarks/test_runner.py
test_dry_run_writes_plan_without_cell_execution asserts dry_run=True marks the run as dry, reports correct planned/succeeded/failed cell counts, writes planned-cells.json, and does not create transcripts directory.
Input validation error handling
tests/benchmarks/test_runner.py
test_duplicate_corpus_ids_are_rejected and test_missing_corpus_id_is_rejected assert run_benchmark raises BenchmarkValidationError with specific messages for duplicate corpus IDs and missing required id fields.
Failure recording and transcripts
tests/benchmarks/test_runner.py
test_competitor_cell_failure_is_recorded forces a competitor cell failure and verifies failure and transcript JSON output record expected status, error message content, and external_provider_calls=False.
CLI dry-run end-to-end integration
tests/benchmarks/test_runner.py
test_cli_dry_run_outputs_summary invokes python -m benchmarks run --dry-run via subprocess, asserts zero exit code, validates JSON summary output to stdout with key fields, and checks run-summary.json is written.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A suite of tests, so fine and clean,
Benchmarks running, all scenes unseen,
Dry runs planned, metadata bright,
Failures caught in JSON's light,
The CLI hops—end-to-end in sight! 🏃✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The change set adds only benchmark tests and does not implement the CLI runner, artifact layout, or failure-scoring behavior required by #72. Implement the benchmark runner/CLI and artifact-generation logic first, then keep these tests as coverage for the required behaviors.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is concise and accurately summarizes the benchmark runner's reproducible CLI and artifact-layout work.
Out of Scope Changes check ✅ Passed No clear out-of-scope changes are evident; the added benchmark tests align with the requested runner coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch agent/72-benchmark-runner-cli

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbiteu coderabbiteu Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/benchmarks/test_runner.py (1)

146-176: ⚡ Quick win

Add assertions for failed-cell scoring artifacts and fixed denominator.

This test validates failure/transcript persistence, but it doesn’t assert the scoring contract from the runner objectives (failed eligible cells scored 0.0 with fixed denominator). Add assertions against the scoring artifact and/or summary fields to lock this behavior and prevent regressions.

As per coding guidelines, tests/**: "Focus review on meaningful assertions, regression coverage, fixture correctness, deterministic behavior, and avoiding network- or environment-dependent tests unless the test is explicitly marked as an integration smoke test."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmarks/test_runner.py` around lines 146 - 176, The test
test_competitor_cell_failure_is_recorded currently checks failure/transcript
persistence but lacks assertions that failed eligible cells are scored 0.0 with
the fixed denominator; update the test to read the scoring artifact (e.g., under
out_dir / "scores" / "failing-baseline" / "q001.json") and assert the recorded
score for that cell is 0.0 and that the denominator used for scoring is the
expected fixed value (e.g., 1), and also assert summary contains the
corresponding scoring fields (for example summary["scores"] or
summary["denominator"] and summary["failed_cells"]) to lock the scoring contract
produced by run_benchmark / BenchmarkConfig.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/benchmarks/test_runner.py`:
- Around line 146-176: The test test_competitor_cell_failure_is_recorded
currently checks failure/transcript persistence but lacks assertions that failed
eligible cells are scored 0.0 with the fixed denominator; update the test to
read the scoring artifact (e.g., under out_dir / "scores" / "failing-baseline" /
"q001.json") and assert the recorded score for that cell is 0.0 and that the
denominator used for scoring is the expected fixed value (e.g., 1), and also
assert summary contains the corresponding scoring fields (for example
summary["scores"] or summary["denominator"] and summary["failed_cells"]) to lock
the scoring contract produced by run_benchmark / BenchmarkConfig.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2b2430c5-9c12-4d44-9828-5bd0144c8fd5

📥 Commits

Reviewing files that changed from the base of the PR and between 80a093c and 14f615f.

⛔ Files ignored due to path filters (3)
  • benchmarks/__init__.py is excluded by none and included by none
  • benchmarks/__main__.py is excluded by none and included by none
  • benchmarks/runner.py is excluded by none and included by none
📒 Files selected for processing (1)
  • tests/benchmarks/test_runner.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/benchmarks/test_runner.py`:
- Line 92: Replace the environment-dependent assertion that
metadata["repo_commit"] != "unknown" with a deterministic presence/type check:
ensure the "repo_commit" key exists in metadata and its value is a non-empty
string (e.g., assert "repo_commit" in metadata and
isinstance(metadata["repo_commit"], str) and metadata["repo_commit"].strip() !=
""), rather than assuming git is available; this touches the test in
tests/benchmarks/test_runner.py and relates to the _repo_commit_sha() behavior
so you can still allow the value "unknown" in environments without git.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: bcb56153-286b-4a24-adca-1249fd4574a2

📥 Commits

Reviewing files that changed from the base of the PR and between 80a093c and 14f615f.

⛔ Files ignored due to path filters (3)
  • benchmarks/__init__.py is excluded by none and included by none
  • benchmarks/__main__.py is excluded by none and included by none
  • benchmarks/runner.py is excluded by none and included by none
📒 Files selected for processing (1)
  • tests/benchmarks/test_runner.py

metadata = json.loads((out_dir / "environment.json").read_text(encoding="utf-8"))
assert metadata["run_id"] == "metadata"
assert metadata["repo_commit"]
assert metadata["repo_commit"] != "unknown"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoff

Remove or relax the environment-dependent assertion.

The assertion metadata["repo_commit"] != "unknown" will fail in environments without git (e.g., minimal CI containers, non-git test runs). Context snippet 4 shows that _repo_commit_sha() returns "unknown" when git is unavailable. This creates fragile, environment-dependent test behavior.

🛡️ Proposed fix to validate field presence without assuming git availability
     assert metadata["run_id"] == "metadata"
-    assert metadata["repo_commit"]
-    assert metadata["repo_commit"] != "unknown"
+    assert "repo_commit" in metadata
+    assert isinstance(metadata["repo_commit"], str)
     assert metadata["python_version"].startswith(f"{sys.version_info.major}.")

As per coding guidelines: "Focus review on ... deterministic behavior, and avoiding network- or environment-dependent tests unless the test is explicitly marked as an integration smoke test."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/benchmarks/test_runner.py` at line 92, Replace the
environment-dependent assertion that metadata["repo_commit"] != "unknown" with a
deterministic presence/type check: ensure the "repo_commit" key exists in
metadata and its value is a non-empty string (e.g., assert "repo_commit" in
metadata and isinstance(metadata["repo_commit"], str) and
metadata["repo_commit"].strip() != ""), rather than assuming git is available;
this touches the test in tests/benchmarks/test_runner.py and relates to the
_repo_commit_sha() behavior so you can still allow the value "unknown" in
environments without git.

Source: Coding guidelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

supervisor-review Vision supervisor decision required before further automation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[v0.5.0] benchmark runner — add reproducible CLI and artifact layout

1 participant