[v0.5.0] benchmark runner — add reproducible CLI and artifact layout#75
[v0.5.0] benchmark runner — add reproducible CLI and artifact layout#75ayhammouda wants to merge 2 commits into
Conversation
📝 WalkthroughWalkthroughA new test module ChangesBenchmark runner test coverage
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📝 WalkthroughWalkthroughThis PR adds a comprehensive pytest-based test suite for the benchmark runner. The module exercises the ChangesBenchmark Runner Test Suite
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/benchmarks/test_runner.py (1)
146-176: ⚡ Quick winAdd assertions for failed-cell scoring artifacts and fixed denominator.
This test validates failure/transcript persistence, but it doesn’t assert the scoring contract from the runner objectives (failed eligible cells scored
0.0with fixed denominator). Add assertions against the scoring artifact and/or summary fields to lock this behavior and prevent regressions.As per coding guidelines,
tests/**: "Focus review on meaningful assertions, regression coverage, fixture correctness, deterministic behavior, and avoiding network- or environment-dependent tests unless the test is explicitly marked as an integration smoke test."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/benchmarks/test_runner.py` around lines 146 - 176, The test test_competitor_cell_failure_is_recorded currently checks failure/transcript persistence but lacks assertions that failed eligible cells are scored 0.0 with the fixed denominator; update the test to read the scoring artifact (e.g., under out_dir / "scores" / "failing-baseline" / "q001.json") and assert the recorded score for that cell is 0.0 and that the denominator used for scoring is the expected fixed value (e.g., 1), and also assert summary contains the corresponding scoring fields (for example summary["scores"] or summary["denominator"] and summary["failed_cells"]) to lock the scoring contract produced by run_benchmark / BenchmarkConfig.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/benchmarks/test_runner.py`:
- Around line 146-176: The test test_competitor_cell_failure_is_recorded
currently checks failure/transcript persistence but lacks assertions that failed
eligible cells are scored 0.0 with the fixed denominator; update the test to
read the scoring artifact (e.g., under out_dir / "scores" / "failing-baseline" /
"q001.json") and assert the recorded score for that cell is 0.0 and that the
denominator used for scoring is the expected fixed value (e.g., 1), and also
assert summary contains the corresponding scoring fields (for example
summary["scores"] or summary["denominator"] and summary["failed_cells"]) to lock
the scoring contract produced by run_benchmark / BenchmarkConfig.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2b2430c5-9c12-4d44-9828-5bd0144c8fd5
⛔ Files ignored due to path filters (3)
benchmarks/__init__.pyis excluded by none and included by nonebenchmarks/__main__.pyis excluded by none and included by nonebenchmarks/runner.pyis excluded by none and included by none
📒 Files selected for processing (1)
tests/benchmarks/test_runner.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/benchmarks/test_runner.py`:
- Line 92: Replace the environment-dependent assertion that
metadata["repo_commit"] != "unknown" with a deterministic presence/type check:
ensure the "repo_commit" key exists in metadata and its value is a non-empty
string (e.g., assert "repo_commit" in metadata and
isinstance(metadata["repo_commit"], str) and metadata["repo_commit"].strip() !=
""), rather than assuming git is available; this touches the test in
tests/benchmarks/test_runner.py and relates to the _repo_commit_sha() behavior
so you can still allow the value "unknown" in environments without git.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: bcb56153-286b-4a24-adca-1249fd4574a2
⛔ Files ignored due to path filters (3)
benchmarks/__init__.pyis excluded by none and included by nonebenchmarks/__main__.pyis excluded by none and included by nonebenchmarks/runner.pyis excluded by none and included by none
📒 Files selected for processing (1)
tests/benchmarks/test_runner.py
| metadata = json.loads((out_dir / "environment.json").read_text(encoding="utf-8")) | ||
| assert metadata["run_id"] == "metadata" | ||
| assert metadata["repo_commit"] | ||
| assert metadata["repo_commit"] != "unknown" |
There was a problem hiding this comment.
Remove or relax the environment-dependent assertion.
The assertion metadata["repo_commit"] != "unknown" will fail in environments without git (e.g., minimal CI containers, non-git test runs). Context snippet 4 shows that _repo_commit_sha() returns "unknown" when git is unavailable. This creates fragile, environment-dependent test behavior.
🛡️ Proposed fix to validate field presence without assuming git availability
assert metadata["run_id"] == "metadata"
- assert metadata["repo_commit"]
- assert metadata["repo_commit"] != "unknown"
+ assert "repo_commit" in metadata
+ assert isinstance(metadata["repo_commit"], str)
assert metadata["python_version"].startswith(f"{sys.version_info.major}.")As per coding guidelines: "Focus review on ... deterministic behavior, and avoiding network- or environment-dependent tests unless the test is explicitly marked as an integration smoke test."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/benchmarks/test_runner.py` at line 92, Replace the
environment-dependent assertion that metadata["repo_commit"] != "unknown" with a
deterministic presence/type check: ensure the "repo_commit" key exists in
metadata and its value is a non-empty string (e.g., assert "repo_commit" in
metadata and isinstance(metadata["repo_commit"], str) and
metadata["repo_commit"].strip() != ""), rather than assuming git is available;
this touches the test in tests/benchmarks/test_runner.py and relates to the
_repo_commit_sha() behavior so you can still allow the value "unknown" in
environments without git.
Source: Coding guidelines
Refs #63
Fixes #72
Summary
uv run python -m benchmarks runwith--corpus,--manifest,--out,--run-id, and--dry-run.snapshots/,environment.json,planned-cells.json,run-summary.json, and, for executed cells,transcripts/,tokens/,latency/,scoring/, plusfailures/for failed cells.score: 0.0, keeps failed queries in the correctness denominator, and includestool_model_keyplus error categories for future [v0.5.0] benchmark reporting — generate raw report and README-safe summary #74 error/timeout reporting by tool+model pairing.Dry-run output
{ "dry_run": true, "external_provider_calls": false, "failed_cells": 0, "planned_cells": 1, "run_id": "run-1", "succeeded_cells": 0 }Dry-run artifact files written:
Verification
Additional sanity checks for the new top-level benchmark module:
External provider calls
No external provider calls were made. The runner records
external_provider_calls: false, and issue #72 only implements local fake/no-MCP baseline plumbing.Why this triggered supervisor review
The diff is over the autonomous-agent pipeline's 500-line review threshold because issue #72 adds the runner module and focused tests together. No forbidden files were modified, no runtime network access was added, and no public MCP tool surface changed.