feat(benchmark): pass^k reliability metric + LLM-judge stub#62
Merged
Conversation
Four publication-blocking changes to the benchmark harness: - Retry transient 429/5xx and URLError on live provider HTTP with bounded backoff + jitter; honors Retry-After. Distinguishes transient from terminal errors so a single hiccup no longer leaves a failed cell in the published heatmap. - Add --runs N: per-run output/cache subdirs (run-1/, run-2/, ...), median + min/max aggregation, full per-run history preserved. Cached pass is forced off for N>1 since it composes poorly with repetition. - Floor benchmark_score to 0 when records is empty instead of letting cleanliness/source_fidelity/freshness vacuously return 100 and arithmetic-averaging to ~50. Zero records is a routing gap. - Article generator discloses methodology: heuristic-weight caveat, boilerplate substring-sniff caveat, freshness presence-test caveat, N-run repetition note, and a new Workload Disclosure table showing per-workflow page counts so readers can see comparison terms. 16 benchmark tests pass (12 existing + 4 new for retry, retry-exhaust, empty-pack floor, and N=3 aggregation). Full 476-test suite clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- _workload_disclosure_lines: annotate med_text/range_text as str so mypy stops inferring int from the populated branch (#61 CI). - _retry_delay_seconds: # nosec B311 on the jitter random.uniform call; backoff jitter is not security-sensitive. - ruff format pass on the file. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
pass^k (worst-of-k-trials clears the bar) is now computed alongside the existing median and rendered in benchmark.summary.md. The N3 publication report's headline median hid a 33-point reliability gap between providers at the 90 threshold; pass^3 surfaces it (e.g. docpull 88% vs. tavily 50%). The LLM-judge module is an advisory stub — separate from benchmark_score, key-gated, with a four-dimension rubric (coverage / groundedness / source_authority / synthesis_readiness) per the Anthropic eval guidance. Not wired into the report until calibrated against a hand-graded set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
CI runs `ruff format --check` (stricter than `ruff check`) and bandit separately; my local sweep used only `ruff check`. Apply formatting and switch the urlopen suppression from `# noqa: S310` (ruff prefix) to `# nosec B310` (bandit's directive, matching the existing pattern at benchmark.py:2491). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
benchmark.summary.md. Stricter framing than median for the publication: "users expect reliable behavior every time" per the Anthropic "Demystifying evals for AI agents" post.docpull.judge) lands as an advisory module — separate frombenchmark_score, key-gated (skips with structured reason whenANTHROPIC_API_KEYunset), four-dimension rubric (coverage / groundedness / source_authority / synthesis_readiness). Not wired into the report until calibrated against a hand-graded set.Fix CI: mypy assignment, ruff format, bandit B311(already pushed).Why this hardens the N3 publication numbers
On the existing
n3-rerun-publicationrun, the headline median hid a 33-point reliability gap at the 90 threshold that pass^3 surfaces:pack_scorebenchmark_scorePer-provider on
benchmark_score@90: docpull 88%, exa 75%, parallel 57%, tavily 50%.Test plan
pytest tests/test_passk.py tests/test_judge.py— 18 passedruff check+mypyclean on new files andbenchmark.py.bench/runs/n3-rerun-publication/benchmark.summary.mdto verify the Reliability section renders🤖 Generated with Claude Code