Make publication benchmark numbers harder to dismiss#61
Merged
Conversation
Four publication-blocking changes to the benchmark harness: - Retry transient 429/5xx and URLError on live provider HTTP with bounded backoff + jitter; honors Retry-After. Distinguishes transient from terminal errors so a single hiccup no longer leaves a failed cell in the published heatmap. - Add --runs N: per-run output/cache subdirs (run-1/, run-2/, ...), median + min/max aggregation, full per-run history preserved. Cached pass is forced off for N>1 since it composes poorly with repetition. - Floor benchmark_score to 0 when records is empty instead of letting cleanliness/source_fidelity/freshness vacuously return 100 and arithmetic-averaging to ~50. Zero records is a routing gap. - Article generator discloses methodology: heuristic-weight caveat, boilerplate substring-sniff caveat, freshness presence-test caveat, N-run repetition note, and a new Workload Disclosure table showing per-workflow page counts so readers can see comparison terms. 16 benchmark tests pass (12 existing + 4 new for retry, retry-exhaust, empty-pack floor, and N=3 aggregation). Full 476-test suite clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
- _workload_disclosure_lines: annotate med_text/range_text as str so mypy stops inferring int from the populated branch (#61 CI). - _retry_delay_seconds: # nosec B311 on the jitter random.uniform call; backoff jitter is not security-sensitive. - ruff format pass on the file. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
admin-raintree
added a commit
that referenced
this pull request
Jun 9, 2026
* Make publication benchmark numbers harder to dismiss Four publication-blocking changes to the benchmark harness: - Retry transient 429/5xx and URLError on live provider HTTP with bounded backoff + jitter; honors Retry-After. Distinguishes transient from terminal errors so a single hiccup no longer leaves a failed cell in the published heatmap. - Add --runs N: per-run output/cache subdirs (run-1/, run-2/, ...), median + min/max aggregation, full per-run history preserved. Cached pass is forced off for N>1 since it composes poorly with repetition. - Floor benchmark_score to 0 when records is empty instead of letting cleanliness/source_fidelity/freshness vacuously return 100 and arithmetic-averaging to ~50. Zero records is a routing gap. - Article generator discloses methodology: heuristic-weight caveat, boilerplate substring-sniff caveat, freshness presence-test caveat, N-run repetition note, and a new Workload Disclosure table showing per-workflow page counts so readers can see comparison terms. 16 benchmark tests pass (12 existing + 4 new for retry, retry-exhaust, empty-pack floor, and N=3 aggregation). Full 476-test suite clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Fix CI: mypy assignment, ruff format, bandit B311 - _workload_disclosure_lines: annotate med_text/range_text as str so mypy stops inferring int from the populated branch (#61 CI). - _retry_delay_seconds: # nosec B311 on the jitter random.uniform call; backoff jitter is not security-sensitive. - ruff format pass on the file. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(benchmark): add pass^k reliability metric + LLM-judge stub pass^k (worst-of-k-trials clears the bar) is now computed alongside the existing median and rendered in benchmark.summary.md. The N3 publication report's headline median hid a 33-point reliability gap between providers at the 90 threshold; pass^3 surfaces it (e.g. docpull 88% vs. tavily 50%). The LLM-judge module is an advisory stub — separate from benchmark_score, key-gated, with a four-dimension rubric (coverage / groundedness / source_authority / synthesis_readiness) per the Anthropic eval guidance. Not wired into the report until calibrated against a hand-graded set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: ruff format + bandit B310 suppression on judge urlopen CI runs `ruff format --check` (stricter than `ruff check`) and bandit separately; my local sweep used only `ruff check`. Apply formatting and switch the urlopen suppression from `# noqa: S310` (ruff prefix) to `# nosec B310` (bandit's directive, matching the existing pattern at benchmark.py:2491). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: admin-raintree <277948009+admin-raintree@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four publication-blocking changes to the benchmark harness — prep for the N=3 republish of the DocPull provider matrix article.
Retry-After. A single hiccup no longer leaves a failed cell in the published heatmap.--runs Nwith per-run output/cache subdirs (run-1/,run-2/, ...), median + min/max aggregation across runs, full per-run history preserved underruns[]. Cached pass is forced off for N>1.benchmark_scorereturns 0 when records is empty, instead of letting cleanliness/source_fidelity/freshness vacuously return 100 and arithmetic-averaging to ~50. Real example:docpull_docs/parallel-searchwent from 50 → 0.Test plan
.bench/runs/n3-rerun-publication/with valid JSON + markdown + article artifacts ($0.414 actual spend, 130 Raindrop signals)