Make publication benchmark numbers harder to dismiss by admin-raintree · Pull Request #61 · raintree-technology/docpull

admin-raintree · 2026-06-09T22:37:10Z

Summary

Four publication-blocking changes to the benchmark harness — prep for the N=3 republish of the DocPull provider matrix article.

Retry transient 429/5xx and URLError on live provider HTTP with bounded backoff + jitter; honors Retry-After. A single hiccup no longer leaves a failed cell in the published heatmap.
--runs N with per-run output/cache subdirs (run-1/, run-2/, ...), median + min/max aggregation across runs, full per-run history preserved under runs[]. Cached pass is forced off for N>1.
Empty-pack floor: benchmark_score returns 0 when records is empty, instead of letting cleanliness/source_fidelity/freshness vacuously return 100 and arithmetic-averaging to ~50. Real example: docpull_docs/parallel-search went from 50 → 0.
Article methodology disclosure: heuristic-weight caveat, boilerplate substring-sniff caveat, freshness presence-test caveat, N-run repetition note, and a new Workload Disclosure table showing per-workflow page counts so readers can see comparison terms.

Test plan

16 benchmark tests pass (12 existing + 4 new for retry-then-succeed, retry-exhaust, empty-pack floor, N=3 aggregation)
Full 476-test suite clean
Ruff lint clean
End-to-end smoke run produced .bench/runs/n3-rerun-publication/ with valid JSON + markdown + article artifacts ($0.414 actual spend, 130 Raindrop signals)

Four publication-blocking changes to the benchmark harness: - Retry transient 429/5xx and URLError on live provider HTTP with bounded backoff + jitter; honors Retry-After. Distinguishes transient from terminal errors so a single hiccup no longer leaves a failed cell in the published heatmap. - Add --runs N: per-run output/cache subdirs (run-1/, run-2/, ...), median + min/max aggregation, full per-run history preserved. Cached pass is forced off for N>1 since it composes poorly with repetition. - Floor benchmark_score to 0 when records is empty instead of letting cleanliness/source_fidelity/freshness vacuously return 100 and arithmetic-averaging to ~50. Zero records is a routing gap. - Article generator discloses methodology: heuristic-weight caveat, boilerplate substring-sniff caveat, freshness presence-test caveat, N-run repetition note, and a new Workload Disclosure table showing per-workflow page counts so readers can see comparison terms. 16 benchmark tests pass (12 existing + 4 new for retry, retry-exhaust, empty-pack floor, and N=3 aggregation). Full 476-test suite clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

vercel · 2026-06-09T22:37:16Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
docpull	Ready	Preview, Comment	Jun 9, 2026 10:44pm

- _workload_disclosure_lines: annotate med_text/range_text as str so mypy stops inferring int from the populated branch (#61 CI). - _retry_delay_seconds: # nosec B311 on the jitter random.uniform call; backoff jitter is not security-sensitive. - ruff format pass on the file. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Make publication benchmark numbers harder to dismiss Four publication-blocking changes to the benchmark harness: - Retry transient 429/5xx and URLError on live provider HTTP with bounded backoff + jitter; honors Retry-After. Distinguishes transient from terminal errors so a single hiccup no longer leaves a failed cell in the published heatmap. - Add --runs N: per-run output/cache subdirs (run-1/, run-2/, ...), median + min/max aggregation, full per-run history preserved. Cached pass is forced off for N>1 since it composes poorly with repetition. - Floor benchmark_score to 0 when records is empty instead of letting cleanliness/source_fidelity/freshness vacuously return 100 and arithmetic-averaging to ~50. Zero records is a routing gap. - Article generator discloses methodology: heuristic-weight caveat, boilerplate substring-sniff caveat, freshness presence-test caveat, N-run repetition note, and a new Workload Disclosure table showing per-workflow page counts so readers can see comparison terms. 16 benchmark tests pass (12 existing + 4 new for retry, retry-exhaust, empty-pack floor, and N=3 aggregation). Full 476-test suite clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Fix CI: mypy assignment, ruff format, bandit B311 - _workload_disclosure_lines: annotate med_text/range_text as str so mypy stops inferring int from the populated branch (#61 CI). - _retry_delay_seconds: # nosec B311 on the jitter random.uniform call; backoff jitter is not security-sensitive. - ruff format pass on the file. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(benchmark): add pass^k reliability metric + LLM-judge stub pass^k (worst-of-k-trials clears the bar) is now computed alongside the existing median and rendered in benchmark.summary.md. The N3 publication report's headline median hid a 33-point reliability gap between providers at the 90 threshold; pass^3 surfaces it (e.g. docpull 88% vs. tavily 50%). The LLM-judge module is an advisory stub — separate from benchmark_score, key-gated, with a four-dimension rubric (coverage / groundedness / source_authority / synthesis_readiness) per the Anthropic eval guidance. Not wired into the report until calibrated against a hand-graded set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: ruff format + bandit B310 suppression on judge urlopen CI runs `ruff format --check` (stricter than `ruff check`) and bandit separately; my local sweep used only `ruff check`. Apply formatting and switch the urlopen suppression from `# noqa: S310` (ruff prefix) to `# nosec B310` (bandit's directive, matching the existing pattern at benchmark.py:2491). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: admin-raintree <277948009+admin-raintree@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

admin-raintree enabled auto-merge (squash) June 9, 2026 22:37

vercel Bot deployed to Preview June 9, 2026 22:37 View deployment

vercel Bot deployed to Preview June 9, 2026 22:44 View deployment

admin-raintree merged commit 30c63c3 into main Jun 9, 2026
17 checks passed

admin-raintree deleted the feat/n3-publication-rerun branch June 9, 2026 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make publication benchmark numbers harder to dismiss#61

Make publication benchmark numbers harder to dismiss#61
admin-raintree merged 2 commits into
mainfrom
feat/n3-publication-rerun

admin-raintree commented Jun 9, 2026

Uh oh!

vercel Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

admin-raintree commented Jun 9, 2026

Summary

Test plan

Uh oh!

vercel Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 9, 2026 •

edited

Loading