Skip to content

Make publication benchmark numbers harder to dismiss#61

Merged
admin-raintree merged 2 commits into
mainfrom
feat/n3-publication-rerun
Jun 9, 2026
Merged

Make publication benchmark numbers harder to dismiss#61
admin-raintree merged 2 commits into
mainfrom
feat/n3-publication-rerun

Conversation

@admin-raintree

Copy link
Copy Markdown
Contributor

Summary

Four publication-blocking changes to the benchmark harness — prep for the N=3 republish of the DocPull provider matrix article.

  • Retry transient 429/5xx and URLError on live provider HTTP with bounded backoff + jitter; honors Retry-After. A single hiccup no longer leaves a failed cell in the published heatmap.
  • --runs N with per-run output/cache subdirs (run-1/, run-2/, ...), median + min/max aggregation across runs, full per-run history preserved under runs[]. Cached pass is forced off for N>1.
  • Empty-pack floor: benchmark_score returns 0 when records is empty, instead of letting cleanliness/source_fidelity/freshness vacuously return 100 and arithmetic-averaging to ~50. Real example: docpull_docs/parallel-search went from 50 → 0.
  • Article methodology disclosure: heuristic-weight caveat, boilerplate substring-sniff caveat, freshness presence-test caveat, N-run repetition note, and a new Workload Disclosure table showing per-workflow page counts so readers can see comparison terms.

Test plan

  • 16 benchmark tests pass (12 existing + 4 new for retry-then-succeed, retry-exhaust, empty-pack floor, N=3 aggregation)
  • Full 476-test suite clean
  • Ruff lint clean
  • End-to-end smoke run produced .bench/runs/n3-rerun-publication/ with valid JSON + markdown + article artifacts ($0.414 actual spend, 130 Raindrop signals)

Four publication-blocking changes to the benchmark harness:

- Retry transient 429/5xx and URLError on live provider HTTP with
  bounded backoff + jitter; honors Retry-After. Distinguishes transient
  from terminal errors so a single hiccup no longer leaves a failed
  cell in the published heatmap.
- Add --runs N: per-run output/cache subdirs (run-1/, run-2/, ...),
  median + min/max aggregation, full per-run history preserved. Cached
  pass is forced off for N>1 since it composes poorly with repetition.
- Floor benchmark_score to 0 when records is empty instead of letting
  cleanliness/source_fidelity/freshness vacuously return 100 and
  arithmetic-averaging to ~50. Zero records is a routing gap.
- Article generator discloses methodology: heuristic-weight caveat,
  boilerplate substring-sniff caveat, freshness presence-test caveat,
  N-run repetition note, and a new Workload Disclosure table showing
  per-workflow page counts so readers can see comparison terms.

16 benchmark tests pass (12 existing + 4 new for retry, retry-exhaust,
empty-pack floor, and N=3 aggregation). Full 476-test suite clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@admin-raintree admin-raintree enabled auto-merge (squash) June 9, 2026 22:37
@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docpull Ready Ready Preview, Comment Jun 9, 2026 10:44pm

Request Review

- _workload_disclosure_lines: annotate med_text/range_text as str so
  mypy stops inferring int from the populated branch (#61 CI).
- _retry_delay_seconds: # nosec B311 on the jitter random.uniform call;
  backoff jitter is not security-sensitive.
- ruff format pass on the file.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@admin-raintree admin-raintree merged commit 30c63c3 into main Jun 9, 2026
17 checks passed
@admin-raintree admin-raintree deleted the feat/n3-publication-rerun branch June 9, 2026 22:45
admin-raintree added a commit that referenced this pull request Jun 9, 2026
* Make publication benchmark numbers harder to dismiss

Four publication-blocking changes to the benchmark harness:

- Retry transient 429/5xx and URLError on live provider HTTP with
  bounded backoff + jitter; honors Retry-After. Distinguishes transient
  from terminal errors so a single hiccup no longer leaves a failed
  cell in the published heatmap.
- Add --runs N: per-run output/cache subdirs (run-1/, run-2/, ...),
  median + min/max aggregation, full per-run history preserved. Cached
  pass is forced off for N>1 since it composes poorly with repetition.
- Floor benchmark_score to 0 when records is empty instead of letting
  cleanliness/source_fidelity/freshness vacuously return 100 and
  arithmetic-averaging to ~50. Zero records is a routing gap.
- Article generator discloses methodology: heuristic-weight caveat,
  boilerplate substring-sniff caveat, freshness presence-test caveat,
  N-run repetition note, and a new Workload Disclosure table showing
  per-workflow page counts so readers can see comparison terms.

16 benchmark tests pass (12 existing + 4 new for retry, retry-exhaust,
empty-pack floor, and N=3 aggregation). Full 476-test suite clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Fix CI: mypy assignment, ruff format, bandit B311

- _workload_disclosure_lines: annotate med_text/range_text as str so
  mypy stops inferring int from the populated branch (#61 CI).
- _retry_delay_seconds: # nosec B311 on the jitter random.uniform call;
  backoff jitter is not security-sensitive.
- ruff format pass on the file.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(benchmark): add pass^k reliability metric + LLM-judge stub

pass^k (worst-of-k-trials clears the bar) is now computed alongside the
existing median and rendered in benchmark.summary.md. The N3 publication
report's headline median hid a 33-point reliability gap between providers
at the 90 threshold; pass^3 surfaces it (e.g. docpull 88% vs. tavily 50%).

The LLM-judge module is an advisory stub — separate from benchmark_score,
key-gated, with a four-dimension rubric (coverage / groundedness /
source_authority / synthesis_readiness) per the Anthropic eval guidance.
Not wired into the report until calibrated against a hand-graded set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: ruff format + bandit B310 suppression on judge urlopen

CI runs `ruff format --check` (stricter than `ruff check`) and bandit
separately; my local sweep used only `ruff check`. Apply formatting and
switch the urlopen suppression from `# noqa: S310` (ruff prefix) to
`# nosec B310` (bandit's directive, matching the existing pattern at
benchmark.py:2491).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: admin-raintree <277948009+admin-raintree@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant