feat(benchmark): pass^k reliability metric + LLM-judge stub by admin-raintree · Pull Request #62 · raintree-technology/docpull

admin-raintree · 2026-06-09T23:22:31Z

Summary

pass^k (worst-of-k-trials clears the threshold) is now computed alongside the existing median and rendered in benchmark.summary.md. Stricter framing than median for the publication: "users expect reliable behavior every time" per the Anthropic "Demystifying evals for AI agents" post.
LLM-judge stub (docpull.judge) lands as an advisory module — separate from benchmark_score, key-gated (skips with structured reason when ANTHROPIC_API_KEY unset), four-dimension rubric (coverage / groundedness / source_authority / synthesis_readiness). Not wired into the report until calibrated against a hand-graded set.
Also includes Fix CI: mypy assignment, ruff format, bandit B311 (already pushed).

Why this hardens the N3 publication numbers

On the existing n3-rerun-publication run, the headline median hid a 33-point reliability gap at the 90 threshold that pass^3 surfaces:

score	@70	@80	@90
`pack_score`	94.4%	94.4%	91.7%
`benchmark_score`	94.4%	86.1%	66.7%

Per-provider on benchmark_score@90: docpull 88%, exa 75%, parallel 57%, tavily 50%.

Test plan

pytest tests/test_passk.py tests/test_judge.py — 18 passed
Full unit sweep (excluding live/perf benchmarks) — 494 passed
ruff check + mypy clean on new files and benchmark.py
Regenerated .bench/runs/n3-rerun-publication/benchmark.summary.md to verify the Reliability section renders
CI green

🤖 Generated with Claude Code

Four publication-blocking changes to the benchmark harness: - Retry transient 429/5xx and URLError on live provider HTTP with bounded backoff + jitter; honors Retry-After. Distinguishes transient from terminal errors so a single hiccup no longer leaves a failed cell in the published heatmap. - Add --runs N: per-run output/cache subdirs (run-1/, run-2/, ...), median + min/max aggregation, full per-run history preserved. Cached pass is forced off for N>1 since it composes poorly with repetition. - Floor benchmark_score to 0 when records is empty instead of letting cleanliness/source_fidelity/freshness vacuously return 100 and arithmetic-averaging to ~50. Zero records is a routing gap. - Article generator discloses methodology: heuristic-weight caveat, boilerplate substring-sniff caveat, freshness presence-test caveat, N-run repetition note, and a new Workload Disclosure table showing per-workflow page counts so readers can see comparison terms. 16 benchmark tests pass (12 existing + 4 new for retry, retry-exhaust, empty-pack floor, and N=3 aggregation). Full 476-test suite clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- _workload_disclosure_lines: annotate med_text/range_text as str so mypy stops inferring int from the populated branch (#61 CI). - _retry_delay_seconds: # nosec B311 on the jitter random.uniform call; backoff jitter is not security-sensitive. - ruff format pass on the file. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

pass^k (worst-of-k-trials clears the bar) is now computed alongside the existing median and rendered in benchmark.summary.md. The N3 publication report's headline median hid a 33-point reliability gap between providers at the 90 threshold; pass^3 surfaces it (e.g. docpull 88% vs. tavily 50%). The LLM-judge module is an advisory stub — separate from benchmark_score, key-gated, with a four-dimension rubric (coverage / groundedness / source_authority / synthesis_readiness) per the Anthropic eval guidance. Not wired into the report until calibrated against a hand-graded set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

vercel · 2026-06-09T23:22:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
docpull	Ready	Preview, Comment	Jun 9, 2026 11:25pm

CI runs `ruff format --check` (stricter than `ruff check`) and bandit separately; my local sweep used only `ruff check`. Apply formatting and switch the urlopen suppression from `# noqa: S310` (ruff prefix) to `# nosec B310` (bandit's directive, matching the existing pattern at benchmark.py:2491). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

admin-raintree and others added 3 commits June 9, 2026 15:36

vercel Bot deployed to Preview June 9, 2026 23:22 View deployment

admin-raintree enabled auto-merge (squash) June 9, 2026 23:22

vercel Bot deployed to Preview June 9, 2026 23:25 View deployment

admin-raintree merged commit 8357fa7 into main Jun 9, 2026
17 checks passed

admin-raintree deleted the feat/n3-publication-rerun branch June 9, 2026 23:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): pass^k reliability metric + LLM-judge stub#62

feat(benchmark): pass^k reliability metric + LLM-judge stub#62
admin-raintree merged 4 commits into
mainfrom
feat/n3-publication-rerun

admin-raintree commented Jun 9, 2026

Uh oh!

vercel Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

admin-raintree commented Jun 9, 2026

Summary

Why this hardens the N3 publication numbers

Test plan

Uh oh!

vercel Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 9, 2026 •

edited

Loading