Skip to content

feat(benchmark): pass^k reliability metric + LLM-judge stub#62

Merged
admin-raintree merged 4 commits into
mainfrom
feat/n3-publication-rerun
Jun 9, 2026
Merged

feat(benchmark): pass^k reliability metric + LLM-judge stub#62
admin-raintree merged 4 commits into
mainfrom
feat/n3-publication-rerun

Conversation

@admin-raintree

Copy link
Copy Markdown
Contributor

Summary

  • pass^k (worst-of-k-trials clears the threshold) is now computed alongside the existing median and rendered in benchmark.summary.md. Stricter framing than median for the publication: "users expect reliable behavior every time" per the Anthropic "Demystifying evals for AI agents" post.
  • LLM-judge stub (docpull.judge) lands as an advisory module — separate from benchmark_score, key-gated (skips with structured reason when ANTHROPIC_API_KEY unset), four-dimension rubric (coverage / groundedness / source_authority / synthesis_readiness). Not wired into the report until calibrated against a hand-graded set.
  • Also includes Fix CI: mypy assignment, ruff format, bandit B311 (already pushed).

Why this hardens the N3 publication numbers

On the existing n3-rerun-publication run, the headline median hid a 33-point reliability gap at the 90 threshold that pass^3 surfaces:

score @70 @80 @90
pack_score 94.4% 94.4% 91.7%
benchmark_score 94.4% 86.1% 66.7%

Per-provider on benchmark_score@90: docpull 88%, exa 75%, parallel 57%, tavily 50%.

Test plan

  • pytest tests/test_passk.py tests/test_judge.py — 18 passed
  • Full unit sweep (excluding live/perf benchmarks) — 494 passed
  • ruff check + mypy clean on new files and benchmark.py
  • Regenerated .bench/runs/n3-rerun-publication/benchmark.summary.md to verify the Reliability section renders
  • CI green

🤖 Generated with Claude Code

admin-raintree and others added 3 commits June 9, 2026 15:36
Four publication-blocking changes to the benchmark harness:

- Retry transient 429/5xx and URLError on live provider HTTP with
  bounded backoff + jitter; honors Retry-After. Distinguishes transient
  from terminal errors so a single hiccup no longer leaves a failed
  cell in the published heatmap.
- Add --runs N: per-run output/cache subdirs (run-1/, run-2/, ...),
  median + min/max aggregation, full per-run history preserved. Cached
  pass is forced off for N>1 since it composes poorly with repetition.
- Floor benchmark_score to 0 when records is empty instead of letting
  cleanliness/source_fidelity/freshness vacuously return 100 and
  arithmetic-averaging to ~50. Zero records is a routing gap.
- Article generator discloses methodology: heuristic-weight caveat,
  boilerplate substring-sniff caveat, freshness presence-test caveat,
  N-run repetition note, and a new Workload Disclosure table showing
  per-workflow page counts so readers can see comparison terms.

16 benchmark tests pass (12 existing + 4 new for retry, retry-exhaust,
empty-pack floor, and N=3 aggregation). Full 476-test suite clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- _workload_disclosure_lines: annotate med_text/range_text as str so
  mypy stops inferring int from the populated branch (#61 CI).
- _retry_delay_seconds: # nosec B311 on the jitter random.uniform call;
  backoff jitter is not security-sensitive.
- ruff format pass on the file.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
pass^k (worst-of-k-trials clears the bar) is now computed alongside the
existing median and rendered in benchmark.summary.md. The N3 publication
report's headline median hid a 33-point reliability gap between providers
at the 90 threshold; pass^3 surfaces it (e.g. docpull 88% vs. tavily 50%).

The LLM-judge module is an advisory stub — separate from benchmark_score,
key-gated, with a four-dimension rubric (coverage / groundedness /
source_authority / synthesis_readiness) per the Anthropic eval guidance.
Not wired into the report until calibrated against a hand-graded set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docpull Ready Ready Preview, Comment Jun 9, 2026 11:25pm

Request Review

@admin-raintree admin-raintree enabled auto-merge (squash) June 9, 2026 23:22
CI runs `ruff format --check` (stricter than `ruff check`) and bandit
separately; my local sweep used only `ruff check`. Apply formatting and
switch the urlopen suppression from `# noqa: S310` (ruff prefix) to
`# nosec B310` (bandit's directive, matching the existing pattern at
benchmark.py:2491).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@admin-raintree admin-raintree merged commit 8357fa7 into main Jun 9, 2026
17 checks passed
@admin-raintree admin-raintree deleted the feat/n3-publication-rerun branch June 9, 2026 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant