[v0.5.0] benchmark corpus — define schema and 50-question eval pack

## Context

Parent: #63. Methodology: `docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md`.

This is the human-led corpus issue. The public benchmark is only credible if the question set is fixed, balanced, version-aware, and auditable before results are known.

## Goal

Define the benchmark corpus schema and the 50-question Python stdlib evaluation pack used by the v0.5.0 public benchmark.

## Acceptance criteria

- [ ] `docs/benchmarks/corpus.schema.json` defines the corpus shape: stable ID, category, Python version or version pair, prompt, official-doc answer key, required source sections, expected answer properties, ambiguity notes.
- [ ] `docs/benchmarks/corpus.yml` contains exactly 50 questions matching the methodology distribution: 15 exact-symbol, 10 concept/API usage, 15 cross-version, 5 PEP-adjacent, 5 applied stdlib-selection.
- [ ] Cross-version questions lead with `compare_versions`-style diffs and cite the relevant official Python versions.
- [ ] Every answer key cites CPython docs source, generated official docs, or official What's New pages. No blogs, snippets, LLM answers, or third-party mirrors as truth source.
- [ ] A validation command fails if the corpus has duplicate IDs, missing citations, wrong category counts, or unsupported categories.
- [ ] The corpus is frozen before any benchmark result is used for README/PyPI/launch copy.

## Scope boundaries

In scope:
- Corpus schema.
- Corpus validation command/test.
- The 50-question corpus itself.
- Human review of question quality and category balance.

Out of scope:
- Running the benchmark.
- Scoring model answers.
- Publishing README benchmark claims.
- Adding provider/model API calls.

## Forbidden-territory reminder

Do not modify MCP tool names, parameters, return shapes, `schema.sql`, `.github/workflows/`, `pyproject.toml` project metadata, `.planning/POSITIONING.md`, the README hero section, `LICENSE`, `SECURITY.md`, or existing tests by weakening/deleting assertions.

## Validation commands

```bash
uv run ruff check src/ tests/
uv run pyright src/
uv run pytest --tb=short -q
uv run python-docs-mcp-server doctor
```

Add the new corpus validation command/test once implemented and include its output in the PR.

## PR template

Use `Refs #63`, not `Closes #63`.

The PR must include:
- Category count summary.
- Corpus validation output.
- Notes for any ambiguous questions and why they remain acceptable.
- Confirmation that no benchmark results were consulted while editing the corpus.

## Recovery

If the corpus cannot reach 50 high-quality questions without ambiguity or source gaps, stop and comment with the weak categories and proposed fixes.

## Effort estimate

4-8 hours. This is intentionally human-led; do not label it `agent-ready` until the corpus questions are reviewed by Vision.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.5.0] benchmark corpus — define schema and 50-question eval pack #71

Context

Goal

Acceptance criteria

Scope boundaries

Forbidden-territory reminder

Validation commands

PR template

Recovery

Effort estimate

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[v0.5.0] benchmark corpus — define schema and 50-question eval pack #71

Description

Context

Goal

Acceptance criteria

Scope boundaries

Forbidden-territory reminder

Validation commands

PR template

Recovery

Effort estimate

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions