You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the human-led corpus issue. The public benchmark is only credible if the question set is fixed, balanced, version-aware, and auditable before results are known.
Goal
Define the benchmark corpus schema and the 50-question Python stdlib evaluation pack used by the v0.5.0 public benchmark.
Acceptance criteria
docs/benchmarks/corpus.schema.json defines the corpus shape: stable ID, category, Python version or version pair, prompt, official-doc answer key, required source sections, expected answer properties, ambiguity notes.
Cross-version questions lead with compare_versions-style diffs and cite the relevant official Python versions.
Every answer key cites CPython docs source, generated official docs, or official What's New pages. No blogs, snippets, LLM answers, or third-party mirrors as truth source.
A validation command fails if the corpus has duplicate IDs, missing citations, wrong category counts, or unsupported categories.
The corpus is frozen before any benchmark result is used for README/PyPI/launch copy.
Scope boundaries
In scope:
Corpus schema.
Corpus validation command/test.
The 50-question corpus itself.
Human review of question quality and category balance.
Out of scope:
Running the benchmark.
Scoring model answers.
Publishing README benchmark claims.
Adding provider/model API calls.
Forbidden-territory reminder
Do not modify MCP tool names, parameters, return shapes, schema.sql, .github/workflows/, pyproject.toml project metadata, .planning/POSITIONING.md, the README hero section, LICENSE, SECURITY.md, or existing tests by weakening/deleting assertions.
Validation commands
uv run ruff check src/ tests/
uv run pyright src/
uv run pytest --tb=short -q
uv run python-docs-mcp-server doctor
Add the new corpus validation command/test once implemented and include its output in the PR.
PR template
Use Refs #63, not Closes #63.
The PR must include:
Category count summary.
Corpus validation output.
Notes for any ambiguous questions and why they remain acceptable.
Confirmation that no benchmark results were consulted while editing the corpus.
Recovery
If the corpus cannot reach 50 high-quality questions without ambiguity or source gaps, stop and comment with the weak categories and proposed fixes.
Effort estimate
4-8 hours. This is intentionally human-led; do not label it agent-ready until the corpus questions are reviewed by Vision.
Context
Parent: #63. Methodology:
docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md.This is the human-led corpus issue. The public benchmark is only credible if the question set is fixed, balanced, version-aware, and auditable before results are known.
Goal
Define the benchmark corpus schema and the 50-question Python stdlib evaluation pack used by the v0.5.0 public benchmark.
Acceptance criteria
docs/benchmarks/corpus.schema.jsondefines the corpus shape: stable ID, category, Python version or version pair, prompt, official-doc answer key, required source sections, expected answer properties, ambiguity notes.docs/benchmarks/corpus.ymlcontains exactly 50 questions matching the methodology distribution: 15 exact-symbol, 10 concept/API usage, 15 cross-version, 5 PEP-adjacent, 5 applied stdlib-selection.compare_versions-style diffs and cite the relevant official Python versions.Scope boundaries
In scope:
Out of scope:
Forbidden-territory reminder
Do not modify MCP tool names, parameters, return shapes,
schema.sql,.github/workflows/,pyproject.tomlproject metadata,.planning/POSITIONING.md, the README hero section,LICENSE,SECURITY.md, or existing tests by weakening/deleting assertions.Validation commands
Add the new corpus validation command/test once implemented and include its output in the PR.
PR template
Use
Refs #63, notCloses #63.The PR must include:
Recovery
If the corpus cannot reach 50 high-quality questions without ambiguity or source gaps, stop and comment with the weak categories and proposed fixes.
Effort estimate
4-8 hours. This is intentionally human-led; do not label it
agent-readyuntil the corpus questions are reviewed by Vision.