Skip to content

[v0.5.0] benchmark corpus — define schema and 50-question eval pack #71

@ayhammouda

Description

@ayhammouda

Context

Parent: #63. Methodology: docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md.

This is the human-led corpus issue. The public benchmark is only credible if the question set is fixed, balanced, version-aware, and auditable before results are known.

Goal

Define the benchmark corpus schema and the 50-question Python stdlib evaluation pack used by the v0.5.0 public benchmark.

Acceptance criteria

  • docs/benchmarks/corpus.schema.json defines the corpus shape: stable ID, category, Python version or version pair, prompt, official-doc answer key, required source sections, expected answer properties, ambiguity notes.
  • docs/benchmarks/corpus.yml contains exactly 50 questions matching the methodology distribution: 15 exact-symbol, 10 concept/API usage, 15 cross-version, 5 PEP-adjacent, 5 applied stdlib-selection.
  • Cross-version questions lead with compare_versions-style diffs and cite the relevant official Python versions.
  • Every answer key cites CPython docs source, generated official docs, or official What's New pages. No blogs, snippets, LLM answers, or third-party mirrors as truth source.
  • A validation command fails if the corpus has duplicate IDs, missing citations, wrong category counts, or unsupported categories.
  • The corpus is frozen before any benchmark result is used for README/PyPI/launch copy.

Scope boundaries

In scope:

  • Corpus schema.
  • Corpus validation command/test.
  • The 50-question corpus itself.
  • Human review of question quality and category balance.

Out of scope:

  • Running the benchmark.
  • Scoring model answers.
  • Publishing README benchmark claims.
  • Adding provider/model API calls.

Forbidden-territory reminder

Do not modify MCP tool names, parameters, return shapes, schema.sql, .github/workflows/, pyproject.toml project metadata, .planning/POSITIONING.md, the README hero section, LICENSE, SECURITY.md, or existing tests by weakening/deleting assertions.

Validation commands

uv run ruff check src/ tests/
uv run pyright src/
uv run pytest --tb=short -q
uv run python-docs-mcp-server doctor

Add the new corpus validation command/test once implemented and include its output in the PR.

PR template

Use Refs #63, not Closes #63.

The PR must include:

  • Category count summary.
  • Corpus validation output.
  • Notes for any ambiguous questions and why they remain acceptable.
  • Confirmation that no benchmark results were consulted while editing the corpus.

Recovery

If the corpus cannot reach 50 high-quality questions without ambiguity or source gaps, stop and comment with the weak categories and proposed fixes.

Effort estimate

4-8 hours. This is intentionally human-led; do not label it agent-ready until the corpus questions are reviewed by Vision.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or requestpriority:P2Medium priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions