Skip to content

[v0.5.0] benchmark adapters — define OpenAI/Google model matrix #73

@ayhammouda

Description

@ayhammouda

Context

Parent: #63. Methodology: docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md.

Aymen explicitly wants OpenAI and Google model families included. This issue defines the model/client matrix and provider adapter contracts without letting model choice blur into a single fake tool-quality score.

Goal

Add the benchmark model matrix and provider adapter contracts for OpenAI and Google-backed runs, with tests that use mocks rather than paid/live calls.

Acceptance criteria

  • The runner/model design cross-multiplies the evaluation matrix: every eligible tool is evaluated against every model in the matrix independently.
  • Tool scores are not averaged across different model families; any aggregate must remain secondary and clearly labeled.
  • Token metrics are labeled as Claude Tokens (Normalized Payload) when using the methodology tokenizer, so they are not confused with OpenAI or Google API billing tokens.
  • docs/benchmarks/model-matrix.yml defines the model/client families to run, including at least one OpenAI-backed and one Google-backed model family.
  • The matrix format records provider, model ID, client/wrapper, token-count method, latency scope, and whether the result is eligible for headline claims.
  • Provider adapter interfaces support request, response, latency, token metadata, and failure capture.
  • Tests cover mock OpenAI and mock Google adapter behavior without making network calls.
  • The report design keeps per-model results visible and marks any aggregate as aggregate.
  • The implementation refuses to run live providers unless explicit environment variables/config are present.

Scope boundaries

In scope:

  • Model matrix spec.
  • Provider adapter interfaces.
  • Mock adapter tests.
  • Live-call guardrails.

Out of scope:

  • Running the full benchmark.
  • Spending API credits.
  • Adding README claims.
  • Choosing final headline wording.

Forbidden-territory reminder

Do not modify MCP tool names, parameters, return shapes, schema.sql, .github/workflows/, pyproject.toml project metadata, .planning/POSITIONING.md, the README hero section, LICENSE, SECURITY.md, or existing tests by weakening/deleting assertions.

If a new third-party runtime dependency is proposed, trigger supervisor review and justify why stdlib/httpx is insufficient.

Validation commands

uv run ruff check src/ tests/
uv run pyright src/
uv run pytest --tb=short -q
uv run python-docs-mcp-server doctor

Run any new adapter tests directly as well.

PR template

Use Refs #63, not Closes #63.

The PR must include:

  • Model matrix summary.
  • Explanation of live-call guardrails.
  • Test output.
  • Confirmation that no paid/live provider calls were made.

Recovery

If token counting cannot be made comparable across providers, stop and comment with the exact mismatch instead of averaging incompatible numbers.

Effort estimate

4-6 hours. This can be agent-ready for mocked adapter plumbing only; live benchmark execution remains Vision-controlled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-readyIssue passed AGENT-EXECUTION-PIPELINE.md §10 pre-flight; scoped for an autonomous agentenhancementNew feature or requestpriority:P2Medium priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions