You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Aymen explicitly wants OpenAI and Google model families included. This issue defines the model/client matrix and provider adapter contracts without letting model choice blur into a single fake tool-quality score.
Goal
Add the benchmark model matrix and provider adapter contracts for OpenAI and Google-backed runs, with tests that use mocks rather than paid/live calls.
Acceptance criteria
The runner/model design cross-multiplies the evaluation matrix: every eligible tool is evaluated against every model in the matrix independently.
Tool scores are not averaged across different model families; any aggregate must remain secondary and clearly labeled.
Token metrics are labeled as Claude Tokens (Normalized Payload) when using the methodology tokenizer, so they are not confused with OpenAI or Google API billing tokens.
docs/benchmarks/model-matrix.yml defines the model/client families to run, including at least one OpenAI-backed and one Google-backed model family.
The matrix format records provider, model ID, client/wrapper, token-count method, latency scope, and whether the result is eligible for headline claims.
Provider adapter interfaces support request, response, latency, token metadata, and failure capture.
Tests cover mock OpenAI and mock Google adapter behavior without making network calls.
The report design keeps per-model results visible and marks any aggregate as aggregate.
The implementation refuses to run live providers unless explicit environment variables/config are present.
Scope boundaries
In scope:
Model matrix spec.
Provider adapter interfaces.
Mock adapter tests.
Live-call guardrails.
Out of scope:
Running the full benchmark.
Spending API credits.
Adding README claims.
Choosing final headline wording.
Forbidden-territory reminder
Do not modify MCP tool names, parameters, return shapes, schema.sql, .github/workflows/, pyproject.toml project metadata, .planning/POSITIONING.md, the README hero section, LICENSE, SECURITY.md, or existing tests by weakening/deleting assertions.
If a new third-party runtime dependency is proposed, trigger supervisor review and justify why stdlib/httpx is insufficient.
Validation commands
uv run ruff check src/ tests/
uv run pyright src/
uv run pytest --tb=short -q
uv run python-docs-mcp-server doctor
Run any new adapter tests directly as well.
PR template
Use Refs #63, not Closes #63.
The PR must include:
Model matrix summary.
Explanation of live-call guardrails.
Test output.
Confirmation that no paid/live provider calls were made.
Recovery
If token counting cannot be made comparable across providers, stop and comment with the exact mismatch instead of averaging incompatible numbers.
Effort estimate
4-6 hours. This can be agent-ready for mocked adapter plumbing only; live benchmark execution remains Vision-controlled.
Context
Parent: #63. Methodology:
docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md.Aymen explicitly wants OpenAI and Google model families included. This issue defines the model/client matrix and provider adapter contracts without letting model choice blur into a single fake tool-quality score.
Goal
Add the benchmark model matrix and provider adapter contracts for OpenAI and Google-backed runs, with tests that use mocks rather than paid/live calls.
Acceptance criteria
Claude Tokens (Normalized Payload)when using the methodology tokenizer, so they are not confused with OpenAI or Google API billing tokens.docs/benchmarks/model-matrix.ymldefines the model/client families to run, including at least one OpenAI-backed and one Google-backed model family.Scope boundaries
In scope:
Out of scope:
Forbidden-territory reminder
Do not modify MCP tool names, parameters, return shapes,
schema.sql,.github/workflows/,pyproject.tomlproject metadata,.planning/POSITIONING.md, the README hero section,LICENSE,SECURITY.md, or existing tests by weakening/deleting assertions.If a new third-party runtime dependency is proposed, trigger supervisor review and justify why stdlib/httpx is insufficient.
Validation commands
Run any new adapter tests directly as well.
PR template
Use
Refs #63, notCloses #63.The PR must include:
Recovery
If token counting cannot be made comparable across providers, stop and comment with the exact mismatch instead of averaging incompatible numbers.
Effort estimate
4-6 hours. This can be agent-ready for mocked adapter plumbing only; live benchmark execution remains Vision-controlled.