[v0.5.0] benchmark adapters — define OpenAI/Google model matrix

## Context

Parent: #63. Methodology: `docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md`.

Aymen explicitly wants OpenAI and Google model families included. This issue defines the model/client matrix and provider adapter contracts without letting model choice blur into a single fake tool-quality score.

## Goal

Add the benchmark model matrix and provider adapter contracts for OpenAI and Google-backed runs, with tests that use mocks rather than paid/live calls.

## Acceptance criteria

- [ ] The runner/model design cross-multiplies the evaluation matrix: every eligible tool is evaluated against every model in the matrix independently.
- [ ] Tool scores are not averaged across different model families; any aggregate must remain secondary and clearly labeled.
- [ ] Token metrics are labeled as `Claude Tokens (Normalized Payload)` when using the methodology tokenizer, so they are not confused with OpenAI or Google API billing tokens.
- [ ] `docs/benchmarks/model-matrix.yml` defines the model/client families to run, including at least one OpenAI-backed and one Google-backed model family.
- [ ] The matrix format records provider, model ID, client/wrapper, token-count method, latency scope, and whether the result is eligible for headline claims.
- [ ] Provider adapter interfaces support request, response, latency, token metadata, and failure capture.
- [ ] Tests cover mock OpenAI and mock Google adapter behavior without making network calls.
- [ ] The report design keeps per-model results visible and marks any aggregate as aggregate.
- [ ] The implementation refuses to run live providers unless explicit environment variables/config are present.

## Scope boundaries

In scope:
- Model matrix spec.
- Provider adapter interfaces.
- Mock adapter tests.
- Live-call guardrails.

Out of scope:
- Running the full benchmark.
- Spending API credits.
- Adding README claims.
- Choosing final headline wording.

## Forbidden-territory reminder

Do not modify MCP tool names, parameters, return shapes, `schema.sql`, `.github/workflows/`, `pyproject.toml` project metadata, `.planning/POSITIONING.md`, the README hero section, `LICENSE`, `SECURITY.md`, or existing tests by weakening/deleting assertions.

If a new third-party runtime dependency is proposed, trigger supervisor review and justify why stdlib/httpx is insufficient.

## Validation commands

```bash
uv run ruff check src/ tests/
uv run pyright src/
uv run pytest --tb=short -q
uv run python-docs-mcp-server doctor
```

Run any new adapter tests directly as well.

## PR template

Use `Refs #63`, not `Closes #63`.

The PR must include:
- Model matrix summary.
- Explanation of live-call guardrails.
- Test output.
- Confirmation that no paid/live provider calls were made.

## Recovery

If token counting cannot be made comparable across providers, stop and comment with the exact mismatch instead of averaging incompatible numbers.

## Effort estimate

4-6 hours. This can be agent-ready for mocked adapter plumbing only; live benchmark execution remains Vision-controlled.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.5.0] benchmark adapters — define OpenAI/Google model matrix #73

Context

Goal

Acceptance criteria

Scope boundaries

Forbidden-territory reminder

Validation commands

PR template

Recovery

Effort estimate

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[v0.5.0] benchmark adapters — define OpenAI/Google model matrix #73

Description

Context

Goal

Acceptance criteria

Scope boundaries

Forbidden-territory reminder

Validation commands

PR template

Recovery

Effort estimate

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions