← Configuration · Running Experiments →
CoEval supports 18 model interfaces spanning every major cloud provider, OpenAI-compatible APIs, local GPU inference, and virtual benchmark teachers.
| Interface | Provider / Runtime | Batch API | 50% Discount | Auth |
|---|---|---|---|---|
openai |
OpenAI (GPT-4o, o3, o1, GPT-3.5, …) | ✅ OpenAI Batch API | ✅ | OPENAI_API_KEY |
anthropic |
Anthropic (Claude 3.5 Sonnet/Haiku, Claude 3 Opus) | ✅ Message Batches API | ✅ | ANTHROPIC_API_KEY |
gemini |
Google Gemini 2.0 Flash, 1.5 Pro/Flash | ⚡ Concurrent¹ | — | GEMINI_API_KEY or GOOGLE_API_KEY |
azure_openai |
Azure OpenAI deployments (any GPT model) | ✅ Azure Batch API | ✅ | AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT |
azure_ai |
Azure AI Foundry / GitHub Models | — | — | AZURE_AI_API_KEY or GITHUB_TOKEN |
bedrock |
AWS Bedrock — all foundation models | ✅ Model Invocation Jobs² | ✅ | Native API key or IAM |
vertex |
Google Vertex AI (Gemini on GCP) | ✅ Batch Prediction Jobs² | ✅ | GOOGLE_CLOUD_PROJECT + ADC |
openrouter |
OpenRouter — 300+ models | — | — | OPENROUTER_API_KEY |
groq |
Groq Cloud (ultra-fast inference) | — | — | GROQ_API_KEY |
deepseek |
DeepSeek API | — | — | DEEPSEEK_API_KEY |
mistral |
Mistral AI | ✅ Mistral Batch API | ✅ | MISTRAL_API_KEY |
deepinfra |
DeepInfra | — | — | DEEPINFRA_API_KEY |
cerebras |
Cerebras (ultra-fast inference) | — | — | CEREBRAS_API_KEY |
cohere |
Cohere (Command R/R+/A family) | — | — | COHERE_API_KEY |
huggingface_api |
HuggingFace Inference API (50k+ Hub models) | — | — | HF_TOKEN or HUGGINGFACE_HUB_TOKEN |
ollama |
Ollama — local model server | — | — | none (no key needed) |
huggingface |
Any HuggingFace model (local GPU) | — | — | HF_TOKEN or HUGGINGFACE_HUB_TOKEN |
benchmark |
Virtual — pre-ingested dataset responses | N/A | N/A | none |
¹ Gemini concurrent mode: Google's Generative AI API does not offer a native asynchronous batch endpoint. CoEval submits all Gemini requests concurrently via a thread pool (
GeminiBatchRunner). This is faster than sequential calls but does not provide a 50% batch discount — you pay standard per-token rates.² Bedrock and Vertex batch use cloud storage (S3/GCS) as the job transport. Additional setup is required: an S3 bucket + IAM service role for Bedrock; a GCS bucket + ADC for Vertex. See the Batch API section below.
The default and most feature-complete interface. Supports OpenAI Batch API for Phase 4 (response collection) and Phase 5 (evaluation) with a 50% cost discount. Batch jobs are submitted automatically and polled until completion.
- name: gpt-4o-mini
interface: openai
parameters:
model: gpt-4o-mini
roles: [teacher, student, judge]Supports Anthropic's Message Batches API with a 50% batch discount. Requires pip install anthropic.
- name: claude-3-5-haiku
interface: anthropic
parameters:
model: claude-3-5-haiku-20241022
roles: [student, judge]Submits requests concurrently via a thread pool (pseudo-batch mode). Google's Generative AI API does not offer a native async batch endpoint — there is no 50% batch discount. Requires pip install google-genai.
- name: gemini-2.0-flash
interface: gemini
parameters:
model: gemini-2.0-flash
roles: [student, judge]Connects to Azure OpenAI deployments. Requires deployment name in model, endpoint URL, and API version. Supports Azure Batch API with a 50% discount (see Batch API below).
- name: my-gpt4-deployment
interface: azure_openai
parameters:
model: my-deployment-name
azure_endpoint: https://my-resource.openai.azure.com/
api_version: 2024-08-01-preview
roles: [student]AWS Bedrock supports two mutually exclusive authentication modes:
Native API key (no boto3 required):
bedrock:
api_key: BedrockAPIKey-...:...
region: us-east-1IAM credentials (requires pip install boto3):
bedrock:
access_key_id: AKIA...
secret_access_key: ...
region: us-east-1Native API key takes priority if both are present.
OpenRouter provides access to 300+ models (Llama, Mistral, Qwen, DeepSeek, Cohere, Gemma, and more) through a single API key and OpenAI-compatible interface. Ideal for multi-model comparisons without managing individual provider accounts.
- name: llama-3.3-70b
interface: openrouter
parameters:
model: meta-llama/llama-3.3-70b-instruct
roles: [teacher, student, judge]Ultra-fast inference providers — 500–1000 tokens/second throughput. Best for experiments where latency matters more than cost. Both use the OpenAI SDK wire format.
- name: llama-3.1-70b-groq
interface: groq
parameters:
model: llama-3.1-70b-versatile
roles: [student]Cohere's Command R/R+/A family via their OpenAI-compatible endpoint (https://api.cohere.com/compatibility/v1). Best-in-class for retrieval-augmented generation (RAG) and long-context tasks. No batch discount — real-time only.
- name: command-r-plus
interface: cohere
parameters:
model: command-r-plus-08-2024
temperature: 0.7
max_tokens: 512
roles: [teacher, student, judge]Pricing (Command family):
| Model | Input ($/1M) | Output ($/1M) |
|---|---|---|
| command-a-03-2025 | $2.50 | $10.00 |
| command-r-plus-08-2024 | $2.50 | $10.00 |
| command-r-08-2024 | $0.15 | $0.60 |
| command-r7b-12-2024 | $0.04 | $0.15 |
Key file:
providers:
cohere: co-...Access any of the 50,000+ hosted models on HuggingFace Hub via their serverless Inference API (https://api-inference.huggingface.co/v1). No GPU required — inference runs on HuggingFace's infrastructure. Model IDs must be full Hub paths (e.g. mistralai/Mistral-7B-Instruct-v0.3).
- name: mistral-7b-hf
interface: huggingface_api
parameters:
model: mistralai/Mistral-7B-Instruct-v0.3
temperature: 0.7
max_tokens: 512
roles: [student]Pricing: Pay-per-token for serverless inference. Rates depend on model size; popular models cost $0.04–$0.23/M tokens. HuggingFace PRO subscribers get free monthly quota on many models.
| Model | Input & Output ($/1M) |
|---|---|
| meta-llama/Llama-3.1-8B-Instruct | $0.06 |
| meta-llama/Llama-3.3-70B-Instruct | $0.23 |
| mistralai/Mistral-7B-Instruct-v0.3 | $0.04 |
| Qwen/Qwen2.5-72B-Instruct | $0.23 |
| google/gemma-2-9b-it | $0.06 |
Note:
huggingface_api(cloud inference, this section) is distinct fromhuggingface(local GPU inference, below). Usehuggingface_apiwhen you don't have a GPU; usehuggingfacefor private or quantized models that must run locally.
Key file:
providers:
huggingface_api: hf_... # same token as huggingface; accepts HF_TOKEN or HUGGINGFACE_HUB_TOKENRuns any instruction-tuned HuggingFace model locally. Requires pip install -e ".[huggingface]" and a CUDA GPU for reasonable throughput. Models are loaded into VRAM and run sequentially (no concurrent batching for GPU-bound inference).
- name: qwen2p5-1b5
interface: huggingface
parameters:
model: Qwen/Qwen2.5-1.5B-Instruct
device: auto
max_new_tokens: 512
roles: [teacher, student, judge]Ollama runs any supported open-weight model locally using a lightweight server that exposes an OpenAI-compatible REST API. No API key is required. Ideal for privacy-sensitive experiments or air-gapped environments.
Install Ollama: https://ollama.com — then pull a model:
ollama pull llama3.2
ollama pull phi4
ollama pull gemma3Minimal config (default localhost):
- name: llama3.2-local
interface: ollama
parameters:
model: llama3.2
temperature: 0.7
max_tokens: 512
roles: [student]Custom host (remote Ollama server or non-default port):
- name: llama3.2-remote
interface: ollama
parameters:
model: llama3.2
base_url: http://192.168.1.50:11434/v1 # overrides default localhost
roles: [student]Alternatively, set OLLAMA_HOST=http://192.168.1.50:11434 in your environment.
Key file entry (optional base_url override):
providers:
ollama:
base_url: http://192.168.1.50:11434/v1Notes:
- Ollama is treated as a network interface (no GPU pool management in CoEval)
- No batching — requests are sent concurrently via the standard sequential path
- Cost estimation returns $0 for Ollama models (no per-token cost)
coeval probecallsmodels.list()on the Ollama server to verify connectivity
The benchmark interface is a zero-cost virtual teacher that replays pre-ingested responses from real datasets. Phase 3 is skipped entirely for benchmark models — data was already ingested by coeval ingest or benchmark/setup_mixed.py.
- name: xsum
interface: benchmark
parameters:
dataset: xsum
split: test
roles: [teacher]Pre-ingested datasets available via benchmark/setup_mixed.py:
xsum— BBC news articles with one-sentence summariescodesearchnet-python— Python functions with docstring explanationsaeslc— Email bodies with subject lineswikitablequestions— Wikipedia tables with natural language questions
Additional datasets via coeval ingest:
mmlu, hellaswag, truthfulqa, humaneval, medqa, gsm8k
The benchmark teacher's name field should match the dataset identifier used during ingestion. The virtual interface looks for pre-ingested JSONL files named:
phase3_datapoints/{task_id}.{model_name}.datapoints.jsonl
Recommended naming pattern:
| Model name | Dataset source |
|---|---|
benchmark |
Default / unspecified ingested data |
benchmark-xsum |
XSum summarization dataset |
benchmark-aeslc |
AESLC email subject-line corpus |
benchmark-codesearchnet |
CodeSearchNet code+docstring pairs |
benchmark-wikitableqa |
WikiTableQuestions dataset |
Six interfaces support true asynchronous batch processing with a ~50% cost discount. Gemini uses a concurrent runner (faster than sequential but no discount). See the per-interface sections for setup details.
| Interface | Batch Mode | Discount | Extra Requirements |
|---|---|---|---|
openai |
OpenAI Batch API — async, 24h window | ✅ 50% | None |
anthropic |
Message Batches API — async, 24h window | ✅ 50% | None |
azure_openai |
Azure Global Batch API — async | ✅ 50% | Azure endpoint |
mistral |
Mistral Batch API — async (OpenAI-compat format) | ✅ ~50% | None |
bedrock |
AWS Model Invocation Jobs — async | ✅ ~50% | S3 bucket + IAM role |
vertex |
Vertex AI Batch Prediction — async | ✅ 50% | GCS bucket + ADC |
gemini |
Concurrent thread pool (pseudo-batch) | ❌ none | None |
Enable per provider and per phase in the experiment config:
experiment:
batch:
openai:
response_collection: true
evaluation: true
anthropic:
response_collection: true
evaluation: true
azure_openai:
response_collection: true
evaluation: true
mistral: # Mistral Batch API — async, ~50% off; no extra requirements
response_collection: true
evaluation: true
bedrock: # requires batch_s3_bucket + batch_role_arn in model params
response_collection: true
evaluation: true
vertex: # requires batch_gcs_bucket in model params
response_collection: true
evaluation: true
gemini: # concurrent mode — no discount, but faster than sequential
response_collection: true
evaluation: trueBatch jobs are submitted at the start of each phase and polled automatically. Use coeval status --fetch-batches to check completion status manually.
How async batch works (OpenAI / Anthropic / Azure / Bedrock / Vertex):
- At the start of a batch-enabled phase, CoEval submits all requests as a batch job
- The process polls the provider API at intervals until the job completes
- Results are downloaded and processed identically to real-time responses
- Batch mode is transparent to the rest of the pipeline — no changes to output format or downstream analysis
Full batch status:
| Interface | Batch Mode | CoEval Implementation | Discount |
|---|---|---|---|
openai |
OpenAI Batch API (async) | ✅ OpenAIBatchRunner |
50% |
anthropic |
Message Batches API (async) | ✅ AnthropicBatchRunner |
50% |
azure_openai |
Azure Global Batch (async) | ✅ AzureBatchRunner |
50% |
mistral |
Mistral Batch API (async, OpenAI-compat) | ✅ MistralBatchRunner |
~50% |
bedrock |
AWS Model Invocation Jobs (async) | ✅ BedrockBatchRunner |
~50% |
vertex |
Vertex AI Batch Prediction (async) | ✅ VertexBatchRunner |
50% |
gemini |
Concurrent thread pool (pseudo-batch) | ✅ GeminiBatchRunner |
❌ none |
openrouter |
None (real-time) | N/A | — |
cohere |
None (real-time) | N/A | — |
huggingface_api |
None (real-time) | N/A | — |
huggingface |
None (local GPU) | N/A | — |
Prices as of 2026-03-02. Verify at provider pricing pages before running large experiments.
| Model | Input ($/1M) | Output ($/1M) | Batch (50% off) |
|---|---|---|---|
| gpt-4o | $2.50 | $10.00 | ✅ $1.25 / $5.00 |
| gpt-4o-mini | $0.15 | $0.60 | ✅ $0.075 / $0.30 |
| gpt-4.1 | $2.00 | $8.00 | ✅ $1.00 / $4.00 |
| gpt-4.1-mini | $0.40 | $1.60 | ✅ $0.20 / $0.80 |
| gpt-4.1-nano | $0.10 | $0.40 | ✅ $0.05 / $0.20 |
| gpt-3.5-turbo | $0.50 | $1.50 | ✅ $0.25 / $0.75 |
| o3-mini | $1.10 | $4.40 | ✅ $0.55 / $2.20 |
| o4-mini | $1.10 | $4.40 | ✅ $0.55 / $2.20 |
Batch discount: 50% via OpenAI Batch API. Batches are queued asynchronously; use coeval status to poll.
Configuration:
providers:
openai: sk-...| Model | Input ($/1M) | Output ($/1M) | Batch (50% off) |
|---|---|---|---|
| claude-3-5-haiku-20241022 | $0.80 | $4.00 | ✅ $0.40 / $2.00 |
| claude-3-5-sonnet-20241022 | $3.00 | $15.00 | ✅ $1.50 / $7.50 |
| claude-sonnet-4-6 | $3.00 | $15.00 | ✅ $1.50 / $7.50 |
| claude-haiku-4-6 | $0.80 | $4.00 | ✅ $0.40 / $2.00 |
| claude-opus-4-6 | $15.00 | $75.00 | ✅ $7.50 / $37.50 |
Batch discount: 50% via Anthropic Message Batches API. Async; poll with coeval status.
Configuration:
providers:
anthropic: sk-ant-...| Model | Input ($/1M) | Output ($/1M) | Batch (50% off) |
|---|---|---|---|
| gemini-2.0-flash | $0.10 | $0.40 | — |
| gemini-2.0-flash-lite | $0.075 | $0.30 | — |
| gemini-2.5-flash | $0.15 | $0.60 | — |
| gemini-2.5-flash-lite | $0.075 | $0.30 | — |
| gemini-1.5-flash | $0.075 | $0.30 | — |
| gemini-1.5-pro | $1.25 | $5.00 | — |
| gemini-2.5-pro | $1.25 | $10.00 | — |
No batch discount. CoEval uses concurrent requests (thread pool) for Gemini — you pay standard per-token rates. Google does not expose a batch discount API comparable to OpenAI or Anthropic.
Configuration:
providers:
gemini: AIza...OpenRouter is a meta-router providing a single OpenAI-compatible API for hundreds of open and commercial models. It is the recommended interface for open-weight models (Llama, Mistral, Qwen, DeepSeek) because:
- Single API key and interface covers all open models
- Routes to cheapest available backend automatically
- Supports all CoEval model parameters (
temperature,max_tokens)
No batch discount. OpenRouter is real-time only.
| Model ID | Input ($/1M) | Output ($/1M) |
|---|---|---|
| meta-llama/llama-3.3-70b-instruct | $0.12 | $0.40 |
| meta-llama/llama-3.1-70b-instruct | $0.10 | $0.28 |
| meta-llama/llama-3.1-8b-instruct | $0.05 | $0.08 |
| mistralai/mistral-small-24b | $0.10 | $0.30 |
| mistralai/mistral-large-2411 | $2.00 | $6.00 |
| deepseek/deepseek-chat | $0.14 | $0.28 |
| deepseek/deepseek-r1 | $0.55 | $2.19 |
| qwen/qwen-2.5-72b-instruct | $0.12 | $0.39 |
| google/gemini-2.0-flash-001 | $0.10 | $0.40 |
Configuration:
providers:
openrouter: sk-or-v1-...Direct access to Mistral models with native Batch API support (~50% off). MistralBatchRunner is a thin wrapper over OpenAIBatchRunner that points the OpenAI SDK at https://api.mistral.ai/v1 — the batch API format is identical to OpenAI's (same /v1/files + /v1/batches endpoints).
| Model | Input ($/1M) | Output ($/1M) | Batch (~50% off) |
|---|---|---|---|
| mistral-small-latest | $0.10 | $0.30 | ✅ $0.05 / $0.15 |
| mistral-large-latest | $2.00 | $6.00 | ✅ $1.00 / $3.00 |
| codestral-latest | $0.20 | $0.60 | ✅ $0.10 / $0.30 |
| ministral-8b-latest | $0.10 | $0.10 | ✅ $0.05 / $0.05 |
| open-mistral-nemo | $0.15 | $0.15 | ✅ $0.075 / $0.075 |
| pixtral-12b-2409 | $0.15 | $0.15 | ✅ $0.075 / $0.075 |
Enable batch in config:
experiment:
batch:
mistral:
response_collection: true
evaluation: trueMistral's Batch API follows the same OpenAI-compatible format: upload a JSONL file to /v1/files, create a batch job via /v1/batches, poll until complete, download results. CoEval's MistralBatchRunner automates the full workflow — no additional setup beyond a valid API key.
Supported models for batch: All models listed above. See Mistral Batch docs for updates.
Configuration:
providers:
mistral: ... # MISTRAL_API_KEYCommand R/R+/A family via OpenAI-compatible endpoint. Best for RAG-oriented tasks and long-context reasoning. No batch discount — real-time only.
| Model | Input ($/1M) | Output ($/1M) |
|---|---|---|
| command-a-03-2025 | $2.50 | $10.00 |
| command-r-plus-08-2024 | $2.50 | $10.00 |
| command-r-08-2024 | $0.15 | $0.60 |
| command-r7b-12-2024 | $0.04 | $0.15 |
Configuration:
providers:
cohere: co-... # COHERE_API_KEYServerless pay-per-token inference for 50k+ Hub models. Rates depend on model size.
| Model | Input & Output ($/1M) |
|---|---|
| meta-llama/Llama-3.1-8B-Instruct | $0.06 |
| meta-llama/Llama-3.3-70B-Instruct | $0.23 |
| mistralai/Mistral-7B-Instruct-v0.3 | $0.04 |
| Qwen/Qwen2.5-72B-Instruct | $0.23 |
| google/gemma-2-9b-it | $0.06 |
HuggingFace PRO subscribers receive free monthly quota on many popular models. See HuggingFace pricing for the full model list and rates.
Configuration:
providers:
huggingface_api: hf_... # HF_TOKEN (same token as huggingface interface)CoEval supports two Bedrock authentication modes for real-time inference:
Native API key (for real-time inference):
providers:
bedrock:
api_key: BedrockAPIKey-...:...
region: us-east-1IAM credentials (also required for batch inference):
providers:
bedrock:
access_key_id: AKIA...
secret_access_key: ...
region: us-east-1
batch_role_arn: arn:aws:iam::123456789012:role/BedrockBatchRole # for batchNote: The native API key is only for the real-time Converse API. Batch inference (Model Invocation Jobs) requires IAM auth and an IAM service role. Both can coexist — CoEval uses the native key for real-time calls and IAM creds for batch.
AWS Bedrock's Model Invocation Jobs API provides ~50% off for supported models. CoEval's BedrockBatchRunner automates the full workflow — upload JSONL to S3, submit the job, poll, download results.
Additional model parameters required:
- name: claude-bedrock
interface: bedrock
batch_enabled: true
parameters:
model: anthropic.claude-3-5-haiku-20241022-v1:0
region: us-east-1
batch_s3_bucket: my-coeval-batch-bucket # required: S3 bucket in same region
batch_s3_prefix: coeval-jobs # optional (default: "coeval")
batch_role_arn: arn:aws:iam::123456789012:role/BedrockBatchRole # required
roles: [student, judge]IAM service role setup (one-time):
- Create an S3 bucket in the same region as your Bedrock endpoint.
- Create an IAM role with trust policy for
bedrock.amazonaws.com:{ "Effect": "Allow", "Principal": {"Service": "bedrock.amazonaws.com"}, "Action": "sts:AssumeRole", "Condition": {"StringEquals": {"aws:SourceAccount": "<ACCOUNT_ID>"}} } - Attach an inline policy granting S3 access to the bucket:
{"Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"], "Resource": ["arn:aws:s3:::my-coeval-batch-bucket", "arn:aws:s3:::my-coeval-batch-bucket/*"]} - Copy the role ARN (
arn:aws:iam::<account>:role/<name>) intobatch_role_arn.
Supported models: Anthropic Claude, Amazon Nova, Meta Llama 3.x, Mistral Large, AI21 Jamba. Not all models are available in all regions — see AWS Batch Inference docs.
Selected Bedrock model prices:
| Model | Input ($/1M) | Output ($/1M) | Batch (~50% off) |
|---|---|---|---|
| anthropic.claude-3-5-haiku-20241022-v1 | $0.80 | $4.00 | ✅ ~$0.40 / $2.00 |
| anthropic.claude-3-5-sonnet-20241022-v2 | $3.00 | $15.00 | ✅ ~$1.50 / $7.50 |
| amazon.nova-micro-v1 | $0.035 | $0.14 | ✅ ~$0.018 / $0.07 |
| amazon.nova-lite-v1 | $0.06 | $0.24 | ✅ ~$0.03 / $0.12 |
| amazon.nova-pro-v1 | $0.80 | $3.20 | ✅ ~$0.40 / $1.60 |
| meta.llama3-70b-instruct-v1 | $0.99 | $0.99 | ✅ ~$0.50 / $0.50 |
Azure OpenAI supports Global Batch (50% discount) — AzureBatchRunner is already implemented in CoEval.
providers:
azure_openai:
api_key: ...
endpoint: https://my-resource.openai.azure.com/
api_version: 2024-08-01-previewEnable batch per-phase in the experiment config:
experiment:
batch:
azure_openai:
response_collection: true
evaluation: trueSelected Azure OpenAI model prices:
| Deployment | Input ($/1M) | Output ($/1M) | Batch (50% off) |
|---|---|---|---|
| gpt-4o | $2.50 | $10.00 | ✅ $1.25 / $5.00 |
| gpt-4o-mini | $0.165 | $0.66 | ✅ $0.083 / $0.33 |
Note: Azure prices include a small Azure markup over native OpenAI pricing (e.g. GPT-4o-mini $0.165 vs $0.15). After the 50% batch discount, the effective price is similar to OpenAI native batch.
providers:
vertex:
project: my-gcp-project
location: us-central1
# Optional: service account key file path (default: ADC)
# service_account_key: /path/to/sa-key.jsonRequires Application Default Credentials (run gcloud auth application-default login). Supports the same Gemini models as the Gemini AI Studio interface with enterprise-grade SLAs.
Vertex AI Batch Prediction Jobs offer 50% off for Gemini models. CoEval's VertexBatchRunner uploads a JSONL file to GCS, submits the batch job, polls until completion, and downloads results.
Additional model parameters required:
- name: gemini-vertex
interface: vertex
batch_enabled: true
parameters:
model: gemini-2.0-flash-001
project: my-gcp-project
location: us-central1
batch_gcs_bucket: gs://my-coeval-batch-bucket # required for batch
batch_gcs_prefix: coeval-jobs # optional (default: "coeval")
roles: [student, judge]Prerequisites:
- A GCS bucket in the same region as the Vertex AI endpoint
- IAM permissions on the service account:
aiplatform.batchPredictionJobs.create,aiplatform.batchPredictionJobs.get,storage.objects.create/get,storage.buckets.get pip install google-cloud-aiplatform google-cloud-storage
Supported Gemini models for batch: Gemini 2.5 Pro/Flash, Gemini 2.0 Flash, Gemini 1.5 Pro/Flash. See Vertex Batch docs for updates.
Selected Vertex AI (Gemini) model prices:
| Model | Input ($/1M) | Output ($/1M) | Batch (50% off) |
|---|---|---|---|
| gemini-2.0-flash-001 | $0.10 | $0.40 | ✅ $0.05 / $0.20 |
| gemini-2.5-flash | $0.15 | $0.60 | ✅ $0.075 / $0.30 |
| gemini-1.5-flash | $0.075 | $0.30 | ✅ $0.038 / $0.15 |
| gemini-1.5-pro | $1.25 | $5.00 | ✅ $0.63 / $2.50 |
The async batch discounts are exclusive to each provider's own proprietary models:
| Batch discount | Applies to |
|---|---|
| OpenAI Batch API (50%) | GPT-4o, GPT-4o-mini, GPT-4.1, o-series only |
| Anthropic Message Batches (50%) | Claude 3.x / Claude 4.x only |
| Azure Global Batch (50%) | GPT-4o, GPT-4o-mini (Azure deployments) |
| Mistral Batch API (~50%) | All Mistral-hosted models via interface: mistral |
| AWS Bedrock Batch (~50%) | Claude, Amazon Nova, Llama, Mistral on Bedrock |
| Vertex AI Batch Prediction (50%) | Gemini 1.5 / 2.0 / 2.5 models on GCP |
| Gemini AI Studio (concurrent, no discount) | Gemini models — thread pool only |
| Open-weight models via OpenRouter / DeepInfra / Cerebras | No batch discount |
For interface: auto, frontier models are automatically routed to their native batch-enabled provider. Open-weight models go to OpenRouter because no third-party hosting provider offers a batch discount for them. OpenRouter at $0.04–$0.12/M is already the cheapest option for these models.
Mistral exception: Unlike other open-weight model providers, Mistral AI offers ~50% off on its own hosted models via
interface: mistral. If you are using Mistral Small, Mistral Large, or Codestral, preferinterface: mistralwith batch enabled overinterface: openrouter.
Setting interface: auto in a model configuration tells CoEval to automatically select the cheapest available provider for the given model, based on:
- The
auto_routingtable inConfig/provider_pricing.yaml - Which providers have credentials configured in
keys.yaml
Example:
- name: deepseek-v3
interface: auto # CoEval resolves to openrouter (cheapest with credentials)
parameters:
model: deepseek/deepseek-chat
temperature: 0.7
max_tokens: 512
roles: [student]CoEval scans the auto_routing table top-to-bottom (cheapest first), finds the first fragment matching deepseek/deepseek-chat, and resolves to openrouter if those credentials exist. If OpenRouter is not configured, it tries the next provider in the routing table.
Resolution happens at config load time — the interface is permanently set before validation, so coeval plan, coeval probe, and coeval run all see the resolved interface.
How to update the routing table: Edit Config/provider_pricing.yaml and modify the auto_routing section (entries are ordered cheapest-first):
auto_routing:
deepseek/deepseek-r1: {interface: openrouter, notes: "reasoning model"}
deepseek: {interface: openrouter, notes: "default deepseek"}To update prices: Edit Config/provider_pricing.yaml — find the model under its providers: block and update input: / output: values. The cost estimator loads this file at runtime. The hardcoded PRICE_TABLE in cost_estimator.py is only a fallback if the YAML is unavailable.
Store all credentials in one place. CoEval discovers and resolves them automatically.
Search order:
--keys PATHCLI flagCOEVAL_KEYS_FILEenvironment variablekeys.yamlat the project root~/.coeval/keys.yaml
Full key file format:
# ~/.coeval/keys.yaml (or project root keys.yaml)
providers:
openai: sk-...
anthropic: sk-ant-...
gemini: AIza...
huggingface: hf_...
openrouter: sk-or-v1-...
groq: gsk_...
deepseek: sk-...
mistral: ...
deepinfra: di-...
cerebras: csk-...
cohere: co-...
huggingface_api: hf_... # same token as huggingface; accepts HF_TOKEN or HUGGINGFACE_HUB_TOKEN
# Ollama — no key needed; only set if using a non-default host
ollama:
base_url: http://192.168.1.50:11434/v1 # optional
azure_openai:
api_key: ...
endpoint: https://my-resource.openai.azure.com/
api_version: 2024-08-01-preview
bedrock:
api_key: BedrockAPIKey-...:... # native API key (no boto3 needed; real-time only)
region: us-east-1
# — OR — IAM credentials (supports both real-time and batch):
# bedrock:
# access_key_id: AKIA...
# secret_access_key: ...
# region: us-east-1
# batch_role_arn: arn:aws:iam::123456789012:role/BedrockBatchRole # for batch
vertex:
project: my-gcp-project
location: us-central1
service_account_key: /path/to/key.json # optional; uses ADC if omittedCredential resolution order per model:
model.access_key (in YAML) → provider entry in keys.yaml → environment variable
Security:
keys.yaml,*.keys.yaml, and.coeval/are included in.gitignoreby default. Never commit credentials to version control.
Q: Does Gemini get a 50% batch discount like OpenAI and Anthropic?
A: It depends on the interface. When you use interface: gemini (Google AI Studio), CoEval's GeminiBatchRunner submits requests concurrently via a thread pool — faster than sequential calls but at standard per-token rates with no batch discount. However, if you access Gemini models via interface: vertex (Google Cloud Vertex AI), Vertex AI Batch Prediction Jobs do provide a ~50% cost reduction. See the Vertex AI section for setup details.
Q: How do I use Ollama for local models without any API key?
A: Install Ollama from https://ollama.com, pull a model (e.g., ollama pull llama3.2), and set interface: ollama in your config with model: llama3.2. No API key is required. If your Ollama server is on a different host or port, set base_url: http://<host>:11434/v1 either in the model parameters or in keys.yaml under providers.ollama.base_url.
Q: Which providers support the 50% batch discount?
A: Six interfaces support true asynchronous batch processing with a ~50% cost discount: openai (OpenAI Batch API), anthropic (Message Batches API), azure_openai (Azure Global Batch API), mistral (Mistral Batch API — same OpenAI-compatible format, no extra setup), bedrock (AWS Model Invocation Jobs), and vertex (Vertex AI Batch Prediction Jobs). gemini uses a concurrent thread pool (faster than sequential, but no discount). Enable batch per-phase in the experiment.batch config block.
Q: What is interface: auto and how does it pick a provider?
A: interface: auto tells CoEval to select the cheapest available provider for the given model at config load time. It scans the auto_routing table in Config/provider_pricing.yaml top-to-bottom and picks the first interface for which credentials exist in your key file. The resolved interface is logged at DEBUG level, and coeval plan shows the selected provider before any calls are made.
Q: What is the difference between using Bedrock with a native API key vs. IAM credentials?
A: Bedrock's native API key mode (api_key: BedrockAPIKey-...:...) uses direct HTTP with an x-amzn-bedrock-key header and requires no extra library — it works with CoEval's core install. IAM credentials (access_key_id + secret_access_key) use the boto3 SDK, which must be installed separately with pip install boto3. Native API key takes priority if both are present. Note: For Bedrock batch jobs, IAM credentials are always required — the native API key cannot be used to manage Model Invocation Jobs.
Q: How do I set up Bedrock or Vertex batch jobs? A: Both require cloud storage for job I/O and a service identity with write access:
- Bedrock: Create an S3 bucket and an IAM role that trusts
bedrock.amazonaws.comwiths3:GetObject/s3:PutObject/s3:ListBucket. Addbatch_s3_bucketandbatch_role_arnto the model'sparametersblock (see AWS Bedrock). - Vertex: Create a GCS bucket and enable Vertex AI in your project. Add
batch_gcs_bucketandprojectto the model'sparametersblock. Authentication uses Application Default Credentials (gcloud auth application-default login) or a service account key file (see Google Vertex AI).
Q: Can I access open-weight models like Llama or Mistral without managing individual provider accounts?
A: Yes — use interface: openrouter. OpenRouter provides a single OpenAI-compatible API and a single key covering 300+ models including Llama, Mistral, Qwen, DeepSeek, Cohere, and Gemma. It is the recommended interface for open-weight models when you want broad model access without juggling multiple API keys.