Enforce per-model context window limits at the front-end to prevent requests that exceed model capabilities.
The Context Window Enforcement feature enforces per-model context window limits at the front-end, preventing requests that exceed model capabilities and providing clear error messages before they reach backend providers. This protects against excessive token usage, prevents unnecessary API calls and costs, and provides structured error responses that help clients understand and fix the issue.
- Customizable Limits: Configure different context window sizes per model and backend
- Input Token Enforcement: Blocks requests that exceed input token limits with structured error responses
- Front-end Protection: Prevents unnecessary API calls and costs by validating limits before backend requests
- Flexible Configuration: Supports both
context_windowandmax_input_tokensfor fine-grained control - Clear Error Messages: Provides detailed information about limits and measured token counts
- Model-Specific Tokenizers: Uses model-specific tokenizers when available for accurate counting
- Registry-Backed Modality Checks: Rejects requests with unsupported input modalities (image/audio) when registry metadata is available
Context windows are configured in backend-specific YAML files or model defaults.
When the model registry is enabled, the proxy can also enforce modality support based on registry metadata. If the registry file is missing/unparsable or the requested model is absent from the registry, both modality and context enforcement are skipped (fail-open). If the model exists but modalities are missing, only modality validation is skipped.
# config/backends/custom/backend.yaml
models:
"your-model-name":
limits:
context_window: 262144 # Total context window size (tokens)
max_input_tokens: 200000 # Input token limit (tokens)
max_output_tokens: 62144 # Output token limit (tokens)
requests_per_minute: 60 # Rate limits
tokens_per_minute: 1000000# config.yaml
model_defaults:
"your-model-name":
limits:
context_window: 128000 # 128K context window
max_input_tokens: 100000 # 100K input limitForce a specific context window size for all models:
--force-context-window 8000 # Override all model context windows to 8000 tokensmodels:
"large-context-model":
limits:
context_window: 262144 # 256K total context window
max_input_tokens: 200000 # 200K input limit (leaves room for response)
requests_per_minute: 30 # Conservative rate limitsmodels:
"small-fast-model":
limits:
context_window: 8192 # 8K context window
max_input_tokens: 6000 # 6K input limit
requests_per_minute: 120 # Higher rate for smaller modelpython -m src.core.cli \
--default-backend openai \
--force-context-window 8000model_defaults:
"gpt-4":
limits:
context_window: 128000
max_input_tokens: 100000
"gpt-3.5-turbo":
limits:
context_window: 16384
max_input_tokens: 12000
"claude-3-opus":
limits:
context_window: 200000
max_input_tokens: 180000When limits are exceeded, the proxy returns a structured 400 error:
{
"detail": {
"code": "input_limit_exceeded",
"message": "Input token limit exceeded",
"details": {
"model": "your-model-name",
"limit": 100000,
"measured": 125432
}
}
}- code: Error code (
input_limit_exceeded) - message: Human-readable error message
- details.model: Model name that exceeded the limit
- details.limit: Configured token limit
- details.measured: Actual token count in the request
Prevent accidental large-context requests with expensive models:
models:
"expensive-model":
limits:
max_input_tokens: 50000 # Limit to 50K tokens to control costsEnsure agents with long conversations don't exceed model limits:
models:
"agent-model":
limits:
context_window: 128000
max_input_tokens: 100000 # Leave room for agent's responseOptimize different models for different use cases:
models:
"fast-model":
limits:
context_window: 8192 # Small context for fast responses
"quality-model":
limits:
context_window: 200000 # Large context for qualityConfigure different limits for different user tiers or applications:
# Free tier
models:
"free-tier-model":
limits:
max_input_tokens: 4000
# Premium tier
models:
"premium-tier-model":
limits:
max_input_tokens: 100000Test how agents behave with smaller context windows:
python -m src.core.cli \
--default-backend openai \
--force-context-window 4000 # Test with 4K contextUse fixed context windows to isolate issues:
python -m src.core.cli \
--force-context-window 8000 # Fixed 8K context for debugging- Uses model-specific tokenizers when available for accurate counting
- Falls back to generic tokenizers for unknown models
- Counts all message content, system prompts, and tool definitions
- context_window: Total context window size (input + output)
- max_input_tokens: Maximum input tokens (recommended for enforcement)
- max_output_tokens: Maximum output tokens (handled by backend providers)
- If
max_input_tokensis not specified,context_windowis used as fallback - If neither is specified, no limit is enforced
- Configuration supports both
backend:modeland plainmodelkey formats
- Input limits are enforced strictly at the front-end
- Output limits are handled by backend providers
- Enforcement happens before backend API calls to save costs
Limits not being enforced:
- Verify limits are configured for the model
- Check that the model name matches the configuration key
- Review logs for limit enforcement messages
- Ensure the feature is not disabled
Incorrect token counts:
- Verify the correct tokenizer is being used for the model
- Check if special tokens are being counted
- Review logs for tokenization messages
- Consider if the model uses a non-standard tokenizer
Requests being blocked incorrectly:
- Verify the configured limits are appropriate for the model
- Check if the token count includes unexpected content
- Review the error response for measured vs limit tokens
- Consider increasing the limit if it's too restrictive
Performance impact:
- Token counting adds minimal overhead (<10ms per request)
- The benefit (preventing expensive API calls) far outweighs the cost
- Disable enforcement if not needed to eliminate overhead
CLI override not working:
- Verify the
--force-context-windowflag is being used - Check that the value is a valid integer
- Review logs for override messages
- Ensure no configuration is overriding the CLI flag
- Session Management - Intelligent session continuity
- Pytest Compression - Compress output to save tokens
- Pytest Context Saving - Add context-saving flags
- Edit Precision Tuning - Automatic parameter adjustment