Request deduplication prevents duplicate requests from being sent to backend APIs within a configurable time window. This feature protects against rate limit exhaustion caused by client retry behavior.
When agentic clients experience network latency or timeouts, they may re-send identical requests in rapid succession. Without deduplication, each retry consumes backend quota and can quickly exhaust rate limits.
The deduplication service:
- Detects duplicate requests - Computes a content hash of each request (session ID, model, messages, tools)
- Swallows duplicates - Identical requests within the dedup window are blocked with a
DuplicateRequestError(HTTP 429) - Logs blocked requests - Each blocked duplicate is logged at WARNING level for visibility
- Tracks statistics - Maintains counters for processed requests, blocked duplicates, and dedup rate
# Set deduplication window in seconds (default: 3.0)
--request-dedup-window 3.0
# Disable deduplication entirely
--disable-request-dedup| Variable | Default | Description |
|---|---|---|
LLM_REQUEST_DEDUP_WINDOW |
3.0 |
Time window in seconds for duplicate detection. Set to 0 to disable. |
# config/config.yaml
request_dedup_window: 3.0 # secondsConfiguration values are resolved in this order (highest priority first):
- CLI parameter (
--request-dedup-windowor--disable-request-dedup) - Environment variable (
LLM_REQUEST_DEDUP_WINDOW) - Configuration file (
request_dedup_window) - Default value (
3.0seconds)
Each request is hashed based on:
- Session ID - Requests from different sessions are never considered duplicates
- Model name - Same message to different models is not a duplicate
- Messages - The full message history including roles and content
- Tools - Tool definitions if present
When a request arrives:
- Content hash is computed
- Cache key is formed:
{session_id}:{content_hash} - If key exists in cache AND entry is within the dedup window → DUPLICATE
- Otherwise, request is registered in cache and processed normally
The service automatically cleans up expired entries:
- Time-based cleanup - Every 30 seconds, entries older than the dedup window are removed
- Size-based cleanup - When cache exceeds 10,000 entries, oldest entries are evicted
- On-access cleanup - Cleanup checks happen during normal request processing
T=0.0s: Client sends request A → Processed normally
T=1.5s: Client retries request A (timeout) → BLOCKED (duplicate)
T=2.8s: Client retries request A again → BLOCKED (duplicate)
T=4.0s: Client sends new request B → Processed normally
Session-1 sends request A → Processed normally
Session-2 sends identical request A → Processed normally (different session)
T=0.0s: Request A processed
T=3.5s: Same request A sent again → Processed normally (window expired)
When a duplicate is detected and blocked:
WARNING Duplicate request swallowed: hash=a1b2c3d4 session=sess-123 model=gpt-4
Debug logging provides additional detail:
DEBUG Duplicate detected: hash=a1b2c3d4, session=sess-123, age=1.25s
DEBUG Request deduplication enabled with window=3.0s
DEBUG Dedup cache cleanup: removed 15 expired entries, cache_size=42
The service tracks these metrics (accessible programmatically):
| Metric | Description |
|---|---|
requests_processed |
Total requests checked by the dedup service |
duplicates_blocked |
Number of duplicate requests blocked |
cache_size |
Current number of entries in the dedup cache |
dedup_rate |
Ratio of blocked duplicates to total requests |
window_seconds |
Configured dedup window |
enabled |
Whether deduplication is active |
To disable deduplication:
# Via CLI
--disable-request-dedup
# Or set window to 0
--request-dedup-window 0
# Via environment
export LLM_REQUEST_DEDUP_WINDOW=0The deduplication service is fully thread-safe:
- All cache operations are protected by an asyncio lock
- Statistics reads are non-blocking (approximate values)
- Concurrent requests from multiple sessions are handled correctly
- Failure Handling - Automatic retry and failover for backend errors
- Health Checks - Backend health monitoring and circuit breaker
- Session Management - Session handling and state management
If legitimate requests are being blocked:
- Check if requests are truly identical (same messages, model, tools)
- Increase the dedup window:
--request-dedup-window 1.0 - Or disable deduplication:
--disable-request-dedup
A high duplicate rate (e.g., >10%) suggests client-side issues:
- Check client timeout settings - may be too aggressive
- Review network latency between client and proxy
- Consider increasing client timeouts rather than disabling dedup
The cache is bounded to 10,000 entries maximum. With a 3-second window and typical request patterns, memory usage is minimal (<1MB).