The LLM Interactive Proxy provides comprehensive monitoring and analytics capabilities to help you understand usage patterns, track costs, optimize performance, and ensure backend availability.
Backend Health Checks monitors the health of backend API endpoints and automatically excludes unhealthy backends from request routing.
Key Capabilities:
- ICMP ping checks for network reachability
- HTTP probe checks for application-level health
- Circuit breaker to exclude unhealthy backends from routing
- Real-time backend notifications on health changes
- REST API for health status monitoring
Use Cases:
- Automatic failover when backend APIs become unavailable
- Multi-region deployment with intelligent routing
- Proactive monitoring and alerting
- Preventing requests to unreachable endpoints
Quick Start:
# Check health status
curl "http://localhost:8000/internal/health" | jq '.endpoint_health'
# View summary
curl "http://localhost:8000/internal/health" | jq '.endpoint_health.summary'Usage Tracking and Statistics provides detailed monitoring of all traffic passing through the proxy.
Key Capabilities:
- Multi-point token tracking (verbatim and mutated)
- Performance metrics (TTFT, latency, throughput)
- Cost tracking and billing reconciliation
- Request/response monitoring
- Tool call tracking
- Multi-dimensional filtering and aggregation
Use Cases:
- Billing reconciliation with provider invoices
- Performance monitoring and optimization
- Cost analysis by backend, model, and time period
- Tool usage analysis
- Session and conversation analysis
- Error rate monitoring
Quick Start:
# Get aggregated statistics
curl "http://localhost:8000/v1/usage/stats?backend_type=openai&model=gpt-4"
# Get recent usage records
curl "http://localhost:8000/v1/usage/recent?limit=50"
# Export usage data
curl "http://localhost:8000/v1/usage/export?start_date=2025-12-01T00:00:00Z&end_date=2025-12-31T23:59:59Z"Replacement Metrics tracks activation rates and effectiveness of the random model replacement feature.
Key Capabilities:
- Track replacement activation rates
- Monitor turn counts before replacement
- Analyze opt-out patterns
- Measure replacement effectiveness
Wire Capture records full HTTP requests and responses for debugging.
Key Capabilities:
- Record complete request/response cycles
- Analyze traffic patterns
- Debug integration issues
- Replay captured sessions
| Feature | Token Tracking | Cost Tracking | Performance Metrics | Request Details | Real-time API | Health Monitoring |
|---|---|---|---|---|---|---|
| Health Checks | - | - | ✓ (latency) | - | ✓ | ✓ (ping, HTTP) |
| Usage Tracking | ✓ (4 points) | ✓ | ✓ (TTFT, latency) | ✓ (full details) | ✓ | - |
| Replacement Metrics | - | - | - | ✓ (replacements) | ✓ | - |
| Wire Capture | - | - | - | ✓ (full payload) | - | - |
# Health checks (enabled by default)
health_check:
enabled: true
circuit_breaker_enabled: true
notify_backends: true
ping:
enabled: true
interval_seconds: 30.0
failure_threshold: 3
http:
enabled: true
interval_seconds: 60.0
failure_threshold: 2
# Usage tracking (enabled by default)
usage_tracking:
enabled: true
persistence_path: "./var/usage_data.json"
flush_interval_seconds: 30.0
# Wire capture (disabled by default)
wire_capture:
enabled: true
output_dir: "./var/wire_captures"# Usage tracking
export USAGE_TRACKING_ENABLED=true
export USAGE_TRACKING_PERSISTENCE_PATH="./var/usage_data.json"
# Wire capture
export WIRE_CAPTURE_ENABLED=true
export WIRE_CAPTURE_OUTPUT_DIR="./var/wire_captures"For security, the proxy binds to 127.0.0.1:8000 (localhost only) by default. This means:
- REST API endpoints are only accessible from the local machine
- External access requires explicit configuration
- Authentication is enforced when binding to external interfaces
To allow external access:
# config.yaml
host: "0.0.0.0" # Bind to all interfaces
auth:
disable_auth: false # Keep authentication enabledOr via command line:
python -m src.core.cli --host 0.0.0.0Warning: When exposing the API externally, always ensure authentication is enabled to prevent unauthorized access to usage data.
# Get statistics for a specific backend
curl "http://localhost:8000/v1/usage/stats?backend_type=openai"
# Get statistics with multiple filters
curl "http://localhost:8000/v1/usage/stats?backend_type=anthropic&model=claude-3-5-sonnet&start_date=2025-12-01T00:00:00Z"# Export usage data for a date range
curl "http://localhost:8000/v1/usage/export?start_date=2025-12-01T00:00:00Z&end_date=2025-12-31T23:59:59Z" > usage_data.json# Check TTFT and latency metrics
curl "http://localhost:8000/v1/usage/stats" | jq '.ttft_stats, .duration_stats'Use Usage Tracking to compare proxy-calculated tokens with backend-reported tokens:
curl "http://localhost:8000/v1/usage/stats?backend_type=openai" | jq '{
proxy_tokens: .total_tokens,
backend_tokens: .backend_reported_total_tokens,
difference: (.total_tokens - .backend_reported_total_tokens)
}'Track Time to First Token (TTFT) and identify slow responses:
curl "http://localhost:8000/v1/usage/stats" | jq '.ttft_stats'Analyze costs by backend and model:
curl "http://localhost:8000/v1/usage/stats?backend_type=anthropic" | jq '{
total_cost: .total_cost,
requests: .request_count,
cost_per_request: (.total_cost / .request_count)
}'Use Wire Capture to record and analyze problematic requests:
# Enable wire capture
python -m src.core.cli --wire-capture-enabled
# Analyze captured traffic
python scripts/inspect_cbor_capture.py var/wire_captures_cbor/capture_file.cbor --detect-issuesTrack replacement effectiveness with Replacement Metrics:
curl "http://localhost:8000/v1/replacement/metrics"Set up automated monitoring to track key metrics:
- Token consumption trends
- Error rates (HTTP status codes)
- Performance degradation (TTFT increases)
- Cost anomalies
Configure appropriate retention policies:
usage_tracking:
max_records_in_memory: 100000 # Adjust based on traffic volume
flush_interval_seconds: 30.0 # Balance between durability and performanceUsage tracking does NOT record:
- Message content (prompts or completions)
- API keys or authentication tokens
- Personal identifiable information (PII)
All monitoring features are designed for minimal overhead:
- Usage tracking: <1ms per request
- Wire capture: ~2-5ms per request (when enabled)
- Replacement metrics: <0.1ms per request
If usage tracking consumes too much memory:
- Reduce
max_records_in_memory - Increase
flush_interval_seconds - Implement custom archival logic
If usage data is missing after restart:
- Check
persistence_pathis writable - Verify file exists at configured path
- Check proxy logs for persistence errors
If statistics queries are slow:
- Add more specific filters to reduce result set
- Use date range filters to limit scope
- Consider implementing database persistence for large deployments
- Health Checks Guide - Backend health monitoring and circuit breaker
- Usage Tracking User Guide - Complete feature documentation
- Usage Tracking Integration Guide - Developer integration guide
- Wire Capture Guide - Traffic recording and analysis
- Replacement Metrics Guide - Model replacement tracking
- Configuration Guide - Complete configuration reference