This guide covers the comprehensive observability features added to the Weather MCP server, including metrics, logging, tracing, and monitoring dashboards.
The Weather MCP server now includes a complete observability stack with:
- Prometheus - Metrics collection and alerting
- Grafana - Visualization and dashboards
- Loki - Log aggregation and analysis
- AlertManager - Alert routing and notifications
- Structured Logging - JSON-formatted logs with correlation IDs
- Distributed Tracing - Request flow tracking
- Health Endpoints - Service health monitoring
# Option 1: Use the convenience script
make start-observability
# Option 2: Start manually
make observability-up- Weather MCP API: http://localhost:8000
- Grafana Dashboard: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9091
- AlertManager: http://localhost:9093
- Health Check: http://localhost:8001/health (runs on port +1)
- Metrics Endpoint: http://localhost:8001/metrics (runs on port +1)
# Check all services
make observability-status
# Check specific endpoints
make health-check
make metrics-checkThe following metrics are automatically collected:
weather_api_requests_total- Total API requests by endpoint and statusweather_api_request_duration_seconds- Request latency histogramsweather_errors_total- Error counts by service and type
nws_api_requests_total- National Weather Service API requestsnws_api_request_duration_seconds- NWS API response times
cache_operations_total- Cache operations (hit/miss/set/error)
sse_connections_active- Current active SSE connectionssse_connections_total- Total SSE connections by status
weather_data_freshness_seconds- Age of weather data by location
You can add custom metrics using the observability decorators:
from weather_mcp.observability import track_api_request, track_nws_request
@track_api_request("my_endpoint", "GET")
async def my_api_function():
# Your API logic here
pass
@track_nws_request("weather")
async def call_nws_api():
# NWS API call logic
passLogs are output in structured JSON format with the following fields:
{
"time": "2024-01-15 10:30:45.123",
"level": "INFO",
"name": "weather_mcp.services.weather_service",
"function": "get_current_weather",
"line": 25,
"message": "API request completed",
"extra": {
"correlation_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"endpoint": "current_weather",
"method": "GET",
"status_code": "200",
"duration": 0.234,
"success": true
}
}Every request gets a unique correlation ID that tracks the request through all services and logs, making debugging much easier.
DEBUG- Detailed debugging informationINFO- General operational informationWARNING- Warning conditionsERROR- Error conditionsCRITICAL- Critical error conditions
General health check with basic metrics:
curl http://localhost:8000/healthResponse:
{
"status": "healthy",
"timestamp": 1642234567.123,
"metrics": {
"active_sse_connections": 5,
"total_api_requests": 1234,
"total_nws_requests": 567,
"total_errors": 2
}
}Readiness check for Kubernetes deployments:
curl http://localhost:8000/health/readyLiveness check for Kubernetes deployments:
curl http://localhost:8000/health/livePrometheus metrics endpoint:
curl http://localhost:8000/metricsThe included dashboard provides:
- API request rate
- 95th percentile latency
- Error rate
- Active SSE connections
- Request rate by endpoint
- Response time percentiles (50th, 95th, 99th)
- Status code distribution
- NWS API request rate and latency
- Dependency health status
- Cache hit rate
- SSE connection patterns
- Connection lifecycle metrics
- Open Grafana: http://localhost:3000
- Login with
admin/admin - Navigate to the "Weather MCP Service Dashboard"
The system includes the following alert rules:
- Service Down - Weather MCP service is unreachable
- NWS API Failures - High failure rate from National Weather Service
- High Error Rate - Overall error rate > 10%
- High API Latency - 95th percentile > 2 seconds
- Low Cache Hit Rate - Cache hit rate < 70%
- Too Many SSE Connections - Active connections > 80
- Stale Weather Data - Data older than 1 hour
Alerts are configured in monitoring/prometheus/alert_rules.yml and routed through AlertManager (monitoring/alertmanager/alertmanager.yml).
To configure Slack notifications:
- Edit
monitoring/alertmanager/alertmanager.yml - Uncomment the Slack configuration section
- Add your Slack webhook URL
- Restart AlertManager:
make observability-restart
The observability stack includes:
- clima-mcp - Main Weather MCP service with metrics enabled
- prometheus - Metrics collection (port 9091)
- grafana - Dashboards and visualization (port 3000)
- loki - Log aggregation (port 3100)
- promtail - Log shipping agent
- alertmanager - Alert routing (port 9093)
- node-exporter - System metrics (port 9100)
- cadvisor - Container metrics (port 8080)
For development with hot reload:
# Start development stack
docker-compose -f docker-compose.observability.yml --profile dev up
# Or using Makefile
make observability-upObservability features can be configured via environment variables:
# Enable/disable features
ENABLE_METRICS=true
ENABLE_TRACING=true
STRUCTURED_LOGGING=true
# Ports
METRICS_PORT=9090
# Logging
LOG_LEVEL=INFO
LOG_FILE=logs/clima-mcp.logEdit these files to customize the observability stack:
monitoring/prometheus/prometheus.yml- Prometheus configurationmonitoring/grafana/provisioning/- Grafana datasources and dashboardsmonitoring/loki/loki.yml- Loki log aggregation settingsmonitoring/alertmanager/alertmanager.yml- Alert routing rules
# Check Docker logs
make observability-logs
# Check individual service logs
docker-compose -f docker-compose.observability.yml logs prometheus
docker-compose -f docker-compose.observability.yml logs grafana- Verify metrics endpoint:
curl http://localhost:8000/metrics - Check Prometheus targets: http://localhost:9091/targets
- Verify Prometheus can scrape the service
- Check Prometheus datasource connection in Grafana
- Verify metrics are being collected: http://localhost:9091/graph
- Check dashboard queries match your metric names
- Check Promtail configuration:
monitoring/promtail/promtail.yml - Verify log file permissions and paths
- Check Promtail logs:
docker-compose -f docker-compose.observability.yml logs promtail
Adjust retention settings in:
monitoring/prometheus/prometheus.yml(storage.tsdb.retention.time)monitoring/loki/loki.yml(retention policies)
Configure log rotation:
- Prometheus data retention
- Loki log retention
- Application log rotation (already configured)
-
Start with the Golden Signals:
- Latency (response times)
- Traffic (request rates)
- Errors (error rates)
- Saturation (resource utilization)
-
Set Meaningful Alerts:
- Focus on symptoms, not causes
- Alert on user-impacting issues
- Avoid alert fatigue
-
Use Dashboards Effectively:
- Start with overview, drill down to details
- Include both technical and business metrics
- Make dashboards actionable
-
Local Development:
make start-observability # Start full stack make run # Start Weather MCP in another terminal
-
Testing Changes:
make health-check # Verify service health make metrics-check # Check metrics format
-
Debugging Issues:
make observability-logs # View all service logs # Check correlation IDs in logs to trace requests
- Authentication: Configure Grafana with proper authentication
- Network Security: Use proper firewall rules and VPNs
- Secrets Management: Use proper secret management for alert webhooks
- TLS: Enable TLS for all external connections
- Prometheus: Consider federation for multiple instances
- Grafana: Use external database for HA setups
- Loki: Configure object storage for large deployments
- Resource Limits: Set appropriate CPU/memory limits
- Prometheus Data: Regular snapshots of TSDB
- Grafana Dashboards: Export dashboards and datasources
- Configuration: Version control all configuration files
# Quick start
make start-observability
# Status checks
make observability-status
make health-check
make metrics-check
# Management
make observability-up
make observability-down
make observability-restart
make observability-logs
# Open tools
make grafana-open
make prometheus-open
# Development
docker-compose -f docker-compose.observability.yml --profile dev upFor observability-related issues:
- Check the troubleshooting section above
- Review service logs:
make observability-logs - Verify configuration files in
monitoring/directory - Check Docker container status:
docker ps
The observability stack provides comprehensive monitoring for the Weather MCP service, enabling proactive issue detection and performance optimization.