Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ The observability stack provides full monitoring capabilities through metrics, l
- Loki
- Traefik
- APISIX Gateway
- Traefik and APISIX are cross-stack targets resolved via Swarm's `<stack>_<service>` DNS naming on the shared overlay network.

#### Logs

Expand Down
11 changes: 11 additions & 0 deletions apisix/api-gateway/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,17 @@ services:
- ./config/apisix.yaml:/usr/local/apisix/conf/config.yaml:ro
- cache:/tmp/apisix-cache/
restart: unless-stopped
healthcheck:
test:
- CMD
- wget
- --no-verbose
- --tries=1
- --spider
- http://localhost:9091/apisix/prometheus/metrics
interval: 20s
timeout: 5s
retries: 3
labels:
- "traefik.enable=${TRAEFIK_ENABLE}"
- "traefik.http.routers.apisix.rule=Host(`${HOST}`)"
Expand Down
5 changes: 5 additions & 0 deletions observability/prometheus/config/prometheus.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ global:
rule_files:
- 'alert.rules'

# DNS naming convention:
# Same-stack targets use short names (e.g., 'tempo:3200')
# Cross-stack targets use <stack>_<service> (e.g., 'infrastructure_traefik:8080')
# All stacks share the 'traefik-public' overlay network.

scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
Expand Down
11 changes: 11 additions & 0 deletions observability/prometheus/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,17 @@ services:
networks:
- default
restart: unless-stopped
healthcheck:
test:
- CMD
- wget
- --no-verbose
- --tries=1
- --spider
- http://localhost:9090/-/ready
interval: 30s
timeout: 5s
retries: 3
logging:
options:
max-size: "10m"
Expand Down
2 changes: 1 addition & 1 deletion stacks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ python3 tools/render_compose.py -i stacks/infrastructure.yml -o /tmp/check.rende

- Ensure each service folder has a `.env` available. For local development, copy from `.env.example`; for production, use `./stackctl.sh secrets deploy` (see [Managing Secrets](../docs/Managing%20Secrets.md)).
- APISIX dashboard uses `apisix/api-dashboard/config/conf.yaml` (generated from `conf.example.yml`).
- Consider adding healthchecks for critical dependencies to improve startup reliability.
- Healthchecks have been added for Prometheus and APISIX. Consider adding them for other services as needed.

### Resource caps & logging

Expand Down
11 changes: 11 additions & 0 deletions stacks/infrastructure.yml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,17 @@ services:
volumes:
- ./apisix/api-gateway/config/apisix.yaml:/usr/local/apisix/conf/config.yaml:ro
- cache:/tmp/apisix-cache/
healthcheck:
test:
- CMD
- wget
- --no-verbose
- --tries=1
- --spider
- http://localhost:9091/apisix/prometheus/metrics
interval: 20s
timeout: 5s
retries: 3
labels:
- traefik.enable=${TRAEFIK_ENABLE}
- traefik.http.routers.apisix.rule=Host(`${HOST}`)
Expand Down
11 changes: 11 additions & 0 deletions stacks/observability.yml
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,17 @@ services:
- traefik.http.services.prometheus.loadbalancer.server.port=${PORT}
networks:
- default
healthcheck:
test:
- CMD
- wget
- --no-verbose
- --tries=1
- --spider
- http://localhost:9090/-/ready
interval: 30s
timeout: 5s
retries: 3
logging:
options:
max-size: 10m
Expand Down