Summary
Agent health is currently detected by polling every 30 seconds from the backend. This means an agent crash takes up to 30s to detect, and during that window other agents continue sending requests to a dead container. Combined with no circuit breaker (see RELIABILITY-001), this causes long cascading timeouts.
Solution
Flip from pull to push: agents send a lightweight heartbeat to the platform every 5 seconds. A missed heartbeat triggers immediate status change and (with RELIABILITY-001) opens the circuit breaker preemptively.
Scope
Backend:
- New internal endpoint:
POST /api/internal/heartbeat (no auth, internal network only)
- Store heartbeat timestamp in Redis with 15s TTL:
heartbeat:{agent_name}
- Health check reads Redis key — missing key = agent unresponsive
- Feed heartbeat status into monitoring service and circuit breaker (if RELIABILITY-001 is implemented)
Agent-side:
- Add background task in
agent-server.py that POSTs heartbeat every 5s
- Include basic health info in payload: memory usage, active execution count, uptime
Acceptance Criteria
Key Files
src/backend/routers/internal.py — new heartbeat endpoint
src/backend/services/monitoring_service.py — consume heartbeat data
docker/base-image/agent_server/ — add heartbeat background task
Dependencies
- Enhances RELIABILITY-001 (circuit breaker) — heartbeat loss can proactively open circuit
- Independent of other reliability issues
Summary
Agent health is currently detected by polling every 30 seconds from the backend. This means an agent crash takes up to 30s to detect, and during that window other agents continue sending requests to a dead container. Combined with no circuit breaker (see RELIABILITY-001), this causes long cascading timeouts.
Solution
Flip from pull to push: agents send a lightweight heartbeat to the platform every 5 seconds. A missed heartbeat triggers immediate status change and (with RELIABILITY-001) opens the circuit breaker preemptively.
Scope
Backend:
POST /api/internal/heartbeat(no auth, internal network only)heartbeat:{agent_name}Agent-side:
agent-server.pythat POSTs heartbeat every 5sAcceptance Criteria
Key Files
src/backend/routers/internal.py— new heartbeat endpointsrc/backend/services/monitoring_service.py— consume heartbeat datadocker/base-image/agent_server/— add heartbeat background taskDependencies