Skip to content

feat: Agent heartbeat push for fast failure detection (RELIABILITY-004) #307

@vybe

Description

@vybe

Summary

Agent health is currently detected by polling every 30 seconds from the backend. This means an agent crash takes up to 30s to detect, and during that window other agents continue sending requests to a dead container. Combined with no circuit breaker (see RELIABILITY-001), this causes long cascading timeouts.

Solution

Flip from pull to push: agents send a lightweight heartbeat to the platform every 5 seconds. A missed heartbeat triggers immediate status change and (with RELIABILITY-001) opens the circuit breaker preemptively.

Scope

Backend:

  • New internal endpoint: POST /api/internal/heartbeat (no auth, internal network only)
  • Store heartbeat timestamp in Redis with 15s TTL: heartbeat:{agent_name}
  • Health check reads Redis key — missing key = agent unresponsive
  • Feed heartbeat status into monitoring service and circuit breaker (if RELIABILITY-001 is implemented)

Agent-side:

  • Add background task in agent-server.py that POSTs heartbeat every 5s
  • Include basic health info in payload: memory usage, active execution count, uptime

Acceptance Criteria

  • Running agents send heartbeat every 5 seconds
  • Backend detects missing heartbeat within 15 seconds (vs 30s+ currently)
  • Heartbeat status reflected in fleet health endpoint
  • Agent server continues working normally if backend is temporarily unreachable (heartbeat fails silently)
  • ~40 lines total (backend endpoint + agent-side background task)
  • No new infrastructure — uses existing Redis

Key Files

  • src/backend/routers/internal.py — new heartbeat endpoint
  • src/backend/services/monitoring_service.py — consume heartbeat data
  • docker/base-image/agent_server/ — add heartbeat background task

Dependencies

  • Enhances RELIABILITY-001 (circuit breaker) — heartbeat loss can proactively open circuit
  • Independent of other reliability issues

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions