|
| 1 | +# Reference |
| 2 | +* Original RFC [#46](https://github.com/concourse/rfcs/pull/46) |
| 3 | +* Original POC [#4818](https://github.com/concourse/concourse/pull/4818) |
| 4 | + |
| 5 | +# Summary |
| 6 | +This proposal outlines the beginnings of support for a `health` endpoint, which has a simple backend service which monitors crucial Concourse interfaces such as the database connectivity, the worker count (of healthy workers (*should be some threshold depending on the update strategy*)), the state of the webs (ATC/TSA) and others. |
| 7 | + |
| 8 | +# Motivation |
| 9 | +#### Currently, Concourse does not expose a dedicated, standardized health endpoint that external systems can query to determine the system’s overall health. This creates challenges in the following areas: |
| 10 | + |
| 11 | +### 1. Monitoring & Alerting |
| 12 | +Operators and platform teams often integrate Concourse with monitoring systems (e.g. Prometheus, Datadog, Kubernetes liveness/readiness probes). Without a clear health endpoint, they must rely on indirect signals (such as API responses, metrics, or manual checks), which can be unreliable or difficult to standardize. |
| 13 | + |
| 14 | +### 2. Automation & Self-Healing |
| 15 | +Modern infrastructure frequently depends on health endpoints for automated actions like restarting unhealthy pods, removing failing nodes from load balancers, or scaling workloads. The lack of a health endpoint makes such automation harder to implement for Concourse. |
| 16 | + |
| 17 | +### 3. User Experience |
| 18 | +When Concourse becomes partially degraded (e.g. workers are down, ATC is unresponsive, DB is lagging), it is not immediately obvious to users or operators. A health endpoint would provide a quick, single source of truth for identifying issues. |
| 19 | + |
| 20 | +### 4. Consistency with Industry Standards |
| 21 | +Most modern distributed systems (e.g. Kubernetes components, CI/CD systems, databases) expose health endpoints (commonly `/healthz`, `/readyz`, `/livez`). Introducing a similar endpoint in Concourse aligns it with best practices and user expectations. |
| 22 | + |
| 23 | +## What will it bring? |
| 24 | +By introducing a health endpoint, we make it easier to operate Concourse reliably in production environments, reduce the burden on operators, and enable better integration with external observability and orchestration systems. |
| 25 | + |
| 26 | +# Proposal |
| 27 | +## API Changes |
| 28 | +What comes to mind is a simple **unauthenticated** HTTP endpoint (e.g. `/health`) that returns a JSON payload indicating the overall health status of the Concourse system. Could be something simple like: |
| 29 | +```json |
| 30 | +{ |
| 31 | + "status": "healthy/unhealthy", |
| 32 | + "details": { |
| 33 | + "database": "healthy/unhealthy", |
| 34 | + "workers": "healthy/unhealthy", |
| 35 | + } |
| 36 | +} |
| 37 | +``` |
| 38 | + |
| 39 | +## Backend Service changes |
| 40 | +A new service (e.g. `HealthChecker`) will be introduced to periodically check the health of critical components: |
| 41 | +- **Database Connectivity**: Ensure the database is reachable and responsive - e.g. via a simple query, or checking logs for errors etc. |
| 42 | +- **Worker Health**: Monitor the number of healthy workers and their responsiveness - we already know the desired workers, by introducing a simple threshold property (e.g. 80% of desired workers) we can determine if the system has enough registered workers to handle loads. The threshold can be calculated based on the update strategy (e.g. rolling updates might tolerate fewer workers temporarily, depending on the count of *in parallel/max in flight* configured). |
| 43 | + |
| 44 | +## Alternatives |
| 45 | +* There are solutions like [SLI runner](https://github.com/cirocosta/slirunner) that could potentially be leveraged for health checking in Concourse, but that requires SLA suites and additional configurations, which are much more granular, the proposition here is to have a simple, out-of-the-box health endpoint that can be used for basic high-end health checks, for the standard out-of-the-box Concourse. People can always build on top of that for more complex use cases. |
| 46 | +* Extending the dataset of the `/info` endpoint to include a health json object is another alternative, but that endpoint is more about static information about the Concourse instance, rather than its dynamic health state. |
| 47 | + |
| 48 | +# Open Questions |
| 49 | +- I think it wouldn't require much changes to the existing infrastructure, but would it be better to have a dedicated microservice for this, or integrate it into the existing ATC service? From a general perspective and previous approaches/discussions, the idea is to have it within the ATC. |
| 50 | +- Should we have a `degraded` state for the workers, where the number is bellow the expected, but not zero? I think in any case that could be further extended in the future, but as a start, a simple healthy/unhealthy state should suffice. |
| 51 | + |
| 52 | +# Answered Questions |
| 53 | +... TBD ... |
| 54 | + |
| 55 | +# New Implications |
| 56 | +I do not see (out of the box) negative implications of this change, rather it would improve the overall reliability and operability of Concourse in production environments. |
0 commit comments