Skip to content

Commit e4b0430

Browse files
committed
Create RFC for health endpoint
Signed-off-by: Kump3r <tonevkalin@gmail.com>
1 parent 66dc7ff commit e4b0430

1 file changed

Lines changed: 56 additions & 0 deletions

File tree

141-health-endpoint/proposal.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Reference
2+
* Original RFC [#46](https://github.com/concourse/rfcs/pull/46)
3+
* Original POC [#4818](https://github.com/concourse/concourse/pull/4818)
4+
5+
# Summary
6+
This proposal outlines the beginnings of support for a `health` endpoint, which has a simple backend service which monitors crucial Concourse interfaces such as the database connectivity, the worker count (of healthy workers (*should be some threshold depending on the update strategy*)), the state of the webs (ATC/TSA) and others.
7+
8+
# Motivation
9+
#### Currently, Concourse does not expose a dedicated, standardized health endpoint that external systems can query to determine the system’s overall health. This creates challenges in the following areas:
10+
11+
### 1. Monitoring & Alerting
12+
Operators and platform teams often integrate Concourse with monitoring systems (e.g. Prometheus, Datadog, Kubernetes liveness/readiness probes). Without a clear health endpoint, they must rely on indirect signals (such as API responses, metrics, or manual checks), which can be unreliable or difficult to standardize.
13+
14+
### 2. Automation & Self-Healing
15+
Modern infrastructure frequently depends on health endpoints for automated actions like restarting unhealthy pods, removing failing nodes from load balancers, or scaling workloads. The lack of a health endpoint makes such automation harder to implement for Concourse.
16+
17+
### 3. User Experience
18+
When Concourse becomes partially degraded (e.g. workers are down, ATC is unresponsive, DB is lagging), it is not immediately obvious to users or operators. A health endpoint would provide a quick, single source of truth for identifying issues.
19+
20+
### 4. Consistency with Industry Standards
21+
Most modern distributed systems (e.g. Kubernetes components, CI/CD systems, databases) expose health endpoints (commonly `/healthz`, `/readyz`, `/livez`). Introducing a similar endpoint in Concourse aligns it with best practices and user expectations.
22+
23+
## What will it bring?
24+
By introducing a health endpoint, we make it easier to operate Concourse reliably in production environments, reduce the burden on operators, and enable better integration with external observability and orchestration systems.
25+
26+
# Proposal
27+
## API Changes
28+
What comes to mind is a simple **unauthenticated** HTTP endpoint (e.g. `/health`) that returns a JSON payload indicating the overall health status of the Concourse system. Could be something simple like:
29+
```json
30+
{
31+
"status": "healthy/unhealthy",
32+
"details": {
33+
"database": "healthy/unhealthy",
34+
"workers": "healthy/unhealthy",
35+
}
36+
}
37+
```
38+
39+
## Backend Service changes
40+
A new service (e.g. `HealthChecker`) will be introduced to periodically check the health of critical components:
41+
- **Database Connectivity**: Ensure the database is reachable and responsive - e.g. via a simple query, or checking logs for errors etc.
42+
- **Worker Health**: Monitor the number of healthy workers and their responsiveness - we already know the desired workers, by introducing a simple threshold property (e.g. 80% of desired workers) we can determine if the system has enough registered workers to handle loads. The threshold can be calculated based on the update strategy (e.g. rolling updates might tolerate fewer workers temporarily, depending on the count of *in parallel/max in flight* configured).
43+
44+
## Alternatives
45+
* There are solutions like [SLI runner](https://github.com/cirocosta/slirunner) that could potentially be leveraged for health checking in Concourse, but that requires SLA suites and additional configurations, which are much more granular, the proposition here is to have a simple, out-of-the-box health endpoint that can be used for basic high-end health checks, for the standard out-of-the-box Concourse. People can always build on top of that for more complex use cases.
46+
* Extending the dataset of the `/info` endpoint to include a health json object is another alternative, but that endpoint is more about static information about the Concourse instance, rather than its dynamic health state.
47+
48+
# Open Questions
49+
- I think it wouldn't require much changes to the existing infrastructure, but would it be better to have a dedicated microservice for this, or integrate it into the existing ATC service? From a general perspective and previous approaches/discussions, the idea is to have it within the ATC.
50+
- Should we have a `degraded` state for the workers, where the number is bellow the expected, but not zero? I think in any case that could be further extended in the future, but as a start, a simple healthy/unhealthy state should suffice.
51+
52+
# Answered Questions
53+
... TBD ...
54+
55+
# New Implications
56+
I do not see (out of the box) negative implications of this change, rather it would improve the overall reliability and operability of Concourse in production environments.

0 commit comments

Comments
 (0)