Skip to content

feat: /health endpoints#7727

Merged
rubenfiszel merged 11 commits intomainfrom
alp/health_endpoint
Feb 12, 2026
Merged

feat: /health endpoints#7727
rubenfiszel merged 11 commits intomainfrom
alp/health_endpoint

Conversation

@alpetric
Copy link
Copy Markdown
Collaborator

/api/health (simple, no auth optional HEALTH_CHECK_SECRET env):

{
  "status": "healthy",
  "timestamp": "2026-01-28T18:40:16.008701263Z",
  "database_healthy": true,
  "workers_alive": 1
}

/api/health/detailed (authed):

{
  "status": "healthy",
  "timestamp": "2026-01-28T18:40:22.585047186Z",
  "version": "EE v1.615.3-6-g81594b8b56",
  "checks": {
    "database": {
      "healthy": true,
      "latency_ms": 0,
      "pool": {"size": 15, "idle": 13, "max_connections": 50}
    },
    "workers": {
      "healthy": true,
      "active_count": 1,
      "worker_groups": ["default"],
      "min_version": "v1.420.0",
      "versions": ["v1.615.3-6-g81594b8b56"]
    },
    "queue": {"pending_jobs": 8, "running_jobs": 32},
    "readiness": {"healthy": true}
  }
}

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Jan 28, 2026

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: a19b774
Status:⚡️  Build in progress...

View logs

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jan 28, 2026

Claude finished @alpetric's task in 2m 56s —— View job


PR Review Complete

  • Analyze PR changes and context
  • Review health.rs implementation
  • Review OpenAPI specification changes
  • Review lib.rs integration
  • Check for security considerations
  • Check for potential bugs
  • Provide feedback and testing instructions

Summary

This PR adds two new health check endpoints to Windmill:

  • /api/health - Simple health check for load balancers/k8s probes (optional static secret auth)
  • /api/health/detailed - Detailed health check requiring authentication

Overall: The implementation is well-structured and follows existing patterns in the codebase. The separation of public (simple) and authenticated (detailed) endpoints is a good design choice.

Key Observations

Strengths:

  • Clean separation of concerns with public and authenticated services
  • Good use of existing queue counting functions
  • Proper OpenAPI documentation with all response schemas
  • Appropriate HTTP status codes (200 for healthy/degraded, 503 for unhealthy)
  • Timeout protection on database checks (5 second limit)

Items to Address:

  1. Security (Important): The secret comparison at health.rs:245 uses direct string comparison (!=) which is vulnerable to timing attacks. See inline comment for fix.

  2. Simplification opportunity: The HEALTH_CHECK_SECRET uses Arc<RwLock<Option<String>>> for an environment variable that won't change at runtime. A simpler Option<String> would suffice.

  3. Minor null handling: The wm_version field in worker_ping could potentially be null for older workers. The code filters empty strings but assumes the field is always present.

Testing Instructions

Simple health endpoint (/api/health):

Navigate to the Windmill instance and call GET /api/health without authentication. The endpoint should return:

  • Status 200 with "status": "healthy" when database is up and workers are active
  • Status 200 with "status": "degraded" when database is up but no workers are active
  • Status 503 with "status": "unhealthy" when database is unreachable

To test secret authentication: Set HEALTH_CHECK_SECRET=mysecret environment variable, restart the server, and verify that requests without the X-Health-Secret: mysecret header return 401 Unauthorized.

Detailed health endpoint (/api/health/detailed):

While logged into Windmill, call GET /api/health/detailed (requires authentication via bearer token or session cookie). The response should include database pool stats, worker information with groups and versions, queue counts (pending and running jobs), and readiness status.

Comment thread backend/windmill-api/src/health.rs Outdated
Extension(db): Extension<DB>,
) -> impl IntoResponse {
// Check static secret if configured
let secret = HEALTH_CHECK_SECRET.read().await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a constant‐time comparison when validating the X-Health-Secret header to reduce any potential timing side‐channels.

Comment thread backend/windmill-api/src/health.rs Outdated

// ============ Check Functions ============

async fn check_database_healthy(db: &DB) -> bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider logging or handling database errors explicitly in check_database_healthy rather than silently mapping failures to false.


let active_count = workers.len() as i64;

let worker_groups: Vec<String> = workers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collecting worker_groups and versions into a HashSet does not preserve order. If order matters, consider sorting the results.

Comment thread backend/windmill-api/src/lib.rs Outdated
.nest("/ai", ai::global_service())
.nest("/inkeep", inkeep_oss::global_service())
.nest("/mcp/w/:workspace_id/list_tools", mcp_list_tools_service)
.nest("/health/detailed", health::authed_service())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that the detailed health endpoint (/health/detailed) is prioritized correctly to avoid potential routing conflicts with the public endpoint.

Comment thread backend/windmill-api/src/lib.rs Outdated
}
})
.route("/version", get(git_v))
.nest("/health", health::public_service())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify that the public health endpoint (/health) does not conflict with more specific subroutes like /health/detailed.

Comment thread backend/windmill-api/src/health.rs Outdated
.get("X-Health-Secret")
.and_then(|v| v.to_str().ok());

if provided_secret != Some(expected_secret.as_str()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Concern: The secret comparison using != is vulnerable to timing attacks. When HEALTH_CHECK_SECRET is configured to protect this endpoint, an attacker could potentially extract the secret character-by-character by measuring response times.

Consider using constant-time comparison instead:

Suggested change
if provided_secret != Some(expected_secret.as_str()) {
// Use constant-time comparison to prevent timing attacks
let secrets_match = provided_secret
.map(|p| p.len() == expected_secret.len() &&
p.bytes().zip(expected_secret.bytes()).fold(0u8, |acc, (a, b)| acc | (a ^ b)) == 0)
.unwrap_or(false);
if !secrets_match {

Alternatively, you could add constant_time_eq as a dependency (it's already used in the http_trigger feature) for a cleaner solution.

let active_count = workers.len() as i64;

let worker_groups: Vec<String> = workers
.iter()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: The .clone() here could potentially be avoided by using a reference, though the performance impact is minimal since worker groups are typically few and short strings:

let worker_groups: Vec<String> = workers
    .iter()
    .filter_map(|w| w.worker_group.as_ref())
    .collect::<HashSet<_>>()
    .into_iter()
    .cloned()
    .collect();

However, this is a minor optimization - the current implementation is clear and readable.

let pending_counts = windmill_common::queue::get_queue_counts(db).await;
let running_counts = windmill_common::queue::get_queue_running_counts(db).await;

let pending_jobs: u64 = pending_counts.values().map(|&v| v as u64).sum();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor type consideration: The get_queue_counts and get_queue_running_counts functions return HashMap<String, u32>, but you're casting to u64 here. This is safe, but you might want to keep consistency with the OpenAPI spec which shows int64 for these fields.

The cast v as u64 is fine since u32 always fits in u64, but just noting for awareness.


/// Detailed health check - requires DB authentication
async fn health_detailed(
_authed: ApiAuthed,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice: Good use of the _authed prefix to indicate the parameter is used only for authentication extraction but not otherwise accessed. This makes the authentication requirement clear while showing the value itself isn't needed.

Comment thread backend/windmill-api/src/health.rs Outdated
Comment on lines +29 to +30
lazy_static::lazy_static! {
pub static ref HEALTH_CHECK_SECRET: Arc<RwLock<Option<String>>> =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Using lazy_static with RwLock for an environment variable that won't change at runtime seems like overkill. The secret is read from the environment once at startup.

Consider simplifying to:

lazy_static::lazy_static! {
    pub static ref HEALTH_CHECK_SECRET: Option<String> = std::env::var("HEALTH_CHECK_SECRET").ok();
}

This would also eliminate the need for the async read lock in the handler. The RwLock pattern would only be needed if you wanted to support hot-reloading the secret, which doesn't seem like a requirement.

Comment on lines +182 to +183
let versions: Vec<String> = workers
.iter()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The wm_version field in worker_ping is character varying, which in Rust maps to String. The query returns it directly, so the w.wm_version.clone() should work. However, checking the schema, wm_version could potentially be null for old workers. The query might panic if a null value is encountered.

Looking at the query result type, this should be handled correctly by sqlx, but you might want to filter null versions:

let versions: Vec<String> = workers
    .iter()
    .filter_map(|w| {
        let v = &w.wm_version;
        if !v.is_empty() { Some(v.clone()) } else { None }
    })
    .collect::<HashSet<_>>()
    .into_iter()
    .collect();

The current code handles empty strings but assumes wm_version is always present.

operationId: getHealthSimple
tags:
- settings
security: []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: Correctly specified security: [] to indicate this endpoint doesn't require authentication (when HEALTH_CHECK_SECRET is not set). This allows the simple health endpoint to work with Kubernetes probes without auth.

Comment thread backend/windmill-api/src/lib.rs Outdated
.nest("/ai", ai::global_service())
.nest("/inkeep", inkeep_oss::global_service())
.nest("/mcp/w/:workspace_id/list_tools", mcp_list_tools_service)
.nest("/health/detailed", health::authed_service())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good placement: The detailed health endpoint is correctly placed inside the authenticated route layer (route_layer(from_extractor::<ApiAuthed>())), ensuring it requires proper authentication.

Comment thread backend/windmill-api/src/lib.rs Outdated
}
})
.route("/version", get(git_v))
.nest("/health", health::public_service())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good placement: The public health endpoint is correctly placed outside the authentication layers, allowing it to be accessed without credentials (useful for k8s probes and load balancers).

@alpetric
Copy link
Copy Markdown
Collaborator Author

/updatesqlx

@windmill-internal-app
Copy link
Copy Markdown
Contributor

Starting sqlx update...

View workflow run

@windmill-internal-app
Copy link
Copy Markdown
Contributor

Successfully ran sqlx update

Comment thread backend/windmill-api/src/health.rs Outdated
if METRICS_ENABLED.load(std::sync::atomic::Ordering::Relaxed) {
Some(prometheus::register_gauge!(
"health_status",
"Health status: 1=healthy, 0.5=degraded, 0=unhealthy"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not think it's a good idea. You need health_status_phase with 3 labels, healthy, degraded, unhealty. Otherwise won't be usable by our alerts

Comment thread backend/windmill-api/src/health.rs Outdated
static ref HEALTH_DATABASE_HEALTHY: Option<prometheus::Gauge> =
if METRICS_ENABLED.load(std::sync::atomic::Ordering::Relaxed) {
Some(prometheus::register_gauge!(
"health_database_healthy",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are multiple servers/workers so this could have a confusing name, I would provide the following:

health_database_latency (gauge)
health_database_unresponsive (1/0)
health_database_errors (counter) this one require a bit more work and you may not want to do it in this version but should be incremented anytime there is an error that's related to PoolTimeout, etc. It can probably be done. There would be one label per type of error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also add the pool size metrics

Comment thread backend/windmill-api/src/health.rs Outdated
Comment on lines +64 to +73
static ref HEALTH_WORKERS_ALIVE: Option<prometheus::Gauge> =
if METRICS_ENABLED.load(std::sync::atomic::Ordering::Relaxed) {
Some(prometheus::register_gauge!(
"health_workers_alive",
"Number of workers alive"
).unwrap())
} else {
None
};
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove there are other ways to get that metric and given there are multiple servers it will be more confusing than helpful

@rubenfiszel rubenfiszel merged commit 7df4aa4 into main Feb 12, 2026
7 of 9 checks passed
@rubenfiszel rubenfiszel deleted the alp/health_endpoint branch February 12, 2026 10:53
@github-actions github-actions Bot locked and limited conversation to collaborators Feb 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants