Conversation
|
Claude finished @alpetric's task in 2m 56s —— View job PR Review Complete
SummaryThis PR adds two new health check endpoints to Windmill:
Overall: The implementation is well-structured and follows existing patterns in the codebase. The separation of public (simple) and authenticated (detailed) endpoints is a good design choice. Key ObservationsStrengths:
Items to Address:
Testing InstructionsSimple health endpoint ( Navigate to the Windmill instance and call
To test secret authentication: Set Detailed health endpoint ( While logged into Windmill, call |
| Extension(db): Extension<DB>, | ||
| ) -> impl IntoResponse { | ||
| // Check static secret if configured | ||
| let secret = HEALTH_CHECK_SECRET.read().await; |
There was a problem hiding this comment.
Consider using a constant‐time comparison when validating the X-Health-Secret header to reduce any potential timing side‐channels.
|
|
||
| // ============ Check Functions ============ | ||
|
|
||
| async fn check_database_healthy(db: &DB) -> bool { |
There was a problem hiding this comment.
Consider logging or handling database errors explicitly in check_database_healthy rather than silently mapping failures to false.
|
|
||
| let active_count = workers.len() as i64; | ||
|
|
||
| let worker_groups: Vec<String> = workers |
There was a problem hiding this comment.
Collecting worker_groups and versions into a HashSet does not preserve order. If order matters, consider sorting the results.
| .nest("/ai", ai::global_service()) | ||
| .nest("/inkeep", inkeep_oss::global_service()) | ||
| .nest("/mcp/w/:workspace_id/list_tools", mcp_list_tools_service) | ||
| .nest("/health/detailed", health::authed_service()) |
There was a problem hiding this comment.
Ensure that the detailed health endpoint (/health/detailed) is prioritized correctly to avoid potential routing conflicts with the public endpoint.
| } | ||
| }) | ||
| .route("/version", get(git_v)) | ||
| .nest("/health", health::public_service()) |
There was a problem hiding this comment.
Verify that the public health endpoint (/health) does not conflict with more specific subroutes like /health/detailed.
| .get("X-Health-Secret") | ||
| .and_then(|v| v.to_str().ok()); | ||
|
|
||
| if provided_secret != Some(expected_secret.as_str()) { |
There was a problem hiding this comment.
Security Concern: The secret comparison using != is vulnerable to timing attacks. When HEALTH_CHECK_SECRET is configured to protect this endpoint, an attacker could potentially extract the secret character-by-character by measuring response times.
Consider using constant-time comparison instead:
| if provided_secret != Some(expected_secret.as_str()) { | |
| // Use constant-time comparison to prevent timing attacks | |
| let secrets_match = provided_secret | |
| .map(|p| p.len() == expected_secret.len() && | |
| p.bytes().zip(expected_secret.bytes()).fold(0u8, |acc, (a, b)| acc | (a ^ b)) == 0) | |
| .unwrap_or(false); | |
| if !secrets_match { |
Alternatively, you could add constant_time_eq as a dependency (it's already used in the http_trigger feature) for a cleaner solution.
| let active_count = workers.len() as i64; | ||
|
|
||
| let worker_groups: Vec<String> = workers | ||
| .iter() |
There was a problem hiding this comment.
Minor: The .clone() here could potentially be avoided by using a reference, though the performance impact is minimal since worker groups are typically few and short strings:
let worker_groups: Vec<String> = workers
.iter()
.filter_map(|w| w.worker_group.as_ref())
.collect::<HashSet<_>>()
.into_iter()
.cloned()
.collect();However, this is a minor optimization - the current implementation is clear and readable.
| let pending_counts = windmill_common::queue::get_queue_counts(db).await; | ||
| let running_counts = windmill_common::queue::get_queue_running_counts(db).await; | ||
|
|
||
| let pending_jobs: u64 = pending_counts.values().map(|&v| v as u64).sum(); |
There was a problem hiding this comment.
Minor type consideration: The get_queue_counts and get_queue_running_counts functions return HashMap<String, u32>, but you're casting to u64 here. This is safe, but you might want to keep consistency with the OpenAPI spec which shows int64 for these fields.
The cast v as u64 is fine since u32 always fits in u64, but just noting for awareness.
|
|
||
| /// Detailed health check - requires DB authentication | ||
| async fn health_detailed( | ||
| _authed: ApiAuthed, |
There was a problem hiding this comment.
Nice: Good use of the _authed prefix to indicate the parameter is used only for authentication extraction but not otherwise accessed. This makes the authentication requirement clear while showing the value itself isn't needed.
| lazy_static::lazy_static! { | ||
| pub static ref HEALTH_CHECK_SECRET: Arc<RwLock<Option<String>>> = |
There was a problem hiding this comment.
Question: Using lazy_static with RwLock for an environment variable that won't change at runtime seems like overkill. The secret is read from the environment once at startup.
Consider simplifying to:
lazy_static::lazy_static! {
pub static ref HEALTH_CHECK_SECRET: Option<String> = std::env::var("HEALTH_CHECK_SECRET").ok();
}This would also eliminate the need for the async read lock in the handler. The RwLock pattern would only be needed if you wanted to support hot-reloading the secret, which doesn't seem like a requirement.
| let versions: Vec<String> = workers | ||
| .iter() |
There was a problem hiding this comment.
Note: The wm_version field in worker_ping is character varying, which in Rust maps to String. The query returns it directly, so the w.wm_version.clone() should work. However, checking the schema, wm_version could potentially be null for old workers. The query might panic if a null value is encountered.
Looking at the query result type, this should be handled correctly by sqlx, but you might want to filter null versions:
let versions: Vec<String> = workers
.iter()
.filter_map(|w| {
let v = &w.wm_version;
if !v.is_empty() { Some(v.clone()) } else { None }
})
.collect::<HashSet<_>>()
.into_iter()
.collect();The current code handles empty strings but assumes wm_version is always present.
| operationId: getHealthSimple | ||
| tags: | ||
| - settings | ||
| security: [] |
There was a problem hiding this comment.
Good: Correctly specified security: [] to indicate this endpoint doesn't require authentication (when HEALTH_CHECK_SECRET is not set). This allows the simple health endpoint to work with Kubernetes probes without auth.
| .nest("/ai", ai::global_service()) | ||
| .nest("/inkeep", inkeep_oss::global_service()) | ||
| .nest("/mcp/w/:workspace_id/list_tools", mcp_list_tools_service) | ||
| .nest("/health/detailed", health::authed_service()) |
There was a problem hiding this comment.
Good placement: The detailed health endpoint is correctly placed inside the authenticated route layer (route_layer(from_extractor::<ApiAuthed>())), ensuring it requires proper authentication.
| } | ||
| }) | ||
| .route("/version", get(git_v)) | ||
| .nest("/health", health::public_service()) |
There was a problem hiding this comment.
Good placement: The public health endpoint is correctly placed outside the authentication layers, allowing it to be accessed without credentials (useful for k8s probes and load balancers).
|
/updatesqlx |
|
Starting sqlx update... |
|
Successfully ran sqlx update |
| if METRICS_ENABLED.load(std::sync::atomic::Ordering::Relaxed) { | ||
| Some(prometheus::register_gauge!( | ||
| "health_status", | ||
| "Health status: 1=healthy, 0.5=degraded, 0=unhealthy" |
There was a problem hiding this comment.
do not think it's a good idea. You need health_status_phase with 3 labels, healthy, degraded, unhealty. Otherwise won't be usable by our alerts
| static ref HEALTH_DATABASE_HEALTHY: Option<prometheus::Gauge> = | ||
| if METRICS_ENABLED.load(std::sync::atomic::Ordering::Relaxed) { | ||
| Some(prometheus::register_gauge!( | ||
| "health_database_healthy", |
There was a problem hiding this comment.
there are multiple servers/workers so this could have a confusing name, I would provide the following:
health_database_latency (gauge)
health_database_unresponsive (1/0)
health_database_errors (counter) this one require a bit more work and you may not want to do it in this version but should be incremented anytime there is an error that's related to PoolTimeout, etc. It can probably be done. There would be one label per type of error
There was a problem hiding this comment.
can also add the pool size metrics
| static ref HEALTH_WORKERS_ALIVE: Option<prometheus::Gauge> = | ||
| if METRICS_ENABLED.load(std::sync::atomic::Ordering::Relaxed) { | ||
| Some(prometheus::register_gauge!( | ||
| "health_workers_alive", | ||
| "Number of workers alive" | ||
| ).unwrap()) | ||
| } else { | ||
| None | ||
| }; | ||
| } |
There was a problem hiding this comment.
remove there are other ways to get that metric and given there are multiple servers it will be more confusing than helpful
/api/health (simple, no auth optional HEALTH_CHECK_SECRET env):
/api/health/detailed (authed):