Skip to content

Webhook TLS certificate rotation uses monotonic time, but cert validity uses wall-clock time #1174

@lfrancke

Description

@lfrancke

Summary

The webhook TLS certificate rotation interval is scheduled using tokio::time::Instant (monotonic clock), but certificate validity is based on wall-clock time (SystemTime via x509_cert::time::Validity::from_now). When these two clocks diverge — e.g. during system hibernation, VM live migration, or cgroup freezing — the certificate can expire before the rotation fires.

Root Cause

In crates/stackable-webhook/src/tls/mod.rs (lines 156–157), the rotation is scheduled using monotonic time:

let start = tokio::time::Instant::now() + *WEBHOOK_CERTIFICATE_ROTATION_INTERVAL;
let mut interval = tokio::time::interval_at(start, *WEBHOOK_CERTIFICATE_ROTATION_INTERVAL);

The constants are:

  • WEBHOOK_CERTIFICATE_LIFETIME: 24 hours (wall-clock)
  • WEBHOOK_CERTIFICATE_ROTATION_INTERVAL: 20 hours (monotonic)

This gives a 4-hour buffer — but only if the monotonic and wall clocks stay in sync.

Meanwhile, in crates/stackable-webhook/src/tls/cert_resolver.rs (line 125), the leaf certificate is generated with a wall-clock-based validity:

let certificate_pair = ca
    .generate_ecdsa_leaf_certificate(
        "Leaf",
        "webhook",
        subject_alterative_dns_names.iter().map(|san| san.as_str()),
        WEBHOOK_CERTIFICATE_LIFETIME,  // 24h wall-clock validity
    )

Which calls Validity::from_now(*validity) in crates/stackable-certs/src/ca/mod.rs (line 316), using SystemTime.

How to Reproduce

  1. Deploy the secret-operator (or any operator using stackable-webhook)
  2. Hibernate / suspend the host machine for > 4 hours
  3. Resume the machine
  4. Try to apply a resource that triggers the conversion webhook (e.g. a SecretClass using v1alpha1)

The monotonic clock paused during hibernation, so the 20h rotation interval hasn't elapsed from tokio's perspective. But wall-clock time advanced past the cert's 24h not_after.

Observed Behavior

The conversion webhook fails with:

tls: failed to verify certificate: x509: certificate has expired or is not yet valid:
current time 2026-03-11T22:32:22Z is after 2026-03-11T21:07:50Z

The operator logs show only the initial certificate rotation at startup — no subsequent rotation occurred despite 25+ hours of wall-clock uptime. The TLS server continues serving the expired certificate.

Additional observations from diagnostics:

  • The caBundle in both the SecretClass and TrustStore CRDs contains the same expired leaf certificate
  • Existing resources stored as the current storage version (v1alpha2) can still be read (no conversion needed)
  • Applying resources using the storage version directly (v1alpha2) still works — only the v1alpha1 → v1alpha2 conversion path is broken
  • The webhook TLS server is still running and accepting connections, just with an expired cert

Impact

Any operation that triggers a CRD conversion webhook call will fail. This affects:

  • Creating/updating resources using non-storage API versions (e.g. v1alpha1 when storage is v1alpha2)
  • Any client (including test frameworks) that defaults to older API versions

Scenarios Where This Could Happen on Real Clusters

While developer laptop hibernation is the easiest trigger, this could also affect production:

  • VM live migration: Cloud providers (AWS, GCP, Azure) pause VMs during host migration. Usually brief, but combined with a cert already near the 4h buffer window, could be enough.
  • Spot/preemptible instance suspension: Some providers freeze rather than terminate spot instances.
  • cgroup freezer: Container runtimes or node management tools can freeze processes, pausing monotonic clocks.
  • Large NTP clock corrections: If the wall clock jumps forward significantly (e.g. after a clock desync), the cert expires instantly while the monotonic timer hasn't advanced.
  • VM snapshot/restore: DR scenarios where a VM is restored from an older snapshot.

Suggested Fix

Add a wall-clock-based check alongside the monotonic interval in TlsServer::run(). For example, add a short periodic check (e.g. every few minutes) that compares SystemTime::now() against the current certificate's not_after, and triggers early rotation if the cert is within some threshold of expiry (or already expired).

This could be done by storing the not_after time from the current cert in the CertificateResolver and exposing a method like is_rotation_needed() that checks wall-clock time.

Affected Code

  • crates/stackable-webhook/src/tls/mod.rs — rotation scheduling in TlsServer::run()
  • crates/stackable-webhook/src/tls/cert_resolver.rs — certificate generation and storage

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions