Skip to content

Fix rolling update deadlock when pods are stuck in non-running state#3051

Open
annielzy wants to merge 4 commits intozalando:masterfrom
annielzy:fix/break-rolling-update-deadlock-on-non-running-pods
Open

Fix rolling update deadlock when pods are stuck in non-running state#3051
annielzy wants to merge 4 commits intozalando:masterfrom
annielzy:fix/break-rolling-update-deadlock-on-non-running-pods

Conversation

@annielzy
Copy link

@annielzy annielzy commented Mar 6, 2026

Problem

The postgres-operator can enter a permanent deadlock state where pods that need
recreation are never recreated because the Patroni API safety checks fail on
non-running pods.

This occurs when:

  1. The operator upgrades and updates the StatefulSet template (e.g. changing a
    secret reference from postgres-operator-<VERSION A>-secret to
    postgres-operator-<VERSION B>-secret)
  2. The operator marks pods with the zalando-postgres-operator-rolling-update-required
    annotation
  3. The operator crashes or restarts before it can execute recreatePods to
    complete the rolling update
  4. The old secret is removed as part of the operator upgrade
  5. The pod restarts and enters CreateContainerConfigError because the old secret
    no longer exists
  6. On subsequent syncs, syncPatroniConfig and restartInstances fail because
    they cannot reach the Patroni API on the non-running pod
  7. These failures set isSafeToRecreatePods = false, which blocks recreatePods
  8. The pod can only be fixed by recreation, but recreation is blocked — deadlock

This affects both single-node and multi-node clusters. In a 3-node cluster, one
broken pod blocks the rolling update of all pods, including the healthy ones.

Root Cause

syncStatefulSet unconditionally sets isSafeToRecreatePods = false when
syncPatroniConfig or restartInstances returns any error. It does not
distinguish between:

  • A genuinely unhealthy Patroni cluster where recreation would be risky
  • Expected errors from non-running pods that can never be reached via the API

Fix

pod.go

  • Add podIsNotRunning() helper that checks if a pod is stuck in a non-running
    state (e.g. CreateContainerConfigError, CrashLoopBackOff, ImagePullBackOff,
    terminated containers, non-Running phase)
  • Add allPodsRunning() helper used by syncStatefulSet to determine if Patroni
    API errors are expected
  • In recreatePods, skip the switchover attempt when the master pod is not running,
    since the Patroni API is unreachable and Patroni has likely already triggered
    automatic failover

sync.go

  • When syncPatroniConfig or restartInstances fails, only set
    isSafeToRecreatePods = false if all pods are actually running. If some pods
    are not running, the Patroni API errors are expected and should not block the
    pod recreation that would fix them.

Behavior change summary

Scenario Before After
All pods running, Patroni sync fails Postpone ✅ Postpone ✅ (unchanged)
1-node: master in CreateContainerConfigError Deadlock ❌ Recreate master ✅
3-node: 1 replica broken, others healthy All 3 postponed ❌ All 3 recreated via normal rolling update ✅
3-node: master broken, replicas healthy All 3 postponed ❌ Skip switchover, recreate all ✅
3-node: all pods broken All postponed ❌ Recreate all ✅

Testing

  • Added TestPodIsNotRunning — verifies detection of various non-running states
    (CreateContainerConfigError, CrashLoopBackOff, ImagePullBackOff, terminated
    containers, mixed container states, pending/failed phase)
  • Added TestAllPodsRunning — verifies correct behavior with all-running,
    mixed, all-broken, and empty pod lists

@zalando-robot
Copy link

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

@annielzy annielzy marked this pull request as draft March 6, 2026 19:37
@zalando-robot
Copy link

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

@zalando-robot
Copy link

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

@zalando-robot
Copy link

Cannot start a pipeline due to:

No accountable user for this pipeline: no Zalando employee associated to this GitHub username

Click on pipeline status check Details link below for more information.

@annielzy annielzy marked this pull request as ready for review March 6, 2026 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants