Fix rolling update deadlock when pods are stuck in non-running state#3051
Open
annielzy wants to merge 4 commits intozalando:masterfrom
Open
Fix rolling update deadlock when pods are stuck in non-running state#3051annielzy wants to merge 4 commits intozalando:masterfrom
annielzy wants to merge 4 commits intozalando:masterfrom
Conversation
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
|
Cannot start a pipeline due to: Click on pipeline status check Details link below for more information. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The postgres-operator can enter a permanent deadlock state where pods that need
recreation are never recreated because the Patroni API safety checks fail on
non-running pods.
This occurs when:
secret reference from
postgres-operator-<VERSION A>-secrettopostgres-operator-<VERSION B>-secret)zalando-postgres-operator-rolling-update-requiredannotation
recreatePodstocomplete the rolling update
CreateContainerConfigErrorbecause the old secretno longer exists
syncPatroniConfigandrestartInstancesfail becausethey cannot reach the Patroni API on the non-running pod
isSafeToRecreatePods = false, which blocksrecreatePodsThis affects both single-node and multi-node clusters. In a 3-node cluster, one
broken pod blocks the rolling update of all pods, including the healthy ones.
Root Cause
syncStatefulSetunconditionally setsisSafeToRecreatePods = falsewhensyncPatroniConfigorrestartInstancesreturns any error. It does notdistinguish between:
Fix
pod.gopodIsNotRunning()helper that checks if a pod is stuck in a non-runningstate (e.g.
CreateContainerConfigError,CrashLoopBackOff,ImagePullBackOff,terminated containers, non-Running phase)
allPodsRunning()helper used bysyncStatefulSetto determine if PatroniAPI errors are expected
recreatePods, skip the switchover attempt when the master pod is not running,since the Patroni API is unreachable and Patroni has likely already triggered
automatic failover
sync.gosyncPatroniConfigorrestartInstancesfails, only setisSafeToRecreatePods = falseif all pods are actually running. If some podsare not running, the Patroni API errors are expected and should not block the
pod recreation that would fix them.
Behavior change summary
Testing
TestPodIsNotRunning— verifies detection of various non-running states(CreateContainerConfigError, CrashLoopBackOff, ImagePullBackOff, terminated
containers, mixed container states, pending/failed phase)
TestAllPodsRunning— verifies correct behavior with all-running,mixed, all-broken, and empty pod lists