[release-4.19] OCPBUGS-85275: Wait for revision stability before removing etcd members#1613
Conversation
Previously, the ClusterMemberRemovalController would remove etcd members during revision rollouts, causing cluster degradation when simultaneously deleting multiple control plane machines with the OnDelete strategy. During a revision rollout, etcd members can temporarily appear unhealthy while their pods are reinstalled to the latest revision. This is different from members being indefinitely unhealthy on a stable revision. Additionally, the EtcdEndpointsController pauses during revision rollouts, so when a replacement machine is added and triggers a rollout, the etcd-endpoints configmap won't update. This causes API servers on the old revision to use removed member endpoints, leading to API unavailability. This change adds a revision stability check before allowing member removal, ensuring we only remove members when revisions are stable and unhealthy members are truly unhealthy. This explicitly codifies the 4.17 behavior where the operator waited for all revisions to complete before removing members and lifecycle hooks. Additionally, the ClusterMemberRemovalController now verifies that the live etcd membership matches the configmap before proceeding with member removal, preventing potential issues during rapid member deletion (cherry picked from commit 0168733)
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@openshift-cherrypick-robot: Jira Issue OCPBUGS-77313 has been cloned as Jira Issue OCPBUGS-85275. Will retitle bug to link to clone. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-85275, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/hold |
|
/cherrypick release-4.18 |
|
@hasbro17: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest-required |
|
/lgtm |
|
/test e2e-aws-ovn-etcd-scaling |
|
/test e2e-aws-ovn-etcd-scaling /approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
🤖 Automated labeling by agent on behalf of @tjungblu This is an automated bot PR. Adding required Tide labels: /verified by ci This action was performed by an automated agent. If this is incorrect, please review and adjust manually. |
|
@tjungblu: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-ovn-etcd-scaling |
|
@openshift-cherrypick-robot: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Okay so the vertical scaling suite is super flaky at least on 4.19 but at least we have a passing test run for this fix. |
|
/unhold |
024f8af
into
openshift:release-4.19
|
@openshift-cherrypick-robot: Jira Issue Verification Checks: Jira Issue OCPBUGS-85275 Jira Issue OCPBUGS-85275 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@hasbro17: new pull request created: #1623 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in release 4.19.0-0.nightly-2026-05-20-004750 |
This is an automated cherry-pick of #1571
/assign hasbro17