Fix MinorUpdateOVNDataplane race when OVN image unchanged between versions#1864
Fix MinorUpdateOVNDataplane race when OVN image unchanged between versions#1864stuggi wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
Conversation
OpenStackControlPlane CRD Size Report
Threshold reference
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abays, stuggi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hmm, looks like the minor update got stuck: Let's try again and see if it reproduces. /test openstack-operator-build-deploy-kuttl-4-18 |
/test openstack-operator-build-deploy-kuttl-4-18 |
|
@stuggi: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Same error as before: Might be a legitimate issue? |
Looks like a problem with the data plane reconciling an I didn't realize we used Metal3 provisioning in our Prow jobs. Is something messing up that is making the |
rabi
left a comment
There was a problem hiding this comment.
This is getting really messy 😄 We do create nodset/deployment to validate the CRs[1] after update. Probably we need to move the edpm_deploy_cleanup before we check for openstackctlplane ready in L195.
| serviceCache := make(map[string]*dataplanev1.OpenStackDataPlaneService) | ||
|
|
||
| for _, deployment := range deployments.Items { | ||
| // Skip completed deployments |
There was a problem hiding this comment.
There could be earlier failed deployments containing ovn. We should check if there is a running deployment. Also, I think probably a would still be a race here.
| } | ||
|
|
||
| if slices.Contains(svc.Spec.ContainerImageFields, containerImageField) { | ||
| return true, nil |
There was a problem hiding this comment.
We can probably just check serviceType 'ovn' and avoid the harcoded check for the image name.
There was a problem hiding this comment.
is there always a service type, which is stable and not custom?
There was a problem hiding this comment.
yes, serviceType is always fixed and the service name can change.
…sions Commit 63fd705 removed the nodeset.IsReady() check from DataplaneNodesetsOVNControllerImagesMatch to fix the minor update workflow getting stuck when unrelated deployments were running. That check was too strict — it blocked when any deployment was in progress, even if the OVN update had already completed. However, the remaining pure image comparison is too loose. When the OVN controller image does not change between two OpenStack versions, the nodeset already has the matching image from the previous update. The OpenStackVersion controller then sets MinorUpdateOVNDataplane=True immediately, before the edpm-ovn-update deployment finishes. This causes the subsequent minor update steps (controlplane update, edpm services update) to proceed while the OVN dataplane deployment is still running — resulting in both dataplane deployments running concurrently. Fix this with two mechanisms: 1. IsDataplaneDeploymentRunningForContainerImage: a new generic function that lists all in-progress OpenStackDataPlaneDeployment resources, resolves which services each deploys (from ServicesOverride or the nodeset's service list), and inspects each service's ContainerImageFields to determine if it manages the specified container image. This does not rely on deployment or service naming conventions and works with custom services. 2. Saved condition state tracking: when the OVN image is unchanged between the deployed and target versions, image comparison alone cannot distinguish "already updated" from "not yet updated." To handle this, we use the saved condition state (captured before Init resets conditions each reconciliation) to track whether a running OVN deployment has been observed during this update cycle: - Running OVN deployment seen → set condition False(RequestedReason) - No running deployment + previous condition was RequestedReason → deployment completed, proceed - No running deployment + previous condition was NOT RequestedReason → deployment not created yet, keep waiting The OpenStackVersion controller watches nodesets, so when a deployment starts (nodeset becomes not-Ready) and completes (nodeset becomes Ready), reconciliation is triggered, ensuring we observe the running deployment at least once. When the OVN image differs between versions, the existing image match check is sufficient — the nodeset's ContainerImages are only updated on successful deployment completion, so a match proves the deployment ran. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Martin Schuppert <mschuppert@redhat.com>
|
New changes are detected. LGTM label has been removed. |
|
I have updated the logic which I think addresses checking if the job runs/was running. re the ci issue, wondering if we should stop testing edpm updates in prow as we now have the update job in zuul job which has a proper edpm node? |
Yeah we can drop that if we want. |
Uh oh!
There was an error while loading. Please reload this page.