Skip to content

Fix MinorUpdateOVNDataplane race when OVN image unchanged between versions#1864

Open
stuggi wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
stuggi:ovn_update_cond
Open

Fix MinorUpdateOVNDataplane race when OVN image unchanged between versions#1864
stuggi wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
stuggi:ovn_update_cond

Conversation

@stuggi
Copy link
Contributor

@stuggi stuggi commented Mar 26, 2026

Commit 63fd70544f removed the nodeset.IsReady() check from
DataplaneNodesetsOVNControllerImagesMatch to fix the minor update
workflow getting stuck when unrelated deployments were running. That
check was too strict — it blocked when any deployment was in progress,
even if the OVN update had already completed.

However, the remaining pure image comparison is too loose. When the
OVN controller image does not change between two OpenStack versions,
the nodeset already has the matching image from the previous update.
The OpenStackVersion controller then sets MinorUpdateOVNDataplane=True
immediately, before the edpm-ovn-update deployment finishes. This
causes the subsequent minor update steps (controlplane update, edpm
services update) to proceed while the OVN dataplane deployment is
still running — resulting in both dataplane deployments running
concurrently.

Fix this with two mechanisms:

1. IsDataplaneDeploymentRunningForContainerImage: a new generic
   function that lists all in-progress OpenStackDataPlaneDeployment
   resources, resolves which services each deploys (from
   ServicesOverride or the nodeset's service list), and inspects each
   service's ContainerImageFields to determine if it manages the
   specified container image. This does not rely on deployment or
   service naming conventions and works with custom services.

2. Saved condition state tracking: when the OVN image is unchanged
   between the deployed and target versions, image comparison alone
   cannot distinguish "already updated" from "not yet updated." To
   handle this, we use the saved condition state (captured before
   Init resets conditions each reconciliation) to track whether a
   running OVN deployment has been observed during this update cycle:
   - Running OVN deployment seen → set condition False(RequestedReason)
   - No running deployment + previous condition was RequestedReason →
     deployment completed, proceed
   - No running deployment + previous condition was NOT
     RequestedReason → deployment not created yet, keep waiting

   The OpenStackVersion controller watches nodesets, so when a
   deployment starts (nodeset becomes not-Ready) and completes
   (nodeset becomes Ready), reconciliation is triggered, ensuring
   we observe the running deployment at least once.

When the OVN image differs between versions, the existing image match
check is sufficient — the nodeset's ContainerImages are only updated
on successful deployment completion, so a match proves the deployment
ran.

@openshift-ci openshift-ci bot requested review from dprince and slagle March 26, 2026 14:50
@github-actions
Copy link

github-actions bot commented Mar 26, 2026

OpenStackControlPlane CRD Size Report

Metric Value
CRD JSON size 322136 bytes (315KB)
Base branch size 322136 bytes
Change +0.00%
Status yellow — growing
Threshold reference
Color Range Meaning
🟢 green < 300KB Comfortable
🟡 yellow 300–400KB Growing
🟠 orange 400–750KB Concerning
🔴 red > 750KB Approaching 1.5MB etcd limit (cut in half to allow space for update)

@stuggi stuggi requested a review from abays March 26, 2026 14:52
Copy link
Contributor

@abays abays left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abays, stuggi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abays
Copy link
Contributor

abays commented Mar 26, 2026

@stuggi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/openstack-operator-build-deploy-kuttl-4-18 0585f27 link true /test openstack-operator-build-deploy-kuttl-4-18

Full PR test history. Your PR dashboard.
Details

Hmm, looks like the minor update got stuck:

oc patch -n openstack openstackversion.core.openstack.org/openstack-galera --type merge --patch '{"spec": {"targetVersion": "0.0.1-1774540092"}}'
openstackversion.core.openstack.org/openstack-galera condition met
openstackversion.core.openstack.org/openstack-galera patched
+ sleep 10
+ oc wait openstackcontrolplane -n openstack --for=condition=Ready --timeout=1200s -l core.openstack.org/openstackcontrolplane
error: timed out waiting for the condition on openstackcontrolplanes/openstack-galera

Let's try again and see if it reproduces.

/test openstack-operator-build-deploy-kuttl-4-18

@abays
Copy link
Contributor

abays commented Mar 26, 2026

@stuggi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/openstack-operator-build-deploy-kuttl-4-18 0585f27 link true /test openstack-operator-build-deploy-kuttl-4-18

Full PR test history. Your PR dashboard.
Details

* 2026-03-26T17:28:43Z 1x kubelet: Error: context deadline exceeded
* 2026-03-26T18:09:13Z 6x kubelet: Failed to inspect image "": rpc error: code = DeadlineExceeded desc = stream terminated by RST_STREAM with error code: CANCEL
* 2026-03-26T17:59:08Z 14x kubelet: Error: ImageInspectError
* 2026-03-26T17:33:02Z 1x kubelet: Error: stream terminated by RST_STREAM with error code: CANCEL
* 2026-03-26T18:25:16Z 20x kubelet: Failed to inspect image "": rpc error: code = DeadlineExceeded desc = context deadline exceeded

/test openstack-operator-build-deploy-kuttl-4-18

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 26, 2026

@stuggi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/openstack-operator-build-deploy-kuttl-4-18 0585f27 link true /test openstack-operator-build-deploy-kuttl-4-18

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@abays
Copy link
Contributor

abays commented Mar 26, 2026

@stuggi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/openstack-operator-build-deploy-kuttl-4-18 0585f27 link true /test openstack-operator-build-deploy-kuttl-4-18

Full PR test history. Your PR dashboard.
Details

Same error as before:

openstackversion.core.openstack.org/openstack-galera patched
+ sleep 10
+ oc wait openstackcontrolplane -n openstack --for=condition=Ready --timeout=1200s -l core.openstack.org/openstackcontrolplane
error: timed out waiting for the condition on openstackcontrolplanes/openstack-galera

Might be a legitimate issue?

@abays
Copy link
Contributor

abays commented Mar 26, 2026

@stuggi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/openstack-operator-build-deploy-kuttl-4-18 0585f27 link true /test openstack-operator-build-deploy-kuttl-4-18
Full PR test history. Your PR dashboard.
Details

Same error as before:

openstackversion.core.openstack.org/openstack-galera patched
+ sleep 10
+ oc wait openstackcontrolplane -n openstack --for=condition=Ready --timeout=1200s -l core.openstack.org/openstackcontrolplane
error: timed out waiting for the condition on openstackcontrolplanes/openstack-galera

Might be a legitimate issue?

Looks like a problem with the data plane reconciling an OpenStackBaremetalSet:

2026-03-26T20:37:18.416334575Z 2026-03-26T20:37:18Z	INFO	Controllers.OpenStackDataPlaneNodeSet	Reconciling BaremetalSet	{"controller": "openstackdataplanenodeset", "controllerGroup": "dataplane.openstack.org", "controllerKind": "OpenStackDataPlaneNodeSet", "OpenStackDataPlaneNodeSet": {"name":"openstack-edpm-ipam","namespace":"openstack"}, "namespace": "openstack", "name": "openstack-edpm-ipam", "reconcileID": "0dbd65cf-8bb5-4af0-909a-085c2ccebaf1", "ObjectType": "*v1beta1.OpenStackDataPlaneNodeSet", "ObjectNamespace": "openstack", "ObjectName": "openstack-edpm-ipam"}
2026-03-26T20:37:18.430773614Z 2026-03-26T20:37:18Z	ERROR	Reconciler error	{"controller": "openstackdataplanenodeset", "controllerGroup": "dataplane.openstack.org", "controllerKind": "OpenStackDataPlaneNodeSet", "OpenStackDataPlaneNodeSet": {"name":"openstack-edpm-ipam","namespace":"openstack"}, "namespace": "openstack", "name": "openstack-edpm-ipam", "reconcileID": "0dbd65cf-8bb5-4af0-909a-085c2ccebaf1", "error": "admission webhook \"vopenstackbaremetalset-v1beta1.kb.io\" denied the request: unable to find 1 requested BaremetalHosts with labels [app:openstack] in namespace openstack for scale-up (0 in use, 0 available)"}

I didn't realize we used Metal3 provisioning in our Prow jobs. Is something messing up that is making the OpenStackDataPlaneNodeSet think it needs to provision something first? Don't we use pre-provisioned?

Copy link
Contributor

@rabi rabi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting really messy 😄 We do create nodset/deployment to validate the CRs[1] after update. Probably we need to move the edpm_deploy_cleanup before we check for openstackctlplane ready in L195.

https://github.com/openshift/release/blob/main/ci-operator/step-registry/openstack-k8s-operators/kuttl/openstack-k8s-operators-kuttl-commands.sh#L173

serviceCache := make(map[string]*dataplanev1.OpenStackDataPlaneService)

for _, deployment := range deployments.Items {
// Skip completed deployments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be earlier failed deployments containing ovn. We should check if there is a running deployment. Also, I think probably a would still be a race here.

}

if slices.Contains(svc.Spec.ContainerImageFields, containerImageField) {
return true, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably just check serviceType 'ovn' and avoid the harcoded check for the image name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there always a service type, which is stable and not custom?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, serviceType is always fixed and the service name can change.

…sions

Commit 63fd705 removed the nodeset.IsReady() check from
DataplaneNodesetsOVNControllerImagesMatch to fix the minor update
workflow getting stuck when unrelated deployments were running. That
check was too strict — it blocked when any deployment was in progress,
even if the OVN update had already completed.

However, the remaining pure image comparison is too loose. When the
OVN controller image does not change between two OpenStack versions,
the nodeset already has the matching image from the previous update.
The OpenStackVersion controller then sets MinorUpdateOVNDataplane=True
immediately, before the edpm-ovn-update deployment finishes. This
causes the subsequent minor update steps (controlplane update, edpm
services update) to proceed while the OVN dataplane deployment is
still running — resulting in both dataplane deployments running
concurrently.

Fix this with two mechanisms:

1. IsDataplaneDeploymentRunningForContainerImage: a new generic
   function that lists all in-progress OpenStackDataPlaneDeployment
   resources, resolves which services each deploys (from
   ServicesOverride or the nodeset's service list), and inspects each
   service's ContainerImageFields to determine if it manages the
   specified container image. This does not rely on deployment or
   service naming conventions and works with custom services.

2. Saved condition state tracking: when the OVN image is unchanged
   between the deployed and target versions, image comparison alone
   cannot distinguish "already updated" from "not yet updated." To
   handle this, we use the saved condition state (captured before
   Init resets conditions each reconciliation) to track whether a
   running OVN deployment has been observed during this update cycle:
   - Running OVN deployment seen → set condition False(RequestedReason)
   - No running deployment + previous condition was RequestedReason →
     deployment completed, proceed
   - No running deployment + previous condition was NOT
     RequestedReason → deployment not created yet, keep waiting

   The OpenStackVersion controller watches nodesets, so when a
   deployment starts (nodeset becomes not-Ready) and completes
   (nodeset becomes Ready), reconciliation is triggered, ensuring
   we observe the running deployment at least once.

When the OVN image differs between versions, the existing image match
check is sufficient — the nodeset's ContainerImages are only updated
on successful deployment completion, so a match proves the deployment
ran.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Martin Schuppert <mschuppert@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm label Mar 27, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 27, 2026

New changes are detected. LGTM label has been removed.

@stuggi
Copy link
Contributor Author

stuggi commented Mar 27, 2026

I have updated the logic which I think addresses checking if the job runs/was running. re the ci issue, wondering if we should stop testing edpm updates in prow as we now have the update job in zuul job which has a proper edpm node?

@rabi
Copy link
Contributor

rabi commented Mar 27, 2026

I have updated the logic which I think addresses checking if the job runs/was running. re the ci issue, wondering if we should stop testing edpm updates in prow as we now have the update job in zuul job which has a proper edpm node?

Yeah we can drop that if we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants