[multiple] Co-locate provisionserver with metal3 to prevent DHCP fail… by mnietoji · Pull Request #3738 · openstack-k8s-operators/ci-framework

mnietoji · 2026-03-04T11:27:30Z

…ures

When metal3-dnsmasq pod restarts during a node's DHCP lease renewal on the provisioning network (172.23.0.0/24), NetworkManager fails to renew and sets ipv4.method=disabled. NMState operator then preserves this disabled state, causing permanent loss of provisioning network connectivity on that node.

The issue occurs when OpenStackProvisionServer and metal3 pods run on different nodes. If metal3 restarts while a node is attempting DHCP renewal, the temporary unavailability of metal3-dnsmasq causes the renewal to fail.

Solution:
Automatically detect the node running metal3 pod (via k8s-app=metal3 label) and configure provisionServerNodeSelector in baremetalSetTemplate to schedule OpenStackProvisionServer on the same node. This ensures provisioning network connectivity is maintained because metal3-static-ip-manager maintains a static IP (172.23.0.3) on the metal3 node regardless of dnsmasq restarts.

openshift-ci · 2026-03-04T11:27:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tosky for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

softwarefactory-project-zuul · 2026-03-04T11:38:10Z

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ci-framework for 3738,d660efa12350eb88ab3c89b1d91a04abcbc82293

softwarefactory-project-zuul · 2026-03-04T11:43:08Z

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ci-framework for 3738,369ae185bc2b7d5a266e63c93224f86f1d2723cd

softwarefactory-project-zuul · 2026-03-04T11:49:32Z

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/ci-framework for 3738,3fa51c9d28a6f3c53f0c99dbbdef1baf476724d5

softwarefactory-project-zuul · 2026-03-04T12:02:11Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/3217375613864e0d83f7f88f394dcfaa

❌ openstack-k8s-operators-content-provider FAILURE in 7m 16s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal-minor-update SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
✔️ cifmw-pod-zuul-files SUCCESS in 4m 25s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 21s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 49s
✔️ cifmw-pod-pre-commit SUCCESS in 8m 44s
✔️ cifmw-architecture-validate-hci SUCCESS in 4m 54s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 24s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 14s

softwarefactory-project-zuul · 2026-03-10T17:18:55Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/0b04bcb1f4f54d518d017da862888f74

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 03m 30s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 22m 08s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 25m 42s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 1h 49m 34s
✔️ cifmw-pod-zuul-files SUCCESS in 5m 28s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 35s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 49s
❌ cifmw-pod-pre-commit TIMED_OUT in 31m 04s
✔️ cifmw-architecture-validate-hci SUCCESS in 4m 29s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 33s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 07s

…ures II When metal3-dnsmasq pod restarts during a node's DHCP lease renewal on the provisioning network (172.23.0.0/24), NetworkManager fails to renew and sets ipv4.method=disabled. NMState operator then preserves this disabled state, causing permanent loss of provisioning network connectivity on that node. The issue occurs when OpenStackProvisionServer and metal3 pods run on different nodes. If metal3 restarts while a node is attempting DHCP renewal, the temporary unavailability of metal3-dnsmasq causes the renewal to fail. Solution: Automatically detect the node running metal3 pod (via k8s-app=metal3 label) and configure provisionServerNodeSelector in baremetalSetTemplate to schedule OpenStackProvisionServer on the same node. This ensures provisioning network connectivity is maintained because metal3-static-ip-manager maintains a static IP (172.23.0.3) on the metal3 node regardless of dnsmasq restarts. Signed-off-by: Miguel Angel Nieto Jimenez <mnietoji@redhat.com> Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Valkyrie00 · 2026-03-20T11:38:02Z

roles/ci_gen_kustomize_values/templates/shiftstack/edpm-nodeset-values/values.yaml.j2

+{% set _original_ansible_vars = (_original_nodeset.ansible | default({})).ansibleVars | default({}) %}
+{% set _original_bootstrap = _original_ansible_vars.edpm_bootstrap_command | default('') %}
+{% set _needs_policy_json = cifmw_ci_gen_kustomize_values_add_policy_json | default(true) | bool and 'policy.json' not in _original_bootstrap %}
+{% if _needs_policy_json or _original_bootstrap %}
+        edpm_bootstrap_command: |
+{%   if _original_bootstrap %}
+          {{ _original_bootstrap }}
+
+{%   endif %}
+{%   if _needs_policy_json %}
+          mkdir -p /root/.config/containers/ && echo '{"default":[{"type":"insecureAcceptAnything"}]}' > /root/.config/containers/policy.json
+{%   endif %}
+{% endif %}
+{% if cifmw_ci_gen_kustomize_values_registry_logins is defined and cifmw_ci_gen_kustomize_values_registry_logins | length > 0 %}
+        edpm_container_registry_logins:
+{%   for registry, creds in cifmw_ci_gen_kustomize_values_registry_logins.items() %}
+          {{ registry }}:
+{%     for username, password in creds.items() %}
+            {{ username }}: {{ password }}
+{%     endfor %}
+{%   endfor %}
+{% endif %}


This block are indented at 8 spaces (belonging under ansibleVars), but they're placed after the services: conditional (L63-70) which renders at 4-space indent. When the services conditional evaluates to true, the rendered YAML would look like.

nodeset: ansible: ansibleVars: edpm_fips_mode: ... edpm_sshd_allowed_ranges: ... services: # 4-space — leaves ansibleVars scope - "repo-setup" - "original-service" edpm_bootstrap_command: | # 8-space — no valid parent here ... edpm_container_registry_logins: # 8-space — same problem ... nodes:

This produces invalid YAML because the parser has already exited the ansible -> ansibleVars context when it hits services::
For example, the common template (common/edpm-nodeset-values/values.yaml.j2) has the correct ordering.

My suggestion, move this block, L72-93, before the services conditional (before L63), right after the edpm_sshd_allowed_ranges block. Just as you did for the other blocks

Valkyrie00 · 2026-03-20T11:44:57Z

playbooks/06-deploy-architecture.yml

+          loop_control:
+            loop_var: registry
+          ansible.builtin.include_role:
+            name: edpm_pullsecret_sync


I can't find the reference for this role. Is it being introduced in a separate PR?

Valkyrie00 · 2026-03-20T11:50:35Z

playbooks/06-deploy-architecture.yml

+            cifmw_ci_gen_kustomize_values_registry_logins: >-
+              {%- set result = {} -%}
+              {%- for registry in (cifmw_reproducer_registry_list | default(['registry.stage.redhat.io'])) -%}
+                {%- set fact_name = 'cifmw_reproducer_registry_' + (registry | replace('.', '_') | replace('-', '_')) + '_creds_dict' -%}


May I ask you to check this part?
L180 tells the role to store credentials in a fact ending with _creds:, but here in this line looks up the fact with an extra _dict suffix.

The vars[fact_name] is defined check in the next line (L188) will always be false, so cifmw_ci_gen_kustomize_values_registry_logins will always be an empty {}. No error is raised because the rescue block only catches exceptions, not a silently empty result.

Valkyrie00 · 2026-03-20T12:03:41Z

roles/kustomize_deploy/tasks/execute_step.yml

+    - name: Detect metal3 pod node for baremetal nodeset provisioning
+      when:
+        - stage.path is defined
+        - stage.path is match('.*/edpm/nodeset.*')
+      block:
+        - name: Debug - Entering metal3 detection block
+          ansible.builtin.debug:
+            msg: "INSIDE metal3 detection block for stage: {{ stage.path }}"
+
+        - name: Get metal3 pod information
+          kubernetes.core.k8s_info:
+            kubeconfig: "{{ cifmw_openshift_kubeconfig }}"
+            kind: Pod
+            namespace: openshift-machine-api
+            label_selectors:
+              - k8s-app=metal3
+          register: _cifmw_kustomize_deploy_metal3_pod_info
+
+        - name: Set metal3 node for provisionserver nodeSelector
+          ansible.builtin.set_fact:
+            cifmw_kustomize_deploy_metal3_node: "{{ _cifmw_kustomize_deploy_metal3_pod_info.resources[0].spec.nodeName }}"
+            cacheable: true
+          when:
+            - _cifmw_kustomize_deploy_metal3_pod_info.resources is defined
+            - _cifmw_kustomize_deploy_metal3_pod_info.resources | length > 0
+
+        - name: Log metal3 node location
+          ansible.builtin.debug:
+            msg: "Metal3 pod is running on {{ cifmw_kustomize_deploy_metal3_node }}, provisionserver will be co-located there"
+          when: cifmw_kustomize_deploy_metal3_node is defined
+
+        - name: Debug - metal3 pod info result
+          ansible.builtin.debug:
+            msg:
+              - "Resources found: {{ _cifmw_kustomize_deploy_metal3_pod_info.resources | default([]) | length }}"
+              - "Variable set: {{ cifmw_kustomize_deploy_metal3_node is defined }}"
+              - "Variable value: {{ cifmw_kustomize_deploy_metal3_node | default('NOT_SET') }}"
+


The file execute_step.yml has a very consistent pattern: every task that interacts with a live cluster is gated by not cifmw_kustomize_deploy_generate_crs_only | bool. You can see this pattern used four times in the file:

Line 122 — pre_stage hooks: not cifmw_kustomize_deploy_generate_crs_only | bool
Line 320 — oc apply: not cifmw_kustomize_deploy_generate_crs_only | bool
Line 339 — wait conditions: not cifmw_kustomize_deploy_generate_crs_only | bool
Line 364 — post_stage hooks: not cifmw_kustomize_deploy_generate_crs_only | bool
When cifmw_kustomize_deploy_generate_crs_only: true (dry-run / architecture test mode), all of these are skipped because there is no live cluster. The test only generates and validates YAML artifacts.

The new block at line 136 calls kubernetes.core.k8s_info to query pods on the cluster but is missing this gate.

if I'm not mistaken.. for any scenario with an edpm/nodeset stage path (which is most baremetal scenarios), the k8s_info call will attempt to connect to a cluster that doesn't exist in dry-run mode. The block has no rescue or ignore_errors, so it will fail hard and break architecture tests.

Once you've checked that, here's my suggestion:

.... when: - not cifmw_kustomize_deploy_generate_crs_only | bool - stage.path is defined - stage.path is match('.*/edpm/nodeset.*') ....

mnietoji added the do-not-merge/work-in-progress label Mar 4, 2026

mnietoji force-pushed the dhcp_provisioning_with_fix branch from f5c2a2c to d660efa Compare March 4, 2026 11:37

openshift-merge-robot added the needs-rebase label Mar 4, 2026

mnietoji force-pushed the dhcp_provisioning_with_fix branch from d660efa to 369ae18 Compare March 4, 2026 11:42

mnietoji force-pushed the dhcp_provisioning_with_fix branch from 369ae18 to 3fa51c9 Compare March 4, 2026 11:49

mnietoji force-pushed the dhcp_provisioning_with_fix branch from 3fa51c9 to d0cf92f Compare March 4, 2026 11:52

openshift-merge-robot removed the needs-rebase label Mar 4, 2026

mnietoji force-pushed the dhcp_provisioning_with_fix branch from d0cf92f to 6b9c8b0 Compare March 4, 2026 14:44

github-actions bot added the Ready For Review label Mar 4, 2026

mnietoji force-pushed the dhcp_provisioning_with_fix branch from 6b9c8b0 to 1339a1d Compare March 4, 2026 17:48

github-actions bot added Ready For Review and removed Ready For Review labels Mar 4, 2026

mnietoji force-pushed the dhcp_provisioning_with_fix branch from 1339a1d to cf58db9 Compare March 5, 2026 10:55

github-actions bot removed Ready For Review labels Mar 5, 2026

mnietoji force-pushed the dhcp_provisioning_with_fix branch 3 times, most recently from e29b915 to 08a2b2b Compare March 10, 2026 15:14

mnietoji force-pushed the dhcp_provisioning_with_fix branch from 08a2b2b to a7b18cb Compare March 12, 2026 21:29

github-actions bot added the Ready For Review label Mar 12, 2026

mnietoji removed the do-not-merge/work-in-progress label Mar 18, 2026

mnietoji force-pushed the dhcp_provisioning_with_fix branch from a7b18cb to 4ec3152 Compare March 18, 2026 15:15

github-actions bot removed the Ready For Review label Mar 18, 2026

github-actions bot added Ready For Review and removed Ready For Review labels Mar 18, 2026

Valkyrie00 reviewed Mar 20, 2026

View reviewed changes

Valkyrie00 requested a review from a team March 20, 2026 12:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multiple] Co-locate provisionserver with metal3 to prevent DHCP fail…#3738

[multiple] Co-locate provisionserver with metal3 to prevent DHCP fail…#3738
mnietoji wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
mnietoji:dhcp_provisioning_with_fix

mnietoji commented Mar 4, 2026

Uh oh!

openshift-ci bot commented Mar 4, 2026

Uh oh!

softwarefactory-project-zuul bot commented Mar 4, 2026

Uh oh!

softwarefactory-project-zuul bot commented Mar 4, 2026

Uh oh!

softwarefactory-project-zuul bot commented Mar 4, 2026

Uh oh!

softwarefactory-project-zuul bot commented Mar 4, 2026

Uh oh!

softwarefactory-project-zuul bot commented Mar 10, 2026

Uh oh!

Valkyrie00 Mar 20, 2026

Uh oh!

Valkyrie00 Mar 20, 2026

Uh oh!

Valkyrie00 Mar 20, 2026

Uh oh!

Valkyrie00 Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mnietoji commented Mar 4, 2026

Uh oh!

openshift-ci bot commented Mar 4, 2026

Uh oh!

softwarefactory-project-zuul bot commented Mar 4, 2026

Uh oh!

softwarefactory-project-zuul bot commented Mar 4, 2026

Uh oh!

softwarefactory-project-zuul bot commented Mar 4, 2026

Uh oh!

softwarefactory-project-zuul bot commented Mar 4, 2026

Uh oh!

softwarefactory-project-zuul bot commented Mar 10, 2026

Uh oh!

Valkyrie00 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Valkyrie00 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Valkyrie00 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Valkyrie00 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants