Skip to content

MULTIARCH-6023: PowerVS IPI install signal handling to prevent deprovision failures#80917

Open
Neha-dot-Yadav wants to merge 2 commits into
openshift:mainfrom
Neha-dot-Yadav:powervs-update-trap
Open

MULTIARCH-6023: PowerVS IPI install signal handling to prevent deprovision failures#80917
Neha-dot-Yadav wants to merge 2 commits into
openshift:mainfrom
Neha-dot-Yadav:powervs-update-trap

Conversation

@Neha-dot-Yadav

@Neha-dot-Yadav Neha-dot-Yadav commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Problem

When PowerVS IPI installations timeout, the deprovision step fails with:

Skipping: /tmp/secret/metadata.json not found.

This prevents cleanup of cloud resources (VMs, networks, DNS records), leading to stale resource and subsequent job failures.

Root Cause

The script has two issues with signal handling:

  1. Trap overwriting: Two traps are set for SIGTERM, where the second overwrites the first:
    trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi' TERM
    trap 'prepare_next_steps' EXIT TERM  # ← Overwrites line above
    
  2. Pipeline blocking: Bash cannot execute trap handlers while blocked on pipeline execution:
    openshift-install ... | grep ... # Bash is blocked here

When SIGTERM arrives, it's queued but cannot be processed until the pipeline completes. If the pipeline never completes (stuck installation), SIGKILL arrives after the grace period and terminates everything without running cleanup.

Solution

  1. Combine traps: Merge both handlers into a single trap for EXIT/TERM signals
  2. Use background processes: Run openshift-install in background with process substitution instead of pipelines
  3. Use interruptible wait: The wait command can be interrupted by signals (unlike pipeline execution)

Summary by CodeRabbit

This PR improves the reliability of PowerVS IPI provisioning cleanup in OpenShift CI by fixing signal-handling behavior during install timeouts—preventing failures that previously left stale infrastructure behind (VMs, networks, and DNS records) and caused subsequent jobs to fail.

What changed (practical impact)

In ci-operator/step-registry/ipi/install/powervs/install/ipi-install-powervs-install-commands.sh, the install flow was adjusted so that when an install is interrupted (e.g., due to timeout), the script can still run its cleanup/“next steps” logic instead of exiting prematurely with errors like Skipping: /tmp/secret/metadata.json not found.

How it fixes the timeout/deprovision failure

  • Signal handling is consolidated: The script’s trap setup was reworked so the intended handler is not overwritten (ensuring termination signals trigger the correct logic).
  • Installer execution no longer blocks signal processing: Instead of running openshift-install ... | grep ... as a foreground pipeline (which can delay trap execution until the pipeline completes), openshift-install is now run in the background with output filtered via process substitution.
  • Uses wait for interruptible lifecycle control: The script captures the installer PID, waits on it, and propagates the resulting status—so SIGTERM/SIGKILL timing doesn’t prevent the script from executing its termination path.

These adjustments apply to both phases:

  • openshift-install --dir=... create cluster
  • openshift-install wait-for install-complete

Overall, PowerVS IPI jobs are less likely to strand cloud resources when interrupted, reducing downstream retries and cascading failures.

@openshift-ci openshift-ci Bot requested a review from dharaneeshvrd June 23, 2026 12:36
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

An error occurred during the review process. Please try again later.

Walkthrough

The script's signal handling is split from a single combined trap into separate TERM and EXIT traps. The TERM trap now kills background child processes and exits with status 143, while EXIT runs cleanup on normal termination. Both openshift-install create cluster and openshift-install wait-for install-complete invocations are converted from foreground pipelines (using PIPESTATUS[0]) to background processes with process-substitution output filtering, capturing the installer PID and deriving the exit code via wait.

Changes

Signal handling and process execution refactor

Layer / File(s) Summary
Signal trap refactor
ci-operator/step-registry/ipi/install/powervs/install/ipi-install-powervs-install-commands.sh
A single combined EXIT/TERM trap is split into a dedicated TERM trap that kills background child jobs via jobs -p, waits for them, suppresses the EXIT trap, then calls prepare_next_steps and exits with code 143; a separate EXIT trap continues to run prepare_next_steps on normal termination.
Background installer execution
ci-operator/step-registry/ipi/install/powervs/install/ipi-install-powervs-install-commands.sh
Both openshift-install create cluster and openshift-install wait-for install-complete are converted from foreground | pipelines with PIPESTATUS[0] to background processes with output filtered via process substitution (> >(grep ...)); installer PID is captured and exit code is obtained via wait.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error The PR introduces a redirection order bug that defeats secret-scrubbing: 2>&1 > >(grep ...) causes stderr to bypass the grep filter, exposing unredacted logs containing passwords and tokens. Swap redirection order to > >(grep ...) 2>&1 on lines 941 and 973 so stderr is filtered through grep, matching the AWS reference implementation.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix PowerVS IPI install signal handling to prevent deprovision failures' directly and clearly describes the main change in the pull request, which addresses signal handling issues in PowerVS IPI installation scripts.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Custom check for Ginkgo test name stability is not applicable. The PR modifies only a bash shell script (ipi-install-powervs-install-commands.sh) with no Ginkgo tests or test definitions.
Test Structure And Quality ✅ Passed Custom check for Ginkgo test code quality is not applicable to this PR, which modifies only a bash installation script with no test code.
Microshift Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. It only modifies a bash shell script for CI/CD infrastructure, making the MicroShift Test Compatibility check inapplicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR modifies a bash installation script (ipi-install-powervs-install-commands.sh), not Go/Ginkgo e2e test code. The SNO compatibility check applies only to new Ginkgo e2e tests, which are not p...
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only a bash CI/CD installation script's signal handling. No Kubernetes deployment manifests, operator code, or scheduling constraints are added/modified. Check applies only to deploymen...
Ote Binary Stdout Contract ✅ Passed PR modifies only a bash shell script for CI infrastructure; no Go code or OTE binaries involved. Custom check for OTE binary stdout contract violations is not applicable.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR modifies only a bash shell script (ipi-install-powervs-install-commands.sh), not Ginkgo e2e tests. The custom check applies only to new Ginkgo tests; no Go test files are present in this PR.
No-Weak-Crypto ✅ Passed The pull request modifies a bash shell script for PowerVS IPI installation, focusing on signal handling and process management. No cryptographic algorithms (MD5, SHA1, DES, RC4, etc.), custom crypt...
Container-Privileges ✅ Passed PR modifies bash script and step-registry YAML reference files with no container security definitions; no privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN, or allowPrivilegeEscalation: tr...
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested a review from hamzy June 23, 2026 12:36

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
ci-operator/step-registry/ipi/install/powervs/install/ipi-install-powervs-install-commands.sh (1)

972-976: 🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win

Same redirection-order bug as create cluster — stderr bypasses the filter here too.

🔒 Proposed fix
-  openshift-install wait-for install-complete --dir="${dir}" 2>&1 > >(grep --line-buffered -v 'password\|X-Auth-Token\|UserData:') &
+  openshift-install wait-for install-complete --dir="${dir}" > >(grep --line-buffered -v 'password\|X-Auth-Token\|UserData:') 2>&1 &
   INSTALL_PID=$!
   wait $INSTALL_PID
   ret=$?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/ipi/install/powervs/install/ipi-install-powervs-install-commands.sh`
around lines 972 - 976, The redirection order in the openshift-install wait-for
install-complete command is incorrect, causing stderr to bypass the grep filter.
The current order `2>&1 > >(grep ...)` redirects stderr before stdout is piped
to grep, so stderr goes to the original stdout instead of through the filter.
Move the `2>&1` redirection to come after the process substitution `> >(grep
...)` so that both stdout and stderr are properly filtered through the grep
command that removes sensitive information like passwords and authentication
tokens.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/ipi/install/powervs/install/ipi-install-powervs-install-commands.sh`:
- Around line 940-944: The redirection order in the openshift-install command is
incorrect, causing stderr to bypass the secret-scrubbing grep filter. Currently,
stderr is redirected to stdout before stdout is redirected to the grep process
substitution, allowing sensitive information like passwords and tokens to leak
into CI logs unredacted. Fix this by swapping the redirection order in the
openshift-install invocation so that stdout is first redirected to the grep
process substitution using > >(...), and then stderr is merged into it using
2>&1, ensuring both output streams pass through the password, X-Auth-Token, and
UserData filtering.
- Around line 707-710: Split the combined trap statement handling both EXIT and
TERM into two separate trap handlers to prevent prepare_next_steps from
executing twice and to preserve the correct exit code. Create a TERM trap that
kills background children and exits the script immediately, and a separate EXIT
trap that captures the exit code before running prepare_next_steps, ensuring the
function reads the actual installation result rather than the status from the
wait command.

---

Duplicate comments:
In
`@ci-operator/step-registry/ipi/install/powervs/install/ipi-install-powervs-install-commands.sh`:
- Around line 972-976: The redirection order in the openshift-install wait-for
install-complete command is incorrect, causing stderr to bypass the grep filter.
The current order `2>&1 > >(grep ...)` redirects stderr before stdout is piped
to grep, so stderr goes to the original stdout instead of through the filter.
Move the `2>&1` redirection to come after the process substitution `> >(grep
...)` so that both stdout and stderr are properly filtered through the grep
command that removes sensitive information like passwords and authentication
tokens.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b04622d5-638c-4b9d-86ce-8f97b938dad7

📥 Commits

Reviewing files that changed from the base of the PR and between c75ad6f and ae07ba4.

📒 Files selected for processing (1)
  • ci-operator/step-registry/ipi/install/powervs/install/ipi-install-powervs-install-commands.sh

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@Neha-dot-Yadav: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-installer-main-e2e-powervs-capi-ovn openshift/installer presubmit Registry content changed
pull-ci-openshift-installer-release-5.1-e2e-powervs-capi-ovn openshift/installer presubmit Registry content changed
pull-ci-openshift-installer-release-5.0-e2e-powervs-capi-ovn openshift/installer presubmit Registry content changed
pull-ci-openshift-installer-release-4.23-e2e-powervs-capi-ovn openshift/installer presubmit Registry content changed
pull-ci-openshift-installer-release-4.22-e2e-powervs-capi-ovn openshift/installer presubmit Registry content changed
pull-ci-openshift-installer-release-4.21-e2e-powervs-capi-ovn openshift/installer presubmit Registry content changed
pull-ci-openshift-installer-release-4.20-e2e-powervs-capi-ovn openshift/installer presubmit Registry content changed
pull-ci-openshift-installer-release-4.19-e2e-powervs-capi-ovn openshift/installer presubmit Registry content changed
pull-ci-openshift-installer-release-4.18-e2e-powervs-capi-ovn openshift/installer presubmit Registry content changed
pull-ci-openshift-installer-release-4.17-e2e-powervs-ovn openshift/installer presubmit Registry content changed
pull-ci-openshift-installer-release-4.16-altinfra-e2e-powervs-capi-ovn openshift/installer presubmit Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.21-ppc64le-nightly-powervs-ipi-f14 N/A periodic Registry content changed
periodic-ci-openshift-multiarch-main-nightly-4.20-ocp-e2e-ovn-powervs-capi-multi-p-p N/A periodic Registry content changed
periodic-ci-openshift-multiarch-main-nightly-4.16-ocp-e2e-ovn-ppc64le-powervs-original N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.20-ppc64le-nightly-powervs-ipi-f14-destructive N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.18-ppc64le-nightly-powervs-ipi-f14 N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-5.0-ppc64le-nightly-powervs-ipi-f7-destructive N/A periodic Registry content changed
periodic-ci-openshift-release-main-nightly-4.16-e2e-powervs-ovn-techpreview N/A periodic Registry content changed
periodic-ci-openshift-release-main-ci-4.16-e2e-powervs-ovn-techpreview N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.17-ppc64le-nightly-powervs-ipi-f14-destructive N/A periodic Registry content changed
periodic-ci-openshift-multiarch-main-nightly-4.14-ocp-e2e-serial-ovn-ppc64le-powervs N/A periodic Registry content changed
periodic-ci-openshift-openshift-tests-private-release-4.21-ppc64le-nightly-powervs-ipi-f14-destructive N/A periodic Registry content changed
periodic-ci-openshift-multiarch-main-nightly-4.13-ocp-e2e-ovn-ppc64le-powervs N/A periodic Registry content changed
periodic-ci-openshift-multiarch-main-nightly-4.15-ocp-e2e-ovn-ppc64le-powervs-original N/A periodic Registry content changed
periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-ovn-powervs-capi-multi-p-p N/A periodic Registry content changed

A total of 43 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@Neha-dot-Yadav: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hamzy

hamzy commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-5.0-ocp-e2e-ovn-powervc-multi-p-p

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@hamzy: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@hamzy: job(s): periodic-ci-openshift-multiarch-main-nightly-5.0-ocp-e2e-ovn-powervc-multi-p-p either don't exist or were not found to be affected, and cannot be rehearsed

@hamzy

hamzy commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-ovn-powervs-capi-multi-p-p

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@hamzy: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@prb112 prb112 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 23, 2026
@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Neha-dot-Yadav, prb112

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2026
@prb112

prb112 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

/retitle MULTIARCH-6023: PowerVS IPI install signal handling to prevent deprovision failures

@openshift-ci openshift-ci Bot changed the title Fix PowerVS IPI install signal handling to prevent deprovision failures MULTIARCH-6023: PowerVS IPI install signal handling to prevent deprovision failures Jun 23, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@Neha-dot-Yadav: This pull request references MULTIARCH-6023 which is a valid jira issue.

Details

In response to this:

Problem

When PowerVS IPI installations timeout, the deprovision step fails with:

Skipping: /tmp/secret/metadata.json not found.

This prevents cleanup of cloud resources (VMs, networks, DNS records), leading to stale resource and subsequent job failures.

Root Cause

The script has two issues with signal handling:

  1. Trap overwriting: Two traps are set for SIGTERM, where the second overwrites the first:
trap 'CHILDREN=$(jobs -p); if test -n "${CHILDREN}"; then kill ${CHILDREN} && wait; fi' TERM
trap 'prepare_next_steps' EXIT TERM  # ← Overwrites line above

2. Pipeline blocking: Bash cannot execute trap handlers while blocked on pipeline execution:
`openshift-install ... | grep ...  # Bash is blocked here`

When SIGTERM arrives, it's queued but cannot be processed until the pipeline completes. If the pipeline never completes (stuck installation), SIGKILL arrives after the grace period and terminates everything without running cleanup.

## Solution
1. Combine traps: Merge both handlers into a single trap for EXIT/TERM signals
2. Use background processes: Run openshift-install in background with process substitution instead of pipelines
3. Use interruptible wait: The wait command can be interrupted by signals (unlike pipeline execution)


<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

This PR improves the reliability of PowerVS IPI provisioning cleanup in OpenShift CI by fixing signal-handling behavior during install timeouts—preventing failures that previously left stale infrastructure behind (VMs, networks, and DNS records) and caused subsequent jobs to fail.

### What changed (practical impact)
In `ci-operator/step-registry/ipi/install/powervs/install/ipi-install-powervs-install-commands.sh`, the install flow was adjusted so that when an install is interrupted (e.g., due to timeout), the script can still run its cleanup/“next steps” logic instead of exiting prematurely with errors like `Skipping: /tmp/secret/metadata.json not found`.

### How it fixes the timeout/deprovision failure
- **Signal handling is consolidated:** The script’s trap setup was reworked so the intended handler is not overwritten (ensuring termination signals trigger the correct logic).
- **Installer execution no longer blocks signal processing:** Instead of running `openshift-install ... | grep ...` as a foreground pipeline (which can delay trap execution until the pipeline completes), `openshift-install` is now run in the background with output filtered via process substitution.
- **Uses `wait` for interruptible lifecycle control:** The script captures the installer PID, waits on it, and propagates the resulting status—so SIGTERM/SIGKILL timing doesn’t prevent the script from executing its termination path.

These adjustments apply to both phases:
- `openshift-install --dir=... create cluster`
- `openshift-install wait-for install-complete`

Overall, PowerVS IPI jobs are less likely to strand cloud resources when interrupted, reducing downstream retries and cascading failures.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 23, 2026
@hamzy

hamzy commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-ovn-powervs-capi-multi-p-p

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@hamzy: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants