[CP 1274] [Fix] GPUOP-607 fail the ANR workflow when imagePullBackOff#500
Merged
sajmera-pensando merged 1 commit intoROCm:mainfrom Apr 3, 2026
Conversation
* [Fix] GPUOP-607 fail the ANR workflow when imagePullBackOff Signed-off-by: yansun1996 <Yan.Sun3@amd.com> * Update internal/controllers/remediation/scripts/test.sh Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update internal/controllers/remediation/scripts/test.sh Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: yansun1996 <Yan.Sun3@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> (cherry picked from commit 344e480dacbe64c8c3e99129be5c3b09537fc5e5)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cp of pensando/gpu-operator#1274
Source PR Description (pensando/gpu-operator#1274):
https://pensando.atlassian.net/browse/GPUOP-607
Add Early Detection for ImagePullBackOff in Test Runner
Summary
Enhanced the test runner script to detect ImagePullBackOff and ErrImagePull errors early, preventing unnecessary waiting until timeout when container images fail to pull.
Problem Statement
Previously, when a test runner job encountered an ImagePullBackOff error (either in the main container or init container), the monitoring script would continue waiting until the full timeout period expired before reporting failure. This resulted in:
Changes Made
File Modified
internal/controllers/remediation/scripts/test.shImplementation Details
Added proactive pod status checking within the job monitoring loop (lines 209-225):
Container Status Checks: On each iteration, query the status of both:
amd-test-runner)driver-init)Early Exit Conditions: Detect and exit immediately when either:
ImagePullBackOffstate is detectedErrImagePullstate is detectedError Reporting: On detection, the script:
Cherrypick triggered by: ACP-Automation