Skip to content

[CP 1274] [Fix] GPUOP-607 fail the ANR workflow when imagePullBackOff#500

Merged
sajmera-pensando merged 1 commit intoROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1274.rocm.gpu-operator.main
Apr 3, 2026
Merged

[CP 1274] [Fix] GPUOP-607 fail the ANR workflow when imagePullBackOff#500
sajmera-pensando merged 1 commit intoROCm:mainfrom
ci-penbot-01:CP.O2O.pensando.gpu-operator.1274.rocm.gpu-operator.main

Conversation

@ci-penbot-01
Copy link
Copy Markdown
Contributor

cp of pensando/gpu-operator#1274


Source PR Description (pensando/gpu-operator#1274):

https://pensando.atlassian.net/browse/GPUOP-607

Add Early Detection for ImagePullBackOff in Test Runner

Summary

Enhanced the test runner script to detect ImagePullBackOff and ErrImagePull errors early, preventing unnecessary waiting until timeout when container images fail to pull.

Problem Statement

Previously, when a test runner job encountered an ImagePullBackOff error (either in the main container or init container), the monitoring script would continue waiting until the full timeout period expired before reporting failure. This resulted in:

  • Wasted time waiting for timeouts (potentially hours for long-running tests)
  • Delayed feedback on configuration errors
  • Inefficient resource usage in CI/CD pipelines
  • Poor developer experience when debugging image pull issues

Changes Made

File Modified

  • internal/controllers/remediation/scripts/test.sh

Implementation Details

Added proactive pod status checking within the job monitoring loop (lines 209-225):

  1. Container Status Checks: On each iteration, query the status of both:

    • Main container (amd-test-runner)
    • Init container (driver-init)
  2. Early Exit Conditions: Detect and exit immediately when either:

    • ImagePullBackOff state is detected
    • ErrImagePull state is detected
  3. Error Reporting: On detection, the script:

    • Displays a clear error message indicating which component failed
    • Shows pod events with detailed image pull error information
    • Attempts to display any available logs
    • Exits with status code 1

Cherrypick triggered by: ACP-Automation

* [Fix] GPUOP-607 fail the ANR workflow when imagePullBackOff

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>

* Update internal/controllers/remediation/scripts/test.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update internal/controllers/remediation/scripts/test.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(cherry picked from commit 344e480dacbe64c8c3e99129be5c3b09537fc5e5)
@sajmera-pensando sajmera-pensando merged commit 5abf58b into ROCm:main Apr 3, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants