Skip to content

[DO NOT MERGE] Revert "Only apply 5-minute sleep workaround for known multi-arch repos"#5055

Open
kunalmemane wants to merge 4 commits intoopenshift:mainfrom
kunalmemane:OCPBUGS-65845-race-condition
Open

[DO NOT MERGE] Revert "Only apply 5-minute sleep workaround for known multi-arch repos"#5055
kunalmemane wants to merge 4 commits intoopenshift:mainfrom
kunalmemane:OCPBUGS-65845-race-condition

Conversation

@kunalmemane
Copy link
Copy Markdown
Member

This reverts commit 1d12ac7.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@kunalmemane
Copy link
Copy Markdown
Member Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

Walkthrough

Removed the predicate-based multi-arch workaround: call sites stopped passing a function-valued workaround predicate to handleBuild; ImageBuildOptions.NeedsMultiArchWorkaround and the multi-arch allowlist/predicate were removed; handleBuild signature changed and the previous conditional workaround sleep was deleted (no runtime delay remains).

Changes

Cohort / File(s) Summary
Call-site updates
pkg/steps/bundle_source.go, pkg/steps/project_image.go
Removed the inline func() bool { ... } argument passed to handleBuild; callers now invoke handleBuild(..., b) without a predicate.
Core multi-arch refactor
pkg/steps/source.go
Removed ImageBuildOptions.NeedsMultiArchWorkaround, deleted multiArchRepos and label-derived predicate logic, updated handleBuild signature to drop the predicate parameter, and removed the workaround-gated time.Sleep behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Tools execution failed with the following error:

Failed to run tools: 14 UNAVAILABLE: read ECONNRESET


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from bear-redhat and droslean March 27, 2026 08:01
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kunalmemane
Once this PR has been reviewed and has the lgtm label, please assign droslean for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/steps/source.go`:
- Line 493: The call to handleBuild(ctx, buildClient, podClient, b, true)
unconditionally enables the 5-minute multi-arch delay for all builds; revert
this by restoring the original gating logic (e.g., check the repository
allowlist or the predicate that previously determined if the multi-arch
workaround is needed) and only pass true when that check returns true. Locate
the callers in handleBuilds/handleBuild and replace the hardcoded true with the
original condition (for example needsMultiArchWorkaround(b.Repo) or
repoAllowlistContains(b.Repo)) or a feature-flag variable so only affected
repositories sleep 5 minutes. Ensure the symbolic names handleBuild,
handleBuilds and the repo allowlist/predicate are used to find and update the
code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1dacf753-4d04-4642-b731-b8c57fef3213

📥 Commits

Reviewing files that changed from the base of the PR and between 6280ffc and df878a3.

📒 Files selected for processing (3)
  • pkg/steps/bundle_source.go
  • pkg/steps/project_image.go
  • pkg/steps/source.go

defer wg.Done()
metricsAgent.AddNodeWorkload(ctx, b.Namespace, fmt.Sprintf("%s-build", b.Name), b.Name, podClient)
if err := handleBuild(ctx, buildClient, podClient, b, needsMultiArchWorkaround); err != nil {
if err := handleBuild(ctx, buildClient, podClient, b, true); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unconditional 5-minute delay applied to all multi-arch builds.

Passing true here means every build routed through handleBuilds will now sleep 5 minutes, regardless of whether the repository actually needs the multi-arch workaround. Previously, this was gated by a repository allowlist check.

Given the PR title indicates "DO NOT MERGE," this appears intentional for testing/investigation, but merging this would introduce a significant build-time regression.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/steps/source.go` at line 493, The call to handleBuild(ctx, buildClient,
podClient, b, true) unconditionally enables the 5-minute multi-arch delay for
all builds; revert this by restoring the original gating logic (e.g., check the
repository allowlist or the predicate that previously determined if the
multi-arch workaround is needed) and only pass true when that check returns
true. Locate the callers in handleBuilds/handleBuild and replace the hardcoded
true with the original condition (for example needsMultiArchWorkaround(b.Repo)
or repoAllowlistContains(b.Repo)) or a feature-flag variable so only affected
repositories sleep 5 minutes. Ensure the symbolic names handleBuild,
handleBuilds and the repo allowlist/predicate are used to find and update the
code.

@kunalmemane
Copy link
Copy Markdown
Member Author

/retest images

@kunalmemane
Copy link
Copy Markdown
Member Author

/test images

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/steps/source.go (1)

469-475: ⚠️ Potential issue | 🟠 Major

Restore the per-build gate for the 5-minute workaround.

Removing NeedsMultiArchWorkaround and sleeping unconditionally on Line 558 puts every handleBuild caller on the slow path, including the bundle-only callers in pkg/steps/bundle_source.go (Lines 82-85) and pkg/steps/project_image.go (Lines 73-79) that previously opted out. Because this sleep sits inside the retry loop, a flaky build can now burn 25 minutes before retries even recreate the build. Please thread the old predicate/allowlist back through handleBuilds/handleBuild and keep bundle callers on a constant false.

Minimal fix direction
 type ImageBuildOptions struct {
 	Architectures []string
+	NeedsMultiArchWorkaround func() bool
 }
-			if err := handleBuild(ctx, buildClient, podClient, b); err != nil {
+			if err := handleBuild(ctx, buildClient, podClient, b, o.NeedsMultiArchWorkaround); err != nil {
 				errChan <- fmt.Errorf("error occurred handling build %s: %w", b.Name, err)
 			}
-func handleBuild(ctx context.Context, client BuildClient, podClient kubernetes.PodClient, build buildapi.Build) error {
+func handleBuild(ctx context.Context, client BuildClient, podClient kubernetes.PodClient, build buildapi.Build, needsMultiArchWorkaround func() bool) error {
 	const attempts = 5
 	ns, name := build.Namespace, build.Name
 	var errs []error
 	if err := wait.ExponentialBackoff(wait.Backoff{Duration: time.Minute, Factor: 1.5, Steps: attempts}, func() (bool, error) {
 		var attempt buildapi.Build
-		// builds are using older src image, adding wait to avoid race condition
-		time.Sleep(5 * time.Minute)
+		if needsMultiArchWorkaround != nil && needsMultiArchWorkaround() {
+			// builds are using older src image, adding wait to avoid race condition
+			time.Sleep(5 * time.Minute)
+		}
 		build.DeepCopyInto(&attempt)

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 493-493, 551-558

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/steps/source.go` around lines 469 - 475, The unconditional 5-minute sleep
was reintroduced and affects all handleBuild callers; restore a per-build gate
by reintroducing the NeedsMultiArchWorkaround predicate and thread it through
handleBuilds -> handleBuild (and any build option structs like
ImageBuildOptions) so only builds that need the multi-arch workaround sleep;
ensure bundle-only callers (the callers in bundle_source.go and
project_image.go) pass/propagate a false value (or opt out) so they remain on
the fast path and the retry loop no longer incurs the long sleep for them.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/steps/source.go`:
- Around line 469-475: The unconditional 5-minute sleep was reintroduced and
affects all handleBuild callers; restore a per-build gate by reintroducing the
NeedsMultiArchWorkaround predicate and thread it through handleBuilds ->
handleBuild (and any build option structs like ImageBuildOptions) so only builds
that need the multi-arch workaround sleep; ensure bundle-only callers (the
callers in bundle_source.go and project_image.go) pass/propagate a false value
(or opt out) so they remain on the fast path and the retry loop no longer incurs
the long sleep for them.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f5e9dac1-7a50-452a-8991-e39200dea768

📥 Commits

Reviewing files that changed from the base of the PR and between df878a3 and 3ed7a4c.

📒 Files selected for processing (3)
  • pkg/steps/bundle_source.go
  • pkg/steps/project_image.go
  • pkg/steps/source.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/steps/project_image.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/steps/source.go (1)

551-558: ⚠️ Potential issue | 🟠 Major

Restore conditional gating for the pre-create sleep

Line 558 now sleeps unconditionally for every build attempt, and with the signature change at Line 551 there is no longer a way to scope this workaround to affected repos/builds. This adds avoidable latency to all builds and can materially slow CI throughput.

Proposed direction
 type ImageBuildOptions struct {
   Architectures []string
+  NeedsMultiArchWorkaround bool
 }

-func handleBuild(ctx context.Context, client BuildClient, podClient kubernetes.PodClient, build buildapi.Build) error {
+func handleBuild(ctx context.Context, client BuildClient, podClient kubernetes.PodClient, build buildapi.Build, needsMultiArchWorkaround bool) error {
   const attempts = 5
   ...
-  // builds are using older src image, adding wait to avoid race condition
-  time.Sleep(1 * time.Minute)
+  if needsMultiArchWorkaround {
+    // builds are using older src image, adding wait to avoid race condition
+    time.Sleep(1 * time.Minute)
+  }
   ...
 }

As per coding guidelines, "-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/steps/source.go` around lines 551 - 558, The unconditional time.Sleep(1 *
time.Minute) inside handleBuild (within wait.ExponentialBackoff) causes every
build attempt to pause and must be gated; change handleBuild to only perform the
pre-create sleep for builds that actually need the older-src-image workaround
(e.g., inspect build.Metadata/Annotations/Labels or
build.Spec.SourceImage/version) and only before the first attempt rather than on
every retry. Locate the sleep in handleBuild and replace it with a conditional
check (or a helper like needsOldSrcImage(build)) so the delay runs only for
affected builds; ensure the logic uses existing build fields
(labels/annotations/spec) and does not add global latency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/steps/source.go`:
- Around line 551-558: The unconditional time.Sleep(1 * time.Minute) inside
handleBuild (within wait.ExponentialBackoff) causes every build attempt to pause
and must be gated; change handleBuild to only perform the pre-create sleep for
builds that actually need the older-src-image workaround (e.g., inspect
build.Metadata/Annotations/Labels or build.Spec.SourceImage/version) and only
before the first attempt rather than on every retry. Locate the sleep in
handleBuild and replace it with a conditional check (or a helper like
needsOldSrcImage(build)) so the delay runs only for affected builds; ensure the
logic uses existing build fields (labels/annotations/spec) and does not add
global latency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 805e612d-abb1-4645-8564-0070ed445643

📥 Commits

Reviewing files that changed from the base of the PR and between 3ed7a4c and bc5fecd.

📒 Files selected for processing (1)
  • pkg/steps/source.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/steps/source.go`:
- Around line 551-559: In handleBuild, remove the commented-out sleep and its
explanatory comment (the dead code around the "builds are using older src image"
sleep) or replace it with a deterministic wait: implement a proper retry/wait
that checks the build/pod image readiness before proceeding (e.g., use
wait.PollImmediate or extend the existing ExponentialBackoff to poll the
relevant image/version field on the attempt/build or associated pod via
podClient until it reflects the expected src image), ensuring you update the
logic around build.DeepCopyInto(&attempt) accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c8023165-a9f4-4c49-b930-d6d27164d2c2

📥 Commits

Reviewing files that changed from the base of the PR and between bc5fecd and 543ae8e.

📒 Files selected for processing (1)
  • pkg/steps/source.go

Comment on lines +551 to 559
func handleBuild(ctx context.Context, client BuildClient, podClient kubernetes.PodClient, build buildapi.Build) error {
const attempts = 5
ns, name := build.Namespace, build.Name
var errs []error
if err := wait.ExponentialBackoff(wait.Backoff{Duration: time.Minute, Factor: 1.5, Steps: attempts}, func() (bool, error) {
var attempt buildapi.Build
// TODO: This is a workaround to avoid race condition with multiple ci-operator instances building different architectures
// See https://issues.redhat.com/browse/OCPBUGS-65845
if needsMultiArchWorkaround != nil && needsMultiArchWorkaround() {
time.Sleep(5 * time.Minute)
}
// builds are using older src image, adding wait to avoid race condition
//time.Sleep(1 * time.Minute)
build.DeepCopyInto(&attempt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Commented-out race condition fix is a concern if merged.

Lines 557-558 contain dead code with a comment acknowledging a race condition ("builds are using older src image"). The sleep that previously mitigated this is now commented out.

Given the PR title "DO NOT MERGE" and /hold label, this appears intentional for investigation. However, if this were to be merged:

  • The race condition described in the comment would remain unaddressed
  • The commented-out code should be removed entirely or replaced with a proper fix

Consider removing the dead code and comment, or implementing a proper solution before any future merge.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/steps/source.go` around lines 551 - 559, In handleBuild, remove the
commented-out sleep and its explanatory comment (the dead code around the
"builds are using older src image" sleep) or replace it with a deterministic
wait: implement a proper retry/wait that checks the build/pod image readiness
before proceeding (e.g., use wait.PollImmediate or extend the existing
ExponentialBackoff to poll the relevant image/version field on the attempt/build
or associated pod via podClient until it reflects the expected src image),
ensuring you update the logic around build.DeepCopyInto(&attempt) accordingly.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 27, 2026

@kunalmemane: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/validate-vendor 543ae8e link true /test validate-vendor
ci/prow/integration 543ae8e link true /test integration
ci/prow/images 543ae8e link true /test images
ci/prow/breaking-changes 543ae8e link false /test breaking-changes

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants