WIP: coordinate parallel import steps in shared namespace with lock/done ConfigMaps for aggregated runs by deepsm007 · Pull Request #5050 · openshift/ci-tools

deepsm007 · 2026-03-26T13:34:29Z

What I found while debugging

I was chasing failures in aggregated payload runs where release:initial would fail with:

failed to get CLI image: unable to wait for the 'cli' image in the stable stream to populate

Logs usually showed a multi-minute wait before the failure. What stood out was that only some parallel ci-operator job failed, not all of them which didn’t look like “the whole cluster is slow.” It looked like several ci-operators in the same shared ci-op-* namespace stepping on each other.

When I traced it, the import_release path isn’t safe to run fully in parallel against the same namespace and same initial pullspec: all jobs touch the same release:<name> import, the same stable-<name>:cli expectation, the same extract pod / ConfigMap / ImageStream merge work. Races there line up with intermittent timeouts waiting for cli on stable-initial, even though we’re not trying to use different stream names we need stable-initial to stay canonical.

So the parallel ci-operators treating the same import as independent work when it’s really one logical import that should complete once and be reused.

Constraints I had to respect

I still want one namespace for the aggregate (I’m not trying to fan out into N namespaces per shard).
I can’t fix this by renaming streams (e.g. suffixed stable-initial-*); the names need to stay standard.

What I’m doing in this PR

I added coordination in-namespace: lock + done markers (ConfigMaps) keyed by release name and a short hash of the pullspec, so one run does the import/extract/merge, and the others wait and skip once that work is finished—without changing stable-initial or the shared-namespace model.

/cc @openshift/test-platform

openshift-ci-robot · 2026-03-26T13:34:31Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci · 2026-03-26T13:35:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepsm007

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [deepsm007]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-03-26T13:38:13Z

Walkthrough

The change introduces distributed coordination using ConfigMap-based locking and done markers to prevent duplicate release imports. The run method checks for existing done markers, creates lock ConfigMaps, waits for completion markers, and refactors import logic into a separate helper function.

Changes

Cohort / File(s)	Summary
Distributed Coordination via ConfigMaps `pkg/steps/release/import_release.go`	Added deterministic coordination using namespaced ConfigMaps with SHA-256–derived resource suffixes. Implements lock acquisition with early return for completed imports, polling logic for done markers, and refactored core import work into `runImportBody` helper. Includes four new unexported functions (`releaseImportResourceSuffix`, `waitForReleaseImportDone`, `runImportBody`) and two new constants (`releaseImportLockPrefix`, `releaseImportDonePrefix`) to manage distributed state.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-03-26T14:29:53Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e

danilo-gemoli · 2026-03-26T17:36:09Z

What's the rationale behind this change?

cmd/ci-operator/main.go

…ck/done ConfigMaps

deepsm007 · 2026-03-26T21:33:43Z

@coderabbitai review

coderabbitai · 2026-03-26T21:33:50Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/steps/release/import_release.go`:
- Around line 151-159: The current branch only waits for doneName after seeing
kerrors.IsAlreadyExists on s.client.Create(ctx, lockCM), which can hang peers if
the lock holder exits without creating doneName; modify the logic in the
Create()/AlreadyExists handling (and the analogous block around lines 182-198)
to loop-retry Create: when Create returns AlreadyExists, poll the lock ConfigMap
via s.client.Get (or equivalent) with a short backoff until either the lock CM
is gone or DefaultImageImportTimeout elapses; if the lock is removed, retry
s.client.Create(ctx, lockCM); if doneName appears while waiting, return as
before; propagate errors appropriately and ensure waitForReleaseImportDone is
still used for the normal “wait for completion” case.
- Around line 161-166: The deferred lock cleanup currently reuses ctx (which may
be canceled by runImportBody()), causing the ConfigMap lock to be left behind;
change the defer to call s.client.Delete using a fresh context (e.g.
context.Background() or a short context.WithTimeout) instead of ctx when
deleting the ConfigMap (the block that constructs del with Namespace: ns, Name:
lockName and calls s.client.Delete). Keep the existing kerrors.IsNotFound check
and logging (logrus.WithError) but ensure the delete uses the newCtx so
cancellation of runImportBody() won't prevent lock cleanup.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1ea9a93e-ad00-4d87-b065-b87fb4b4cb95

📥 Commits

Reviewing files that changed from the base of the PR and between ce4580e and 75eda8f.

📒 Files selected for processing (1)

pkg/steps/release/import_release.go

coderabbitai · 2026-03-26T21:49:57Z

pkg/steps/release/import_release.go

+	if err := s.client.Create(ctx, lockCM); err != nil {
+		if !kerrors.IsAlreadyExists(err) {
+			return fmt.Errorf("create release import lock: %w", err)
+		}
+		logrus.Infof("waiting for another ci-operator to finish release %s import in this namespace", s.name)
+		if err := waitForReleaseImportDone(ctx, s.client, ns, doneName, pullSpec, utils.DefaultImageImportTimeout); err != nil {
+			return err
+		}
+		return nil


⚠️ Potential issue | 🟠 Major

Retry lock acquisition when the current importer exits early.

This path only waits for doneName. If the lock holder returns any error before creating the done ConfigMap, every peer that already saw AlreadyExists will sit there until DefaultImageImportTimeout even after the lock is gone. That turns one transient importer failure into a namespace-wide timeout. Re-check the lock while waiting and retry Create() when it disappears instead of waiting solely on doneName.

Also applies to: 182-198

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/steps/release/import_release.go` around lines 151 - 159, The current branch only waits for doneName after seeing kerrors.IsAlreadyExists on s.client.Create(ctx, lockCM), which can hang peers if the lock holder exits without creating doneName; modify the logic in the Create()/AlreadyExists handling (and the analogous block around lines 182-198) to loop-retry Create: when Create returns AlreadyExists, poll the lock ConfigMap via s.client.Get (or equivalent) with a short backoff until either the lock CM is gone or DefaultImageImportTimeout elapses; if the lock is removed, retry s.client.Create(ctx, lockCM); if doneName appears while waiting, return as before; propagate errors appropriately and ensure waitForReleaseImportDone is still used for the normal “wait for completion” case.

coderabbitai · 2026-03-26T21:49:58Z

pkg/steps/release/import_release.go

+	defer func() {
+		del := &coreapi.ConfigMap{ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: lockName}}
+		if err := s.client.Delete(ctx, del); err != nil && !kerrors.IsNotFound(err) {
+			logrus.WithError(err).Warn("failed to delete release import lock configmap")
+		}
+	}()


⚠️ Potential issue | 🟠 Major

Use a fresh context for deferred lock cleanup.

The deferred delete reuses ctx, so a cancellation or deadline from runImportBody() will usually leave the lock ConfigMap behind. Subsequent jobs then keep hitting AlreadyExists and wait on a done marker that will never be written.

Suggested fix

defer func() { + cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() del := &coreapi.ConfigMap{ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: lockName}} - if err := s.client.Delete(ctx, del); err != nil && !kerrors.IsNotFound(err) { + if err := s.client.Delete(cleanupCtx, del); err != nil && !kerrors.IsNotFound(err) { logrus.WithError(err).Warn("failed to delete release import lock configmap") } }()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

defer func() {

del := &coreapi.ConfigMap{ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: lockName}}

if err := s.client.Delete(ctx, del); err != nil && !kerrors.IsNotFound(err) {

logrus.WithError(err).Warn("failed to delete release import lock configmap")

}

}()

defer func() {

cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)

defer cancel()

del := &coreapi.ConfigMap{ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: lockName}}

if err := s.client.Delete(cleanupCtx, del); err != nil && !kerrors.IsNotFound(err) {

logrus.WithError(err).Warn("failed to delete release import lock configmap")

}

}()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/steps/release/import_release.go` around lines 161 - 166, The deferred lock cleanup currently reuses ctx (which may be canceled by runImportBody()), causing the ConfigMap lock to be left behind; change the defer to call s.client.Delete using a fresh context (e.g. context.Background() or a short context.WithTimeout) instead of ctx when deleting the ConfigMap (the block that constructs del with Namespace: ns, Name: lockName and calls s.client.Delete). Keep the existing kerrors.IsNotFound check and logging (logrus.WithError) but ensure the delete uses the newCtx so cancellation of runImportBody() won't prevent lock cleanup.

petr-muller · 2026-03-26T21:53:50Z

When I traced it, the import_release path isn’t safe to run fully in parallel against the same namespace and same initial pullspec: all jobs touch the same release: import, the same stable-:cli expectation, the same extract pod / ConfigMap / ImageStream merge work.

This sounds more like conjecture that evidence. There's no reason why these operations should be inherently unsafe do to in parallel?

Locking is hard and in kube model there are not too many use cases where locks would be necessary (and if so, it should not be a configmap - if anything, it should be a Lease (https://kubernetes.io/docs/concepts/architecture/leases/#custom-workload).

But in general I do not see real argument for why this should need a lock.

openshift-ci-robot · 2026-03-26T22:37:21Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e

openshift-ci · 2026-03-26T23:30:37Z

@deepsm007: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e	`75eda8f`	link	true	`/test e2e`
ci/prow/breaking-changes	`75eda8f`	link	false	`/test breaking-changes`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

petr-muller · 2026-03-27T13:52:49Z

One other thought that comes to mind is why is this happening so often specifically for /payload-triggered jobs. It is not uncommon for a PR to trigger multiple presubmits at the same time (running in the same namespace...) and these also run the same code path, but we do not seem to see this in standard presubmits as often as in /payload jobs. So something is different for these jobs, but not sure what it is.

deepsm007 · 2026-03-30T19:51:59Z

One other thought that comes to mind is why is this happening so often specifically for /payload-triggered jobs. It is not uncommon for a PR to trigger multiple presubmits at the same time (running in the same namespace...) and these also run the same code path, but we do not seem to see this in standard presubmits as often as in /payload jobs. So something is different for these jobs, but not sure what it is.

Any two ci-operator runs can land in the same ci-op-* namespace whenever resolveInputs ends up with the same inputHash, because the namespace is literally ci-op-{id} and {id} is that hash—see https://github.com/openshift/ci-tools/blob/main/cmd/ci-operator/main.go#L1420-L1430. The hash comes from step Inputs(), the marshaled config, optional --input-hash extras, and the binary stamp—https://github.com/openshift/ci-tools/blob/main/cmd/ci-operator/main.go#L1372-L1421—so I’m not saying normal presubmits can never share a namespace; if two unrelated jobs ever hashed the same inputs, they’d collide. In real life, most concurrent presubmits differ enough (--target, refs, graph) that the hashes usually differ.

What feels specific to PRPQR / /payload-style aggregates is we on purpose wire many shard ProwJobs so they don’t split that hash: generateProwjob adds --target-additional-suffix when aggregateIndex is set and always adds CustomHashInput(prpqrName) (which becomes --input-hash=<prpqr name> on the pod). Then resolveInputs strips the suffix from test.As only for hashing so every shard still maps to one namespace

So IIUC the story isn’t “payload runs different release-import code”—it’s topology: aggregates systematically run N parallel ci-operators in one namespace on the same initial import path. Typical presubmit fan-out usually isn’t trying to line up N sharded copies of the same test in one namespace on purpose; if a namespace is shared elsewhere, that’s more of an accident (identical hash inputs) than this designed fan-in.

In short

resolveInputs / namespace: same inputHash → same ci-op-* — https://github.com/openshift/ci-tools/blob/main/cmd/ci-operator/main.go#L1420-L1430
What feeds the hash: step Inputs(), marshaled config, --input-hash, binary stamp — https://github.com/openshift/ci-tools/blob/main/cmd/ci-operator/main.go#L1372-L1421
PRPQR wiring: --target-additional-suffix, CustomHashInput / --input-hash — https://github.com/openshift/ci-tools/blob/main/pkg/controller/prpqr_reconciler/prpqr_reconciler.go#L639-L650 · https://github.com/openshift/ci-tools/blob/main/pkg/prowgen/podspec.go#L660-L665 · https://github.com/openshift/ci-tools/blob/main/pkg/prowgen/podspec.go#L668-L673
Suffix trim (same namespace for shards): https://github.com/openshift/ci-tools/blob/main/cmd/ci-operator/main.go#L1384-L1397

petr-muller · 2026-03-30T22:38:56Z

I’m not saying normal presubmits can never share a namespace; if two unrelated jobs ever hashed the same inputs, they’d collide. In real life, most concurrent presubmits differ enough (--target, refs, graph) that the hashes usually differ.

That's unfortunately an utter hallucination. Of course most concurrent presubmits (for the same code revision) share a namespace. That's how ci-operator was designed to work, and it is intentional. Click random presubmits on this PR, pretty much all of them will have this:

INFO[2026-03-26T21:12:18Z] Using namespace https://console-openshift-console.apps.build10.ci.devcluster.openshift.com/k8s/cluster/projects/ci-op-znc6ijvz

That's how ci-operator avoid building everything separately for each job.

The hallucination is accidentally correct on the following:

Typical presubmit fan-out usually isn’t trying to line up N sharded copies of the same test in one namespace on purpose;

But it misses the main point. "Standard" presubmits will happily run different e2e jobs in the same namespace (like e2e-aws, e2e-gcp, e2e-aws-upgrade and so on). But pretty much all the workloads preceding the actual multi-step workflow (mostly builds, but also stuff like input and prerequsite image imports) will be shared between the jobs. And many of these jobs will have the "import CLI image from a release" step as well, and it will be shared in normal presubmits. The aggregate payload jobs technically run the same jobs X times which normally is not possible, and that is why there are the suffixes: you cannot have X ci-operators each running exactly e2e-aws (they would race over that name), but you can have each of them run e2e-aws-0, e2e-aws-1 and so on. The actual payload underneath these names then does not matter. It happens to be identical workloads but from the ci-operator PoV they can be anything.

But somehow, the latter mode (run X ci-operator copies doing the same thing with different names) runs into the race importing the CLI image, while the former (run X ci-operator copies running different things) apparently does not.

Actually, writing this I realized there is actually a difference between a typical presubmit and the payload job. A typical presubmit will not import the CLI image from a release payload image (like a CI or Nightly payload), it will use the cli image from a imagestream that was populated from the integration imagestream. Importing a full payload image is very likely a different ci-operator path, not seen that much in presubmits. So that's where I would look into.

openshift-ci bot requested review from droslean and pruan-rht March 26, 2026 13:35

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 26, 2026

deepsm007 changed the title ~~include target-additional-suffix in namespace input hash for aggregated runs~~ WIP: include target-additional-suffix in namespace input hash for aggregated runs Mar 26, 2026

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 26, 2026

danilo-gemoli reviewed Mar 26, 2026

View reviewed changes

cmd/ci-operator/main.go Outdated Show resolved Hide resolved

release: coordinate parallel import steps in shared namespace with lo…

75eda8f

…ck/done ConfigMaps

deepsm007 force-pushed the fix/aggregate-namespace-input-hash branch from ce4580e to 75eda8f Compare March 26, 2026 21:11

deepsm007 changed the title ~~WIP: include target-additional-suffix in namespace input hash for aggregated runs~~ WIP: coordinate parallel import steps in shared namespace with lock/done ConfigMaps for aggregated runs Mar 26, 2026

coderabbitai bot reviewed Mar 26, 2026

View reviewed changes

deepsm007 closed this Mar 31, 2026

Conversation

deepsm007 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What I found while debugging

Constraints I had to respect

What I’m doing in this PR

Uh oh!

openshift-ci-robot commented Mar 26, 2026

Uh oh!

openshift-ci bot commented Mar 26, 2026

Uh oh!

coderabbitai bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci-robot commented Mar 26, 2026

Uh oh!

danilo-gemoli commented Mar 26, 2026

Uh oh!

Uh oh!

deepsm007 commented Mar 26, 2026

Uh oh!

coderabbitai bot commented Mar 26, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

petr-muller commented Mar 26, 2026

Uh oh!

openshift-ci-robot commented Mar 26, 2026

Uh oh!

openshift-ci bot commented Mar 26, 2026

Uh oh!

petr-muller commented Mar 27, 2026

Uh oh!

deepsm007 commented Mar 30, 2026

Uh oh!

petr-muller commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deepsm007 commented Mar 26, 2026 •

edited

Loading

coderabbitai bot commented Mar 26, 2026 •

edited

Loading

petr-muller commented Mar 30, 2026 •

edited

Loading