Skip to content

WIP: coordinate parallel import steps in shared namespace with lock/done ConfigMaps for aggregated runs#5050

Closed
deepsm007 wants to merge 1 commit intoopenshift:mainfrom
deepsm007:fix/aggregate-namespace-input-hash
Closed

WIP: coordinate parallel import steps in shared namespace with lock/done ConfigMaps for aggregated runs#5050
deepsm007 wants to merge 1 commit intoopenshift:mainfrom
deepsm007:fix/aggregate-namespace-input-hash

Conversation

@deepsm007
Copy link
Copy Markdown
Contributor

@deepsm007 deepsm007 commented Mar 26, 2026

What I found while debugging

I was chasing failures in aggregated payload runs where release:initial would fail with:

failed to get CLI image: unable to wait for the 'cli' image in the stable stream to populate

Logs usually showed a multi-minute wait before the failure. What stood out was that only some parallel ci-operator job failed, not all of them which didn’t look like “the whole cluster is slow.” It looked like several ci-operators in the same shared ci-op-* namespace stepping on each other.

When I traced it, the import_release path isn’t safe to run fully in parallel against the same namespace and same initial pullspec: all jobs touch the same release:<name> import, the same stable-<name>:cli expectation, the same extract pod / ConfigMap / ImageStream merge work. Races there line up with intermittent timeouts waiting for cli on stable-initial, even though we’re not trying to use different stream names we need stable-initial to stay canonical.

So the parallel ci-operators treating the same import as independent work when it’s really one logical import that should complete once and be reused.

Constraints I had to respect

  • I still want one namespace for the aggregate (I’m not trying to fan out into N namespaces per shard).
  • I can’t fix this by renaming streams (e.g. suffixed stable-initial-*); the names need to stay standard.

What I’m doing in this PR

I added coordination in-namespace: lock + done markers (ConfigMaps) keyed by release name and a short hash of the pullspec, so one run does the import/extract/merge, and the others wait and skip once that work is finished—without changing stable-initial or the shared-namespace model.

/cc @openshift/test-platform

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci bot requested review from droslean and pruan-rht March 26, 2026 13:35
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepsm007

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 26, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

Walkthrough

The change introduces distributed coordination using ConfigMap-based locking and done markers to prevent duplicate release imports. The run method checks for existing done markers, creates lock ConfigMaps, waits for completion markers, and refactors import logic into a separate helper function.

Changes

Cohort / File(s) Summary
Distributed Coordination via ConfigMaps
pkg/steps/release/import_release.go
Added deterministic coordination using namespaced ConfigMaps with SHA-256–derived resource suffixes. Implements lock acquisition with early return for completed imports, polling logic for done markers, and refactored core import work into runImportBody helper. Includes four new unexported functions (releaseImportResourceSuffix, waitForReleaseImportDone, runImportBody) and two new constants (releaseImportLockPrefix, releaseImportDonePrefix) to manage distributed state.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@deepsm007 deepsm007 changed the title include target-additional-suffix in namespace input hash for aggregated runs WIP: include target-additional-suffix in namespace input hash for aggregated runs Mar 26, 2026
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 26, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e

@danilo-gemoli
Copy link
Copy Markdown
Contributor

What's the rationale behind this change?

@deepsm007 deepsm007 force-pushed the fix/aggregate-namespace-input-hash branch from ce4580e to 75eda8f Compare March 26, 2026 21:11
@deepsm007 deepsm007 changed the title WIP: include target-additional-suffix in namespace input hash for aggregated runs WIP: coordinate parallel import steps in shared namespace with lock/done ConfigMaps for aggregated runs Mar 26, 2026
@deepsm007
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/steps/release/import_release.go`:
- Around line 151-159: The current branch only waits for doneName after seeing
kerrors.IsAlreadyExists on s.client.Create(ctx, lockCM), which can hang peers if
the lock holder exits without creating doneName; modify the logic in the
Create()/AlreadyExists handling (and the analogous block around lines 182-198)
to loop-retry Create: when Create returns AlreadyExists, poll the lock ConfigMap
via s.client.Get (or equivalent) with a short backoff until either the lock CM
is gone or DefaultImageImportTimeout elapses; if the lock is removed, retry
s.client.Create(ctx, lockCM); if doneName appears while waiting, return as
before; propagate errors appropriately and ensure waitForReleaseImportDone is
still used for the normal “wait for completion” case.
- Around line 161-166: The deferred lock cleanup currently reuses ctx (which may
be canceled by runImportBody()), causing the ConfigMap lock to be left behind;
change the defer to call s.client.Delete using a fresh context (e.g.
context.Background() or a short context.WithTimeout) instead of ctx when
deleting the ConfigMap (the block that constructs del with Namespace: ns, Name:
lockName and calls s.client.Delete). Keep the existing kerrors.IsNotFound check
and logging (logrus.WithError) but ensure the delete uses the newCtx so
cancellation of runImportBody() won't prevent lock cleanup.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1ea9a93e-ad00-4d87-b065-b87fb4b4cb95

📥 Commits

Reviewing files that changed from the base of the PR and between ce4580e and 75eda8f.

📒 Files selected for processing (1)
  • pkg/steps/release/import_release.go

Comment on lines +151 to +159
if err := s.client.Create(ctx, lockCM); err != nil {
if !kerrors.IsAlreadyExists(err) {
return fmt.Errorf("create release import lock: %w", err)
}
logrus.Infof("waiting for another ci-operator to finish release %s import in this namespace", s.name)
if err := waitForReleaseImportDone(ctx, s.client, ns, doneName, pullSpec, utils.DefaultImageImportTimeout); err != nil {
return err
}
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Retry lock acquisition when the current importer exits early.

This path only waits for doneName. If the lock holder returns any error before creating the done ConfigMap, every peer that already saw AlreadyExists will sit there until DefaultImageImportTimeout even after the lock is gone. That turns one transient importer failure into a namespace-wide timeout. Re-check the lock while waiting and retry Create() when it disappears instead of waiting solely on doneName.

Also applies to: 182-198

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/steps/release/import_release.go` around lines 151 - 159, The current
branch only waits for doneName after seeing kerrors.IsAlreadyExists on
s.client.Create(ctx, lockCM), which can hang peers if the lock holder exits
without creating doneName; modify the logic in the Create()/AlreadyExists
handling (and the analogous block around lines 182-198) to loop-retry Create:
when Create returns AlreadyExists, poll the lock ConfigMap via s.client.Get (or
equivalent) with a short backoff until either the lock CM is gone or
DefaultImageImportTimeout elapses; if the lock is removed, retry
s.client.Create(ctx, lockCM); if doneName appears while waiting, return as
before; propagate errors appropriately and ensure waitForReleaseImportDone is
still used for the normal “wait for completion” case.

Comment on lines +161 to +166
defer func() {
del := &coreapi.ConfigMap{ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: lockName}}
if err := s.client.Delete(ctx, del); err != nil && !kerrors.IsNotFound(err) {
logrus.WithError(err).Warn("failed to delete release import lock configmap")
}
}()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a fresh context for deferred lock cleanup.

The deferred delete reuses ctx, so a cancellation or deadline from runImportBody() will usually leave the lock ConfigMap behind. Subsequent jobs then keep hitting AlreadyExists and wait on a done marker that will never be written.

Suggested fix
 	defer func() {
+		cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+		defer cancel()
 		del := &coreapi.ConfigMap{ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: lockName}}
-		if err := s.client.Delete(ctx, del); err != nil && !kerrors.IsNotFound(err) {
+		if err := s.client.Delete(cleanupCtx, del); err != nil && !kerrors.IsNotFound(err) {
 			logrus.WithError(err).Warn("failed to delete release import lock configmap")
 		}
 	}()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
defer func() {
del := &coreapi.ConfigMap{ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: lockName}}
if err := s.client.Delete(ctx, del); err != nil && !kerrors.IsNotFound(err) {
logrus.WithError(err).Warn("failed to delete release import lock configmap")
}
}()
defer func() {
cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
del := &coreapi.ConfigMap{ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: lockName}}
if err := s.client.Delete(cleanupCtx, del); err != nil && !kerrors.IsNotFound(err) {
logrus.WithError(err).Warn("failed to delete release import lock configmap")
}
}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/steps/release/import_release.go` around lines 161 - 166, The deferred
lock cleanup currently reuses ctx (which may be canceled by runImportBody()),
causing the ConfigMap lock to be left behind; change the defer to call
s.client.Delete using a fresh context (e.g. context.Background() or a short
context.WithTimeout) instead of ctx when deleting the ConfigMap (the block that
constructs del with Namespace: ns, Name: lockName and calls s.client.Delete).
Keep the existing kerrors.IsNotFound check and logging (logrus.WithError) but
ensure the delete uses the newCtx so cancellation of runImportBody() won't
prevent lock cleanup.

@petr-muller
Copy link
Copy Markdown
Member

When I traced it, the import_release path isn’t safe to run fully in parallel against the same namespace and same initial pullspec: all jobs touch the same release: import, the same stable-:cli expectation, the same extract pod / ConfigMap / ImageStream merge work.

This sounds more like conjecture that evidence. There's no reason why these operations should be inherently unsafe do to in parallel?

Locking is hard and in kube model there are not too many use cases where locks would be necessary (and if so, it should not be a configmap - if anything, it should be a Lease (https://kubernetes.io/docs/concepts/architecture/leases/#custom-workload).

But in general I do not see real argument for why this should need a lock.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 26, 2026

@deepsm007: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e 75eda8f link true /test e2e
ci/prow/breaking-changes 75eda8f link false /test breaking-changes

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@petr-muller
Copy link
Copy Markdown
Member

One other thought that comes to mind is why is this happening so often specifically for /payload-triggered jobs. It is not uncommon for a PR to trigger multiple presubmits at the same time (running in the same namespace...) and these also run the same code path, but we do not seem to see this in standard presubmits as often as in /payload jobs. So something is different for these jobs, but not sure what it is.

@deepsm007
Copy link
Copy Markdown
Contributor Author

One other thought that comes to mind is why is this happening so often specifically for /payload-triggered jobs. It is not uncommon for a PR to trigger multiple presubmits at the same time (running in the same namespace...) and these also run the same code path, but we do not seem to see this in standard presubmits as often as in /payload jobs. So something is different for these jobs, but not sure what it is.

Any two ci-operator runs can land in the same ci-op-* namespace whenever resolveInputs ends up with the same inputHash, because the namespace is literally ci-op-{id} and {id} is that hash—see https://github.com/openshift/ci-tools/blob/main/cmd/ci-operator/main.go#L1420-L1430. The hash comes from step Inputs(), the marshaled config, optional --input-hash extras, and the binary stamp—https://github.com/openshift/ci-tools/blob/main/cmd/ci-operator/main.go#L1372-L1421—so I’m not saying normal presubmits can never share a namespace; if two unrelated jobs ever hashed the same inputs, they’d collide. In real life, most concurrent presubmits differ enough (--target, refs, graph) that the hashes usually differ.

What feels specific to PRPQR / /payload-style aggregates is we on purpose wire many shard ProwJobs so they don’t split that hash: generateProwjob adds --target-additional-suffix when aggregateIndex is set and always adds CustomHashInput(prpqrName) (which becomes --input-hash=<prpqr name> on the pod). Then resolveInputs strips the suffix from test.As only for hashing so every shard still maps to one namespace

So IIUC the story isn’t “payload runs different release-import code”—it’s topology: aggregates systematically run N parallel ci-operators in one namespace on the same initial import path. Typical presubmit fan-out usually isn’t trying to line up N sharded copies of the same test in one namespace on purpose; if a namespace is shared elsewhere, that’s more of an accident (identical hash inputs) than this designed fan-in.

In short

@petr-muller
Copy link
Copy Markdown
Member

petr-muller commented Mar 30, 2026

I’m not saying normal presubmits can never share a namespace; if two unrelated jobs ever hashed the same inputs, they’d collide. In real life, most concurrent presubmits differ enough (--target, refs, graph) that the hashes usually differ.

That's unfortunately an utter hallucination. Of course most concurrent presubmits (for the same code revision) share a namespace. That's how ci-operator was designed to work, and it is intentional. Click random presubmits on this PR, pretty much all of them will have this:

INFO[2026-03-26T21:12:18Z] Using namespace https://console-openshift-console.apps.build10.ci.devcluster.openshift.com/k8s/cluster/projects/ci-op-znc6ijvz 

That's how ci-operator avoid building everything separately for each job.

The hallucination is accidentally correct on the following:

Typical presubmit fan-out usually isn’t trying to line up N sharded copies of the same test in one namespace on purpose;

But it misses the main point. "Standard" presubmits will happily run different e2e jobs in the same namespace (like e2e-aws, e2e-gcp, e2e-aws-upgrade and so on). But pretty much all the workloads preceding the actual multi-step workflow (mostly builds, but also stuff like input and prerequsite image imports) will be shared between the jobs. And many of these jobs will have the "import CLI image from a release" step as well, and it will be shared in normal presubmits. The aggregate payload jobs technically run the same jobs X times which normally is not possible, and that is why there are the suffixes: you cannot have X ci-operators each running exactly e2e-aws (they would race over that name), but you can have each of them run e2e-aws-0, e2e-aws-1 and so on. The actual payload underneath these names then does not matter. It happens to be identical workloads but from the ci-operator PoV they can be anything.

But somehow, the latter mode (run X ci-operator copies doing the same thing with different names) runs into the race importing the CLI image, while the former (run X ci-operator copies running different things) apparently does not.

Actually, writing this I realized there is actually a difference between a typical presubmit and the payload job. A typical presubmit will not import the CLI image from a release payload image (like a CI or Nightly payload), it will use the cli image from a imagestream that was populated from the integration imagestream. Importing a full payload image is very likely a different ci-operator path, not seen that much in presubmits. So that's where I would look into.

@deepsm007 deepsm007 closed this Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants