WIP: coordinate parallel import steps in shared namespace with lock/done ConfigMaps for aggregated runs#5050
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deepsm007 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughThe change introduces distributed coordination using ConfigMap-based locking and done markers to prevent duplicate release imports. The run method checks for existing done markers, creates lock ConfigMaps, waits for completion markers, and refactors import logic into a separate helper function. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Scheduling tests matching the |
|
What's the rationale behind this change? |
…ck/done ConfigMaps
ce4580e to
75eda8f
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/steps/release/import_release.go`:
- Around line 151-159: The current branch only waits for doneName after seeing
kerrors.IsAlreadyExists on s.client.Create(ctx, lockCM), which can hang peers if
the lock holder exits without creating doneName; modify the logic in the
Create()/AlreadyExists handling (and the analogous block around lines 182-198)
to loop-retry Create: when Create returns AlreadyExists, poll the lock ConfigMap
via s.client.Get (or equivalent) with a short backoff until either the lock CM
is gone or DefaultImageImportTimeout elapses; if the lock is removed, retry
s.client.Create(ctx, lockCM); if doneName appears while waiting, return as
before; propagate errors appropriately and ensure waitForReleaseImportDone is
still used for the normal “wait for completion” case.
- Around line 161-166: The deferred lock cleanup currently reuses ctx (which may
be canceled by runImportBody()), causing the ConfigMap lock to be left behind;
change the defer to call s.client.Delete using a fresh context (e.g.
context.Background() or a short context.WithTimeout) instead of ctx when
deleting the ConfigMap (the block that constructs del with Namespace: ns, Name:
lockName and calls s.client.Delete). Keep the existing kerrors.IsNotFound check
and logging (logrus.WithError) but ensure the delete uses the newCtx so
cancellation of runImportBody() won't prevent lock cleanup.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1ea9a93e-ad00-4d87-b065-b87fb4b4cb95
📒 Files selected for processing (1)
pkg/steps/release/import_release.go
| if err := s.client.Create(ctx, lockCM); err != nil { | ||
| if !kerrors.IsAlreadyExists(err) { | ||
| return fmt.Errorf("create release import lock: %w", err) | ||
| } | ||
| logrus.Infof("waiting for another ci-operator to finish release %s import in this namespace", s.name) | ||
| if err := waitForReleaseImportDone(ctx, s.client, ns, doneName, pullSpec, utils.DefaultImageImportTimeout); err != nil { | ||
| return err | ||
| } | ||
| return nil |
There was a problem hiding this comment.
Retry lock acquisition when the current importer exits early.
This path only waits for doneName. If the lock holder returns any error before creating the done ConfigMap, every peer that already saw AlreadyExists will sit there until DefaultImageImportTimeout even after the lock is gone. That turns one transient importer failure into a namespace-wide timeout. Re-check the lock while waiting and retry Create() when it disappears instead of waiting solely on doneName.
Also applies to: 182-198
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/steps/release/import_release.go` around lines 151 - 159, The current
branch only waits for doneName after seeing kerrors.IsAlreadyExists on
s.client.Create(ctx, lockCM), which can hang peers if the lock holder exits
without creating doneName; modify the logic in the Create()/AlreadyExists
handling (and the analogous block around lines 182-198) to loop-retry Create:
when Create returns AlreadyExists, poll the lock ConfigMap via s.client.Get (or
equivalent) with a short backoff until either the lock CM is gone or
DefaultImageImportTimeout elapses; if the lock is removed, retry
s.client.Create(ctx, lockCM); if doneName appears while waiting, return as
before; propagate errors appropriately and ensure waitForReleaseImportDone is
still used for the normal “wait for completion” case.
| defer func() { | ||
| del := &coreapi.ConfigMap{ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: lockName}} | ||
| if err := s.client.Delete(ctx, del); err != nil && !kerrors.IsNotFound(err) { | ||
| logrus.WithError(err).Warn("failed to delete release import lock configmap") | ||
| } | ||
| }() |
There was a problem hiding this comment.
Use a fresh context for deferred lock cleanup.
The deferred delete reuses ctx, so a cancellation or deadline from runImportBody() will usually leave the lock ConfigMap behind. Subsequent jobs then keep hitting AlreadyExists and wait on a done marker that will never be written.
Suggested fix
defer func() {
+ cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+ defer cancel()
del := &coreapi.ConfigMap{ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: lockName}}
- if err := s.client.Delete(ctx, del); err != nil && !kerrors.IsNotFound(err) {
+ if err := s.client.Delete(cleanupCtx, del); err != nil && !kerrors.IsNotFound(err) {
logrus.WithError(err).Warn("failed to delete release import lock configmap")
}
}()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| defer func() { | |
| del := &coreapi.ConfigMap{ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: lockName}} | |
| if err := s.client.Delete(ctx, del); err != nil && !kerrors.IsNotFound(err) { | |
| logrus.WithError(err).Warn("failed to delete release import lock configmap") | |
| } | |
| }() | |
| defer func() { | |
| cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second) | |
| defer cancel() | |
| del := &coreapi.ConfigMap{ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: lockName}} | |
| if err := s.client.Delete(cleanupCtx, del); err != nil && !kerrors.IsNotFound(err) { | |
| logrus.WithError(err).Warn("failed to delete release import lock configmap") | |
| } | |
| }() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/steps/release/import_release.go` around lines 161 - 166, The deferred
lock cleanup currently reuses ctx (which may be canceled by runImportBody()),
causing the ConfigMap lock to be left behind; change the defer to call
s.client.Delete using a fresh context (e.g. context.Background() or a short
context.WithTimeout) instead of ctx when deleting the ConfigMap (the block that
constructs del with Namespace: ns, Name: lockName and calls s.client.Delete).
Keep the existing kerrors.IsNotFound check and logging (logrus.WithError) but
ensure the delete uses the newCtx so cancellation of runImportBody() won't
prevent lock cleanup.
This sounds more like conjecture that evidence. There's no reason why these operations should be inherently unsafe do to in parallel? Locking is hard and in kube model there are not too many use cases where locks would be necessary (and if so, it should not be a configmap - if anything, it should be a Lease (https://kubernetes.io/docs/concepts/architecture/leases/#custom-workload). But in general I do not see real argument for why this should need a lock. |
|
Scheduling tests matching the |
|
@deepsm007: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
One other thought that comes to mind is why is this happening so often specifically for |
Any two ci-operator runs can land in the same What feels specific to PRPQR / So IIUC the story isn’t “payload runs different release-import code”—it’s topology: aggregates systematically run N parallel ci-operators in one namespace on the same initial import path. Typical presubmit fan-out usually isn’t trying to line up N sharded copies of the same test in one namespace on purpose; if a namespace is shared elsewhere, that’s more of an accident (identical hash inputs) than this designed fan-in. In short
|
That's unfortunately an utter hallucination. Of course most concurrent presubmits (for the same code revision) share a namespace. That's how ci-operator was designed to work, and it is intentional. Click random presubmits on this PR, pretty much all of them will have this: That's how ci-operator avoid building everything separately for each job. The hallucination is accidentally correct on the following:
But it misses the main point. "Standard" presubmits will happily run different e2e jobs in the same namespace (like But somehow, the latter mode (run X ci-operator copies doing the same thing with different names) runs into the race importing the CLI image, while the former (run X ci-operator copies running different things) apparently does not. Actually, writing this I realized there is actually a difference between a typical presubmit and the payload job. A typical presubmit will not import the CLI image from a release payload image (like a CI or Nightly payload), it will use the |
What I found while debugging
I was chasing failures in aggregated payload runs where
release:initialwould fail with:failed to get CLI image: unable to wait for the 'cli' image in the stable stream to populateLogs usually showed a multi-minute wait before the failure. What stood out was that only some parallel ci-operator job failed, not all of them which didn’t look like “the whole cluster is slow.” It looked like several ci-operators in the same shared
ci-op-*namespace stepping on each other.When I traced it, the
import_releasepath isn’t safe to run fully in parallel against the same namespace and same initial pullspec: all jobs touch the samerelease:<name>import, the samestable-<name>:cliexpectation, the same extract pod / ConfigMap / ImageStream merge work. Races there line up with intermittent timeouts waiting forclionstable-initial, even though we’re not trying to use different stream names we needstable-initialto stay canonical.So the parallel ci-operators treating the same import as independent work when it’s really one logical import that should complete once and be reused.
Constraints I had to respect
stable-initial-*); the names need to stay standard.What I’m doing in this PR
I added coordination in-namespace: lock + done markers (ConfigMaps) keyed by release name and a short hash of the pullspec, so one run does the import/extract/merge, and the others wait and skip once that work is finished—without changing
stable-initialor the shared-namespace model./cc @openshift/test-platform