Skip to content

OTA-1935: pkg: Use risk.Source framework to feed Upgradeable#1368

Open
wking wants to merge 4 commits intoopenshift:mainfrom
wking:upgradeable-as-update-risk
Open

OTA-1935: pkg: Use risk.Source framework to feed Upgradeable#1368
wking wants to merge 4 commits intoopenshift:mainfrom
wking:upgradeable-as-update-risk

Conversation

@wking
Copy link
Copy Markdown
Member

@wking wking commented Apr 6, 2026

Building on #1367, this simplifies the complex cvo package by shifting some logic into small, narrowly scoped helper packages. That makes for easier unit testing of that logic, and less complication when reading the remaining cvo package code.

Using #1367's risk.Source framework with the new aggregate implementation allows Upgradeable risks to feed into the conditionalUpdateRisks framework. This should help reduce confusion some users experience when the see Upgradeable=False and think it applies to all updates (when in reality it applies to major and minor bumps, but does not
apply to patch updates, especially since 6f8f984, #1314). It will also allow structured access to individual risks that are reported via this pipeline, while Upgradeable had an aggregated message string only consumable by humans.

I'm also removing internal.UpgradeableAdminAckRequired and similar from ClusterVersion status.conditions. While these were visible by end users, they were mostly a mechanism for passing data from one part of the CVO to another. There are no known consumers outside of the CVO.

With the refactor providing clearer abstraction layers, I could drop the entire upgradeableCheckIntervals structure. Major touches in the history of that structure were bd05174 (#808) and cc94c95 (#882). The new approach is event-driven via the informer handlers for almost all of the new risk.Source implementations. The one source that still
relies on polling is the deletion source, where it's watching a CVO-side global variable, and not a Kube-side informer. It will still sync whenever ClusterVersion status is updated though, which should be often enough for this usually-transient situation. If folks want lower latency, follow up work could add informer-style handler callbacks to the deletion global.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 6, 2026

@wking: This pull request references OTA-1935 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Building on #1367, this simplifies the complex cvo package by shifting some logic into small, narrowly scoped helper packages. That makes for easier unit testing of that logic, and less complication when reading the remaining cvo package code.

Using #1367's risk.Source framework with the new aggregate implementation allows Upgradeable risks to feed into the conditionalUpdateRisks framework. This should help reduce confusion some users experience when the see Upgradeable=False and think it applies to all updates (when in reality it applies to major and minor bumps, but does not
apply to patch updates, especially since 6f8f984, #1314). It will also allow structured access to individual risks that are reported via this pipeline, while Upgradeable had an aggregated message string only consumable by humans.

I'm also removing internal.UpgradeableAdminAckRequired and similar from ClusterVersion status.conditions. While these were visible by end users, they were mostly a mechanism for passing data from one part of the CVO to another. There are no known consumers outside of the CVO.

With the refactor providing clearer abstraction layers, I could drop the entire upgradeableCheckIntervals structure. Major touches in the history of that structure were bd05174 (#808) and cc94c95 (#882). The new approach is event-driven via the informer handlers for almost all of the new risk.Source implementations. The one source that still
relies on polling is the deletion source, where it's watching a CVO-side global variable, and not a Kube-side informer. It will still sync whenever ClusterVersion status is updated though, which should be often enough for this usually-transient situation. If folks want lower latency, follow up work could add informer-style handler callbacks to the deletion global.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaces ad-hoc Prometheus-alert handling with a pluggable risk.Source framework, adds multiple concrete risk sources and an aggregator, removes the legacy upgradeable subsystem and workqueue, and integrates aggregated risks into available-updates evaluation and ClusterVersion status computation.

Changes

Cohort / File(s) Summary
Core operator plumbing
pkg/cvo/cvo.go, pkg/cvo/cvo_test.go
Removed AlertGetter and the dedicated upgradeable workqueue/logic; added upgradeable risk.Source and top-level risks risk.Source; wired aggregated risk sources into operator initialization and updated tests to use mock risk sources.
Available-updates logic & tests
pkg/cvo/availableupdates.go, pkg/cvo/availableupdates_test.go
Removed alert-specific risk evaluation; added risks risk.Source to availableUpdates, used risk.Merge to attach risks/version mappings, introduced upstream vs evaluated cache fields and deep-copy helpers; tests refactored to inject risk.Source mocks.
Status computation
pkg/cvo/status.go, pkg/cvo/status_test.go
Threaded context and upgradeable risk.Source into status computation; added setUpgradeableCondition to derive OperatorUpgradeable from aggregated risks and to remove obsolete subconditions; tests updated for signature changes.
Removed upgradeable subsystem
pkg/cvo/upgradeable.go, pkg/cvo/upgradeable_test.go
Deleted prior upgradeable implementation and its tests (throttled checks, admin-gate/ack handling, ClusterOperator aggregation and related informers).
Risk framework core
pkg/risk/risk.go, pkg/risk/risk_test.go
Added risk.Source interface and risk.Merge to validate/deduplicate risks, attach risks to releases/conditional updates, and report merged errors; tests validate merge behavior and idempotency.
Aggregate risk source
pkg/risk/aggregate/aggregate.go, pkg/risk/aggregate/aggregate_test.go
New composite source that merges multiple risk.Source instances, combines risks/version maps, and accumulates per-source errors; tests cover aggregation and duplicate handling.
Alert-based risk source
pkg/risk/alert/alert.go, pkg/risk/alert/alert_test.go
New PromQL-backed alert source converting Prometheus alerts into ConditionalUpdateRisk objects (summary/description/URL, de-duplication, LastTransitionTime); tests exercise conversion logic.
Admin-ack risk source
pkg/risk/adminack/adminack.go, pkg/risk/adminack/adminack_test.go
New source reading admin-gates/admin-acks ConfigMaps, validating gate names/applicability, emitting conditional risks when acks are missing/invalid; includes informer-driven tests.
Overrides risk source
pkg/risk/overrides/overrides.go, pkg/risk/overrides/overrides_test.go
New source detecting spec.overrides with Unmanaged=true on a ClusterVersion and emitting conditional risks restricted to allowed major/minor updates; includes informer change-callback tests.
Deletion-in-progress risk source
pkg/risk/deletion/deletion.go
New source reporting resource deletions-in-progress as blocking risks, with version mapping and documentation URL behavior.
Updating-in-progress risk source
pkg/risk/updating/updating.go, pkg/risk/updating/updating_test.go
New source that emits a risk when ClusterVersion OperatorProgressing is true and maps applicable versions; includes informer-driven tests.
ClusterOperator-based upgradeable source
pkg/risk/upgradeable/upgradeable.go, pkg/risk/upgradeable/upgradeable_test.go, pkg/risk/upgradeable/util.go
New source deriving risks from ClusterOperator OperatorUpgradeable conditions and utility MajorAndMinorUpdates(...); includes informer-driven tests.
Test support: mock risk source
pkg/risk/mock/mock.go
Added a Mock risk.Source for tests with configurable risks, version maps, and error injection.
availableupdates deep-copy helpers
pkg/cvo/availableupdates.go (helpers)
Introduced deepCopyReleases / deepCopyConditionalUpdates and split cache into upstream vs public evaluated fields.
Tests cleanup
multiple *_test.go updates
Removed alert-specific unit tests and upgradeable tests; updated many tests to use new risk abstractions and context usage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (1)
pkg/risk/adminack/adminack_test.go (1)

436-445: Assert both the error expectation and the boolean result.

Lines 437-445 only catch unexpected errors and false positives. A broken implementation returning (false, nil) for a matching gate—or (false, nil) when wantErr is true—would still pass these cases.

Suggested fix
 				isApplicable, err := gateApplicableToCurrentVersion(tt.gateName, version)
-				if err != nil && !tt.wantErr {
-					t.Fatalf("gateApplicableToCurrentVersion() unexpected error: %v", err)
-				}
-				if err != nil {
-					return
-				}
-				if isApplicable && !tt.expectedResult {
-					t.Fatalf("gateApplicableToCurrentVersion() %s should not apply", tt.gateName)
+				if (err != nil) != tt.wantErr {
+					t.Fatalf("gateApplicableToCurrentVersion() error = %v, wantErr %t", err, tt.wantErr)
+				}
+				if err == nil && isApplicable != tt.expectedResult {
+					t.Fatalf("gateApplicableToCurrentVersion() = %t, want %t", isApplicable, tt.expectedResult)
 				}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/adminack/adminack_test.go` around lines 436 - 445, The test only
checks unexpected errors and false positives; update the assertions around
gateApplicableToCurrentVersion so you verify both the error expectation and the
boolean result: after calling gateApplicableToCurrentVersion, assert that (err
!= nil) matches tt.wantErr (use t.Fatalf if they differ), if err != nil return,
then assert that isApplicable equals tt.expectedResult (use t.Fatalf with
context if they differ); reference the variables gateApplicableToCurrentVersion,
tt.wantErr, tt.expectedResult and isApplicable to locate and change the test
logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 164-166: The risk-source callback flow can be skipped by the
recent-fetch fast path causing stale risk data; modify the fast-path logic in
syncAvailableUpdates/riskSourceCallback so that a callback-triggered change
forces evaluation regardless of recent upstream fetch. Add a dirty bit or
timestamp on optrAvailableUpdates (e.g., a new field like risksDirtyAt or
lastRiskCallbackAt) that riskSourceCallback sets, then update the early-return
condition in syncAvailableUpdates to bypass the recent-fetch return when that
dirty bit/timestamp is newer than the last fetch; ensure you still clear/update
optrAvailableUpdates.risks and ShouldReconcileAcceptRisks and call
optrAvailableUpdates.evaluateConditionalUpdates(ctx) after bypassing the fast
path so OperatorUpgradeable and alerts are refreshed immediately.

In `@pkg/cvo/cvo.go`:
- Around line 305-317: The admin-ack informers (cmConfigManagedInformer and
cmConfigInformer) used by adminack.New are not being waited on, causing races;
update the operator's informer sync list so Run() waits for those informers
before enabling admin-ack: add cmConfigManagedInformer.HasSynced and
cmConfigInformer.HasSynced (or their informer sync funcs) into optr.cacheSynced
(the slice checked by Run()), so that optr.Run()/cache sync logic only enables
adminack.New after those caches are synced and avoids AdminAckUnknown at
startup.

In `@pkg/cvo/status.go`:
- Around line 620-629: The code currently ignores the per-version applicability
map returned by upgradeable.Risks and only filters explicit Applies=False
subconditions, which causes global risks (e.g., adminack) to wrongly mark
OperatorUpgradeable=false for patch-only clusters; restore and use the
applicability map from upgradeable.Risks (capture the second return value
instead of discarding it), then when iterating over risks skip any risk that the
applicability map shows as not applicable to the requested updateVersions (in
addition to the existing check of
internal.ConditionalUpdateRiskConditionTypeApplies on the risk's Conditions);
reference upgradeable.Risks, updateVersions, risks, risk.Source and
internal.ConditionalUpdateRiskConditionTypeApplies to locate and implement the
change.
- Around line 652-666: The code sets Condition.Reason from risks[0].Name which
can contain invalid chars; instead, do not use raw risk.Name — assign a
validated reason constant (e.g., introduce/use a constant like
recommendedReasonSingle, similar to recommendedReasonMultiple) for the
single-risk branch and use that when calling
resourcemerge.SetOperatorStatusCondition on cvStatus.Conditions (Type:
configv1.OperatorUpgradeable); if you need to preserve operator identity, map
risk.Name to a validated token or log the raw name elsewhere but do not publish
it as the Condition.Reason.

In `@pkg/risk/adminack/adminack.go`:
- Around line 195-199: The returned ConditionalUpdateRisk currently sets URL to
a placeholder ("https://example.com/FIXME-look-for-a-URI-in-the-message");
update the code that constructs and returns the configv1.ConditionalUpdateRisk
(the struct built alongside riskName and gateValue in adminack.go) to either set
URL to the real canonical admin-ack documentation link or to an empty string
until a proper docs URL exists so users are not directed to the placeholder;
ensure the change is applied to the ConditionalUpdateRisk.URL field where the
struct is created.
- Around line 5-10: The field a.lastSeen is read and written concurrently by
Risks() and eventHandler() (and workers/informers), causing a data race; protect
it by adding a dedicated mutex (e.g., lastSeenMu sync.RWMutex) on the adminack
struct and use Write locks around all mutations in Risks() and any place that
updates a.lastSeen, and Read locks around all diffs/reads in eventHandler() and
the status/available-updates/informer call sites; ensure you reference
a.lastSeen and the new lastSeenMu in those methods (Risks, eventHandler, and the
updater/informer routines) so every access is guarded.

In `@pkg/risk/deletion/deletion.go`:
- Around line 48-74: The code currently drops deletion blocker risks by
returning nils when deletes are in progress; change the early return in the
block that checks len(risks) to return the found risks instead of nils (i.e.,
return risks, nil, err) so the deletion risk is propagated; update the return at
the len(risks) > 0 branch (around the resourcedelete.DeletesInProgress() check)
from return nil, nil, nil to return risks, nil, err.

In `@pkg/risk/overrides/overrides_test.go`:
- Around line 35-39: The test's changeChannel is unbuffered so early invocations
of changeCallback can block the informer and flake the test; change the channel
allocation (changeChannel) to a buffered channel (e.g., make(chan struct{}, 1))
and keep the changeCallback semantics (sending a struct{} into changeChannel) so
startup notifications can be delivered without blocking; apply the same buffered
pattern to any other test notification channels used in the same file (the ones
drained in the startup notification drain logic) to avoid wedging.

In `@pkg/risk/overrides/overrides.go`:
- Around line 23-27: The pointer field sawOverrides on struct overrides is
written on the sync path and read from informer callback goroutines, causing a
data race; protect accesses by replacing sawOverrides with a concurrency-safe
primitive (e.g. use an atomic.Bool or a bool guarded by a sync.Mutex) and update
both the sync-path writer and the informer callback readers to use the chosen
atomic/locked accessors; update all locations that set or test
overrides.sawOverrides so they call the new atomic/locked setter/getter to
ensure no concurrent reads/writes occur.
- Around line 58-77: The code currently unconditionally parses
cv.Status.Desired.Version and errors if it's empty; change it to always
construct and return the ConditionalUpdateRisk (o.name, Message, MatchingRules)
but only attempt semver.Parse and compute majorAndMinorUpdates/versionMap when
cv.Status.Desired.Version is non-empty and parse succeeds; if the version string
is empty, skip parsing, set versionMap to nil (or empty) and use a non-versioned
fallback URL for risk.URL (or omit version formatting) so the override risk is
returned during initial-sync without failing on missing Status.Desired.Version.

In `@pkg/risk/updating/updating_test.go`:
- Around line 36-40: The test's changeChannel is unbuffered causing
changeCallback's send to block and potentially stall informer startup; make
changeChannel buffered (e.g., buffer size 1) so early notifications from
changeCallback won't block the sender and the drain loop that reads from
changeChannel in the test can consume them without blocking startup.

In `@pkg/risk/updating/updating.go`:
- Around line 24-29: The struct updating's wasUpdating pointer is accessed from
multiple goroutines (Risks and the informer callback) causing a race; add a
sync.Mutex (or sync.RWMutex) field to the updating struct and import sync, then
wrap every access and mutation of wasUpdating (reads in the informer callback
and writes in Risks and any other goroutines) with the mutex (Lock/Unlock or
RLock/RUnlock) to ensure atomic, synchronized reads/writes and eliminate the
data race.
- Around line 48-57: The Risks method is incorrectly treating any error from
isUpdating() (and thus from the underlying cvLister.Get) as a benign "not
updating" result; update Risks to only convert meta.IsNoMatchError or
apierrors.IsNotFound to a nil/no-risk return but propagate any other error from
isUpdating() back to the caller (i.e., return nil, nil, err) so lister/cache
failures are not swallowed; apply the same change where isUpdating() is checked
later in this file (the other branch around the same isUpdating() call) to
ensure errors are returned rather than silently clearing the update blocker.

In `@pkg/risk/upgradeable/upgradeable.go`:
- Around line 25-30: The upgradeable struct's slice field lastSeen is being
written in Risks and read in the informer callback without synchronization,
causing a data race; add a sync.Mutex (or sync.RWMutex) to upgradeable, import
sync, protect all accesses to lastSeen in the Risks method and the informer
callback (lock before mutating or reading), and when publishing a slice to
readers make a safe copy (or return a copy) so comparisons use a stable
snapshot; reference the upgradeable type, its lastSeen field, and the Risks
method/informer callback when applying the locks and copying.

---

Nitpick comments:
In `@pkg/risk/adminack/adminack_test.go`:
- Around line 436-445: The test only checks unexpected errors and false
positives; update the assertions around gateApplicableToCurrentVersion so you
verify both the error expectation and the boolean result: after calling
gateApplicableToCurrentVersion, assert that (err != nil) matches tt.wantErr (use
t.Fatalf if they differ), if err != nil return, then assert that isApplicable
equals tt.expectedResult (use t.Fatalf with context if they differ); reference
the variables gateApplicableToCurrentVersion, tt.wantErr, tt.expectedResult and
isApplicable to locate and change the test logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e7bc9e7-cddd-404b-a50d-7806e8b0f3e8

📥 Commits

Reviewing files that changed from the base of the PR and between 26e3765 and c0e4fd2.

📒 Files selected for processing (26)
  • pkg/cvo/availableupdates.go
  • pkg/cvo/availableupdates_test.go
  • pkg/cvo/cvo.go
  • pkg/cvo/cvo_test.go
  • pkg/cvo/status.go
  • pkg/cvo/status_test.go
  • pkg/cvo/upgradeable.go
  • pkg/cvo/upgradeable_test.go
  • pkg/risk/adminack/adminack.go
  • pkg/risk/adminack/adminack_test.go
  • pkg/risk/aggregate/aggregate.go
  • pkg/risk/aggregate/aggregate_test.go
  • pkg/risk/alert/alert.go
  • pkg/risk/alert/alert_test.go
  • pkg/risk/alert/testdata/alerts.json
  • pkg/risk/deletion/deletion.go
  • pkg/risk/mock/mock.go
  • pkg/risk/overrides/overrides.go
  • pkg/risk/overrides/overrides_test.go
  • pkg/risk/risk.go
  • pkg/risk/risk_test.go
  • pkg/risk/updating/updating.go
  • pkg/risk/updating/updating_test.go
  • pkg/risk/upgradeable/upgradeable.go
  • pkg/risk/upgradeable/upgradeable_test.go
  • pkg/risk/upgradeable/util.go
💤 Files with no reviewable changes (2)
  • pkg/cvo/upgradeable.go
  • pkg/cvo/upgradeable_test.go

Comment on lines +5 to +10
import (
"context"
"fmt"
"regexp"
"slices"
"strings"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Protect lastSeen from concurrent reads and writes.

Risks() mutates a.lastSeen, while eventHandler() diffs against it, and this source is invoked from the status worker, the available-updates worker, and informer goroutines. That is a real data race and can lead to missed or duplicated change notifications.

Suggested fix
 import (
 	"context"
 	"fmt"
 	"regexp"
 	"slices"
 	"strings"
+	"sync"
@@
 type adminAck struct {
+	mu               sync.Mutex
 	name             string
 	currentVersion   func() configv1.Release
 	adminGatesLister listerscorev1.ConfigMapNamespaceLister
 	adminAcksLister  listerscorev1.ConfigMapNamespaceLister
 	lastSeen         []configv1.ConditionalUpdateRisk
 }
@@
 	risks, version, err := a.risks()
 	if meta.IsNoMatchError(err) {
-		a.lastSeen = nil
+		a.mu.Lock()
+		a.lastSeen = nil
+		a.mu.Unlock()
 		return nil, nil, nil
 	}
@@
 	if len(risks) == 0 {
-		a.lastSeen = nil
+		a.mu.Lock()
+		a.lastSeen = nil
+		a.mu.Unlock()
 		return nil, nil, err
 	}
 
-	a.lastSeen = risks
+	a.mu.Lock()
+	a.lastSeen = append([]configv1.ConditionalUpdateRisk(nil), risks...)
+	a.mu.Unlock()
@@
 func (a *adminAck) eventHandler(changeCallback func()) {
 	risks, _, err := a.risks()
+	a.mu.Lock()
+	lastSeen := append([]configv1.ConditionalUpdateRisk(nil), a.lastSeen...)
+	a.mu.Unlock()
 	if err != nil {
 		if changeCallback != nil {
 			changeCallback()
@@
-	if diff := cmp.Diff(a.lastSeen, risks); diff != "" {
+	if diff := cmp.Diff(lastSeen, risks); diff != "" {

Also applies to: 35-40, 67-90, 137-149

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/adminack/adminack.go` around lines 5 - 10, The field a.lastSeen is
read and written concurrently by Risks() and eventHandler() (and
workers/informers), causing a data race; protect it by adding a dedicated mutex
(e.g., lastSeenMu sync.RWMutex) on the adminack struct and use Write locks
around all mutations in Risks() and any place that updates a.lastSeen, and Read
locks around all diffs/reads in eventHandler() and the
status/available-updates/informer call sites; ensure you reference a.lastSeen
and the new lastSeenMu in those methods (Risks, eventHandler, and the
updater/informer routines) so every access is guarded.

Comment on lines +36 to +40
changeChannel := make(chan struct{})
changeCallback := func() {
t.Logf("%s sending change notification", time.Now())
changeChannel <- struct{}{}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Buffer changeChannel so early notifications can't block startup.

The drain loop on Lines 56-66 shows early callbacks are expected. With the unbuffered send on Line 39, that first callback blocks until the test starts receiving, which can stall informer progress and make this test hang or flake.

Suggested fix
-	changeChannel := make(chan struct{})
+	changeChannel := make(chan struct{}, 1)
 	changeCallback := func() {
 		t.Logf("%s sending change notification", time.Now())
-		changeChannel <- struct{}{}
+		select {
+		case changeChannel <- struct{}{}:
+		default:
+		}
 	}

Also applies to: 56-66

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/updating/updating_test.go` around lines 36 - 40, The test's
changeChannel is unbuffered causing changeCallback's send to block and
potentially stall informer startup; make changeChannel buffered (e.g., buffer
size 1) so early notifications from changeCallback won't block the sender and
the drain loop that reads from changeChannel in the test can consume them
without blocking startup.

Comment on lines +24 to +29
type updating struct {
name string
cvName string
cvLister configlistersv1.ClusterVersionLister
wasUpdating *bool
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard wasUpdating with synchronization.

Line 54 writes u.wasUpdating in Risks, while Lines 93-104 read it from the informer callback. Those paths run on different goroutines, so this is a data race and can produce missed or duplicate notifications.

Suggested fix
 type updating struct {
 	name        string
 	cvName      string
 	cvLister    configlistersv1.ClusterVersionLister
+	mu          sync.RWMutex
 	wasUpdating *bool
 }
@@
-	u.wasUpdating = &updating
+	u.mu.Lock()
+	u.wasUpdating = &updating
+	u.mu.Unlock()
@@
-	if u.wasUpdating == nil {
+	u.mu.RLock()
+	wasUpdating := u.wasUpdating
+	u.mu.RUnlock()
+	if wasUpdating == nil {
 		if changeCallback != nil {
 			changeCallback()
 		}
 		return
 	}
@@
-	if updating != *u.wasUpdating {
-		klog.V(2).Infof("ClusterVersion updating changed from %t to %t, triggering override callback", *u.wasUpdating, updating)
+	if updating != *wasUpdating {
+		klog.V(2).Infof("ClusterVersion updating changed from %t to %t, triggering override callback", *wasUpdating, updating)
 		if changeCallback != nil {
 			changeCallback()
 		}
 	}

Also add sync to the imports.

Also applies to: 54-55, 92-108

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/updating/updating.go` around lines 24 - 29, The struct updating's
wasUpdating pointer is accessed from multiple goroutines (Risks and the informer
callback) causing a race; add a sync.Mutex (or sync.RWMutex) field to the
updating struct and import sync, then wrap every access and mutation of
wasUpdating (reads in the informer callback and writes in Risks and any other
goroutines) with the mutex (Lock/Unlock or RLock/RUnlock) to ensure atomic,
synchronized reads/writes and eliminate the data race.

Comment on lines +25 to +30
type upgradeable struct {
name string
currentVersion func() configv1.Release
coLister configlistersv1.ClusterOperatorLister
lastSeen []configv1.ConditionalUpdateRisk
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Synchronize lastSeen before diffing informer updates.

Lines 52, 57, and 61 write u.lastSeen in Risks, while Line 122 reads it from the informer callback. Sharing that slice header without a lock is a race and can make change detection compare torn state.

Suggested fix
 type upgradeable struct {
 	name           string
 	currentVersion func() configv1.Release
 	coLister       configlistersv1.ClusterOperatorLister
+	mu             sync.RWMutex
 	lastSeen       []configv1.ConditionalUpdateRisk
 }
@@
-		u.lastSeen = nil
+		u.mu.Lock()
+		u.lastSeen = nil
+		u.mu.Unlock()
 		return nil, nil, nil
 	}
 
 	if len(risks) == 0 {
-		u.lastSeen = nil
+		u.mu.Lock()
+		u.lastSeen = nil
+		u.mu.Unlock()
 		return nil, nil, err
 	}
 
-	u.lastSeen = risks
+	u.mu.Lock()
+	u.lastSeen = append([]configv1.ConditionalUpdateRisk(nil), risks...)
+	u.mu.Unlock()
@@
-	if diff := cmp.Diff(u.lastSeen, risks); diff != "" {
+	u.mu.RLock()
+	lastSeen := append([]configv1.ConditionalUpdateRisk(nil), u.lastSeen...)
+	u.mu.RUnlock()
+	if diff := cmp.Diff(lastSeen, risks); diff != "" {
 		klog.V(2).Infof("ClusterOperator Upgradeable changed (-old +new):\n%s", diff)
 		if changeCallback != nil {
 			changeCallback()

Also add sync to the imports.

Also applies to: 49-61, 114-127

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/upgradeable/upgradeable.go` around lines 25 - 30, The upgradeable
struct's slice field lastSeen is being written in Risks and read in the informer
callback without synchronization, causing a data race; add a sync.Mutex (or
sync.RWMutex) to upgradeable, import sync, protect all accesses to lastSeen in
the Risks method and the informer callback (lock before mutating or reading),
and when publishing a slice to readers make a safe copy (or return a copy) so
comparisons use a stable snapshot; reference the upgradeable type, its lastSeen
field, and the Risks method/informer callback when applying the locks and
copying.

@wking wking force-pushed the upgradeable-as-update-risk branch from c0e4fd2 to cba244e Compare April 6, 2026 17:28
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
pkg/risk/overrides/overrides_test.go (1)

36-40: ⚠️ Potential issue | 🟠 Major

Unbuffered channel can cause test flakiness.

The unbuffered changeChannel with blocking send can wedge the test if informer callbacks fire before the test reaches a receive point. Buffer the channel or use a non-blocking send pattern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/overrides/overrides_test.go` around lines 36 - 40, The test uses an
unbuffered changeChannel and a blocking send in changeCallback which can hang if
the informer fires before the test is ready; make the channel buffered (e.g.,
changeChannel := make(chan struct{}, 1) ) or change changeCallback to perform a
non-blocking send (use a select with a default case) so the send won't block the
informer callback; update references to changeChannel and changeCallback in
overrides_test.go accordingly.
pkg/cvo/status.go (2)

652-653: ⚠️ Potential issue | 🟠 Major

Raw risk name may violate Kubernetes condition-reason pattern.

Risk names can contain hyphens (e.g., ClusterOperatorUpgradeable-test-operator), which violates the K8s reason pattern ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$. The multi-risk case correctly uses a validated constant; apply similar validation for single-risk.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cvo/status.go` around lines 652 - 653, The single-risk branch assigns
reason := risks[0].Name which can contain hyphens and violate the Kubernetes
condition-reason regex; change the code that sets reason and message for the
single-risk case to use the same validated/sanitized constant or validation
logic used in the multi-risk path (sanitize risks[0].Name to match
^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$ or replace it with the existing
validated reason constant), ensuring the reason starts with a letter and
contains only allowed characters while keeping message := risks[0].Message as
before.

620-629: ⚠️ Potential issue | 🟠 Major

Per-version applicability map is discarded.

The version map returned by upgradeable.Risks() is dropped, so risks that only apply to major/minor updates (e.g., from adminack.go or overrides.go) will still set Upgradeable=False even when all available updates are patch-only.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cvo/status.go` around lines 620 - 629, The call to upgradeable.Risks(ctx,
updateVersions) returns a per-version applicability map that is currently
discarded; capture that second return value (e.g., perVersionMap or
applicableByVersion) and use it when filtering risks so you only exclude risks
that truly don't apply to the provided updateVersions. Specifically, change the
assignment to capture the map from upgradeable.Risks, then in the loop that
currently checks meta.FindStatusCondition(risks[i].Conditions,
internal.ConditionalUpdateRiskConditionTypeApplies) also consult the captured
map for risks[i] to skip risks that are not applicable to the requested
updateVersions (so patch-only updates won’t be blocked by major/minor-only
risks).
pkg/risk/overrides/overrides.go (2)

59-62: ⚠️ Potential issue | 🟠 Major

Hard-fail on empty Status.Desired.Version prevents surfacing override risk during initial sync.

When cv.Status.Desired.Version is empty (common during initial sync), this returns an error instead of still returning the override risk. The risk should be surfaced with a fallback URL or without version-specific filtering.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/overrides/overrides.go` around lines 59 - 62, The code currently
calls semver.Parse on cv.Status.Desired.Version and returns an error if parsing
fails, which hard-fails when the version string is empty; change the logic in
the function containing this code so that if cv.Status.Desired.Version == "" you
do not call semver.Parse and instead proceed to compute and return the override
risk (using a non-version-specific path or fallback URL), but keep the existing
fmt.Errorf return for genuine parse errors when the version is non-empty;
reference cv.Status.Desired.Version and semver.Parse in your change and ensure
downstream filtering logic that used the parsed version can handle a nil/absent
version case.

24-29: ⚠️ Potential issue | 🟠 Major

Data race on sawOverrides field remains unaddressed.

The sawOverrides pointer is written at line 54 during Risks() and read at lines 97/107 from informer callbacks without synchronization. This can cause missed or duplicate refresh triggers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/overrides/overrides.go` around lines 24 - 29, The sawOverrides
pointer on the overrides struct is accessed concurrently (written in Risks() and
read in informer callbacks) causing a race; make the flag concurrency-safe by
either replacing sawOverrides *bool with an atomic boolean (e.g., sync/atomic's
atomic.Bool or an int32 used with atomic.Store/Load) or add a sync.Mutex (e.g.,
sawOverridesMu) and guard every read/write. Update all places that set or test
the flag (Risks() and the informer callback handlers that currently read
sawOverrides) to use the new atomic API or acquire the mutex before accessing
the flag so reads/writes are synchronized.
🧹 Nitpick comments (1)
pkg/risk/adminack/adminack_test.go (1)

198-233: Avoid order-dependent shared state across table cases.

The suite mutates one shared fake client/informer and relies on delete scenarios being last (Line 198 and Line 217 comments). This creates brittle coupling between cases and makes future reordering/additions risky. Prefer per-t.Run fixture setup (or explicit reseed of both ConfigMaps at each case start).

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 285-316

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/adminack/adminack_test.go` around lines 198 - 233, The table-driven
tests mutate a shared fake client/informer and rely on delete cases being last
(uses gateCm and ackCm in pkg/risk/adminack/adminack_test.go), so change each
test case to create its own fresh fixture: inside each t.Run recreate the fake
client/informer and seed both ConfigMaps (admin-gates and admin-acks) from the
table entry (gateCm, ackCm) or explicitly reseed them before invoking the code
under test, rather than relying on global/shared state; ensure
expectedRisks/expectedError assertions run against the result of the isolated
per-case setup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/risk/adminack/adminack_test.go`:
- Around line 439-447: The test currently fails to assert two cases: when
tt.wantErr is true but err is nil, and when tt.expectedResult is true but
isApplicable is false; inside the gateApplicableToCurrentVersion() test after
calling the function, keep the existing unexpected-error check (if err != nil &&
!tt.wantErr ...) and early-return on err, then add explicit checks that if
tt.wantErr && err == nil { t.Fatalf("expected error for %s but got nil",
tt.gateName) } and if tt.expectedResult && !isApplicable {
t.Fatalf("gateApplicableToCurrentVersion() %s should apply", tt.gateName) } to
cover both missing-failure paths (use existing symbols tt.wantErr, err,
tt.expectedResult, isApplicable, tt.gateName).

In `@pkg/risk/upgradeable/upgradeable_test.go`:
- Around line 42-46: The test currently uses an unbuffered channel changeChannel
and a blocking send in changeCallback which can deadlock if the callback runs
before the test is ready; change changeChannel to a buffered channel (size 1)
and make the send in changeCallback non-blocking (use a select with a default
branch) so the callback never blocks the goroutine—update the references to
changeChannel and changeCallback in upgradeable_test.go accordingly.

---

Duplicate comments:
In `@pkg/cvo/status.go`:
- Around line 652-653: The single-risk branch assigns reason := risks[0].Name
which can contain hyphens and violate the Kubernetes condition-reason regex;
change the code that sets reason and message for the single-risk case to use the
same validated/sanitized constant or validation logic used in the multi-risk
path (sanitize risks[0].Name to match ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$
or replace it with the existing validated reason constant), ensuring the reason
starts with a letter and contains only allowed characters while keeping message
:= risks[0].Message as before.
- Around line 620-629: The call to upgradeable.Risks(ctx, updateVersions)
returns a per-version applicability map that is currently discarded; capture
that second return value (e.g., perVersionMap or applicableByVersion) and use it
when filtering risks so you only exclude risks that truly don't apply to the
provided updateVersions. Specifically, change the assignment to capture the map
from upgradeable.Risks, then in the loop that currently checks
meta.FindStatusCondition(risks[i].Conditions,
internal.ConditionalUpdateRiskConditionTypeApplies) also consult the captured
map for risks[i] to skip risks that are not applicable to the requested
updateVersions (so patch-only updates won’t be blocked by major/minor-only
risks).

In `@pkg/risk/overrides/overrides_test.go`:
- Around line 36-40: The test uses an unbuffered changeChannel and a blocking
send in changeCallback which can hang if the informer fires before the test is
ready; make the channel buffered (e.g., changeChannel := make(chan struct{}, 1)
) or change changeCallback to perform a non-blocking send (use a select with a
default case) so the send won't block the informer callback; update references
to changeChannel and changeCallback in overrides_test.go accordingly.

In `@pkg/risk/overrides/overrides.go`:
- Around line 59-62: The code currently calls semver.Parse on
cv.Status.Desired.Version and returns an error if parsing fails, which
hard-fails when the version string is empty; change the logic in the function
containing this code so that if cv.Status.Desired.Version == "" you do not call
semver.Parse and instead proceed to compute and return the override risk (using
a non-version-specific path or fallback URL), but keep the existing fmt.Errorf
return for genuine parse errors when the version is non-empty; reference
cv.Status.Desired.Version and semver.Parse in your change and ensure downstream
filtering logic that used the parsed version can handle a nil/absent version
case.
- Around line 24-29: The sawOverrides pointer on the overrides struct is
accessed concurrently (written in Risks() and read in informer callbacks)
causing a race; make the flag concurrency-safe by either replacing sawOverrides
*bool with an atomic boolean (e.g., sync/atomic's atomic.Bool or an int32 used
with atomic.Store/Load) or add a sync.Mutex (e.g., sawOverridesMu) and guard
every read/write. Update all places that set or test the flag (Risks() and the
informer callback handlers that currently read sawOverrides) to use the new
atomic API or acquire the mutex before accessing the flag so reads/writes are
synchronized.

---

Nitpick comments:
In `@pkg/risk/adminack/adminack_test.go`:
- Around line 198-233: The table-driven tests mutate a shared fake
client/informer and rely on delete cases being last (uses gateCm and ackCm in
pkg/risk/adminack/adminack_test.go), so change each test case to create its own
fresh fixture: inside each t.Run recreate the fake client/informer and seed both
ConfigMaps (admin-gates and admin-acks) from the table entry (gateCm, ackCm) or
explicitly reseed them before invoking the code under test, rather than relying
on global/shared state; ensure expectedRisks/expectedError assertions run
against the result of the isolated per-case setup.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3e6a88ab-f2af-4450-90c8-19563f7f5325

📥 Commits

Reviewing files that changed from the base of the PR and between c0e4fd2 and cba244e.

📒 Files selected for processing (20)
  • pkg/cvo/cvo.go
  • pkg/cvo/cvo_test.go
  • pkg/cvo/status.go
  • pkg/cvo/status_test.go
  • pkg/cvo/upgradeable.go
  • pkg/cvo/upgradeable_test.go
  • pkg/risk/adminack/adminack.go
  • pkg/risk/adminack/adminack_test.go
  • pkg/risk/aggregate/aggregate.go
  • pkg/risk/aggregate/aggregate_test.go
  • pkg/risk/deletion/deletion.go
  • pkg/risk/overrides/overrides.go
  • pkg/risk/overrides/overrides_test.go
  • pkg/risk/risk.go
  • pkg/risk/risk_test.go
  • pkg/risk/updating/updating.go
  • pkg/risk/updating/updating_test.go
  • pkg/risk/upgradeable/upgradeable.go
  • pkg/risk/upgradeable/upgradeable_test.go
  • pkg/risk/upgradeable/util.go
💤 Files with no reviewable changes (2)
  • pkg/cvo/upgradeable_test.go
  • pkg/cvo/upgradeable.go
✅ Files skipped from review due to trivial changes (4)
  • pkg/cvo/status_test.go
  • pkg/risk/upgradeable/util.go
  • pkg/risk/risk_test.go
  • pkg/risk/updating/updating_test.go
🚧 Files skipped from review as they are similar to previous changes (8)
  • pkg/risk/aggregate/aggregate.go
  • pkg/risk/deletion/deletion.go
  • pkg/risk/updating/updating.go
  • pkg/risk/upgradeable/upgradeable.go
  • pkg/risk/adminack/adminack.go
  • pkg/risk/risk.go
  • pkg/cvo/cvo_test.go
  • pkg/cvo/cvo.go

Comment on lines +42 to +46
changeChannel := make(chan struct{})
changeCallback := func() {
t.Logf("%s sending change notification", time.Now())
changeChannel <- struct{}{}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Buffer changeChannel to prevent test flakiness.

Same issue as overrides_test.go: the unbuffered channel with a blocking send can deadlock or flake if the callback is invoked before the test reaches a receive. Use a buffered channel with non-blocking send.

Suggested fix
-	changeChannel := make(chan struct{})
+	changeChannel := make(chan struct{}, 1)
 	changeCallback := func() {
 		t.Logf("%s sending change notification", time.Now())
-		changeChannel <- struct{}{}
+		select {
+		case changeChannel <- struct{}{}:
+		default:
+		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
changeChannel := make(chan struct{})
changeCallback := func() {
t.Logf("%s sending change notification", time.Now())
changeChannel <- struct{}{}
}
changeChannel := make(chan struct{}, 1)
changeCallback := func() {
t.Logf("%s sending change notification", time.Now())
select {
case changeChannel <- struct{}{}:
default:
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/upgradeable/upgradeable_test.go` around lines 42 - 46, The test
currently uses an unbuffered channel changeChannel and a blocking send in
changeCallback which can deadlock if the callback runs before the test is ready;
change changeChannel to a buffered channel (size 1) and make the send in
changeCallback non-blocking (use a select with a default branch) so the callback
never blocks the goroutine—update the references to changeChannel and
changeCallback in upgradeable_test.go accordingly.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/risk/adminack/adminack.go (2)

35-42: ⚠️ Potential issue | 🔴 Critical

Protect lastSeen with a mutex across worker and informer paths.

Risks() mutates a.lastSeen while eventHandler() reads it from informer goroutines, which creates a real data race (Line 77/82/86 vs Line 156).

Suggested fix
 import (
 	"context"
 	"fmt"
 	"regexp"
 	"slices"
 	"strings"
+	"sync"
@@
 type adminAck struct {
+	lastSeenMu       sync.RWMutex
 	name             string
 	currentVersion   func() configv1.Release
 	adminGatesLister listerscorev1.ConfigMapNamespaceLister
 	adminAcksLister  listerscorev1.ConfigMapNamespaceLister
 	hasSynced        func() bool
 	lastSeen         []configv1.ConditionalUpdateRisk
 }
@@
 	if meta.IsNoMatchError(err) {
-		a.lastSeen = nil
+		a.lastSeenMu.Lock()
+		a.lastSeen = nil
+		a.lastSeenMu.Unlock()
 		return nil, nil, nil
 	}
@@
 	if len(risks) == 0 {
-		a.lastSeen = nil
+		a.lastSeenMu.Lock()
+		a.lastSeen = nil
+		a.lastSeenMu.Unlock()
 		return nil, nil, err
 	}
 
-	a.lastSeen = risks
+	a.lastSeenMu.Lock()
+	a.lastSeen = append([]configv1.ConditionalUpdateRisk(nil), risks...)
+	a.lastSeenMu.Unlock()
@@
 func (a *adminAck) eventHandler(changeCallback func()) {
 	risks, _, err := a.risks()
@@
-	if diff := cmp.Diff(a.lastSeen, risks); diff != "" {
+	a.lastSeenMu.RLock()
+	lastSeen := append([]configv1.ConditionalUpdateRisk(nil), a.lastSeen...)
+	a.lastSeenMu.RUnlock()
+	if diff := cmp.Diff(lastSeen, risks); diff != "" {

Also applies to: 75-86, 148-157

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/adminack/adminack.go` around lines 35 - 42, The struct adminAck uses
lastSeen concurrently from Risks() (which mutates a.lastSeen) and eventHandler()
(which reads it), causing a data race; add a sync.Mutex (or sync.RWMutex) field
to adminAck (e.g., lastSeenMu) and guard all accesses: lock before
reading/writing lastSeen in Risks(), eventHandler(), and any other methods that
touch lastSeen (use RLock/RUnlock for readers and Lock/Unlock for writers) to
ensure safe concurrent access.

206-210: ⚠️ Potential issue | 🟡 Minor

Replace the placeholder ConditionalUpdateRisk.URL.

Line 209 still returns a placeholder https://example.com/... to end users. Use a real canonical docs link, or leave URL empty until one exists.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/adminack/adminack.go` around lines 206 - 210, The returned
ConditionalUpdateRisk currently sets URL to a placeholder string; update the URL
field in the configv1.ConditionalUpdateRisk literal (the struct returned in
adminack.go) to either a real canonical documentation/help link or an empty
string until a proper URL exists—locate the instantiation of
configv1.ConditionalUpdateRisk (the return that includes Name: riskName,
Message: gateValue) and replace the placeholder value for URL accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/risk/adminack/adminack.go`:
- Around line 35-42: The struct adminAck uses lastSeen concurrently from Risks()
(which mutates a.lastSeen) and eventHandler() (which reads it), causing a data
race; add a sync.Mutex (or sync.RWMutex) field to adminAck (e.g., lastSeenMu)
and guard all accesses: lock before reading/writing lastSeen in Risks(),
eventHandler(), and any other methods that touch lastSeen (use RLock/RUnlock for
readers and Lock/Unlock for writers) to ensure safe concurrent access.
- Around line 206-210: The returned ConditionalUpdateRisk currently sets URL to
a placeholder string; update the URL field in the configv1.ConditionalUpdateRisk
literal (the struct returned in adminack.go) to either a real canonical
documentation/help link or an empty string until a proper URL exists—locate the
instantiation of configv1.ConditionalUpdateRisk (the return that includes Name:
riskName, Message: gateValue) and replace the placeholder value for URL
accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c7bfeb7f-c902-451f-97ea-29b8285a851f

📥 Commits

Reviewing files that changed from the base of the PR and between cba244e and b41f2a4.

📒 Files selected for processing (2)
  • pkg/cvo/cvo.go
  • pkg/risk/adminack/adminack.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/cvo/cvo.go

@wking wking force-pushed the upgradeable-as-update-risk branch from b41f2a4 to be896d7 Compare April 6, 2026 19:34
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (7)
pkg/risk/upgradeable/upgradeable.go (1)

25-30: ⚠️ Potential issue | 🔴 Critical

Guard lastSeen with synchronization and compare a snapshot.

Risks() writes this slice while eventHandler() diffs against it. Without a mutex and a copied snapshot, this is a data race and change detection can miss or duplicate callbacks.

Also applies to: 49-61, 114-123

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/upgradeable/upgradeable.go` around lines 25 - 30, The slice field
lastSeen on struct upgradeable is accessed concurrently by Risks() (which
writes) and eventHandler() (which reads/diffs), so add a sync.RWMutex (e.g.
lastSeenMu) to upgradeable and use it to guard all accesses: acquire write lock
in Risks() when modifying/replacing u.lastSeen (replace the slice atomically, do
not mutate in place) and acquire read lock in eventHandler() to take a copied
snapshot (e.g. snap := append([]configv1.ConditionalUpdateRisk(nil),
u.lastSeen...) ) before doing comparisons; ensure all other reads/writes to
lastSeen use the mutex to avoid races and missed/duplicate callbacks.
pkg/risk/adminack/adminack.go (1)

36-42: ⚠️ Potential issue | 🔴 Critical

Protect lastSeen from concurrent reads and writes.

Risks() updates this slice while informer callbacks diff against it. Without synchronization and a stable snapshot, callback decisions become racy.

Also applies to: 68-80, 138-147

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/adminack/adminack.go` around lines 36 - 42, The adminAck.lastSeen
slice is accessed concurrently (Risks() and informer callbacks) and must be
protected: add a sync.RWMutex field to the adminAck struct (e.g., mu
sync.RWMutex) and use mu.Lock()/mu.Unlock() when mutating lastSeen and
mu.RLock()/mu.RUnlock() when reading; when returning or diffing use a protected
shallow copy (copy into a new []configv1.ConditionalUpdateRisk) to provide a
stable snapshot for informer callbacks and ensure all places that touch lastSeen
(including the Risks() method and the informer callback handlers referenced
around lines 68-80 and 138-147) are updated to acquire the appropriate lock and
operate on the copied slice.
pkg/risk/overrides/overrides.go (2)

24-29: ⚠️ Potential issue | 🔴 Critical

Synchronize sawOverrides before informer callbacks read it.

Risks() writes this field while eventHandler() reads and dereferences it from informer goroutines. That race can miss or duplicate refreshes and leave the override risk stale.

Also applies to: 54-55, 96-108

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/overrides/overrides.go` around lines 24 - 29, The field
overrides.sawOverrides is written in Risks() and read/dereferenced in
eventHandler() from informer goroutines causing a data race; protect all
reads/writes by adding synchronization (e.g. a sync.Mutex or sync.RWMutex) to
the overrides struct and use it when accessing sawOverrides in Risks(), in the
informer callback/eventHandler(), and any other helper that touches sawOverrides
(also at the other mentioned locations around lines 54-55 and 96-108);
alternatively you can replace the pointer-without-lock pattern with an
atomic-safe type (atomic.Value or a dedicated atomic.Bool/atomic.Pointer) but be
sure to update all places that read or write sawOverrides to use the chosen
synchronization mechanism so no goroutine accesses the bool without holding the
lock or using the atomic API.

59-78: ⚠️ Potential issue | 🟠 Major

Return the override risk even when Status.Desired.Version is empty.

Initial-sync ClusterVersion states can have only Status.Desired.Image populated. Hard-failing on semver.Parse here suppresses the override blocker instead of returning the risk and leaving versionMap unset until the version is known.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/overrides/overrides.go` around lines 59 - 78, Handle empty
Desired.Version before calling semver.Parse: if cv.Status.Desired.Version == ""
then still construct the configv1.ConditionalUpdateRisk (using o.name, Message,
and a generic or versionless URL) and return it with a nil/empty versionMap
instead of erroring; otherwise continue to call semver.Parse and follow the
existing flow (use semver.Parse result to build the URL and compute
majorAndMinorUpdates into versionMap). Update the logic around semver.Parse,
risk, majorAndMinorUpdates, and versionMap to ensure the risk is always returned
even when Status.Desired.Version is empty.
pkg/risk/deletion/deletion.go (1)

60-61: ⚠️ Potential issue | 🔴 Critical

Do not drop the deletion risk on the non-empty path.

This condition is inverted. When deletions are in progress, the function returns nils and suppresses the blocker entirely.

Suggested fix
-	if len(risks) > 0 {
-		return nil, nil, nil
+	if len(risks) == 0 {
+		return nil, nil, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/deletion/deletion.go` around lines 60 - 61, The current early-return
drops detected deletion risks by returning all nils when len(risks) > 0;
instead, preserve and return the risks. Change the branch that now does "if
len(risks) > 0 { return nil, nil, nil }" to return the risks according to the
function signature (e.g., "return nil, risks, nil" if the signature is
(blockers, risks, error)), so the detected deletion risks are propagated; keep
the same error handling otherwise and adjust to the actual named function in
pkg/risk/deletion/deletion.go where the "risks" variable is computed.
pkg/risk/updating/updating.go (1)

25-30: ⚠️ Potential issue | 🔴 Critical

Synchronize wasUpdating before callbacks dereference it.

Risks() writes wasUpdating and eventHandler() reads it from informer goroutines. That race can miss or duplicate enqueueing and leave the updating risk stale.

Also applies to: 55-56, 93-105

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/updating/updating.go` around lines 25 - 30, The code races on the
updating.wasUpdating pointer because Risks() writes it and
eventHandler()/informer goroutines read it; synchronize access by replacing the
raw *bool with a guarded value (e.g., add a sync.Mutex and a bool field or use
sync/atomic (atomic.Bool)) and expose small accessor methods (e.g.,
setWasUpdating(bool) and isWasUpdating() bool) so Risks() sets the value under
the lock/atomic store and eventHandler()/other informer callbacks read it via
the accessor under the lock/atomic load; update references in Risks(),
eventHandler(), and any other uses (lines referenced around struct updating,
Risks(), and eventHandler()) to use these accessors instead of dereferencing
wasUpdating directly.
pkg/cvo/status.go (1)

652-668: ⚠️ Potential issue | 🟠 Major

Validate the reason before using it as a Condition.Reason.

The single-risk case at line 652 uses risks[0].Name directly as the Condition.Reason. Risk names from the upgradeable source can include operator names with hyphens (e.g., ClusterOperatorUpgradeable-test-operator), which violates the Kubernetes condition-reason pattern ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$.

The multiple-risk case correctly uses a validated constant (recommendedReasonMultiple). Apply the same pattern to the single-risk case.

Suggested fix
-		reason := risks[0].Name
+		reason := risks[0].Name
+		if !reasonPattern.MatchString(reason) {
+			reason = recommendedReasonExposed
+		}
 		message := risks[0].Message

Run the following script to check the actual risk names generated by the upgradeable source:

#!/bin/bash
# Check what risk names are generated in the upgradeable risk package tests

rg -n 'Name:\s*"[^"]*-' pkg/risk/upgradeable/ -A 2 -B 2

# Also check the reasonPattern definition
rg -n 'reasonPattern' pkg/cvo/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cvo/status.go` around lines 652 - 668, The single-risk branch currently
uses risks[0].Name directly as the Condition.Reason (violates Kubernetes
reasonPattern); change it to use a validated constant like the multiple-risk
branch does (introduce or reuse a constant such as recommendedReasonSingle)
instead of risks[0].Name, and ensure the chosen constant matches the Kubernetes
reason regex; update the code around risks, recommendedReasonMultiple,
resourcemerge.SetOperatorStatusCondition and configv1.OperatorUpgradeable to set
Reason to that validated constant for the single-risk path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/risk/upgradeable/upgradeable.go`:
- Around line 36-39: The event handler registration currently only calls
source.eventHandler(changeCallback) for AddFunc and UpdateFunc, so deletions of
ClusterOperator are ignored; modify the cache.ResourceEventHandlerFuncs passed
to coInformer.Informer().AddEventHandler to include DeleteFunc: set DeleteFunc:
func(_ interface{}) { source.eventHandler(changeCallback) } so deletions trigger
recomputation (which risks() relies on) just like adds/updates.
- Around line 87-111: The risks slice returned from the method that calls
u.coLister.List is not deterministic; before returning (where risks is assembled
using resourcemerge.FindOperatorStatusCondition and appended with entries named
fmt.Sprintf("%s-%s", u.name, op.ObjectMeta.Name)), sort the risks slice by the
Name field (use sort.SliceStable or sort.Slice with a comparator comparing
risks[i].Name < risks[j].Name) so the ordering is stable (this will also
stabilize u.lastSeen comparisons in the eventHandler).

---

Duplicate comments:
In `@pkg/cvo/status.go`:
- Around line 652-668: The single-risk branch currently uses risks[0].Name
directly as the Condition.Reason (violates Kubernetes reasonPattern); change it
to use a validated constant like the multiple-risk branch does (introduce or
reuse a constant such as recommendedReasonSingle) instead of risks[0].Name, and
ensure the chosen constant matches the Kubernetes reason regex; update the code
around risks, recommendedReasonMultiple,
resourcemerge.SetOperatorStatusCondition and configv1.OperatorUpgradeable to set
Reason to that validated constant for the single-risk path.

In `@pkg/risk/adminack/adminack.go`:
- Around line 36-42: The adminAck.lastSeen slice is accessed concurrently
(Risks() and informer callbacks) and must be protected: add a sync.RWMutex field
to the adminAck struct (e.g., mu sync.RWMutex) and use mu.Lock()/mu.Unlock()
when mutating lastSeen and mu.RLock()/mu.RUnlock() when reading; when returning
or diffing use a protected shallow copy (copy into a new
[]configv1.ConditionalUpdateRisk) to provide a stable snapshot for informer
callbacks and ensure all places that touch lastSeen (including the Risks()
method and the informer callback handlers referenced around lines 68-80 and
138-147) are updated to acquire the appropriate lock and operate on the copied
slice.

In `@pkg/risk/deletion/deletion.go`:
- Around line 60-61: The current early-return drops detected deletion risks by
returning all nils when len(risks) > 0; instead, preserve and return the risks.
Change the branch that now does "if len(risks) > 0 { return nil, nil, nil }" to
return the risks according to the function signature (e.g., "return nil, risks,
nil" if the signature is (blockers, risks, error)), so the detected deletion
risks are propagated; keep the same error handling otherwise and adjust to the
actual named function in pkg/risk/deletion/deletion.go where the "risks"
variable is computed.

In `@pkg/risk/overrides/overrides.go`:
- Around line 24-29: The field overrides.sawOverrides is written in Risks() and
read/dereferenced in eventHandler() from informer goroutines causing a data
race; protect all reads/writes by adding synchronization (e.g. a sync.Mutex or
sync.RWMutex) to the overrides struct and use it when accessing sawOverrides in
Risks(), in the informer callback/eventHandler(), and any other helper that
touches sawOverrides (also at the other mentioned locations around lines 54-55
and 96-108); alternatively you can replace the pointer-without-lock pattern with
an atomic-safe type (atomic.Value or a dedicated atomic.Bool/atomic.Pointer) but
be sure to update all places that read or write sawOverrides to use the chosen
synchronization mechanism so no goroutine accesses the bool without holding the
lock or using the atomic API.
- Around line 59-78: Handle empty Desired.Version before calling semver.Parse:
if cv.Status.Desired.Version == "" then still construct the
configv1.ConditionalUpdateRisk (using o.name, Message, and a generic or
versionless URL) and return it with a nil/empty versionMap instead of erroring;
otherwise continue to call semver.Parse and follow the existing flow (use
semver.Parse result to build the URL and compute majorAndMinorUpdates into
versionMap). Update the logic around semver.Parse, risk, majorAndMinorUpdates,
and versionMap to ensure the risk is always returned even when
Status.Desired.Version is empty.

In `@pkg/risk/updating/updating.go`:
- Around line 25-30: The code races on the updating.wasUpdating pointer because
Risks() writes it and eventHandler()/informer goroutines read it; synchronize
access by replacing the raw *bool with a guarded value (e.g., add a sync.Mutex
and a bool field or use sync/atomic (atomic.Bool)) and expose small accessor
methods (e.g., setWasUpdating(bool) and isWasUpdating() bool) so Risks() sets
the value under the lock/atomic store and eventHandler()/other informer
callbacks read it via the accessor under the lock/atomic load; update references
in Risks(), eventHandler(), and any other uses (lines referenced around struct
updating, Risks(), and eventHandler()) to use these accessors instead of
dereferencing wasUpdating directly.

In `@pkg/risk/upgradeable/upgradeable.go`:
- Around line 25-30: The slice field lastSeen on struct upgradeable is accessed
concurrently by Risks() (which writes) and eventHandler() (which reads/diffs),
so add a sync.RWMutex (e.g. lastSeenMu) to upgradeable and use it to guard all
accesses: acquire write lock in Risks() when modifying/replacing u.lastSeen
(replace the slice atomically, do not mutate in place) and acquire read lock in
eventHandler() to take a copied snapshot (e.g. snap :=
append([]configv1.ConditionalUpdateRisk(nil), u.lastSeen...) ) before doing
comparisons; ensure all other reads/writes to lastSeen use the mutex to avoid
races and missed/duplicate callbacks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5cb7a150-e35a-4c13-b524-bd6cf08537fc

📥 Commits

Reviewing files that changed from the base of the PR and between b41f2a4 and be896d7.

📒 Files selected for processing (20)
  • pkg/cvo/cvo.go
  • pkg/cvo/cvo_test.go
  • pkg/cvo/status.go
  • pkg/cvo/status_test.go
  • pkg/cvo/upgradeable.go
  • pkg/cvo/upgradeable_test.go
  • pkg/risk/adminack/adminack.go
  • pkg/risk/adminack/adminack_test.go
  • pkg/risk/aggregate/aggregate.go
  • pkg/risk/aggregate/aggregate_test.go
  • pkg/risk/deletion/deletion.go
  • pkg/risk/overrides/overrides.go
  • pkg/risk/overrides/overrides_test.go
  • pkg/risk/risk.go
  • pkg/risk/risk_test.go
  • pkg/risk/updating/updating.go
  • pkg/risk/updating/updating_test.go
  • pkg/risk/upgradeable/upgradeable.go
  • pkg/risk/upgradeable/upgradeable_test.go
  • pkg/risk/upgradeable/util.go
💤 Files with no reviewable changes (2)
  • pkg/cvo/upgradeable_test.go
  • pkg/cvo/upgradeable.go
✅ Files skipped from review due to trivial changes (7)
  • pkg/cvo/status_test.go
  • pkg/risk/aggregate/aggregate_test.go
  • pkg/risk/risk_test.go
  • pkg/risk/updating/updating_test.go
  • pkg/risk/adminack/adminack_test.go
  • pkg/risk/upgradeable/upgradeable_test.go
  • pkg/risk/overrides/overrides_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/risk/upgradeable/util.go

wking added a commit to wking/cluster-version-operator that referenced this pull request Apr 6, 2026
This code is from 519b466 (Bug 1978376: Add admin ack Upgradeable
condition gate, 2021-07-27, openshift#633).  But I'm about to move it, and
CodeRabbit complains about the unhandled "wantErr=true, but err==nil"
and "expectedResult=true but isApplicable=false" situations.  Extend
the logic to check for those.

Also move from t.Fatal* to t.Error*, because there's no harm in moving
on to check the other condition, even if the first condition we check
has a problem.  Checking both will give the developer more information
about what they need to fix to make the test happy, without them
needing to retest after fixing one property before they hear about a
problem with the other property.

[1]: openshift#1368 (comment)
@wking wking force-pushed the upgradeable-as-update-risk branch from be896d7 to c148c8c Compare April 6, 2026 19:55
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (9)
pkg/risk/upgradeable/upgradeable.go (3)

36-39: ⚠️ Potential issue | 🟠 Major

Missing DeleteFunc handler for ClusterOperator deletions.

The event handler only registers AddFunc and UpdateFunc. When a ClusterOperator is deleted, the eventHandler won't be triggered, leaving the Upgradeable source stale with outdated risks until an unrelated add/update event arrives. Add a DeleteFunc handler.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/upgradeable/upgradeable.go` around lines 36 - 39, The event handler
registration for coInformer.Informer() only wires AddFunc and UpdateFunc, so
deletions of ClusterOperator aren't triggering source.eventHandler and leave
stale state; add a DeleteFunc to the cache.ResourceEventHandlerFuncs that calls
source.eventHandler(changeCallback) (mirror the existing AddFunc/UpdateFunc
behavior) so that ClusterOperator deletions invoke the same changeCallback and
update the Upgradeable source.

25-30: ⚠️ Potential issue | 🔴 Critical

Data race on lastSeen field persists.

The lastSeen slice is written at lines 52, 57, and 61 in Risks() and read at line 122 in eventHandler(). These execute on different goroutines without synchronization, which is a data race that can cause cmp.Diff to compare torn state.

Also applies to: 52-52, 57-57, 61-61, 122-122

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/upgradeable/upgradeable.go` around lines 25 - 30, The lastSeen slice
in struct upgradeable is accessed concurrently by Risks() (writes at the three
locations) and eventHandler() (reads), causing a data race; fix by adding a
sync.RWMutex (e.g., mu) to upgradeable and use mu.Lock()/mu.Unlock() around
writes in Risks() and mu.RLock()/mu.RUnlock() around reads in eventHandler() (or
perform a safe copy under lock and use the copy outside the lock) so cmp.Diff
and other consumers never see torn state; update all references to lastSeen
accordingly to use the protected access.

100-111: ⚠️ Potential issue | 🟠 Major

Unsorted risks cause spurious change callbacks.

The coLister.List() returns ClusterOperator objects in arbitrary order from the cache. This unstable ordering causes cmp.Diff(u.lastSeen, risks) in eventHandler to detect false changes when the list order shifts, triggering unnecessary callbacks. Sort the risks slice by Name before returning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/upgradeable/upgradeable.go` around lines 100 - 111, The generated
risks slice is unstable because coLister.List() yields operators in arbitrary
order; before returning from the function that builds the risks slice (the block
that defines var risks []configv1.ConditionalUpdateRisk and iterates over ops
using resourcemerge.FindOperatorStatusCondition), sort risks by the Name field
to produce a deterministic ordering so cmp.Diff(u.lastSeen, risks) won't see
spurious changes; use sort.SliceStable (or sort.Slice) on risks comparing
risks[i].Name < risks[j].Name and then return the sorted risks, version, nil.
pkg/cvo/status.go (1)

652-666: ⚠️ Potential issue | 🟠 Major

Raw risk.Name as condition reason may violate Kubernetes pattern.

Risk names can contain hyphens (e.g., ClusterOperatorUpgradeable-operator-name from pkg/risk/upgradeable), but Kubernetes condition reasons must match ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$ (hyphens not allowed). Using risks[0].Name directly at line 652 can publish a non-conformant status reason.

Consider sanitizing the reason or using a validated constant like recommendedReasonExposed when the risk name doesn't match the pattern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cvo/status.go` around lines 652 - 666, The code sets the condition Reason
directly from risks[0].Name which may contain invalid characters (hyphens) for
Kubernetes condition reasons; update the logic before calling
resourcemerge.SetOperatorStatusCondition (for the OperatorUpgradeable condition)
to validate/sanitize the reason string (risks[0].Name) against the Kubernetes
reason regex and, if it fails, replace it with a validated constant such as
recommendedReasonExposed (or another allowed constant), ensuring the multi-risk
branch that sets recommendedReasonMultiple still produces a valid Reason; apply
this check where reason is assigned so all paths (single risk and multiple
risks) emit a conformant reason.
pkg/risk/overrides/overrides.go (2)

24-29: ⚠️ Potential issue | 🔴 Critical

Data race on sawOverrides field persists.

The sawOverrides pointer is written at line 54 from Risks() and read at lines 97 and 107-108 from eventHandler(). These run on different goroutines without synchronization, creating a data race.

Also applies to: 54-54, 96-112

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/overrides/overrides.go` around lines 24 - 29, The sawOverrides
pointer on struct overrides is accessed concurrently (written in Risks() and
read in eventHandler()) causing a data race; fix by making synchronization
explicit—either replace *bool with a sync/atomic boolean (e.g., atomic.Bool) or
add a mutex field to overrides and guard all accesses to sawOverrides with that
mutex. Update the writer in Risks() to use the chosen atomic store or
lock/unlock, and update the readers in eventHandler() to use atomic load or
lock/unlock respectively; ensure all references to sawOverrides in the overrides
type, Risks(), and eventHandler() are adjusted to the new synchronized access
pattern.

59-62: ⚠️ Potential issue | 🟠 Major

Hard failure on empty Status.Desired.Version drops the override risk.

When cv.Status.Desired.Version is empty (common during initial sync when only Image is set), semver.Parse fails and the function returns an error without returning the override risk. This drops the upgrade blocker entirely instead of surfacing it with a fallback URL.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/overrides/overrides.go` around lines 59 - 62, The code currently
hard-fails when cv.Status.Desired.Version is empty because semver.Parse returns
an error; instead, guard the parse: check if cv.Status.Desired.Version == "" and
treat that as a non-fatal condition (do not call semver.Parse), then continue to
construct and return the override risk (including the fallback URL) rather than
returning an error; otherwise, for non-empty values, keep using semver.Parse and
return an error only for real parse failures. Ensure the conditional logic
references cv.Status.Desired.Version and semver.Parse so the behavior branches
correctly.
pkg/risk/updating/updating.go (2)

25-30: ⚠️ Potential issue | 🔴 Critical

Data race on wasUpdating field persists.

The wasUpdating pointer is written at line 55 from Risks() (called from sync paths) and read at lines 94 and 104-105 from eventHandler() (called from informer callbacks on different goroutines). This is a data race that can cause missed or duplicate notifications and leave Upgradeable stale.

Add synchronization using a mutex or atomic operations to protect access to wasUpdating.

Also applies to: 54-55, 93-109

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/updating/updating.go` around lines 25 - 30, The wasUpdating pointer
on struct updating is accessed from multiple goroutines (written in Risks() and
read in eventHandler()), causing a data race; add synchronization by replacing
wasUpdating *bool with either an atomic boolean (e.g., atomic.Bool) or adding a
sync.Mutex (or sync.RWMutex) on updating, then update all accesses in Risks()
and eventHandler() to use the chosen synchronization: for atomic, use Load/Store
operations; for a mutex, lock/unlock around reads and writes. Ensure you update
the struct definition (updating) and all references in Risks() and
eventHandler() to use the new guarded access so reads/writes are thread-safe.

50-58: ⚠️ Potential issue | 🔴 Critical

Unhandled lister errors treated as "not updating".

When isUpdating() returns an error that is neither NotFound nor NoMatchError, the code falls through to line 55, sets wasUpdating, and if updating is false, returns nil, nil, nil. This silently treats cache/lister failures as "no risk", potentially clearing the upgrade blocker during transient problems. The error should be propagated.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/updating/updating.go` around lines 50 - 58, The current logic treats
any non-NotFound/NoMatch error from u.isUpdating() as "not updating" and
silently clears the blocker; change the control flow to propagate unexpected
errors instead: after calling u.isUpdating(), if err != nil then if
meta.IsNoMatchError(err) || apierrors.IsNotFound(err) return nil,nil,nil else
return nil,nil,err; only set u.wasUpdating = &updating and proceed when err ==
nil. This ensures u.isUpdating() errors are not treated as successful "not
updating" states.
pkg/risk/deletion/deletion.go (1)

60-72: ⚠️ Potential issue | 🔴 Critical

Fix the inverted early-return that suppresses active deletion risks.

On Line 60, the len(risks) condition is reversed: when a risk is found, the function returns nil, nil, nil, so the deletion blocker is dropped and risk-to-version mapping (Lines 69-72) is skipped.

Proposed fix
-	if len(risks) > 0 {
+	if len(risks) == 0 {
 		return nil, nil, nil
 	}
 
 	majorAndMinorUpdates := upgradeable.MajorAndMinorUpdates(version, versions)
 	if len(majorAndMinorUpdates) == 0 {
 		return risks, nil, err
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/deletion/deletion.go` around lines 60 - 72, The early-return is
inverted: currently the code returns nil,nil,nil when risks exist, suppressing
deletion blockers; change the condition to check for no risks (if len(risks) ==
0) and return nil, nil, nil only in that case so the subsequent logic that
computes majorAndMinorUpdates, builds versionMap, and returns the actual risks
executes when risks are present; update the if condition near the use of
variables risks, majorAndMinorUpdates, and versionMap accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cvo/cvo.go`:
- Line 309: The variable declaration uses an undefined type `riskSource`; change
the declaration of `risks` from `riskSource` to the imported package type
`risk.Source` (i.e., replace `risks := []riskSource{}` with `risks :=
[]risk.Source{}`) so it matches other usages like the struct fields referencing
`risk.Source` and compiles correctly.

In `@pkg/risk/updating/updating.go`:
- Line 88: Remove the debug print statement `fmt.Printf("checking ClusterVersion
conditions: %v\n", cv.Status.Conditions)` from pkg/risk/updating/updating.go
(it's a transient debug of `cv.Status.Conditions`/ClusterVersion conditions). If
that `fmt` import becomes unused after removal, delete the `fmt` import as well
to avoid unused-import compile errors. Ensure no other references rely on this
print and run `go vet`/`go build` to verify.

---

Duplicate comments:
In `@pkg/cvo/status.go`:
- Around line 652-666: The code sets the condition Reason directly from
risks[0].Name which may contain invalid characters (hyphens) for Kubernetes
condition reasons; update the logic before calling
resourcemerge.SetOperatorStatusCondition (for the OperatorUpgradeable condition)
to validate/sanitize the reason string (risks[0].Name) against the Kubernetes
reason regex and, if it fails, replace it with a validated constant such as
recommendedReasonExposed (or another allowed constant), ensuring the multi-risk
branch that sets recommendedReasonMultiple still produces a valid Reason; apply
this check where reason is assigned so all paths (single risk and multiple
risks) emit a conformant reason.

In `@pkg/risk/deletion/deletion.go`:
- Around line 60-72: The early-return is inverted: currently the code returns
nil,nil,nil when risks exist, suppressing deletion blockers; change the
condition to check for no risks (if len(risks) == 0) and return nil, nil, nil
only in that case so the subsequent logic that computes majorAndMinorUpdates,
builds versionMap, and returns the actual risks executes when risks are present;
update the if condition near the use of variables risks, majorAndMinorUpdates,
and versionMap accordingly.

In `@pkg/risk/overrides/overrides.go`:
- Around line 24-29: The sawOverrides pointer on struct overrides is accessed
concurrently (written in Risks() and read in eventHandler()) causing a data
race; fix by making synchronization explicit—either replace *bool with a
sync/atomic boolean (e.g., atomic.Bool) or add a mutex field to overrides and
guard all accesses to sawOverrides with that mutex. Update the writer in Risks()
to use the chosen atomic store or lock/unlock, and update the readers in
eventHandler() to use atomic load or lock/unlock respectively; ensure all
references to sawOverrides in the overrides type, Risks(), and eventHandler()
are adjusted to the new synchronized access pattern.
- Around line 59-62: The code currently hard-fails when
cv.Status.Desired.Version is empty because semver.Parse returns an error;
instead, guard the parse: check if cv.Status.Desired.Version == "" and treat
that as a non-fatal condition (do not call semver.Parse), then continue to
construct and return the override risk (including the fallback URL) rather than
returning an error; otherwise, for non-empty values, keep using semver.Parse and
return an error only for real parse failures. Ensure the conditional logic
references cv.Status.Desired.Version and semver.Parse so the behavior branches
correctly.

In `@pkg/risk/updating/updating.go`:
- Around line 25-30: The wasUpdating pointer on struct updating is accessed from
multiple goroutines (written in Risks() and read in eventHandler()), causing a
data race; add synchronization by replacing wasUpdating *bool with either an
atomic boolean (e.g., atomic.Bool) or adding a sync.Mutex (or sync.RWMutex) on
updating, then update all accesses in Risks() and eventHandler() to use the
chosen synchronization: for atomic, use Load/Store operations; for a mutex,
lock/unlock around reads and writes. Ensure you update the struct definition
(updating) and all references in Risks() and eventHandler() to use the new
guarded access so reads/writes are thread-safe.
- Around line 50-58: The current logic treats any non-NotFound/NoMatch error
from u.isUpdating() as "not updating" and silently clears the blocker; change
the control flow to propagate unexpected errors instead: after calling
u.isUpdating(), if err != nil then if meta.IsNoMatchError(err) ||
apierrors.IsNotFound(err) return nil,nil,nil else return nil,nil,err; only set
u.wasUpdating = &updating and proceed when err == nil. This ensures
u.isUpdating() errors are not treated as successful "not updating" states.

In `@pkg/risk/upgradeable/upgradeable.go`:
- Around line 36-39: The event handler registration for coInformer.Informer()
only wires AddFunc and UpdateFunc, so deletions of ClusterOperator aren't
triggering source.eventHandler and leave stale state; add a DeleteFunc to the
cache.ResourceEventHandlerFuncs that calls source.eventHandler(changeCallback)
(mirror the existing AddFunc/UpdateFunc behavior) so that ClusterOperator
deletions invoke the same changeCallback and update the Upgradeable source.
- Around line 25-30: The lastSeen slice in struct upgradeable is accessed
concurrently by Risks() (writes at the three locations) and eventHandler()
(reads), causing a data race; fix by adding a sync.RWMutex (e.g., mu) to
upgradeable and use mu.Lock()/mu.Unlock() around writes in Risks() and
mu.RLock()/mu.RUnlock() around reads in eventHandler() (or perform a safe copy
under lock and use the copy outside the lock) so cmp.Diff and other consumers
never see torn state; update all references to lastSeen accordingly to use the
protected access.
- Around line 100-111: The generated risks slice is unstable because
coLister.List() yields operators in arbitrary order; before returning from the
function that builds the risks slice (the block that defines var risks
[]configv1.ConditionalUpdateRisk and iterates over ops using
resourcemerge.FindOperatorStatusCondition), sort risks by the Name field to
produce a deterministic ordering so cmp.Diff(u.lastSeen, risks) won't see
spurious changes; use sort.SliceStable (or sort.Slice) on risks comparing
risks[i].Name < risks[j].Name and then return the sorted risks, version, nil.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 10b25e2e-8d65-4356-a306-9568a6926edb

📥 Commits

Reviewing files that changed from the base of the PR and between be896d7 and c148c8c.

📒 Files selected for processing (20)
  • pkg/cvo/cvo.go
  • pkg/cvo/cvo_test.go
  • pkg/cvo/status.go
  • pkg/cvo/status_test.go
  • pkg/cvo/upgradeable.go
  • pkg/cvo/upgradeable_test.go
  • pkg/risk/adminack/adminack.go
  • pkg/risk/adminack/adminack_test.go
  • pkg/risk/aggregate/aggregate.go
  • pkg/risk/aggregate/aggregate_test.go
  • pkg/risk/deletion/deletion.go
  • pkg/risk/overrides/overrides.go
  • pkg/risk/overrides/overrides_test.go
  • pkg/risk/risk.go
  • pkg/risk/risk_test.go
  • pkg/risk/updating/updating.go
  • pkg/risk/updating/updating_test.go
  • pkg/risk/upgradeable/upgradeable.go
  • pkg/risk/upgradeable/upgradeable_test.go
  • pkg/risk/upgradeable/util.go
💤 Files with no reviewable changes (2)
  • pkg/cvo/upgradeable_test.go
  • pkg/cvo/upgradeable.go
✅ Files skipped from review due to trivial changes (6)
  • pkg/risk/overrides/overrides_test.go
  • pkg/cvo/status_test.go
  • pkg/risk/risk_test.go
  • pkg/risk/updating/updating_test.go
  • pkg/risk/upgradeable/upgradeable_test.go
  • pkg/risk/adminack/adminack.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/risk/upgradeable/util.go
  • pkg/risk/aggregate/aggregate_test.go
  • pkg/cvo/cvo_test.go

@wking wking force-pushed the upgradeable-as-update-risk branch 10 times, most recently from 98bc10b to 782b77f Compare April 7, 2026 04:50
wking added 4 commits April 9, 2026 21:22
This code is from 519b466 (Bug 1978376: Add admin ack Upgradeable
condition gate, 2021-07-27, openshift#633).  But I'm about to move it, and
CodeRabbit complains about the unhandled "wantErr=true, but err==nil"
and "expectedResult=true but isApplicable=false" situations.  Extend
the logic to check for those.

Also move from t.Fatal* to t.Error*, because there's no harm in moving
on to check the other condition, even if the first condition we check
has a problem.  Checking both will give the developer more information
about what they need to fix to make the test happy, without them
needing to retest after fixing one property before they hear about a
problem with the other property.

[1]: openshift#1368 (comment)
This simplifies the complex cvo package by shifting some logic into
small, narrowly scoped helper packages.  That makes for easier unit
testing of that logic, and less complication when reading the
remaining cvo package code.

Using the risk.Source framework with the new aggregate implementation
allows Upgradeable risks to feed into the conditionalUpdateRisks
framework.  This should help reduce confusion some users experience
when the see Upgradeable=False and think it applies to all updates
(when in reality it applies to major and minor bumps, but does not
apply to patch updates, especially since 6f8f984, OTA-1860: Stop
blocking patch updates when cluster version overrides are set,
2026-02-10, openshift#1314).  It will also allow structured access to
individual risks that are reported via this pipeline, while
Upgradeable had an aggregated message string only consumable by
humans.

I'm also removing internal.UpgradeableAdminAckRequired and similar
from ClusterVersion status.conditions.  While these were visible by
end users, they were mostly a mechanism for passing data from one part
of the CVO to another.  There are no known consumers outside of the
CVO.

With the refactor providing clearer abstraction layers, I could drop
the entire upgradeableCheckIntervals structure.  Major touches in the
history of that structure were bd05174 (pkg/cvo/upgradeable.go:
Sync Upgradeable status of the CVO more often, 2022-08-02, openshift#808) and
cc94c95 (pkg/cvo/upgradeable: refactor throttling, 2023-01-11, openshift#882).
The new approach is event-driven via the informer handlers for almost
all of the new risk.Source implementations.  The one source that still
relies on polling is the deletion source, where it's watching a
CVO-side global variable, and not a Kube-side informer.  It will still
sync whenever ClusterVersion status is updated though, which should be
often enough for this usually-transient situation.  If folks want
lower latency, follow up work could add informer-style handler
callbacks to the deletion global.

The changeCallback in unit tests is unbuffered, because I want to know
exactly how often that callback is being called.  If a logic change in
the packages causes additional callback calls, the unbuffered channel
will block them from sending the message, deadlock the informer for
that handler [1], and break later test-cases.

[1]: https://github.com/kubernetes/client-go/blob/v0.35.3/tools/cache/shared_informer.go#L1052-L1086
Avoid startup-time "hey, I can't find those ConfigMaps!" complaints
from pkg/risk/adminack by waiting until the informers have
synchronized.

The pkg/cvo Operator type has a cacheSynced that's populated in
Operator.New (with the ClusterVersion, ClusterOperator, and
FeatureGate informers).  cacheSynced goes all the way back to
648d27c (metrics: Report a gauge for when cluster operator
conditions change, 2019-01-31, openshift#107).  But when we added the initial
ConfigMap informer as cmLister back in c147732 (metrics: add
cluster_installer series, 2019-06-26, openshift#213) we didn't add that
informer to cacheSynced.  And when I added cmManagedInformer in
d18deee (pkg/cvo: Separate ConfigMap informer for
openshift-config-managed, 2020-08-21, openshift#441), I didn't add that
informer to cacheSynced.  None of these earlier commits explain why
the ConfigMap informers were not included in cacheSynced, so I'm
assuming that was oversight, and catching up now.

There is a risk that the ConfigMap informers fail to synchronize
quickly, but I'm having trouble imagining a situation where that would
happen but the ClusterVersion and ClusterOperator informers wouldn't
also fail to synchronize.  And either way, we don't serve metrics
while we wait on this cache synchronization, so the critical
ClusterVersionOperatorDown alert will be firing within 10 minutes, and
the responding system administrators can sort out whatever was giving
the cache-sync logic trouble.
vendor/k8s.io/client-go/tools/cache/shared_informer.go has:

  func (s *sharedIndexInformer) AddEventHandlerWithOptions(handler ResourceEventHandler, options HandlerOptions) (ResourceEventHandlerRegistration, error) {
  	s.startedLock.Lock()
  	defer s.startedLock.Unlock()

  	if s.stopped {
  		return nil, fmt.Errorf("handler %v was not added to shared informer because it has stopped already", handler)
  	}

which might be the only reason AddEeventHandler could fail.  But
regardless, it is still good practice to bubble things up and fail
fast, to summon a cluster administrator to help sort out whatever the
issue is.
@wking wking force-pushed the upgradeable-as-update-risk branch from 782b77f to f5453e7 Compare April 10, 2026 04:22
@wking
Copy link
Copy Markdown
Member Author

wking commented Apr 10, 2026

Rebased onto main with 782b77f -> f5453e7, now that #1367 has merged.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 10, 2026

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-ovn f5453e7 link true /test e2e-agnostic-ovn
ci/prow/e2e-hypershift-conformance f5453e7 link true /test e2e-hypershift-conformance
ci/prow/e2e-hypershift f5453e7 link true /test e2e-hypershift
ci/prow/e2e-aws-ovn-techpreview f5453e7 link true /test e2e-aws-ovn-techpreview

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants