OTA-1935: pkg: Use risk.Source framework to feed Upgradeable#1368
OTA-1935: pkg: Use risk.Source framework to feed Upgradeable#1368wking wants to merge 4 commits intoopenshift:mainfrom
Conversation
|
@wking: This pull request references OTA-1935 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces ad-hoc Prometheus-alert handling with a pluggable risk.Source framework, adds multiple concrete risk sources and an aggregator, removes the legacy upgradeable subsystem and workqueue, and integrates aggregated risks into available-updates evaluation and ClusterVersion status computation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 14
🧹 Nitpick comments (1)
pkg/risk/adminack/adminack_test.go (1)
436-445: Assert both the error expectation and the boolean result.Lines 437-445 only catch unexpected errors and false positives. A broken implementation returning
(false, nil)for a matching gate—or(false, nil)whenwantErris true—would still pass these cases.Suggested fix
isApplicable, err := gateApplicableToCurrentVersion(tt.gateName, version) - if err != nil && !tt.wantErr { - t.Fatalf("gateApplicableToCurrentVersion() unexpected error: %v", err) - } - if err != nil { - return - } - if isApplicable && !tt.expectedResult { - t.Fatalf("gateApplicableToCurrentVersion() %s should not apply", tt.gateName) + if (err != nil) != tt.wantErr { + t.Fatalf("gateApplicableToCurrentVersion() error = %v, wantErr %t", err, tt.wantErr) + } + if err == nil && isApplicable != tt.expectedResult { + t.Fatalf("gateApplicableToCurrentVersion() = %t, want %t", isApplicable, tt.expectedResult) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/adminack/adminack_test.go` around lines 436 - 445, The test only checks unexpected errors and false positives; update the assertions around gateApplicableToCurrentVersion so you verify both the error expectation and the boolean result: after calling gateApplicableToCurrentVersion, assert that (err != nil) matches tt.wantErr (use t.Fatalf if they differ), if err != nil return, then assert that isApplicable equals tt.expectedResult (use t.Fatalf with context if they differ); reference the variables gateApplicableToCurrentVersion, tt.wantErr, tt.expectedResult and isApplicable to locate and change the test logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 164-166: The risk-source callback flow can be skipped by the
recent-fetch fast path causing stale risk data; modify the fast-path logic in
syncAvailableUpdates/riskSourceCallback so that a callback-triggered change
forces evaluation regardless of recent upstream fetch. Add a dirty bit or
timestamp on optrAvailableUpdates (e.g., a new field like risksDirtyAt or
lastRiskCallbackAt) that riskSourceCallback sets, then update the early-return
condition in syncAvailableUpdates to bypass the recent-fetch return when that
dirty bit/timestamp is newer than the last fetch; ensure you still clear/update
optrAvailableUpdates.risks and ShouldReconcileAcceptRisks and call
optrAvailableUpdates.evaluateConditionalUpdates(ctx) after bypassing the fast
path so OperatorUpgradeable and alerts are refreshed immediately.
In `@pkg/cvo/cvo.go`:
- Around line 305-317: The admin-ack informers (cmConfigManagedInformer and
cmConfigInformer) used by adminack.New are not being waited on, causing races;
update the operator's informer sync list so Run() waits for those informers
before enabling admin-ack: add cmConfigManagedInformer.HasSynced and
cmConfigInformer.HasSynced (or their informer sync funcs) into optr.cacheSynced
(the slice checked by Run()), so that optr.Run()/cache sync logic only enables
adminack.New after those caches are synced and avoids AdminAckUnknown at
startup.
In `@pkg/cvo/status.go`:
- Around line 620-629: The code currently ignores the per-version applicability
map returned by upgradeable.Risks and only filters explicit Applies=False
subconditions, which causes global risks (e.g., adminack) to wrongly mark
OperatorUpgradeable=false for patch-only clusters; restore and use the
applicability map from upgradeable.Risks (capture the second return value
instead of discarding it), then when iterating over risks skip any risk that the
applicability map shows as not applicable to the requested updateVersions (in
addition to the existing check of
internal.ConditionalUpdateRiskConditionTypeApplies on the risk's Conditions);
reference upgradeable.Risks, updateVersions, risks, risk.Source and
internal.ConditionalUpdateRiskConditionTypeApplies to locate and implement the
change.
- Around line 652-666: The code sets Condition.Reason from risks[0].Name which
can contain invalid chars; instead, do not use raw risk.Name — assign a
validated reason constant (e.g., introduce/use a constant like
recommendedReasonSingle, similar to recommendedReasonMultiple) for the
single-risk branch and use that when calling
resourcemerge.SetOperatorStatusCondition on cvStatus.Conditions (Type:
configv1.OperatorUpgradeable); if you need to preserve operator identity, map
risk.Name to a validated token or log the raw name elsewhere but do not publish
it as the Condition.Reason.
In `@pkg/risk/adminack/adminack.go`:
- Around line 195-199: The returned ConditionalUpdateRisk currently sets URL to
a placeholder ("https://example.com/FIXME-look-for-a-URI-in-the-message");
update the code that constructs and returns the configv1.ConditionalUpdateRisk
(the struct built alongside riskName and gateValue in adminack.go) to either set
URL to the real canonical admin-ack documentation link or to an empty string
until a proper docs URL exists so users are not directed to the placeholder;
ensure the change is applied to the ConditionalUpdateRisk.URL field where the
struct is created.
- Around line 5-10: The field a.lastSeen is read and written concurrently by
Risks() and eventHandler() (and workers/informers), causing a data race; protect
it by adding a dedicated mutex (e.g., lastSeenMu sync.RWMutex) on the adminack
struct and use Write locks around all mutations in Risks() and any place that
updates a.lastSeen, and Read locks around all diffs/reads in eventHandler() and
the status/available-updates/informer call sites; ensure you reference
a.lastSeen and the new lastSeenMu in those methods (Risks, eventHandler, and the
updater/informer routines) so every access is guarded.
In `@pkg/risk/deletion/deletion.go`:
- Around line 48-74: The code currently drops deletion blocker risks by
returning nils when deletes are in progress; change the early return in the
block that checks len(risks) to return the found risks instead of nils (i.e.,
return risks, nil, err) so the deletion risk is propagated; update the return at
the len(risks) > 0 branch (around the resourcedelete.DeletesInProgress() check)
from return nil, nil, nil to return risks, nil, err.
In `@pkg/risk/overrides/overrides_test.go`:
- Around line 35-39: The test's changeChannel is unbuffered so early invocations
of changeCallback can block the informer and flake the test; change the channel
allocation (changeChannel) to a buffered channel (e.g., make(chan struct{}, 1))
and keep the changeCallback semantics (sending a struct{} into changeChannel) so
startup notifications can be delivered without blocking; apply the same buffered
pattern to any other test notification channels used in the same file (the ones
drained in the startup notification drain logic) to avoid wedging.
In `@pkg/risk/overrides/overrides.go`:
- Around line 23-27: The pointer field sawOverrides on struct overrides is
written on the sync path and read from informer callback goroutines, causing a
data race; protect accesses by replacing sawOverrides with a concurrency-safe
primitive (e.g. use an atomic.Bool or a bool guarded by a sync.Mutex) and update
both the sync-path writer and the informer callback readers to use the chosen
atomic/locked accessors; update all locations that set or test
overrides.sawOverrides so they call the new atomic/locked setter/getter to
ensure no concurrent reads/writes occur.
- Around line 58-77: The code currently unconditionally parses
cv.Status.Desired.Version and errors if it's empty; change it to always
construct and return the ConditionalUpdateRisk (o.name, Message, MatchingRules)
but only attempt semver.Parse and compute majorAndMinorUpdates/versionMap when
cv.Status.Desired.Version is non-empty and parse succeeds; if the version string
is empty, skip parsing, set versionMap to nil (or empty) and use a non-versioned
fallback URL for risk.URL (or omit version formatting) so the override risk is
returned during initial-sync without failing on missing Status.Desired.Version.
In `@pkg/risk/updating/updating_test.go`:
- Around line 36-40: The test's changeChannel is unbuffered causing
changeCallback's send to block and potentially stall informer startup; make
changeChannel buffered (e.g., buffer size 1) so early notifications from
changeCallback won't block the sender and the drain loop that reads from
changeChannel in the test can consume them without blocking startup.
In `@pkg/risk/updating/updating.go`:
- Around line 24-29: The struct updating's wasUpdating pointer is accessed from
multiple goroutines (Risks and the informer callback) causing a race; add a
sync.Mutex (or sync.RWMutex) field to the updating struct and import sync, then
wrap every access and mutation of wasUpdating (reads in the informer callback
and writes in Risks and any other goroutines) with the mutex (Lock/Unlock or
RLock/RUnlock) to ensure atomic, synchronized reads/writes and eliminate the
data race.
- Around line 48-57: The Risks method is incorrectly treating any error from
isUpdating() (and thus from the underlying cvLister.Get) as a benign "not
updating" result; update Risks to only convert meta.IsNoMatchError or
apierrors.IsNotFound to a nil/no-risk return but propagate any other error from
isUpdating() back to the caller (i.e., return nil, nil, err) so lister/cache
failures are not swallowed; apply the same change where isUpdating() is checked
later in this file (the other branch around the same isUpdating() call) to
ensure errors are returned rather than silently clearing the update blocker.
In `@pkg/risk/upgradeable/upgradeable.go`:
- Around line 25-30: The upgradeable struct's slice field lastSeen is being
written in Risks and read in the informer callback without synchronization,
causing a data race; add a sync.Mutex (or sync.RWMutex) to upgradeable, import
sync, protect all accesses to lastSeen in the Risks method and the informer
callback (lock before mutating or reading), and when publishing a slice to
readers make a safe copy (or return a copy) so comparisons use a stable
snapshot; reference the upgradeable type, its lastSeen field, and the Risks
method/informer callback when applying the locks and copying.
---
Nitpick comments:
In `@pkg/risk/adminack/adminack_test.go`:
- Around line 436-445: The test only checks unexpected errors and false
positives; update the assertions around gateApplicableToCurrentVersion so you
verify both the error expectation and the boolean result: after calling
gateApplicableToCurrentVersion, assert that (err != nil) matches tt.wantErr (use
t.Fatalf if they differ), if err != nil return, then assert that isApplicable
equals tt.expectedResult (use t.Fatalf with context if they differ); reference
the variables gateApplicableToCurrentVersion, tt.wantErr, tt.expectedResult and
isApplicable to locate and change the test logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1e7bc9e7-cddd-404b-a50d-7806e8b0f3e8
📒 Files selected for processing (26)
pkg/cvo/availableupdates.gopkg/cvo/availableupdates_test.gopkg/cvo/cvo.gopkg/cvo/cvo_test.gopkg/cvo/status.gopkg/cvo/status_test.gopkg/cvo/upgradeable.gopkg/cvo/upgradeable_test.gopkg/risk/adminack/adminack.gopkg/risk/adminack/adminack_test.gopkg/risk/aggregate/aggregate.gopkg/risk/aggregate/aggregate_test.gopkg/risk/alert/alert.gopkg/risk/alert/alert_test.gopkg/risk/alert/testdata/alerts.jsonpkg/risk/deletion/deletion.gopkg/risk/mock/mock.gopkg/risk/overrides/overrides.gopkg/risk/overrides/overrides_test.gopkg/risk/risk.gopkg/risk/risk_test.gopkg/risk/updating/updating.gopkg/risk/updating/updating_test.gopkg/risk/upgradeable/upgradeable.gopkg/risk/upgradeable/upgradeable_test.gopkg/risk/upgradeable/util.go
💤 Files with no reviewable changes (2)
- pkg/cvo/upgradeable.go
- pkg/cvo/upgradeable_test.go
| import ( | ||
| "context" | ||
| "fmt" | ||
| "regexp" | ||
| "slices" | ||
| "strings" |
There was a problem hiding this comment.
Protect lastSeen from concurrent reads and writes.
Risks() mutates a.lastSeen, while eventHandler() diffs against it, and this source is invoked from the status worker, the available-updates worker, and informer goroutines. That is a real data race and can lead to missed or duplicated change notifications.
Suggested fix
import (
"context"
"fmt"
"regexp"
"slices"
"strings"
+ "sync"
@@
type adminAck struct {
+ mu sync.Mutex
name string
currentVersion func() configv1.Release
adminGatesLister listerscorev1.ConfigMapNamespaceLister
adminAcksLister listerscorev1.ConfigMapNamespaceLister
lastSeen []configv1.ConditionalUpdateRisk
}
@@
risks, version, err := a.risks()
if meta.IsNoMatchError(err) {
- a.lastSeen = nil
+ a.mu.Lock()
+ a.lastSeen = nil
+ a.mu.Unlock()
return nil, nil, nil
}
@@
if len(risks) == 0 {
- a.lastSeen = nil
+ a.mu.Lock()
+ a.lastSeen = nil
+ a.mu.Unlock()
return nil, nil, err
}
- a.lastSeen = risks
+ a.mu.Lock()
+ a.lastSeen = append([]configv1.ConditionalUpdateRisk(nil), risks...)
+ a.mu.Unlock()
@@
func (a *adminAck) eventHandler(changeCallback func()) {
risks, _, err := a.risks()
+ a.mu.Lock()
+ lastSeen := append([]configv1.ConditionalUpdateRisk(nil), a.lastSeen...)
+ a.mu.Unlock()
if err != nil {
if changeCallback != nil {
changeCallback()
@@
- if diff := cmp.Diff(a.lastSeen, risks); diff != "" {
+ if diff := cmp.Diff(lastSeen, risks); diff != "" {Also applies to: 35-40, 67-90, 137-149
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/risk/adminack/adminack.go` around lines 5 - 10, The field a.lastSeen is
read and written concurrently by Risks() and eventHandler() (and
workers/informers), causing a data race; protect it by adding a dedicated mutex
(e.g., lastSeenMu sync.RWMutex) on the adminack struct and use Write locks
around all mutations in Risks() and any place that updates a.lastSeen, and Read
locks around all diffs/reads in eventHandler() and the
status/available-updates/informer call sites; ensure you reference a.lastSeen
and the new lastSeenMu in those methods (Risks, eventHandler, and the
updater/informer routines) so every access is guarded.
| changeChannel := make(chan struct{}) | ||
| changeCallback := func() { | ||
| t.Logf("%s sending change notification", time.Now()) | ||
| changeChannel <- struct{}{} | ||
| } |
There was a problem hiding this comment.
Buffer changeChannel so early notifications can't block startup.
The drain loop on Lines 56-66 shows early callbacks are expected. With the unbuffered send on Line 39, that first callback blocks until the test starts receiving, which can stall informer progress and make this test hang or flake.
Suggested fix
- changeChannel := make(chan struct{})
+ changeChannel := make(chan struct{}, 1)
changeCallback := func() {
t.Logf("%s sending change notification", time.Now())
- changeChannel <- struct{}{}
+ select {
+ case changeChannel <- struct{}{}:
+ default:
+ }
}Also applies to: 56-66
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/risk/updating/updating_test.go` around lines 36 - 40, The test's
changeChannel is unbuffered causing changeCallback's send to block and
potentially stall informer startup; make changeChannel buffered (e.g., buffer
size 1) so early notifications from changeCallback won't block the sender and
the drain loop that reads from changeChannel in the test can consume them
without blocking startup.
| type updating struct { | ||
| name string | ||
| cvName string | ||
| cvLister configlistersv1.ClusterVersionLister | ||
| wasUpdating *bool | ||
| } |
There was a problem hiding this comment.
Guard wasUpdating with synchronization.
Line 54 writes u.wasUpdating in Risks, while Lines 93-104 read it from the informer callback. Those paths run on different goroutines, so this is a data race and can produce missed or duplicate notifications.
Suggested fix
type updating struct {
name string
cvName string
cvLister configlistersv1.ClusterVersionLister
+ mu sync.RWMutex
wasUpdating *bool
}
@@
- u.wasUpdating = &updating
+ u.mu.Lock()
+ u.wasUpdating = &updating
+ u.mu.Unlock()
@@
- if u.wasUpdating == nil {
+ u.mu.RLock()
+ wasUpdating := u.wasUpdating
+ u.mu.RUnlock()
+ if wasUpdating == nil {
if changeCallback != nil {
changeCallback()
}
return
}
@@
- if updating != *u.wasUpdating {
- klog.V(2).Infof("ClusterVersion updating changed from %t to %t, triggering override callback", *u.wasUpdating, updating)
+ if updating != *wasUpdating {
+ klog.V(2).Infof("ClusterVersion updating changed from %t to %t, triggering override callback", *wasUpdating, updating)
if changeCallback != nil {
changeCallback()
}
}Also add sync to the imports.
Also applies to: 54-55, 92-108
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/risk/updating/updating.go` around lines 24 - 29, The struct updating's
wasUpdating pointer is accessed from multiple goroutines (Risks and the informer
callback) causing a race; add a sync.Mutex (or sync.RWMutex) field to the
updating struct and import sync, then wrap every access and mutation of
wasUpdating (reads in the informer callback and writes in Risks and any other
goroutines) with the mutex (Lock/Unlock or RLock/RUnlock) to ensure atomic,
synchronized reads/writes and eliminate the data race.
| type upgradeable struct { | ||
| name string | ||
| currentVersion func() configv1.Release | ||
| coLister configlistersv1.ClusterOperatorLister | ||
| lastSeen []configv1.ConditionalUpdateRisk | ||
| } |
There was a problem hiding this comment.
Synchronize lastSeen before diffing informer updates.
Lines 52, 57, and 61 write u.lastSeen in Risks, while Line 122 reads it from the informer callback. Sharing that slice header without a lock is a race and can make change detection compare torn state.
Suggested fix
type upgradeable struct {
name string
currentVersion func() configv1.Release
coLister configlistersv1.ClusterOperatorLister
+ mu sync.RWMutex
lastSeen []configv1.ConditionalUpdateRisk
}
@@
- u.lastSeen = nil
+ u.mu.Lock()
+ u.lastSeen = nil
+ u.mu.Unlock()
return nil, nil, nil
}
if len(risks) == 0 {
- u.lastSeen = nil
+ u.mu.Lock()
+ u.lastSeen = nil
+ u.mu.Unlock()
return nil, nil, err
}
- u.lastSeen = risks
+ u.mu.Lock()
+ u.lastSeen = append([]configv1.ConditionalUpdateRisk(nil), risks...)
+ u.mu.Unlock()
@@
- if diff := cmp.Diff(u.lastSeen, risks); diff != "" {
+ u.mu.RLock()
+ lastSeen := append([]configv1.ConditionalUpdateRisk(nil), u.lastSeen...)
+ u.mu.RUnlock()
+ if diff := cmp.Diff(lastSeen, risks); diff != "" {
klog.V(2).Infof("ClusterOperator Upgradeable changed (-old +new):\n%s", diff)
if changeCallback != nil {
changeCallback()Also add sync to the imports.
Also applies to: 49-61, 114-127
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/risk/upgradeable/upgradeable.go` around lines 25 - 30, The upgradeable
struct's slice field lastSeen is being written in Risks and read in the informer
callback without synchronization, causing a data race; add a sync.Mutex (or
sync.RWMutex) to upgradeable, import sync, protect all accesses to lastSeen in
the Risks method and the informer callback (lock before mutating or reading),
and when publishing a slice to readers make a safe copy (or return a copy) so
comparisons use a stable snapshot; reference the upgradeable type, its lastSeen
field, and the Risks method/informer callback when applying the locks and
copying.
c0e4fd2 to
cba244e
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (5)
pkg/risk/overrides/overrides_test.go (1)
36-40:⚠️ Potential issue | 🟠 MajorUnbuffered channel can cause test flakiness.
The unbuffered
changeChannelwith blocking send can wedge the test if informer callbacks fire before the test reaches a receive point. Buffer the channel or use a non-blocking send pattern.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/overrides/overrides_test.go` around lines 36 - 40, The test uses an unbuffered changeChannel and a blocking send in changeCallback which can hang if the informer fires before the test is ready; make the channel buffered (e.g., changeChannel := make(chan struct{}, 1) ) or change changeCallback to perform a non-blocking send (use a select with a default case) so the send won't block the informer callback; update references to changeChannel and changeCallback in overrides_test.go accordingly.pkg/cvo/status.go (2)
652-653:⚠️ Potential issue | 🟠 MajorRaw risk name may violate Kubernetes condition-reason pattern.
Risk names can contain hyphens (e.g.,
ClusterOperatorUpgradeable-test-operator), which violates the K8s reason pattern^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$. The multi-risk case correctly uses a validated constant; apply similar validation for single-risk.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cvo/status.go` around lines 652 - 653, The single-risk branch assigns reason := risks[0].Name which can contain hyphens and violate the Kubernetes condition-reason regex; change the code that sets reason and message for the single-risk case to use the same validated/sanitized constant or validation logic used in the multi-risk path (sanitize risks[0].Name to match ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$ or replace it with the existing validated reason constant), ensuring the reason starts with a letter and contains only allowed characters while keeping message := risks[0].Message as before.
620-629:⚠️ Potential issue | 🟠 MajorPer-version applicability map is discarded.
The version map returned by
upgradeable.Risks()is dropped, so risks that only apply to major/minor updates (e.g., fromadminack.gooroverrides.go) will still setUpgradeable=Falseeven when all available updates are patch-only.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cvo/status.go` around lines 620 - 629, The call to upgradeable.Risks(ctx, updateVersions) returns a per-version applicability map that is currently discarded; capture that second return value (e.g., perVersionMap or applicableByVersion) and use it when filtering risks so you only exclude risks that truly don't apply to the provided updateVersions. Specifically, change the assignment to capture the map from upgradeable.Risks, then in the loop that currently checks meta.FindStatusCondition(risks[i].Conditions, internal.ConditionalUpdateRiskConditionTypeApplies) also consult the captured map for risks[i] to skip risks that are not applicable to the requested updateVersions (so patch-only updates won’t be blocked by major/minor-only risks).pkg/risk/overrides/overrides.go (2)
59-62:⚠️ Potential issue | 🟠 MajorHard-fail on empty
Status.Desired.Versionprevents surfacing override risk during initial sync.When
cv.Status.Desired.Versionis empty (common during initial sync), this returns an error instead of still returning the override risk. The risk should be surfaced with a fallback URL or without version-specific filtering.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/overrides/overrides.go` around lines 59 - 62, The code currently calls semver.Parse on cv.Status.Desired.Version and returns an error if parsing fails, which hard-fails when the version string is empty; change the logic in the function containing this code so that if cv.Status.Desired.Version == "" you do not call semver.Parse and instead proceed to compute and return the override risk (using a non-version-specific path or fallback URL), but keep the existing fmt.Errorf return for genuine parse errors when the version is non-empty; reference cv.Status.Desired.Version and semver.Parse in your change and ensure downstream filtering logic that used the parsed version can handle a nil/absent version case.
24-29:⚠️ Potential issue | 🟠 MajorData race on
sawOverridesfield remains unaddressed.The
sawOverridespointer is written at line 54 duringRisks()and read at lines 97/107 from informer callbacks without synchronization. This can cause missed or duplicate refresh triggers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/overrides/overrides.go` around lines 24 - 29, The sawOverrides pointer on the overrides struct is accessed concurrently (written in Risks() and read in informer callbacks) causing a race; make the flag concurrency-safe by either replacing sawOverrides *bool with an atomic boolean (e.g., sync/atomic's atomic.Bool or an int32 used with atomic.Store/Load) or add a sync.Mutex (e.g., sawOverridesMu) and guard every read/write. Update all places that set or test the flag (Risks() and the informer callback handlers that currently read sawOverrides) to use the new atomic API or acquire the mutex before accessing the flag so reads/writes are synchronized.
🧹 Nitpick comments (1)
pkg/risk/adminack/adminack_test.go (1)
198-233: Avoid order-dependent shared state across table cases.The suite mutates one shared fake client/informer and relies on delete scenarios being last (Line 198 and Line 217 comments). This creates brittle coupling between cases and makes future reordering/additions risky. Prefer per-
t.Runfixture setup (or explicit reseed of both ConfigMaps at each case start).As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 285-316
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/adminack/adminack_test.go` around lines 198 - 233, The table-driven tests mutate a shared fake client/informer and rely on delete cases being last (uses gateCm and ackCm in pkg/risk/adminack/adminack_test.go), so change each test case to create its own fresh fixture: inside each t.Run recreate the fake client/informer and seed both ConfigMaps (admin-gates and admin-acks) from the table entry (gateCm, ackCm) or explicitly reseed them before invoking the code under test, rather than relying on global/shared state; ensure expectedRisks/expectedError assertions run against the result of the isolated per-case setup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/risk/adminack/adminack_test.go`:
- Around line 439-447: The test currently fails to assert two cases: when
tt.wantErr is true but err is nil, and when tt.expectedResult is true but
isApplicable is false; inside the gateApplicableToCurrentVersion() test after
calling the function, keep the existing unexpected-error check (if err != nil &&
!tt.wantErr ...) and early-return on err, then add explicit checks that if
tt.wantErr && err == nil { t.Fatalf("expected error for %s but got nil",
tt.gateName) } and if tt.expectedResult && !isApplicable {
t.Fatalf("gateApplicableToCurrentVersion() %s should apply", tt.gateName) } to
cover both missing-failure paths (use existing symbols tt.wantErr, err,
tt.expectedResult, isApplicable, tt.gateName).
In `@pkg/risk/upgradeable/upgradeable_test.go`:
- Around line 42-46: The test currently uses an unbuffered channel changeChannel
and a blocking send in changeCallback which can deadlock if the callback runs
before the test is ready; change changeChannel to a buffered channel (size 1)
and make the send in changeCallback non-blocking (use a select with a default
branch) so the callback never blocks the goroutine—update the references to
changeChannel and changeCallback in upgradeable_test.go accordingly.
---
Duplicate comments:
In `@pkg/cvo/status.go`:
- Around line 652-653: The single-risk branch assigns reason := risks[0].Name
which can contain hyphens and violate the Kubernetes condition-reason regex;
change the code that sets reason and message for the single-risk case to use the
same validated/sanitized constant or validation logic used in the multi-risk
path (sanitize risks[0].Name to match ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$
or replace it with the existing validated reason constant), ensuring the reason
starts with a letter and contains only allowed characters while keeping message
:= risks[0].Message as before.
- Around line 620-629: The call to upgradeable.Risks(ctx, updateVersions)
returns a per-version applicability map that is currently discarded; capture
that second return value (e.g., perVersionMap or applicableByVersion) and use it
when filtering risks so you only exclude risks that truly don't apply to the
provided updateVersions. Specifically, change the assignment to capture the map
from upgradeable.Risks, then in the loop that currently checks
meta.FindStatusCondition(risks[i].Conditions,
internal.ConditionalUpdateRiskConditionTypeApplies) also consult the captured
map for risks[i] to skip risks that are not applicable to the requested
updateVersions (so patch-only updates won’t be blocked by major/minor-only
risks).
In `@pkg/risk/overrides/overrides_test.go`:
- Around line 36-40: The test uses an unbuffered changeChannel and a blocking
send in changeCallback which can hang if the informer fires before the test is
ready; make the channel buffered (e.g., changeChannel := make(chan struct{}, 1)
) or change changeCallback to perform a non-blocking send (use a select with a
default case) so the send won't block the informer callback; update references
to changeChannel and changeCallback in overrides_test.go accordingly.
In `@pkg/risk/overrides/overrides.go`:
- Around line 59-62: The code currently calls semver.Parse on
cv.Status.Desired.Version and returns an error if parsing fails, which
hard-fails when the version string is empty; change the logic in the function
containing this code so that if cv.Status.Desired.Version == "" you do not call
semver.Parse and instead proceed to compute and return the override risk (using
a non-version-specific path or fallback URL), but keep the existing fmt.Errorf
return for genuine parse errors when the version is non-empty; reference
cv.Status.Desired.Version and semver.Parse in your change and ensure downstream
filtering logic that used the parsed version can handle a nil/absent version
case.
- Around line 24-29: The sawOverrides pointer on the overrides struct is
accessed concurrently (written in Risks() and read in informer callbacks)
causing a race; make the flag concurrency-safe by either replacing sawOverrides
*bool with an atomic boolean (e.g., sync/atomic's atomic.Bool or an int32 used
with atomic.Store/Load) or add a sync.Mutex (e.g., sawOverridesMu) and guard
every read/write. Update all places that set or test the flag (Risks() and the
informer callback handlers that currently read sawOverrides) to use the new
atomic API or acquire the mutex before accessing the flag so reads/writes are
synchronized.
---
Nitpick comments:
In `@pkg/risk/adminack/adminack_test.go`:
- Around line 198-233: The table-driven tests mutate a shared fake
client/informer and rely on delete cases being last (uses gateCm and ackCm in
pkg/risk/adminack/adminack_test.go), so change each test case to create its own
fresh fixture: inside each t.Run recreate the fake client/informer and seed both
ConfigMaps (admin-gates and admin-acks) from the table entry (gateCm, ackCm) or
explicitly reseed them before invoking the code under test, rather than relying
on global/shared state; ensure expectedRisks/expectedError assertions run
against the result of the isolated per-case setup.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3e6a88ab-f2af-4450-90c8-19563f7f5325
📒 Files selected for processing (20)
pkg/cvo/cvo.gopkg/cvo/cvo_test.gopkg/cvo/status.gopkg/cvo/status_test.gopkg/cvo/upgradeable.gopkg/cvo/upgradeable_test.gopkg/risk/adminack/adminack.gopkg/risk/adminack/adminack_test.gopkg/risk/aggregate/aggregate.gopkg/risk/aggregate/aggregate_test.gopkg/risk/deletion/deletion.gopkg/risk/overrides/overrides.gopkg/risk/overrides/overrides_test.gopkg/risk/risk.gopkg/risk/risk_test.gopkg/risk/updating/updating.gopkg/risk/updating/updating_test.gopkg/risk/upgradeable/upgradeable.gopkg/risk/upgradeable/upgradeable_test.gopkg/risk/upgradeable/util.go
💤 Files with no reviewable changes (2)
- pkg/cvo/upgradeable_test.go
- pkg/cvo/upgradeable.go
✅ Files skipped from review due to trivial changes (4)
- pkg/cvo/status_test.go
- pkg/risk/upgradeable/util.go
- pkg/risk/risk_test.go
- pkg/risk/updating/updating_test.go
🚧 Files skipped from review as they are similar to previous changes (8)
- pkg/risk/aggregate/aggregate.go
- pkg/risk/deletion/deletion.go
- pkg/risk/updating/updating.go
- pkg/risk/upgradeable/upgradeable.go
- pkg/risk/adminack/adminack.go
- pkg/risk/risk.go
- pkg/cvo/cvo_test.go
- pkg/cvo/cvo.go
| changeChannel := make(chan struct{}) | ||
| changeCallback := func() { | ||
| t.Logf("%s sending change notification", time.Now()) | ||
| changeChannel <- struct{}{} | ||
| } |
There was a problem hiding this comment.
Buffer changeChannel to prevent test flakiness.
Same issue as overrides_test.go: the unbuffered channel with a blocking send can deadlock or flake if the callback is invoked before the test reaches a receive. Use a buffered channel with non-blocking send.
Suggested fix
- changeChannel := make(chan struct{})
+ changeChannel := make(chan struct{}, 1)
changeCallback := func() {
t.Logf("%s sending change notification", time.Now())
- changeChannel <- struct{}{}
+ select {
+ case changeChannel <- struct{}{}:
+ default:
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| changeChannel := make(chan struct{}) | |
| changeCallback := func() { | |
| t.Logf("%s sending change notification", time.Now()) | |
| changeChannel <- struct{}{} | |
| } | |
| changeChannel := make(chan struct{}, 1) | |
| changeCallback := func() { | |
| t.Logf("%s sending change notification", time.Now()) | |
| select { | |
| case changeChannel <- struct{}{}: | |
| default: | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/risk/upgradeable/upgradeable_test.go` around lines 42 - 46, The test
currently uses an unbuffered channel changeChannel and a blocking send in
changeCallback which can deadlock if the callback runs before the test is ready;
change changeChannel to a buffered channel (size 1) and make the send in
changeCallback non-blocking (use a select with a default branch) so the callback
never blocks the goroutine—update the references to changeChannel and
changeCallback in upgradeable_test.go accordingly.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/risk/adminack/adminack.go (2)
35-42:⚠️ Potential issue | 🔴 CriticalProtect
lastSeenwith a mutex across worker and informer paths.
Risks()mutatesa.lastSeenwhileeventHandler()reads it from informer goroutines, which creates a real data race (Line 77/82/86 vs Line 156).Suggested fix
import ( "context" "fmt" "regexp" "slices" "strings" + "sync" @@ type adminAck struct { + lastSeenMu sync.RWMutex name string currentVersion func() configv1.Release adminGatesLister listerscorev1.ConfigMapNamespaceLister adminAcksLister listerscorev1.ConfigMapNamespaceLister hasSynced func() bool lastSeen []configv1.ConditionalUpdateRisk } @@ if meta.IsNoMatchError(err) { - a.lastSeen = nil + a.lastSeenMu.Lock() + a.lastSeen = nil + a.lastSeenMu.Unlock() return nil, nil, nil } @@ if len(risks) == 0 { - a.lastSeen = nil + a.lastSeenMu.Lock() + a.lastSeen = nil + a.lastSeenMu.Unlock() return nil, nil, err } - a.lastSeen = risks + a.lastSeenMu.Lock() + a.lastSeen = append([]configv1.ConditionalUpdateRisk(nil), risks...) + a.lastSeenMu.Unlock() @@ func (a *adminAck) eventHandler(changeCallback func()) { risks, _, err := a.risks() @@ - if diff := cmp.Diff(a.lastSeen, risks); diff != "" { + a.lastSeenMu.RLock() + lastSeen := append([]configv1.ConditionalUpdateRisk(nil), a.lastSeen...) + a.lastSeenMu.RUnlock() + if diff := cmp.Diff(lastSeen, risks); diff != "" {Also applies to: 75-86, 148-157
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/adminack/adminack.go` around lines 35 - 42, The struct adminAck uses lastSeen concurrently from Risks() (which mutates a.lastSeen) and eventHandler() (which reads it), causing a data race; add a sync.Mutex (or sync.RWMutex) field to adminAck (e.g., lastSeenMu) and guard all accesses: lock before reading/writing lastSeen in Risks(), eventHandler(), and any other methods that touch lastSeen (use RLock/RUnlock for readers and Lock/Unlock for writers) to ensure safe concurrent access.
206-210:⚠️ Potential issue | 🟡 MinorReplace the placeholder
ConditionalUpdateRisk.URL.Line 209 still returns a placeholder
https://example.com/...to end users. Use a real canonical docs link, or leave URL empty until one exists.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/adminack/adminack.go` around lines 206 - 210, The returned ConditionalUpdateRisk currently sets URL to a placeholder string; update the URL field in the configv1.ConditionalUpdateRisk literal (the struct returned in adminack.go) to either a real canonical documentation/help link or an empty string until a proper URL exists—locate the instantiation of configv1.ConditionalUpdateRisk (the return that includes Name: riskName, Message: gateValue) and replace the placeholder value for URL accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/risk/adminack/adminack.go`:
- Around line 35-42: The struct adminAck uses lastSeen concurrently from Risks()
(which mutates a.lastSeen) and eventHandler() (which reads it), causing a data
race; add a sync.Mutex (or sync.RWMutex) field to adminAck (e.g., lastSeenMu)
and guard all accesses: lock before reading/writing lastSeen in Risks(),
eventHandler(), and any other methods that touch lastSeen (use RLock/RUnlock for
readers and Lock/Unlock for writers) to ensure safe concurrent access.
- Around line 206-210: The returned ConditionalUpdateRisk currently sets URL to
a placeholder string; update the URL field in the configv1.ConditionalUpdateRisk
literal (the struct returned in adminack.go) to either a real canonical
documentation/help link or an empty string until a proper URL exists—locate the
instantiation of configv1.ConditionalUpdateRisk (the return that includes Name:
riskName, Message: gateValue) and replace the placeholder value for URL
accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c7bfeb7f-c902-451f-97ea-29b8285a851f
📒 Files selected for processing (2)
pkg/cvo/cvo.gopkg/risk/adminack/adminack.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/cvo/cvo.go
b41f2a4 to
be896d7
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (7)
pkg/risk/upgradeable/upgradeable.go (1)
25-30:⚠️ Potential issue | 🔴 CriticalGuard
lastSeenwith synchronization and compare a snapshot.
Risks()writes this slice whileeventHandler()diffs against it. Without a mutex and a copied snapshot, this is a data race and change detection can miss or duplicate callbacks.Also applies to: 49-61, 114-123
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/upgradeable/upgradeable.go` around lines 25 - 30, The slice field lastSeen on struct upgradeable is accessed concurrently by Risks() (which writes) and eventHandler() (which reads/diffs), so add a sync.RWMutex (e.g. lastSeenMu) to upgradeable and use it to guard all accesses: acquire write lock in Risks() when modifying/replacing u.lastSeen (replace the slice atomically, do not mutate in place) and acquire read lock in eventHandler() to take a copied snapshot (e.g. snap := append([]configv1.ConditionalUpdateRisk(nil), u.lastSeen...) ) before doing comparisons; ensure all other reads/writes to lastSeen use the mutex to avoid races and missed/duplicate callbacks.pkg/risk/adminack/adminack.go (1)
36-42:⚠️ Potential issue | 🔴 CriticalProtect
lastSeenfrom concurrent reads and writes.
Risks()updates this slice while informer callbacks diff against it. Without synchronization and a stable snapshot, callback decisions become racy.Also applies to: 68-80, 138-147
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/adminack/adminack.go` around lines 36 - 42, The adminAck.lastSeen slice is accessed concurrently (Risks() and informer callbacks) and must be protected: add a sync.RWMutex field to the adminAck struct (e.g., mu sync.RWMutex) and use mu.Lock()/mu.Unlock() when mutating lastSeen and mu.RLock()/mu.RUnlock() when reading; when returning or diffing use a protected shallow copy (copy into a new []configv1.ConditionalUpdateRisk) to provide a stable snapshot for informer callbacks and ensure all places that touch lastSeen (including the Risks() method and the informer callback handlers referenced around lines 68-80 and 138-147) are updated to acquire the appropriate lock and operate on the copied slice.pkg/risk/overrides/overrides.go (2)
24-29:⚠️ Potential issue | 🔴 CriticalSynchronize
sawOverridesbefore informer callbacks read it.
Risks()writes this field whileeventHandler()reads and dereferences it from informer goroutines. That race can miss or duplicate refreshes and leave the override risk stale.Also applies to: 54-55, 96-108
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/overrides/overrides.go` around lines 24 - 29, The field overrides.sawOverrides is written in Risks() and read/dereferenced in eventHandler() from informer goroutines causing a data race; protect all reads/writes by adding synchronization (e.g. a sync.Mutex or sync.RWMutex) to the overrides struct and use it when accessing sawOverrides in Risks(), in the informer callback/eventHandler(), and any other helper that touches sawOverrides (also at the other mentioned locations around lines 54-55 and 96-108); alternatively you can replace the pointer-without-lock pattern with an atomic-safe type (atomic.Value or a dedicated atomic.Bool/atomic.Pointer) but be sure to update all places that read or write sawOverrides to use the chosen synchronization mechanism so no goroutine accesses the bool without holding the lock or using the atomic API.
59-78:⚠️ Potential issue | 🟠 MajorReturn the override risk even when
Status.Desired.Versionis empty.Initial-sync
ClusterVersionstates can have onlyStatus.Desired.Imagepopulated. Hard-failing onsemver.Parsehere suppresses the override blocker instead of returning the risk and leavingversionMapunset until the version is known.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/overrides/overrides.go` around lines 59 - 78, Handle empty Desired.Version before calling semver.Parse: if cv.Status.Desired.Version == "" then still construct the configv1.ConditionalUpdateRisk (using o.name, Message, and a generic or versionless URL) and return it with a nil/empty versionMap instead of erroring; otherwise continue to call semver.Parse and follow the existing flow (use semver.Parse result to build the URL and compute majorAndMinorUpdates into versionMap). Update the logic around semver.Parse, risk, majorAndMinorUpdates, and versionMap to ensure the risk is always returned even when Status.Desired.Version is empty.pkg/risk/deletion/deletion.go (1)
60-61:⚠️ Potential issue | 🔴 CriticalDo not drop the deletion risk on the non-empty path.
This condition is inverted. When deletions are in progress, the function returns nils and suppresses the blocker entirely.
Suggested fix
- if len(risks) > 0 { - return nil, nil, nil + if len(risks) == 0 { + return nil, nil, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/deletion/deletion.go` around lines 60 - 61, The current early-return drops detected deletion risks by returning all nils when len(risks) > 0; instead, preserve and return the risks. Change the branch that now does "if len(risks) > 0 { return nil, nil, nil }" to return the risks according to the function signature (e.g., "return nil, risks, nil" if the signature is (blockers, risks, error)), so the detected deletion risks are propagated; keep the same error handling otherwise and adjust to the actual named function in pkg/risk/deletion/deletion.go where the "risks" variable is computed.pkg/risk/updating/updating.go (1)
25-30:⚠️ Potential issue | 🔴 CriticalSynchronize
wasUpdatingbefore callbacks dereference it.
Risks()writeswasUpdatingandeventHandler()reads it from informer goroutines. That race can miss or duplicate enqueueing and leave the updating risk stale.Also applies to: 55-56, 93-105
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/updating/updating.go` around lines 25 - 30, The code races on the updating.wasUpdating pointer because Risks() writes it and eventHandler()/informer goroutines read it; synchronize access by replacing the raw *bool with a guarded value (e.g., add a sync.Mutex and a bool field or use sync/atomic (atomic.Bool)) and expose small accessor methods (e.g., setWasUpdating(bool) and isWasUpdating() bool) so Risks() sets the value under the lock/atomic store and eventHandler()/other informer callbacks read it via the accessor under the lock/atomic load; update references in Risks(), eventHandler(), and any other uses (lines referenced around struct updating, Risks(), and eventHandler()) to use these accessors instead of dereferencing wasUpdating directly.pkg/cvo/status.go (1)
652-668:⚠️ Potential issue | 🟠 MajorValidate the reason before using it as a Condition.Reason.
The single-risk case at line 652 uses
risks[0].Namedirectly as the Condition.Reason. Risk names from the upgradeable source can include operator names with hyphens (e.g.,ClusterOperatorUpgradeable-test-operator), which violates the Kubernetes condition-reason pattern^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$.The multiple-risk case correctly uses a validated constant (
recommendedReasonMultiple). Apply the same pattern to the single-risk case.Suggested fix
- reason := risks[0].Name + reason := risks[0].Name + if !reasonPattern.MatchString(reason) { + reason = recommendedReasonExposed + } message := risks[0].MessageRun the following script to check the actual risk names generated by the upgradeable source:
#!/bin/bash # Check what risk names are generated in the upgradeable risk package tests rg -n 'Name:\s*"[^"]*-' pkg/risk/upgradeable/ -A 2 -B 2 # Also check the reasonPattern definition rg -n 'reasonPattern' pkg/cvo/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cvo/status.go` around lines 652 - 668, The single-risk branch currently uses risks[0].Name directly as the Condition.Reason (violates Kubernetes reasonPattern); change it to use a validated constant like the multiple-risk branch does (introduce or reuse a constant such as recommendedReasonSingle) instead of risks[0].Name, and ensure the chosen constant matches the Kubernetes reason regex; update the code around risks, recommendedReasonMultiple, resourcemerge.SetOperatorStatusCondition and configv1.OperatorUpgradeable to set Reason to that validated constant for the single-risk path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/risk/upgradeable/upgradeable.go`:
- Around line 36-39: The event handler registration currently only calls
source.eventHandler(changeCallback) for AddFunc and UpdateFunc, so deletions of
ClusterOperator are ignored; modify the cache.ResourceEventHandlerFuncs passed
to coInformer.Informer().AddEventHandler to include DeleteFunc: set DeleteFunc:
func(_ interface{}) { source.eventHandler(changeCallback) } so deletions trigger
recomputation (which risks() relies on) just like adds/updates.
- Around line 87-111: The risks slice returned from the method that calls
u.coLister.List is not deterministic; before returning (where risks is assembled
using resourcemerge.FindOperatorStatusCondition and appended with entries named
fmt.Sprintf("%s-%s", u.name, op.ObjectMeta.Name)), sort the risks slice by the
Name field (use sort.SliceStable or sort.Slice with a comparator comparing
risks[i].Name < risks[j].Name) so the ordering is stable (this will also
stabilize u.lastSeen comparisons in the eventHandler).
---
Duplicate comments:
In `@pkg/cvo/status.go`:
- Around line 652-668: The single-risk branch currently uses risks[0].Name
directly as the Condition.Reason (violates Kubernetes reasonPattern); change it
to use a validated constant like the multiple-risk branch does (introduce or
reuse a constant such as recommendedReasonSingle) instead of risks[0].Name, and
ensure the chosen constant matches the Kubernetes reason regex; update the code
around risks, recommendedReasonMultiple,
resourcemerge.SetOperatorStatusCondition and configv1.OperatorUpgradeable to set
Reason to that validated constant for the single-risk path.
In `@pkg/risk/adminack/adminack.go`:
- Around line 36-42: The adminAck.lastSeen slice is accessed concurrently
(Risks() and informer callbacks) and must be protected: add a sync.RWMutex field
to the adminAck struct (e.g., mu sync.RWMutex) and use mu.Lock()/mu.Unlock()
when mutating lastSeen and mu.RLock()/mu.RUnlock() when reading; when returning
or diffing use a protected shallow copy (copy into a new
[]configv1.ConditionalUpdateRisk) to provide a stable snapshot for informer
callbacks and ensure all places that touch lastSeen (including the Risks()
method and the informer callback handlers referenced around lines 68-80 and
138-147) are updated to acquire the appropriate lock and operate on the copied
slice.
In `@pkg/risk/deletion/deletion.go`:
- Around line 60-61: The current early-return drops detected deletion risks by
returning all nils when len(risks) > 0; instead, preserve and return the risks.
Change the branch that now does "if len(risks) > 0 { return nil, nil, nil }" to
return the risks according to the function signature (e.g., "return nil, risks,
nil" if the signature is (blockers, risks, error)), so the detected deletion
risks are propagated; keep the same error handling otherwise and adjust to the
actual named function in pkg/risk/deletion/deletion.go where the "risks"
variable is computed.
In `@pkg/risk/overrides/overrides.go`:
- Around line 24-29: The field overrides.sawOverrides is written in Risks() and
read/dereferenced in eventHandler() from informer goroutines causing a data
race; protect all reads/writes by adding synchronization (e.g. a sync.Mutex or
sync.RWMutex) to the overrides struct and use it when accessing sawOverrides in
Risks(), in the informer callback/eventHandler(), and any other helper that
touches sawOverrides (also at the other mentioned locations around lines 54-55
and 96-108); alternatively you can replace the pointer-without-lock pattern with
an atomic-safe type (atomic.Value or a dedicated atomic.Bool/atomic.Pointer) but
be sure to update all places that read or write sawOverrides to use the chosen
synchronization mechanism so no goroutine accesses the bool without holding the
lock or using the atomic API.
- Around line 59-78: Handle empty Desired.Version before calling semver.Parse:
if cv.Status.Desired.Version == "" then still construct the
configv1.ConditionalUpdateRisk (using o.name, Message, and a generic or
versionless URL) and return it with a nil/empty versionMap instead of erroring;
otherwise continue to call semver.Parse and follow the existing flow (use
semver.Parse result to build the URL and compute majorAndMinorUpdates into
versionMap). Update the logic around semver.Parse, risk, majorAndMinorUpdates,
and versionMap to ensure the risk is always returned even when
Status.Desired.Version is empty.
In `@pkg/risk/updating/updating.go`:
- Around line 25-30: The code races on the updating.wasUpdating pointer because
Risks() writes it and eventHandler()/informer goroutines read it; synchronize
access by replacing the raw *bool with a guarded value (e.g., add a sync.Mutex
and a bool field or use sync/atomic (atomic.Bool)) and expose small accessor
methods (e.g., setWasUpdating(bool) and isWasUpdating() bool) so Risks() sets
the value under the lock/atomic store and eventHandler()/other informer
callbacks read it via the accessor under the lock/atomic load; update references
in Risks(), eventHandler(), and any other uses (lines referenced around struct
updating, Risks(), and eventHandler()) to use these accessors instead of
dereferencing wasUpdating directly.
In `@pkg/risk/upgradeable/upgradeable.go`:
- Around line 25-30: The slice field lastSeen on struct upgradeable is accessed
concurrently by Risks() (which writes) and eventHandler() (which reads/diffs),
so add a sync.RWMutex (e.g. lastSeenMu) to upgradeable and use it to guard all
accesses: acquire write lock in Risks() when modifying/replacing u.lastSeen
(replace the slice atomically, do not mutate in place) and acquire read lock in
eventHandler() to take a copied snapshot (e.g. snap :=
append([]configv1.ConditionalUpdateRisk(nil), u.lastSeen...) ) before doing
comparisons; ensure all other reads/writes to lastSeen use the mutex to avoid
races and missed/duplicate callbacks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5cb7a150-e35a-4c13-b524-bd6cf08537fc
📒 Files selected for processing (20)
pkg/cvo/cvo.gopkg/cvo/cvo_test.gopkg/cvo/status.gopkg/cvo/status_test.gopkg/cvo/upgradeable.gopkg/cvo/upgradeable_test.gopkg/risk/adminack/adminack.gopkg/risk/adminack/adminack_test.gopkg/risk/aggregate/aggregate.gopkg/risk/aggregate/aggregate_test.gopkg/risk/deletion/deletion.gopkg/risk/overrides/overrides.gopkg/risk/overrides/overrides_test.gopkg/risk/risk.gopkg/risk/risk_test.gopkg/risk/updating/updating.gopkg/risk/updating/updating_test.gopkg/risk/upgradeable/upgradeable.gopkg/risk/upgradeable/upgradeable_test.gopkg/risk/upgradeable/util.go
💤 Files with no reviewable changes (2)
- pkg/cvo/upgradeable_test.go
- pkg/cvo/upgradeable.go
✅ Files skipped from review due to trivial changes (7)
- pkg/cvo/status_test.go
- pkg/risk/aggregate/aggregate_test.go
- pkg/risk/risk_test.go
- pkg/risk/updating/updating_test.go
- pkg/risk/adminack/adminack_test.go
- pkg/risk/upgradeable/upgradeable_test.go
- pkg/risk/overrides/overrides_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/risk/upgradeable/util.go
This code is from 519b466 (Bug 1978376: Add admin ack Upgradeable condition gate, 2021-07-27, openshift#633). But I'm about to move it, and CodeRabbit complains about the unhandled "wantErr=true, but err==nil" and "expectedResult=true but isApplicable=false" situations. Extend the logic to check for those. Also move from t.Fatal* to t.Error*, because there's no harm in moving on to check the other condition, even if the first condition we check has a problem. Checking both will give the developer more information about what they need to fix to make the test happy, without them needing to retest after fixing one property before they hear about a problem with the other property. [1]: openshift#1368 (comment)
be896d7 to
c148c8c
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (9)
pkg/risk/upgradeable/upgradeable.go (3)
36-39:⚠️ Potential issue | 🟠 MajorMissing
DeleteFunchandler forClusterOperatordeletions.The event handler only registers
AddFuncandUpdateFunc. When aClusterOperatoris deleted, theeventHandlerwon't be triggered, leaving theUpgradeablesource stale with outdated risks until an unrelated add/update event arrives. Add aDeleteFunchandler.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/upgradeable/upgradeable.go` around lines 36 - 39, The event handler registration for coInformer.Informer() only wires AddFunc and UpdateFunc, so deletions of ClusterOperator aren't triggering source.eventHandler and leave stale state; add a DeleteFunc to the cache.ResourceEventHandlerFuncs that calls source.eventHandler(changeCallback) (mirror the existing AddFunc/UpdateFunc behavior) so that ClusterOperator deletions invoke the same changeCallback and update the Upgradeable source.
25-30:⚠️ Potential issue | 🔴 CriticalData race on
lastSeenfield persists.The
lastSeenslice is written at lines 52, 57, and 61 inRisks()and read at line 122 ineventHandler(). These execute on different goroutines without synchronization, which is a data race that can causecmp.Diffto compare torn state.Also applies to: 52-52, 57-57, 61-61, 122-122
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/upgradeable/upgradeable.go` around lines 25 - 30, The lastSeen slice in struct upgradeable is accessed concurrently by Risks() (writes at the three locations) and eventHandler() (reads), causing a data race; fix by adding a sync.RWMutex (e.g., mu) to upgradeable and use mu.Lock()/mu.Unlock() around writes in Risks() and mu.RLock()/mu.RUnlock() around reads in eventHandler() (or perform a safe copy under lock and use the copy outside the lock) so cmp.Diff and other consumers never see torn state; update all references to lastSeen accordingly to use the protected access.
100-111:⚠️ Potential issue | 🟠 MajorUnsorted risks cause spurious change callbacks.
The
coLister.List()returnsClusterOperatorobjects in arbitrary order from the cache. This unstable ordering causescmp.Diff(u.lastSeen, risks)ineventHandlerto detect false changes when the list order shifts, triggering unnecessary callbacks. Sort therisksslice byNamebefore returning.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/upgradeable/upgradeable.go` around lines 100 - 111, The generated risks slice is unstable because coLister.List() yields operators in arbitrary order; before returning from the function that builds the risks slice (the block that defines var risks []configv1.ConditionalUpdateRisk and iterates over ops using resourcemerge.FindOperatorStatusCondition), sort risks by the Name field to produce a deterministic ordering so cmp.Diff(u.lastSeen, risks) won't see spurious changes; use sort.SliceStable (or sort.Slice) on risks comparing risks[i].Name < risks[j].Name and then return the sorted risks, version, nil.pkg/cvo/status.go (1)
652-666:⚠️ Potential issue | 🟠 MajorRaw
risk.Nameas condition reason may violate Kubernetes pattern.Risk names can contain hyphens (e.g.,
ClusterOperatorUpgradeable-operator-namefrompkg/risk/upgradeable), but Kubernetes condition reasons must match^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$(hyphens not allowed). Usingrisks[0].Namedirectly at line 652 can publish a non-conformant status reason.Consider sanitizing the reason or using a validated constant like
recommendedReasonExposedwhen the risk name doesn't match the pattern.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cvo/status.go` around lines 652 - 666, The code sets the condition Reason directly from risks[0].Name which may contain invalid characters (hyphens) for Kubernetes condition reasons; update the logic before calling resourcemerge.SetOperatorStatusCondition (for the OperatorUpgradeable condition) to validate/sanitize the reason string (risks[0].Name) against the Kubernetes reason regex and, if it fails, replace it with a validated constant such as recommendedReasonExposed (or another allowed constant), ensuring the multi-risk branch that sets recommendedReasonMultiple still produces a valid Reason; apply this check where reason is assigned so all paths (single risk and multiple risks) emit a conformant reason.pkg/risk/overrides/overrides.go (2)
24-29:⚠️ Potential issue | 🔴 CriticalData race on
sawOverridesfield persists.The
sawOverridespointer is written at line 54 fromRisks()and read at lines 97 and 107-108 fromeventHandler(). These run on different goroutines without synchronization, creating a data race.Also applies to: 54-54, 96-112
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/overrides/overrides.go` around lines 24 - 29, The sawOverrides pointer on struct overrides is accessed concurrently (written in Risks() and read in eventHandler()) causing a data race; fix by making synchronization explicit—either replace *bool with a sync/atomic boolean (e.g., atomic.Bool) or add a mutex field to overrides and guard all accesses to sawOverrides with that mutex. Update the writer in Risks() to use the chosen atomic store or lock/unlock, and update the readers in eventHandler() to use atomic load or lock/unlock respectively; ensure all references to sawOverrides in the overrides type, Risks(), and eventHandler() are adjusted to the new synchronized access pattern.
59-62:⚠️ Potential issue | 🟠 MajorHard failure on empty
Status.Desired.Versiondrops the override risk.When
cv.Status.Desired.Versionis empty (common during initial sync when onlyImageis set),semver.Parsefails and the function returns an error without returning the override risk. This drops the upgrade blocker entirely instead of surfacing it with a fallback URL.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/overrides/overrides.go` around lines 59 - 62, The code currently hard-fails when cv.Status.Desired.Version is empty because semver.Parse returns an error; instead, guard the parse: check if cv.Status.Desired.Version == "" and treat that as a non-fatal condition (do not call semver.Parse), then continue to construct and return the override risk (including the fallback URL) rather than returning an error; otherwise, for non-empty values, keep using semver.Parse and return an error only for real parse failures. Ensure the conditional logic references cv.Status.Desired.Version and semver.Parse so the behavior branches correctly.pkg/risk/updating/updating.go (2)
25-30:⚠️ Potential issue | 🔴 CriticalData race on
wasUpdatingfield persists.The
wasUpdatingpointer is written at line 55 fromRisks()(called from sync paths) and read at lines 94 and 104-105 fromeventHandler()(called from informer callbacks on different goroutines). This is a data race that can cause missed or duplicate notifications and leaveUpgradeablestale.Add synchronization using a mutex or atomic operations to protect access to
wasUpdating.Also applies to: 54-55, 93-109
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/updating/updating.go` around lines 25 - 30, The wasUpdating pointer on struct updating is accessed from multiple goroutines (written in Risks() and read in eventHandler()), causing a data race; add synchronization by replacing wasUpdating *bool with either an atomic boolean (e.g., atomic.Bool) or adding a sync.Mutex (or sync.RWMutex) on updating, then update all accesses in Risks() and eventHandler() to use the chosen synchronization: for atomic, use Load/Store operations; for a mutex, lock/unlock around reads and writes. Ensure you update the struct definition (updating) and all references in Risks() and eventHandler() to use the new guarded access so reads/writes are thread-safe.
50-58:⚠️ Potential issue | 🔴 CriticalUnhandled lister errors treated as "not updating".
When
isUpdating()returns an error that is neitherNotFoundnorNoMatchError, the code falls through to line 55, setswasUpdating, and ifupdatingis false, returnsnil, nil, nil. This silently treats cache/lister failures as "no risk", potentially clearing the upgrade blocker during transient problems. The error should be propagated.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/updating/updating.go` around lines 50 - 58, The current logic treats any non-NotFound/NoMatch error from u.isUpdating() as "not updating" and silently clears the blocker; change the control flow to propagate unexpected errors instead: after calling u.isUpdating(), if err != nil then if meta.IsNoMatchError(err) || apierrors.IsNotFound(err) return nil,nil,nil else return nil,nil,err; only set u.wasUpdating = &updating and proceed when err == nil. This ensures u.isUpdating() errors are not treated as successful "not updating" states.pkg/risk/deletion/deletion.go (1)
60-72:⚠️ Potential issue | 🔴 CriticalFix the inverted early-return that suppresses active deletion risks.
On Line 60, the
len(risks)condition is reversed: when a risk is found, the function returnsnil, nil, nil, so the deletion blocker is dropped and risk-to-version mapping (Lines 69-72) is skipped.Proposed fix
- if len(risks) > 0 { + if len(risks) == 0 { return nil, nil, nil } majorAndMinorUpdates := upgradeable.MajorAndMinorUpdates(version, versions) if len(majorAndMinorUpdates) == 0 { return risks, nil, err }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/deletion/deletion.go` around lines 60 - 72, The early-return is inverted: currently the code returns nil,nil,nil when risks exist, suppressing deletion blockers; change the condition to check for no risks (if len(risks) == 0) and return nil, nil, nil only in that case so the subsequent logic that computes majorAndMinorUpdates, builds versionMap, and returns the actual risks executes when risks are present; update the if condition near the use of variables risks, majorAndMinorUpdates, and versionMap accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cvo/cvo.go`:
- Line 309: The variable declaration uses an undefined type `riskSource`; change
the declaration of `risks` from `riskSource` to the imported package type
`risk.Source` (i.e., replace `risks := []riskSource{}` with `risks :=
[]risk.Source{}`) so it matches other usages like the struct fields referencing
`risk.Source` and compiles correctly.
In `@pkg/risk/updating/updating.go`:
- Line 88: Remove the debug print statement `fmt.Printf("checking ClusterVersion
conditions: %v\n", cv.Status.Conditions)` from pkg/risk/updating/updating.go
(it's a transient debug of `cv.Status.Conditions`/ClusterVersion conditions). If
that `fmt` import becomes unused after removal, delete the `fmt` import as well
to avoid unused-import compile errors. Ensure no other references rely on this
print and run `go vet`/`go build` to verify.
---
Duplicate comments:
In `@pkg/cvo/status.go`:
- Around line 652-666: The code sets the condition Reason directly from
risks[0].Name which may contain invalid characters (hyphens) for Kubernetes
condition reasons; update the logic before calling
resourcemerge.SetOperatorStatusCondition (for the OperatorUpgradeable condition)
to validate/sanitize the reason string (risks[0].Name) against the Kubernetes
reason regex and, if it fails, replace it with a validated constant such as
recommendedReasonExposed (or another allowed constant), ensuring the multi-risk
branch that sets recommendedReasonMultiple still produces a valid Reason; apply
this check where reason is assigned so all paths (single risk and multiple
risks) emit a conformant reason.
In `@pkg/risk/deletion/deletion.go`:
- Around line 60-72: The early-return is inverted: currently the code returns
nil,nil,nil when risks exist, suppressing deletion blockers; change the
condition to check for no risks (if len(risks) == 0) and return nil, nil, nil
only in that case so the subsequent logic that computes majorAndMinorUpdates,
builds versionMap, and returns the actual risks executes when risks are present;
update the if condition near the use of variables risks, majorAndMinorUpdates,
and versionMap accordingly.
In `@pkg/risk/overrides/overrides.go`:
- Around line 24-29: The sawOverrides pointer on struct overrides is accessed
concurrently (written in Risks() and read in eventHandler()) causing a data
race; fix by making synchronization explicit—either replace *bool with a
sync/atomic boolean (e.g., atomic.Bool) or add a mutex field to overrides and
guard all accesses to sawOverrides with that mutex. Update the writer in Risks()
to use the chosen atomic store or lock/unlock, and update the readers in
eventHandler() to use atomic load or lock/unlock respectively; ensure all
references to sawOverrides in the overrides type, Risks(), and eventHandler()
are adjusted to the new synchronized access pattern.
- Around line 59-62: The code currently hard-fails when
cv.Status.Desired.Version is empty because semver.Parse returns an error;
instead, guard the parse: check if cv.Status.Desired.Version == "" and treat
that as a non-fatal condition (do not call semver.Parse), then continue to
construct and return the override risk (including the fallback URL) rather than
returning an error; otherwise, for non-empty values, keep using semver.Parse and
return an error only for real parse failures. Ensure the conditional logic
references cv.Status.Desired.Version and semver.Parse so the behavior branches
correctly.
In `@pkg/risk/updating/updating.go`:
- Around line 25-30: The wasUpdating pointer on struct updating is accessed from
multiple goroutines (written in Risks() and read in eventHandler()), causing a
data race; add synchronization by replacing wasUpdating *bool with either an
atomic boolean (e.g., atomic.Bool) or adding a sync.Mutex (or sync.RWMutex) on
updating, then update all accesses in Risks() and eventHandler() to use the
chosen synchronization: for atomic, use Load/Store operations; for a mutex,
lock/unlock around reads and writes. Ensure you update the struct definition
(updating) and all references in Risks() and eventHandler() to use the new
guarded access so reads/writes are thread-safe.
- Around line 50-58: The current logic treats any non-NotFound/NoMatch error
from u.isUpdating() as "not updating" and silently clears the blocker; change
the control flow to propagate unexpected errors instead: after calling
u.isUpdating(), if err != nil then if meta.IsNoMatchError(err) ||
apierrors.IsNotFound(err) return nil,nil,nil else return nil,nil,err; only set
u.wasUpdating = &updating and proceed when err == nil. This ensures
u.isUpdating() errors are not treated as successful "not updating" states.
In `@pkg/risk/upgradeable/upgradeable.go`:
- Around line 36-39: The event handler registration for coInformer.Informer()
only wires AddFunc and UpdateFunc, so deletions of ClusterOperator aren't
triggering source.eventHandler and leave stale state; add a DeleteFunc to the
cache.ResourceEventHandlerFuncs that calls source.eventHandler(changeCallback)
(mirror the existing AddFunc/UpdateFunc behavior) so that ClusterOperator
deletions invoke the same changeCallback and update the Upgradeable source.
- Around line 25-30: The lastSeen slice in struct upgradeable is accessed
concurrently by Risks() (writes at the three locations) and eventHandler()
(reads), causing a data race; fix by adding a sync.RWMutex (e.g., mu) to
upgradeable and use mu.Lock()/mu.Unlock() around writes in Risks() and
mu.RLock()/mu.RUnlock() around reads in eventHandler() (or perform a safe copy
under lock and use the copy outside the lock) so cmp.Diff and other consumers
never see torn state; update all references to lastSeen accordingly to use the
protected access.
- Around line 100-111: The generated risks slice is unstable because
coLister.List() yields operators in arbitrary order; before returning from the
function that builds the risks slice (the block that defines var risks
[]configv1.ConditionalUpdateRisk and iterates over ops using
resourcemerge.FindOperatorStatusCondition), sort risks by the Name field to
produce a deterministic ordering so cmp.Diff(u.lastSeen, risks) won't see
spurious changes; use sort.SliceStable (or sort.Slice) on risks comparing
risks[i].Name < risks[j].Name and then return the sorted risks, version, nil.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 10b25e2e-8d65-4356-a306-9568a6926edb
📒 Files selected for processing (20)
pkg/cvo/cvo.gopkg/cvo/cvo_test.gopkg/cvo/status.gopkg/cvo/status_test.gopkg/cvo/upgradeable.gopkg/cvo/upgradeable_test.gopkg/risk/adminack/adminack.gopkg/risk/adminack/adminack_test.gopkg/risk/aggregate/aggregate.gopkg/risk/aggregate/aggregate_test.gopkg/risk/deletion/deletion.gopkg/risk/overrides/overrides.gopkg/risk/overrides/overrides_test.gopkg/risk/risk.gopkg/risk/risk_test.gopkg/risk/updating/updating.gopkg/risk/updating/updating_test.gopkg/risk/upgradeable/upgradeable.gopkg/risk/upgradeable/upgradeable_test.gopkg/risk/upgradeable/util.go
💤 Files with no reviewable changes (2)
- pkg/cvo/upgradeable_test.go
- pkg/cvo/upgradeable.go
✅ Files skipped from review due to trivial changes (6)
- pkg/risk/overrides/overrides_test.go
- pkg/cvo/status_test.go
- pkg/risk/risk_test.go
- pkg/risk/updating/updating_test.go
- pkg/risk/upgradeable/upgradeable_test.go
- pkg/risk/adminack/adminack.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/risk/upgradeable/util.go
- pkg/risk/aggregate/aggregate_test.go
- pkg/cvo/cvo_test.go
98bc10b to
782b77f
Compare
This code is from 519b466 (Bug 1978376: Add admin ack Upgradeable condition gate, 2021-07-27, openshift#633). But I'm about to move it, and CodeRabbit complains about the unhandled "wantErr=true, but err==nil" and "expectedResult=true but isApplicable=false" situations. Extend the logic to check for those. Also move from t.Fatal* to t.Error*, because there's no harm in moving on to check the other condition, even if the first condition we check has a problem. Checking both will give the developer more information about what they need to fix to make the test happy, without them needing to retest after fixing one property before they hear about a problem with the other property. [1]: openshift#1368 (comment)
This simplifies the complex cvo package by shifting some logic into small, narrowly scoped helper packages. That makes for easier unit testing of that logic, and less complication when reading the remaining cvo package code. Using the risk.Source framework with the new aggregate implementation allows Upgradeable risks to feed into the conditionalUpdateRisks framework. This should help reduce confusion some users experience when the see Upgradeable=False and think it applies to all updates (when in reality it applies to major and minor bumps, but does not apply to patch updates, especially since 6f8f984, OTA-1860: Stop blocking patch updates when cluster version overrides are set, 2026-02-10, openshift#1314). It will also allow structured access to individual risks that are reported via this pipeline, while Upgradeable had an aggregated message string only consumable by humans. I'm also removing internal.UpgradeableAdminAckRequired and similar from ClusterVersion status.conditions. While these were visible by end users, they were mostly a mechanism for passing data from one part of the CVO to another. There are no known consumers outside of the CVO. With the refactor providing clearer abstraction layers, I could drop the entire upgradeableCheckIntervals structure. Major touches in the history of that structure were bd05174 (pkg/cvo/upgradeable.go: Sync Upgradeable status of the CVO more often, 2022-08-02, openshift#808) and cc94c95 (pkg/cvo/upgradeable: refactor throttling, 2023-01-11, openshift#882). The new approach is event-driven via the informer handlers for almost all of the new risk.Source implementations. The one source that still relies on polling is the deletion source, where it's watching a CVO-side global variable, and not a Kube-side informer. It will still sync whenever ClusterVersion status is updated though, which should be often enough for this usually-transient situation. If folks want lower latency, follow up work could add informer-style handler callbacks to the deletion global. The changeCallback in unit tests is unbuffered, because I want to know exactly how often that callback is being called. If a logic change in the packages causes additional callback calls, the unbuffered channel will block them from sending the message, deadlock the informer for that handler [1], and break later test-cases. [1]: https://github.com/kubernetes/client-go/blob/v0.35.3/tools/cache/shared_informer.go#L1052-L1086
Avoid startup-time "hey, I can't find those ConfigMaps!" complaints from pkg/risk/adminack by waiting until the informers have synchronized. The pkg/cvo Operator type has a cacheSynced that's populated in Operator.New (with the ClusterVersion, ClusterOperator, and FeatureGate informers). cacheSynced goes all the way back to 648d27c (metrics: Report a gauge for when cluster operator conditions change, 2019-01-31, openshift#107). But when we added the initial ConfigMap informer as cmLister back in c147732 (metrics: add cluster_installer series, 2019-06-26, openshift#213) we didn't add that informer to cacheSynced. And when I added cmManagedInformer in d18deee (pkg/cvo: Separate ConfigMap informer for openshift-config-managed, 2020-08-21, openshift#441), I didn't add that informer to cacheSynced. None of these earlier commits explain why the ConfigMap informers were not included in cacheSynced, so I'm assuming that was oversight, and catching up now. There is a risk that the ConfigMap informers fail to synchronize quickly, but I'm having trouble imagining a situation where that would happen but the ClusterVersion and ClusterOperator informers wouldn't also fail to synchronize. And either way, we don't serve metrics while we wait on this cache synchronization, so the critical ClusterVersionOperatorDown alert will be firing within 10 minutes, and the responding system administrators can sort out whatever was giving the cache-sync logic trouble.
vendor/k8s.io/client-go/tools/cache/shared_informer.go has:
func (s *sharedIndexInformer) AddEventHandlerWithOptions(handler ResourceEventHandler, options HandlerOptions) (ResourceEventHandlerRegistration, error) {
s.startedLock.Lock()
defer s.startedLock.Unlock()
if s.stopped {
return nil, fmt.Errorf("handler %v was not added to shared informer because it has stopped already", handler)
}
which might be the only reason AddEeventHandler could fail. But
regardless, it is still good practice to bubble things up and fail
fast, to summon a cluster administrator to help sort out whatever the
issue is.
782b77f to
f5453e7
Compare
|
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Building on #1367, this simplifies the complex
cvopackage by shifting some logic into small, narrowly scoped helper packages. That makes for easier unit testing of that logic, and less complication when reading the remainingcvopackage code.Using #1367's
risk.Sourceframework with the new aggregate implementation allowsUpgradeablerisks to feed into theconditionalUpdateRisksframework. This should help reduce confusion some users experience when the seeUpgradeable=Falseand think it applies to all updates (when in reality it applies to major and minor bumps, but does notapply to patch updates, especially since 6f8f984, #1314). It will also allow structured access to individual risks that are reported via this pipeline, while
Upgradeablehad an aggregated message string only consumable by humans.I'm also removing
internal.UpgradeableAdminAckRequiredand similar from ClusterVersionstatus.conditions. While these were visible by end users, they were mostly a mechanism for passing data from one part of the CVO to another. There are no known consumers outside of the CVO.With the refactor providing clearer abstraction layers, I could drop the entire
upgradeableCheckIntervalsstructure. Major touches in the history of that structure were bd05174 (#808) and cc94c95 (#882). The new approach is event-driven via the informer handlers for almost all of the newrisk.Sourceimplementations. The one source that stillrelies on polling is the
deletionsource, where it's watching a CVO-side global variable, and not a Kube-side informer. It will still sync whenever ClusterVersion status is updated though, which should be often enough for this usually-transient situation. If folks want lower latency, follow up work could add informer-style handler callbacks to the deletion global.