Skip to content

OTA-1933: pkg/risk: Refactor alerts into a generic update-risk interface#1367

Merged
openshift-merge-bot[bot] merged 3 commits intoopenshift:mainfrom
wking:generic-risk-interface
Apr 10, 2026
Merged

OTA-1933: pkg/risk: Refactor alerts into a generic update-risk interface#1367
openshift-merge-bot[bot] merged 3 commits intoopenshift:mainfrom
wking:generic-risk-interface

Conversation

@wking
Copy link
Copy Markdown
Member

@wking wking commented Apr 3, 2026

3b365d0 (#1329) added the alert-risk logic. But the implementation in that commit was fairly directly integrated into pkg/cvo. We want to extend update-risk detection with additional providers in the future, like preflights from the target release, and that could get complicated with logic tucked into pkg/cvo. And pkg/cvo is already plenty complicated, even without this functionality inline.

This commit refactors to remove a few hundred lines of code from pkg/cvo out to a new pkg/risk. And that allows the pkg/cvo handling to move from "I'm integrating additional risks from alerts" to "I'm integrating additional risks from a source". As we more forward from here, we can continue to extend the functionality of the source, and will only have to tweak the source in cvo.go's New(...).

Most of the code is either straightforward interface work or lifted-and-shifted pre-existing code. But the ReplaceAll for U+00A0 No-break space took some time to work out, so I'll explain that in detail here. It flies in the face of go-cmp docs:

Do not depend on this output being stable. If you need the ability to programmatically interpret the difference, consider using a custom Reporter.

I'm manually unwinding the uncertainty they inject in github.com/google/go-cmp/cmp /report_text.go's indentMode.appendIndent, because:

  • I want to test the error message content, to ensure it's sufficiently readable.
  • I'm ok manually updating the test expectations if a go-cpm bump brings in rendering improvements. That manual update will give us space to review to ensure we're still readable.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 3, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 3, 2026

@wking: This pull request references OTA-1933 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

3b365d0 (#1329) added the alert-risk logic. But the implementation in that commit was fairly directly integrated into pkg/cvo. We want to extend update-risk detection with additional providers in the future, like preflights from the target release, and that could get complicated with logic tucked into pkg/cvo. And pkg/cvo is already plenty complicated, even without this functionality inline.

This commit refactors to remove a few hundred lines of code from pkg/cvo out to a new pkg/risk. And that allows the pkg/cvo handling to move from "I'm integrating additional risks from alerts" to "I'm integrating additional risks from a source". As we more forward from here, we can continue to extend the functionality of the source, and will only have to tweak the source in cvo.go's New(...).

Most of the code is either straightforward interface work or lifted-and-shifted pre-existing code. But the ReplaceAll for U+00A0 No-break space took some time to work out, so I'll explain that in detail here. It flies in the face of go-cmp docs:

Do not depend on this output being stable. If you need the ability to programmatically interpret the difference, consider using a custom Reporter.

I'm manually unwinding the uncertainty they inject in github.com/google/go-cmp/cmp /report_text.go's indentMode.appendIndent, because:

  • I want to test the error message content, to ensure it's sufficiently readable.
  • I'm ok manually updating the test expectations if a go-cpm bump brings in rendering improvements. That manual update will give us space to review to ensure we're still readable.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 664da079-953c-4513-9700-3ef5d2797b6b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaced alert-derived plumbing with a pluggable risk.Source: add alert-backed and mock risk providers, introduce risk.Merge to validate and merge reported risks into Releases and ConditionalUpdates, and rewire operator and tests to use risk.Source for conditional-update risk evaluation and merging.

Changes

Cohort / File(s) Summary
Operator core
pkg/cvo/cvo.go, pkg/cvo/cvo_test.go
Replaced AlertGetter on Operator with risks risk.Source; constructor now wires an alert-backed risk.Source; tests updated to use shared cmp options and ignore the risks field during diffs.
Available-updates logic & tests
pkg/cvo/availableupdates.go, pkg/cvo/availableupdates_test.go
Removed alert-specific interfaces/logic and sanityCheck; getAvailableUpdates preserves upstream slices; evaluateConditionalUpdates now gathers all versions and calls u.risks.Risks(...), then merges returned risks via risk.Merge; loadRiskConditions preserves existing matching Conditions; tests switched to risk mocks and updated expected messages/runbooks.
Risk interface & merge utility
pkg/risk/risk.go, pkg/risk/risk_test.go
Added risk.Source interface and risk.Merge(...) that validates risk names, deduplicates/compares definitions (ignoring Conditions), maps risks→versions, injects/merges into Releases/ConditionalUpdates, sorts results, aggregates errors, and has idempotency tests.
Alert-backed risk source
pkg/risk/alert/alert.go, pkg/risk/alert/alert_test.go
New alert-backed risk.Source that queries Prometheus alerts, converts alerts to configv1.ConditionalUpdateRisk (summary, description, runbook, Conditions), merges alerts mapping to the same risk name deterministically, and returns a versions map; tests include in-memory and file-backed fixtures.
Mock risk source
pkg/risk/mock/mock.go
Added mock.Mock implementing risk.Source for tests; configurable InternalRisks and InternalVersions; returns deterministic version mappings and a controllable error when InternalRisks is nil.
Tests & imports cleanup
pkg/cvo/..., pkg/risk/...
Removed alert-based test scaffolding and promql-specific imports; updated contexts from TODO to Background in tests; adjusted expected runbook URLs/messages to match new alert→risk conversion.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 3, 2026
@wking wking force-pushed the generic-risk-interface branch from 08caa80 to eacf23c Compare April 3, 2026 22:18
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 524-547: The risk evaluation in evaluateConditionalUpdates is
being skipped whenever availableUpdatesSync short-circuits for the cached graph,
which leaves live risk sources (u.risks) stale; modify the logic so u.risks is
evaluated independently of the fast-return path in availableUpdatesSync: ensure
evaluateConditionalUpdates (or at least the block that calls u.risks.Risks and
risk.Merge) runs on its own cadence or before the fast-return check so that
u.risks.Risks(ctx, allVersions) is invoked even when
ShouldReconcileAcceptRisks() would otherwise be short-circuited by
minimumUpdateCheckInterval, then merge results via risk.Merge into
u.Updates/u.ConditionalUpdates as before (keeping error logging), so live risk
sources don’t get frozen by the OSUS cache.

In `@pkg/risk/risk_test.go`:
- Around line 244-255: The test fails to clear the previous error string before
the idempotency check: reset actualError (set it to the empty string) just
before you inspect the second-pass err in the idempotent block so stale
first-pass errors don't mask a nil err; update the test around the second
Merge/assertion (the section using updates, conditionalUpdates, err,
actualError, and tt.expectedError) to assign actualError = "" immediately before
the `if err != nil { ... }` check.

In `@pkg/risk/risk.go`:
- Around line 72-126: The code iterates a map (the outer "for riskName, versions
:= range versions" loop) which yields non-deterministic order and causes
non-deterministic appends to conditionalUpdates and errs; fix by iterating the
risk keys in a stable order: collect all keys from the "versions" map into a
slice, sort that slice, and then loop over sorted riskName keys instead of
ranging the map directly; additionally, ensure you maintain deterministic
ordering when adding to conditionalUpdates by either inserting risks into
conditionalUpdates[i].Risks in a consistent spot or by sorting
conditionalUpdates[i].Risks (by Risk.Name) after appending, and sort
conditionalUpdates (by Release.Version) and errs (by their error strings) before
returning to guarantee fully deterministic output.
- Around line 137-140: The comparator currently uses semver.MustParse inside
slices.SortFunc over conditionalUpdates which will panic on malformed
Release.Version; instead call semver.Parse (or semver.NewVersion/ParseTolerant
as appropriate) and propagate parse errors into the Merge function's errs slice
before sorting (validate each configv1.ConditionalUpdate.Release.Version and
append any parse error to errs), or if you prefer, modify the comparator to
handle parse failures by returning a deterministic ordering and recording the
parse error into errs; ensure you reference slices.SortFunc, conditionalUpdates,
configv1.ConditionalUpdate, semver.Parse, and errs within Merge when
implementing the fix.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 82d3314c-38e0-4b83-ac17-73f6d6e3f278

📥 Commits

Reviewing files that changed from the base of the PR and between 26e3765 and 08caa80.

📒 Files selected for processing (10)
  • pkg/cvo/availableupdates.go
  • pkg/cvo/availableupdates_test.go
  • pkg/cvo/cvo.go
  • pkg/cvo/cvo_test.go
  • pkg/risk/alert/alert.go
  • pkg/risk/alert/alert_test.go
  • pkg/risk/alert/testdata/alerts.json
  • pkg/risk/mock/mock.go
  • pkg/risk/risk.go
  • pkg/risk/risk_test.go

@wking wking force-pushed the generic-risk-interface branch 2 times, most recently from 293959d to 95695a1 Compare April 3, 2026 22:32
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
pkg/risk/risk.go (2)

138-141: ⚠️ Potential issue | 🟠 Major

Avoid semver.MustParse in the sort comparator.

Lines 139-140 can still panic the operator on a malformed Release.Version, which is inconsistent with the rest of Merge accumulating validation problems in errs. Parse with error handling and report bad versions through the returned error instead of crashing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/risk.go` around lines 138 - 141, The comparator passed to
slices.SortFunc currently calls semver.MustParse on each
configv1.ConditionalUpdate.Release.Version which can panic; instead, change the
comparator in Merge to parse versions with semver.Parse (or ParseTolerant) and
handle errors by storing parse failures into the existing errs slice (or a local
parseErrs map) and treating unparsable versions as equal/lowest for sorting to
avoid panics; ensure you reference conditionalUpdates and the comparator used in
slices.SortFunc, record any parse error into errs before returning the merged
result, and avoid using semver.MustParse anywhere in Merge.

73-128: ⚠️ Potential issue | 🟠 Major

Make Merge deterministic before appending risks.

Line 74 still ranges a Go map, so identical inputs can produce different ConditionalUpdate.Risks ordering and different errors.Join(...) text. That can churn status and tests even when nothing really changed. Iterate sorted risk names here, and sort/copy the associated version slices before merging if you want stable error ordering too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/risk.go` around lines 73 - 128, The loop in Merge that ranges over
the map "versions" is non-deterministic; change the code to iterate over a
sorted slice of risk names (derive from the map keys) instead of ranging the map
directly, and for each risk make a deterministic copy of its "versions" slice
and sort it before processing; this ensures deterministic ordering when
appending to conditionalUpdates and when building errors. Locate the loop that
currently does "for riskName, versions := range versions" and replace it with
keyed-sorting logic (use the keys from "versions" map, sort them, then for each
key copy and sort the associated versions slice) while keeping the existing
checks against invalidRiskMap, riskMap, riskMissMap, and the logic that appends
to conditionalUpdates and errs.
pkg/cvo/availableupdates.go (1)

529-545: ⚠️ Potential issue | 🟠 Major

Keep live-source risks out of the cached graph snapshot.

Lines 539-545 merge the current source result directly into u.Updates/u.ConditionalUpdates, and those mutated slices are what getAvailableUpdates() copies on the next reconcile. Since risk.Merge only adds risks, a cleared alert stays attached until the next fresh graph fetch rebuilds the base graph state. This block is also skipped by the cached fast path, so new live-source state can be missed too. Re-evaluate live sources against a pristine graph snapshot, or strip previously injected source risks before each merge.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/cvo/availableupdates.go` around lines 529 - 545, The code is mutating
u.Updates/u.ConditionalUpdates with live-source risks (via risk.Merge), which
persists injected risks into the cached graph; instead, operate on a pristine
snapshot or on copies: when calling u.risks.Risks and risk.Merge, first make
local copies of the base slices (e.g., baseUpdates := append([]Update(nil),
u.Updates...), baseConditional := append([]ConditionalUpdate(nil),
u.ConditionalUpdates...)) or obtain the immutable graph snapshot used by
getAvailableUpdates(), then call risk.Merge against those copies and assign the
merge result to local variables (not back to u.Updates/u.ConditionalUpdates) or
strip previously injected source risks from the copies before merging; ensure
u.risks.Risks and risk.Merge are only used to produce transient mergedResults
for reconciliation and do not mutate the cached u.Updates/u.ConditionalUpdates
that getAvailableUpdates() will copy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 529-545: The code is mutating u.Updates/u.ConditionalUpdates with
live-source risks (via risk.Merge), which persists injected risks into the
cached graph; instead, operate on a pristine snapshot or on copies: when calling
u.risks.Risks and risk.Merge, first make local copies of the base slices (e.g.,
baseUpdates := append([]Update(nil), u.Updates...), baseConditional :=
append([]ConditionalUpdate(nil), u.ConditionalUpdates...)) or obtain the
immutable graph snapshot used by getAvailableUpdates(), then call risk.Merge
against those copies and assign the merge result to local variables (not back to
u.Updates/u.ConditionalUpdates) or strip previously injected source risks from
the copies before merging; ensure u.risks.Risks and risk.Merge are only used to
produce transient mergedResults for reconciliation and do not mutate the cached
u.Updates/u.ConditionalUpdates that getAvailableUpdates() will copy.

In `@pkg/risk/risk.go`:
- Around line 138-141: The comparator passed to slices.SortFunc currently calls
semver.MustParse on each configv1.ConditionalUpdate.Release.Version which can
panic; instead, change the comparator in Merge to parse versions with
semver.Parse (or ParseTolerant) and handle errors by storing parse failures into
the existing errs slice (or a local parseErrs map) and treating unparsable
versions as equal/lowest for sorting to avoid panics; ensure you reference
conditionalUpdates and the comparator used in slices.SortFunc, record any parse
error into errs before returning the merged result, and avoid using
semver.MustParse anywhere in Merge.
- Around line 73-128: The loop in Merge that ranges over the map "versions" is
non-deterministic; change the code to iterate over a sorted slice of risk names
(derive from the map keys) instead of ranging the map directly, and for each
risk make a deterministic copy of its "versions" slice and sort it before
processing; this ensures deterministic ordering when appending to
conditionalUpdates and when building errors. Locate the loop that currently does
"for riskName, versions := range versions" and replace it with keyed-sorting
logic (use the keys from "versions" map, sort them, then for each key copy and
sort the associated versions slice) while keeping the existing checks against
invalidRiskMap, riskMap, riskMissMap, and the logic that appends to
conditionalUpdates and errs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b1478d81-6c49-422a-b7a7-b4fe175f38a9

📥 Commits

Reviewing files that changed from the base of the PR and between 08caa80 and 293959d.

📒 Files selected for processing (10)
  • pkg/cvo/availableupdates.go
  • pkg/cvo/availableupdates_test.go
  • pkg/cvo/cvo.go
  • pkg/cvo/cvo_test.go
  • pkg/risk/alert/alert.go
  • pkg/risk/alert/alert_test.go
  • pkg/risk/alert/testdata/alerts.json
  • pkg/risk/mock/mock.go
  • pkg/risk/risk.go
  • pkg/risk/risk_test.go
✅ Files skipped from review due to trivial changes (4)
  • pkg/cvo/cvo_test.go
  • pkg/risk/mock/mock.go
  • pkg/risk/risk_test.go
  • pkg/cvo/availableupdates_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/cvo/cvo.go

@wking wking force-pushed the generic-risk-interface branch from 95695a1 to 5c28a5d Compare April 3, 2026 22:38
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/risk/risk.go (1)

74-128: ⚠️ Potential issue | 🟠 Major

Make merge iteration deterministic to avoid status churn

Line 74 ranges a Go map directly, so merge order can vary between reconciles. That can reorder ConditionalUpdate.Risks (and some error aggregation paths) for identical inputs.

Proposed fix
-	riskMissMap := make(map[string]struct{}, len(versions))
-	for riskName, versions := range versions {
+	riskMissMap := make(map[string]struct{}, len(versions))
+	riskNames := make([]string, 0, len(versions))
+	for riskName := range versions {
+		riskNames = append(riskNames, riskName)
+	}
+	sort.Strings(riskNames)
+	for _, riskName := range riskNames {
+		riskVersions := append([]string(nil), versions[riskName]...)
+		sort.Strings(riskVersions)
 		if _, ok := invalidRiskMap[riskName]; ok {
 			continue // we're already complaining about these
 		}
 		risk, ok := riskMap[riskName]
 		if !ok {
 			riskMissMap[riskName] = struct{}{}
 			continue
 		}
-		for _, version := range versions {
+		for _, version := range riskVersions {
 			found := false
 			for i, update := range updates {
 				if update.Version != version {
 					continue
@@
-				if !riskFound {
-					conditionalUpdates[i].Risks = append(conditionalUpdate.Risks, risk)
+				if !riskFound {
+					conditionalUpdates[i].Risks = append(conditionalUpdate.Risks, risk)
+					slices.SortFunc(conditionalUpdates[i].Risks, func(a, b configv1.ConditionalUpdateRisk) int {
+						return strings.Compare(a.Name, b.Name)
+					})
 				}
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/risk/risk.go` around lines 74 - 128, The merge loop currently ranges
directly over the versions map (the for riskName, versions := range versions
block) which makes the join order non-deterministic; to fix it, collect the risk
names (keys) from the versions map into a slice, sort that slice, and iterate
over the sorted risk names instead of ranging the map directly so that appending
to conditionalUpdates and the order of ConditionalUpdate.Risks is stable; also
consider sorting each versions slice for a given riskName if its order can vary
to ensure fully deterministic merging of releases and errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/risk/risk.go`:
- Around line 74-128: The merge loop currently ranges directly over the versions
map (the for riskName, versions := range versions block) which makes the join
order non-deterministic; to fix it, collect the risk names (keys) from the
versions map into a slice, sort that slice, and iterate over the sorted risk
names instead of ranging the map directly so that appending to
conditionalUpdates and the order of ConditionalUpdate.Risks is stable; also
consider sorting each versions slice for a given riskName if its order can vary
to ensure fully deterministic merging of releases and errors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f3135939-5865-4f91-984a-7b67e145f0a2

📥 Commits

Reviewing files that changed from the base of the PR and between 293959d and 5c28a5d.

📒 Files selected for processing (10)
  • pkg/cvo/availableupdates.go
  • pkg/cvo/availableupdates_test.go
  • pkg/cvo/cvo.go
  • pkg/cvo/cvo_test.go
  • pkg/risk/alert/alert.go
  • pkg/risk/alert/alert_test.go
  • pkg/risk/alert/testdata/alerts.json
  • pkg/risk/mock/mock.go
  • pkg/risk/risk.go
  • pkg/risk/risk_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/cvo/cvo_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/cvo/cvo.go
  • pkg/cvo/availableupdates.go

@wking wking force-pushed the generic-risk-interface branch from 5c28a5d to d14e513 Compare April 3, 2026 22:48
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/risk/risk.go`:
- Around line 83-123: The loop currently appends a new
configv1.ConditionalUpdate for a matching update before checking whether that
release already exists in conditionalUpdates; change the logic in the first
inner loop (the one iterating over updates and matching update.Version ==
version) so that when a matching update is found you first search
conditionalUpdates for an entry whose Release.Version == version and, if found,
merge the risk into that entry's Risks (respecting the existing
risk-duplication/diff checks used later), otherwise append a new
configv1.ConditionalUpdate as today; after merging or appending, remove the
entry from updates (updates = append(updates[:i], updates[i+1:]...) and set
updates = nil when empty) and break. This ensures the existing ConditionalUpdate
is extended instead of creating duplicates (refer to conditionalUpdates,
updates, configv1.ConditionalUpdate, configv1.ConditionalUpdateRisk, and the
risk variable).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5ef103df-419d-48d9-9005-650a5ea9319a

📥 Commits

Reviewing files that changed from the base of the PR and between 5c28a5d and d14e513.

📒 Files selected for processing (10)
  • pkg/cvo/availableupdates.go
  • pkg/cvo/availableupdates_test.go
  • pkg/cvo/cvo.go
  • pkg/cvo/cvo_test.go
  • pkg/risk/alert/alert.go
  • pkg/risk/alert/alert_test.go
  • pkg/risk/alert/testdata/alerts.json
  • pkg/risk/mock/mock.go
  • pkg/risk/risk.go
  • pkg/risk/risk_test.go
✅ Files skipped from review due to trivial changes (3)
  • pkg/risk/mock/mock.go
  • pkg/cvo/cvo_test.go
  • pkg/risk/alert/alert.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/cvo/cvo.go
  • pkg/risk/risk_test.go

@wking wking force-pushed the generic-risk-interface branch from ea87d7c to f2ac936 Compare April 4, 2026 04:05
@wking
Copy link
Copy Markdown
Member Author

wking commented Apr 4, 2026

/retest-required

@wking
Copy link
Copy Markdown
Member Author

wking commented Apr 8, 2026

openshift/release#77508 is working on recovering our e2e-hypershift-conformance CI.

Copy link
Copy Markdown
Member

@hongkailiu hongkailiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a few nits and moving code around.
Mostly my questions about my understanding of the code, thinking aloud.

Otherwise, LGTM.

The cases passed in the e2e job.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1367/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-ovn-techpreview-serial-1of3/2041744451692400640/artifacts/e2e-agnostic-ovn-techpreview-serial/openshift-e2e-test/artifacts/e2e.log | grep risks
started: 0/1/44 "[Jira:\"Cluster Version Operator\"] cluster-version-operator should work with risks from alerts"
passed: (3m1s) 2026-04-08T07:22:44 "[Jira:\"Cluster Version Operator\"] cluster-version-operator should work with risks from alerts"
started: 0/2/44 "[Jira:\"Cluster Version Operator\"] cluster-version-operator should work with accept risks"
passed: (1m1s) 2026-04-08T07:23:46 "[Jira:\"Cluster Version Operator\"] cluster-version-operator should work with accept risks"
started: 0/33/44 "[sig-cli][OCPFeatureGate:ClusterUpdateAcceptRisks] oc can operate accept risks [Serial]"
passed: (1.7s) 2026-04-08T07:41:56 "[sig-cli][OCPFeatureGate:ClusterUpdateAcceptRisks] oc can operate accept risks [Serial]"

@wking wking force-pushed the generic-risk-interface branch 2 times, most recently from 38ddd31 to d9fe47b Compare April 9, 2026 20:11
wking added 3 commits April 9, 2026 13:11
3b365d0 (OTA-1813: Populate risks from alerts, 2026-02-27, openshift#1329)
added the alert-risk logic.  But the implementation in that commit was
fairly directly integrated into pkg/cvo.  We want to extend
update-risk detection with additional providers in the future, like
preflights from the target release [1], and that could get complicated
with logic tucked into pkg/cvo.  And pkg/cvo is already plenty
complicated, even without this functionality inline.

This commit refactors to remove a few hundred lines of code from
pkg/cvo out to a new pkg/risk.  And that allows the pkg/cvo handling
to move from "I'm integrating additional risks from alerts" to "I'm
integrating additional risks from a source".  As we more forward from
here, we can continue to extend the functionality of the source, and
will only have to tweak the source in cvo.go's New(...).

Most of the code is either straightforward interface work or
lifted-and-shifted pre-existing code.  But the ReplaceAll for U+00A0
No-break space took some time to work out, so I'll explain that in
detail here.  It flies in the face of [2]:

  Do not depend on this output being stable. If you need the ability
  to programmatically interpret the difference, consider using a
  custom Reporter.

I'm manually unwinding the uncertainty they inject in
github.com/google/go-cmp/cmp /report_text.go's
indentMode.appendIndent, because:

* I want to test the error message content, to ensure it's
  sufficiently readable.
* I'm ok manually updating the test expectations if a go-cpm bump
  brings in rendering improvements.  That manual update will give us
  space to review to ensure we're still readable.

[1]: openshift/enhancements#1930
[2]: https://pkg.go.dev/github.com/google/go-cmp/cmp#Diff
We've been mixing in alerts since 3b365d0 (OTA-1813: Populate risks
from alerts, 2026-02-27, openshift#1329).  But when the alerts get resolved, we
want the cluster to notice quickly.

With this change, we keep the original upstream data isolated in new
properties, so an evaluation round will be able to clear out resolved
alerts, even if needFreshFetch is false because available updates were
recently retrieved.
…eptance

Because [1]:

1. create a cluster which has 1 recommend version and 1 non-recommend version
2. create alert PodDisruptionBudgetAtLimit with warning severity
3. oc adm upgrade accept PodDisruptionBudgetAtLimit

had been creating fairly noisy output like:

  I0408 03:58:45.443626       1 alerts.go:87] refreshed: 10 alerts
  I0408 03:58:45.443753       1 availableupdates.go:534] Found 10 alerts
  I0408 03:58:45.443818       1 availableupdates.go:631] Got 0 risks
  I0408 03:58:45.445403       1 availableupdates.go:868] Risk with name "PodDisruptionBudgetAtLimit" is accepted by the cluster admin and thus not in the evaluation of conditional update
  I0408 03:58:45.445421       1 availableupdates.go:868] Risk with name "PodDisruptionBudgetAtLimit" is accepted by the cluster admin and thus not in the evaluation of conditional update
  I0408 03:58:45.445429       1 availableupdates.go:868] Risk with name "PodDisruptionBudgetAtLimit" is accepted by the cluster admin and thus not in the evaluation of conditional update
  ...
  I0408 03:58:45.446455       1 availableupdates.go:868] Risk with name "PodDisruptionBudgetAtLimit" is accepted by the cluster admin and thus not in the evaluation of conditional update
  I0408 03:58:45.446460       1 availableupdates.go:868] Risk with name "PodDisruptionBudgetAtLimit" is accepted by the cluster admin and thus not in the evaluation of conditional update
  I0408 03:58:45.446475       1 availableupdates.go:177] Requeue available-update evaluation, because "4.22.0-ec.4" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk SomeInvokerThing (invalid PromQL result length must be one, but is 0)
    SomeInvokerThing description: On clusters on default invoker user, this imaginary bug can happen.
    SomeInvokerThing URL: https://bug.example.com/a
  I0408 03:58:45.449976       1 cvo.go:816] Finished syncing available updates "openshift-cluster-version/version" (159.68958ms)

Logging the version should help distinguish confusedly looping on a
single risk for a single target version, from looping over that same
risks for multiple different target versions.

[1]: https://redhat.atlassian.net/browse/OTA-1813?focusedCommentId=16643738
@hongkailiu
Copy link
Copy Markdown
Member

Thanks for bringing this up.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Copy Markdown
Member Author

wking commented Apr 10, 2026

catalogd should not have pod creation failures during install is orthogonal:

/override ci/prow/e2e-agnostic-ovn-techpreview-serial-2of3

And from test history, all the others have also passed at least once since bbe69d5, and there have been minimal changes since then.

/override ci/prow/e2e-agnostic-ovn
/override ci/prow/e2e-agnostic-ovn-techpreview-serial-1of3
/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
/override ci/prow/e2e-hypershift
/override ci/prow/e2e-hypershift-conformance

Actually, techpreview-serial-1of3 is seeing something related to this pull:

: [Monitor:legacy-test-framework-invariants-alerts][sig-trt][invariant] No new alerts should be firing	0s
{  Found alerts firing which are new or less than two weeks old, which should not be firing: 

OpenShiftUpdateRiskMightApply has no test data, this alert appears new and should not be firing
OpenShiftUpdateRiskMightApply has no test data, this alert appears new and should not be firing
OpenShiftUpdateRiskMightApply has no test data, this alert appears new and should not be firing}

But that's already out in the wild, e.g. this periodic run, and I don't see how I'd be making it worse with this pull.

Also manually tested and confirmed to fix the duplicate release target issue we're seeing in the 4.22/dev main branch today.

/verified by @wking

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 10, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@wking: This PR has been marked as verified by @wking.

Details

In response to this:

catalogd should not have pod creation failures during install is orthogonal:

/override ci/prow/e2e-agnostic-ovn-techpreview-serial-2of3

And from test history, all the others have also passed at least once since bbe69d5, and there have been minimal changes since then.

/override ci/prow/e2e-agnostic-ovn
/override ci/prow/e2e-agnostic-ovn-techpreview-serial-1of3
/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
/override ci/prow/e2e-hypershift
/override ci/prow/e2e-hypershift-conformance

Actually, techpreview-serial-1of3 is seeing something related to this pull:

: [Monitor:legacy-test-framework-invariants-alerts][sig-trt][invariant] No new alerts should be firing	0s
{  Found alerts firing which are new or less than two weeks old, which should not be firing: 

OpenShiftUpdateRiskMightApply has no test data, this alert appears new and should not be firing
OpenShiftUpdateRiskMightApply has no test data, this alert appears new and should not be firing
OpenShiftUpdateRiskMightApply has no test data, this alert appears new and should not be firing}

But that's already out in the wild, e.g. this periodic run, and I don't see how I'd be making it worse with this pull.

Also manually tested and confirmed to fix the duplicate release target issue we're seeing in the 4.22/dev main branch today.

/verified by @wking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 10, 2026

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-ovn, ci/prow/e2e-agnostic-ovn-techpreview-serial-1of3, ci/prow/e2e-agnostic-ovn-techpreview-serial-2of3, ci/prow/e2e-agnostic-ovn-upgrade-into-change, ci/prow/e2e-hypershift, ci/prow/e2e-hypershift-conformance

Details

In response to this:

catalogd should not have pod creation failures during install is orthogonal:

/override ci/prow/e2e-agnostic-ovn-techpreview-serial-2of3

And from test history, all the others have also passed at least once since bbe69d5, and there have been minimal changes since then.

/override ci/prow/e2e-agnostic-ovn
/override ci/prow/e2e-agnostic-ovn-techpreview-serial-1of3
/override ci/prow/e2e-agnostic-ovn-upgrade-into-change
/override ci/prow/e2e-hypershift
/override ci/prow/e2e-hypershift-conformance

Actually, techpreview-serial-1of3 is seeing something related to this pull:

: [Monitor:legacy-test-framework-invariants-alerts][sig-trt][invariant] No new alerts should be firing	0s
{  Found alerts firing which are new or less than two weeks old, which should not be firing: 

OpenShiftUpdateRiskMightApply has no test data, this alert appears new and should not be firing
OpenShiftUpdateRiskMightApply has no test data, this alert appears new and should not be firing
OpenShiftUpdateRiskMightApply has no test data, this alert appears new and should not be firing}

But that's already out in the wild, e.g. this periodic run, and I don't see how I'd be making it worse with this pull.

Also manually tested and confirmed to fix the duplicate release target issue we're seeing in the 4.22/dev main branch today.

/verified by @wking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 10, 2026

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit e9c1c39 into openshift:main Apr 10, 2026
19 checks passed
@wking wking deleted the generic-risk-interface branch April 10, 2026 04:20
hongkailiu added a commit to hongkailiu/cluster-version-operator that referenced this pull request Apr 10, 2026
There was an issue of duplicated conditional risks when

* a risk whose PromQL cannot be evaluated, and

* the risk from alert was accepted.

The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
hongkailiu added a commit to hongkailiu/cluster-version-operator that referenced this pull request Apr 10, 2026
This is to follow up a recent finding [1].

Further digging shows that the related risks to our e2e tests are `TestAlertFeatureE2ETestOTA1813`, `SyntheticRiskA`.

This pull ignores the risks with prefix `TestAlertFeatureE2ETest` similar to [2].

I will fix `SyntheticRiskA` by modifying "fauxinnati" so that the risks coming from it have the prefix too.

[1]. openshift#1367 (comment)

[2]. openshift/origin#30929
hongkailiu added a commit to hongkailiu/cluster-version-operator that referenced this pull request Apr 10, 2026
This is to follow up a recent finding [1].

Further digging shows that the related risks to our e2e tests are `TestAlertFeatureE2ETestOTA1813`, `SyntheticRiskA`.

This pull ignores the risks with prefix `TestAlertFeatureE2ETest` similar to [2].

I will fix `SyntheticRiskA` by modifying "fauxinnati" so that the risks coming from it have the prefix too.

This will recover the health of the TP-enabled jobs in CI.

[1]. openshift#1367 (comment)

[2]. openshift/origin#30929
hongkailiu added a commit to hongkailiu/cluster-version-operator that referenced this pull request Apr 10, 2026
There was an issue of duplicated conditional risks when

* a risk whose PromQL cannot be evaluated, and

* the risk from alert was accepted.

The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
hongkailiu added a commit to hongkailiu/cluster-version-operator that referenced this pull request Apr 10, 2026
There was an issue of duplicated conditional risks when

* a risk whose PromQL cannot be evaluated, and

* the risk from alert was accepted.

The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
hongkailiu added a commit to hongkailiu/cluster-version-operator that referenced this pull request Apr 10, 2026
There was an issue of duplicated conditional risks when

* a risk whose PromQL cannot be evaluated, and

* the risk from alert was accepted.

The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
hongkailiu added a commit to hongkailiu/cluster-version-operator that referenced this pull request Apr 10, 2026
There was an issue of duplicated conditional risks when

* a risk whose PromQL cannot be evaluated, and

* the risk from alert was accepted.

The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
hongkailiu added a commit to hongkailiu/cluster-version-operator that referenced this pull request Apr 10, 2026
This is to follow up a recent finding [1].

Further digging shows that the related risks to our e2e tests are `TestAlertFeatureE2ETestOTA1813`, `SyntheticRiskA`.

They are in the pending state. This pull bump to a longer pending
time so that it gives more time window to get e2e finish. It could
avoid disruption from Production alerts.

We could use `max by (namespace, risk, reason) (last_over_time(cluster_version_risk_conditions{job="cluster-version-operator", condition="Applies", risk!~"TestAlertFeatureE2ETest.*"}[5m]) != 0)` to ignore testing alerts,
but it does not look good to have code handling special cases only for testing.

This will recover the health of the TP-enabled jobs in CI.

[1]. openshift#1367 (comment)

[2]. openshift/origin#30929
hongkailiu added a commit to hongkailiu/cluster-version-operator that referenced this pull request Apr 11, 2026
There was an issue of duplicated conditional risks when

* a risk whose PromQL cannot be evaluated, and

* the risk from alert was accepted.

The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
hongkailiu added a commit to hongkailiu/cluster-version-operator that referenced this pull request Apr 11, 2026
There was an issue of duplicated conditional risks when

* a risk whose PromQL cannot be evaluated, and

* the risk from alert was accepted.

The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants