OTA-1933: pkg/risk: Refactor alerts into a generic update-risk interface#1367
Conversation
|
@wking: This pull request references OTA-1933 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaced alert-derived plumbing with a pluggable Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
08caa80 to
eacf23c
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 524-547: The risk evaluation in evaluateConditionalUpdates is
being skipped whenever availableUpdatesSync short-circuits for the cached graph,
which leaves live risk sources (u.risks) stale; modify the logic so u.risks is
evaluated independently of the fast-return path in availableUpdatesSync: ensure
evaluateConditionalUpdates (or at least the block that calls u.risks.Risks and
risk.Merge) runs on its own cadence or before the fast-return check so that
u.risks.Risks(ctx, allVersions) is invoked even when
ShouldReconcileAcceptRisks() would otherwise be short-circuited by
minimumUpdateCheckInterval, then merge results via risk.Merge into
u.Updates/u.ConditionalUpdates as before (keeping error logging), so live risk
sources don’t get frozen by the OSUS cache.
In `@pkg/risk/risk_test.go`:
- Around line 244-255: The test fails to clear the previous error string before
the idempotency check: reset actualError (set it to the empty string) just
before you inspect the second-pass err in the idempotent block so stale
first-pass errors don't mask a nil err; update the test around the second
Merge/assertion (the section using updates, conditionalUpdates, err,
actualError, and tt.expectedError) to assign actualError = "" immediately before
the `if err != nil { ... }` check.
In `@pkg/risk/risk.go`:
- Around line 72-126: The code iterates a map (the outer "for riskName, versions
:= range versions" loop) which yields non-deterministic order and causes
non-deterministic appends to conditionalUpdates and errs; fix by iterating the
risk keys in a stable order: collect all keys from the "versions" map into a
slice, sort that slice, and then loop over sorted riskName keys instead of
ranging the map directly; additionally, ensure you maintain deterministic
ordering when adding to conditionalUpdates by either inserting risks into
conditionalUpdates[i].Risks in a consistent spot or by sorting
conditionalUpdates[i].Risks (by Risk.Name) after appending, and sort
conditionalUpdates (by Release.Version) and errs (by their error strings) before
returning to guarantee fully deterministic output.
- Around line 137-140: The comparator currently uses semver.MustParse inside
slices.SortFunc over conditionalUpdates which will panic on malformed
Release.Version; instead call semver.Parse (or semver.NewVersion/ParseTolerant
as appropriate) and propagate parse errors into the Merge function's errs slice
before sorting (validate each configv1.ConditionalUpdate.Release.Version and
append any parse error to errs), or if you prefer, modify the comparator to
handle parse failures by returning a deterministic ordering and recording the
parse error into errs; ensure you reference slices.SortFunc, conditionalUpdates,
configv1.ConditionalUpdate, semver.Parse, and errs within Merge when
implementing the fix.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 82d3314c-38e0-4b83-ac17-73f6d6e3f278
📒 Files selected for processing (10)
pkg/cvo/availableupdates.gopkg/cvo/availableupdates_test.gopkg/cvo/cvo.gopkg/cvo/cvo_test.gopkg/risk/alert/alert.gopkg/risk/alert/alert_test.gopkg/risk/alert/testdata/alerts.jsonpkg/risk/mock/mock.gopkg/risk/risk.gopkg/risk/risk_test.go
293959d to
95695a1
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (3)
pkg/risk/risk.go (2)
138-141:⚠️ Potential issue | 🟠 MajorAvoid
semver.MustParsein the sort comparator.Lines 139-140 can still panic the operator on a malformed
Release.Version, which is inconsistent with the rest ofMergeaccumulating validation problems inerrs. Parse with error handling and report bad versions through the returned error instead of crashing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/risk.go` around lines 138 - 141, The comparator passed to slices.SortFunc currently calls semver.MustParse on each configv1.ConditionalUpdate.Release.Version which can panic; instead, change the comparator in Merge to parse versions with semver.Parse (or ParseTolerant) and handle errors by storing parse failures into the existing errs slice (or a local parseErrs map) and treating unparsable versions as equal/lowest for sorting to avoid panics; ensure you reference conditionalUpdates and the comparator used in slices.SortFunc, record any parse error into errs before returning the merged result, and avoid using semver.MustParse anywhere in Merge.
73-128:⚠️ Potential issue | 🟠 MajorMake
Mergedeterministic before appending risks.Line 74 still ranges a Go map, so identical inputs can produce different
ConditionalUpdate.Risksordering and differenterrors.Join(...)text. That can churn status and tests even when nothing really changed. Iterate sorted risk names here, and sort/copy the associated version slices before merging if you want stable error ordering too.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/risk.go` around lines 73 - 128, The loop in Merge that ranges over the map "versions" is non-deterministic; change the code to iterate over a sorted slice of risk names (derive from the map keys) instead of ranging the map directly, and for each risk make a deterministic copy of its "versions" slice and sort it before processing; this ensures deterministic ordering when appending to conditionalUpdates and when building errors. Locate the loop that currently does "for riskName, versions := range versions" and replace it with keyed-sorting logic (use the keys from "versions" map, sort them, then for each key copy and sort the associated versions slice) while keeping the existing checks against invalidRiskMap, riskMap, riskMissMap, and the logic that appends to conditionalUpdates and errs.pkg/cvo/availableupdates.go (1)
529-545:⚠️ Potential issue | 🟠 MajorKeep live-source risks out of the cached graph snapshot.
Lines 539-545 merge the current source result directly into
u.Updates/u.ConditionalUpdates, and those mutated slices are whatgetAvailableUpdates()copies on the next reconcile. Sincerisk.Mergeonly adds risks, a cleared alert stays attached until the next fresh graph fetch rebuilds the base graph state. This block is also skipped by the cached fast path, so new live-source state can be missed too. Re-evaluate live sources against a pristine graph snapshot, or strip previously injected source risks before each merge.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/cvo/availableupdates.go` around lines 529 - 545, The code is mutating u.Updates/u.ConditionalUpdates with live-source risks (via risk.Merge), which persists injected risks into the cached graph; instead, operate on a pristine snapshot or on copies: when calling u.risks.Risks and risk.Merge, first make local copies of the base slices (e.g., baseUpdates := append([]Update(nil), u.Updates...), baseConditional := append([]ConditionalUpdate(nil), u.ConditionalUpdates...)) or obtain the immutable graph snapshot used by getAvailableUpdates(), then call risk.Merge against those copies and assign the merge result to local variables (not back to u.Updates/u.ConditionalUpdates) or strip previously injected source risks from the copies before merging; ensure u.risks.Risks and risk.Merge are only used to produce transient mergedResults for reconciliation and do not mutate the cached u.Updates/u.ConditionalUpdates that getAvailableUpdates() will copy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/cvo/availableupdates.go`:
- Around line 529-545: The code is mutating u.Updates/u.ConditionalUpdates with
live-source risks (via risk.Merge), which persists injected risks into the
cached graph; instead, operate on a pristine snapshot or on copies: when calling
u.risks.Risks and risk.Merge, first make local copies of the base slices (e.g.,
baseUpdates := append([]Update(nil), u.Updates...), baseConditional :=
append([]ConditionalUpdate(nil), u.ConditionalUpdates...)) or obtain the
immutable graph snapshot used by getAvailableUpdates(), then call risk.Merge
against those copies and assign the merge result to local variables (not back to
u.Updates/u.ConditionalUpdates) or strip previously injected source risks from
the copies before merging; ensure u.risks.Risks and risk.Merge are only used to
produce transient mergedResults for reconciliation and do not mutate the cached
u.Updates/u.ConditionalUpdates that getAvailableUpdates() will copy.
In `@pkg/risk/risk.go`:
- Around line 138-141: The comparator passed to slices.SortFunc currently calls
semver.MustParse on each configv1.ConditionalUpdate.Release.Version which can
panic; instead, change the comparator in Merge to parse versions with
semver.Parse (or ParseTolerant) and handle errors by storing parse failures into
the existing errs slice (or a local parseErrs map) and treating unparsable
versions as equal/lowest for sorting to avoid panics; ensure you reference
conditionalUpdates and the comparator used in slices.SortFunc, record any parse
error into errs before returning the merged result, and avoid using
semver.MustParse anywhere in Merge.
- Around line 73-128: The loop in Merge that ranges over the map "versions" is
non-deterministic; change the code to iterate over a sorted slice of risk names
(derive from the map keys) instead of ranging the map directly, and for each
risk make a deterministic copy of its "versions" slice and sort it before
processing; this ensures deterministic ordering when appending to
conditionalUpdates and when building errors. Locate the loop that currently does
"for riskName, versions := range versions" and replace it with keyed-sorting
logic (use the keys from "versions" map, sort them, then for each key copy and
sort the associated versions slice) while keeping the existing checks against
invalidRiskMap, riskMap, riskMissMap, and the logic that appends to
conditionalUpdates and errs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b1478d81-6c49-422a-b7a7-b4fe175f38a9
📒 Files selected for processing (10)
pkg/cvo/availableupdates.gopkg/cvo/availableupdates_test.gopkg/cvo/cvo.gopkg/cvo/cvo_test.gopkg/risk/alert/alert.gopkg/risk/alert/alert_test.gopkg/risk/alert/testdata/alerts.jsonpkg/risk/mock/mock.gopkg/risk/risk.gopkg/risk/risk_test.go
✅ Files skipped from review due to trivial changes (4)
- pkg/cvo/cvo_test.go
- pkg/risk/mock/mock.go
- pkg/risk/risk_test.go
- pkg/cvo/availableupdates_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/cvo/cvo.go
95695a1 to
5c28a5d
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/risk/risk.go (1)
74-128:⚠️ Potential issue | 🟠 MajorMake merge iteration deterministic to avoid status churn
Line 74 ranges a Go map directly, so merge order can vary between reconciles. That can reorder
ConditionalUpdate.Risks(and some error aggregation paths) for identical inputs.Proposed fix
- riskMissMap := make(map[string]struct{}, len(versions)) - for riskName, versions := range versions { + riskMissMap := make(map[string]struct{}, len(versions)) + riskNames := make([]string, 0, len(versions)) + for riskName := range versions { + riskNames = append(riskNames, riskName) + } + sort.Strings(riskNames) + for _, riskName := range riskNames { + riskVersions := append([]string(nil), versions[riskName]...) + sort.Strings(riskVersions) if _, ok := invalidRiskMap[riskName]; ok { continue // we're already complaining about these } risk, ok := riskMap[riskName] if !ok { riskMissMap[riskName] = struct{}{} continue } - for _, version := range versions { + for _, version := range riskVersions { found := false for i, update := range updates { if update.Version != version { continue @@ - if !riskFound { - conditionalUpdates[i].Risks = append(conditionalUpdate.Risks, risk) + if !riskFound { + conditionalUpdates[i].Risks = append(conditionalUpdate.Risks, risk) + slices.SortFunc(conditionalUpdates[i].Risks, func(a, b configv1.ConditionalUpdateRisk) int { + return strings.Compare(a.Name, b.Name) + }) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/risk/risk.go` around lines 74 - 128, The merge loop currently ranges directly over the versions map (the for riskName, versions := range versions block) which makes the join order non-deterministic; to fix it, collect the risk names (keys) from the versions map into a slice, sort that slice, and iterate over the sorted risk names instead of ranging the map directly so that appending to conditionalUpdates and the order of ConditionalUpdate.Risks is stable; also consider sorting each versions slice for a given riskName if its order can vary to ensure fully deterministic merging of releases and errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/risk/risk.go`:
- Around line 74-128: The merge loop currently ranges directly over the versions
map (the for riskName, versions := range versions block) which makes the join
order non-deterministic; to fix it, collect the risk names (keys) from the
versions map into a slice, sort that slice, and iterate over the sorted risk
names instead of ranging the map directly so that appending to
conditionalUpdates and the order of ConditionalUpdate.Risks is stable; also
consider sorting each versions slice for a given riskName if its order can vary
to ensure fully deterministic merging of releases and errors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f3135939-5865-4f91-984a-7b67e145f0a2
📒 Files selected for processing (10)
pkg/cvo/availableupdates.gopkg/cvo/availableupdates_test.gopkg/cvo/cvo.gopkg/cvo/cvo_test.gopkg/risk/alert/alert.gopkg/risk/alert/alert_test.gopkg/risk/alert/testdata/alerts.jsonpkg/risk/mock/mock.gopkg/risk/risk.gopkg/risk/risk_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/cvo/cvo_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/cvo/cvo.go
- pkg/cvo/availableupdates.go
5c28a5d to
d14e513
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/risk/risk.go`:
- Around line 83-123: The loop currently appends a new
configv1.ConditionalUpdate for a matching update before checking whether that
release already exists in conditionalUpdates; change the logic in the first
inner loop (the one iterating over updates and matching update.Version ==
version) so that when a matching update is found you first search
conditionalUpdates for an entry whose Release.Version == version and, if found,
merge the risk into that entry's Risks (respecting the existing
risk-duplication/diff checks used later), otherwise append a new
configv1.ConditionalUpdate as today; after merging or appending, remove the
entry from updates (updates = append(updates[:i], updates[i+1:]...) and set
updates = nil when empty) and break. This ensures the existing ConditionalUpdate
is extended instead of creating duplicates (refer to conditionalUpdates,
updates, configv1.ConditionalUpdate, configv1.ConditionalUpdateRisk, and the
risk variable).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5ef103df-419d-48d9-9005-650a5ea9319a
📒 Files selected for processing (10)
pkg/cvo/availableupdates.gopkg/cvo/availableupdates_test.gopkg/cvo/cvo.gopkg/cvo/cvo_test.gopkg/risk/alert/alert.gopkg/risk/alert/alert_test.gopkg/risk/alert/testdata/alerts.jsonpkg/risk/mock/mock.gopkg/risk/risk.gopkg/risk/risk_test.go
✅ Files skipped from review due to trivial changes (3)
- pkg/risk/mock/mock.go
- pkg/cvo/cvo_test.go
- pkg/risk/alert/alert.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/cvo/cvo.go
- pkg/risk/risk_test.go
ea87d7c to
f2ac936
Compare
|
/retest-required |
|
openshift/release#77508 is working on recovering our |
hongkailiu
left a comment
There was a problem hiding this comment.
Only a few nits and moving code around.
Mostly my questions about my understanding of the code, thinking aloud.
Otherwise, LGTM.
The cases passed in the e2e job.
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1367/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-ovn-techpreview-serial-1of3/2041744451692400640/artifacts/e2e-agnostic-ovn-techpreview-serial/openshift-e2e-test/artifacts/e2e.log | grep risks
started: 0/1/44 "[Jira:\"Cluster Version Operator\"] cluster-version-operator should work with risks from alerts"
passed: (3m1s) 2026-04-08T07:22:44 "[Jira:\"Cluster Version Operator\"] cluster-version-operator should work with risks from alerts"
started: 0/2/44 "[Jira:\"Cluster Version Operator\"] cluster-version-operator should work with accept risks"
passed: (1m1s) 2026-04-08T07:23:46 "[Jira:\"Cluster Version Operator\"] cluster-version-operator should work with accept risks"
started: 0/33/44 "[sig-cli][OCPFeatureGate:ClusterUpdateAcceptRisks] oc can operate accept risks [Serial]"
passed: (1.7s) 2026-04-08T07:41:56 "[sig-cli][OCPFeatureGate:ClusterUpdateAcceptRisks] oc can operate accept risks [Serial]"38ddd31 to
d9fe47b
Compare
3b365d0 (OTA-1813: Populate risks from alerts, 2026-02-27, openshift#1329) added the alert-risk logic. But the implementation in that commit was fairly directly integrated into pkg/cvo. We want to extend update-risk detection with additional providers in the future, like preflights from the target release [1], and that could get complicated with logic tucked into pkg/cvo. And pkg/cvo is already plenty complicated, even without this functionality inline. This commit refactors to remove a few hundred lines of code from pkg/cvo out to a new pkg/risk. And that allows the pkg/cvo handling to move from "I'm integrating additional risks from alerts" to "I'm integrating additional risks from a source". As we more forward from here, we can continue to extend the functionality of the source, and will only have to tweak the source in cvo.go's New(...). Most of the code is either straightforward interface work or lifted-and-shifted pre-existing code. But the ReplaceAll for U+00A0 No-break space took some time to work out, so I'll explain that in detail here. It flies in the face of [2]: Do not depend on this output being stable. If you need the ability to programmatically interpret the difference, consider using a custom Reporter. I'm manually unwinding the uncertainty they inject in github.com/google/go-cmp/cmp /report_text.go's indentMode.appendIndent, because: * I want to test the error message content, to ensure it's sufficiently readable. * I'm ok manually updating the test expectations if a go-cpm bump brings in rendering improvements. That manual update will give us space to review to ensure we're still readable. [1]: openshift/enhancements#1930 [2]: https://pkg.go.dev/github.com/google/go-cmp/cmp#Diff
We've been mixing in alerts since 3b365d0 (OTA-1813: Populate risks from alerts, 2026-02-27, openshift#1329). But when the alerts get resolved, we want the cluster to notice quickly. With this change, we keep the original upstream data isolated in new properties, so an evaluation round will be able to clear out resolved alerts, even if needFreshFetch is false because available updates were recently retrieved.
…eptance
Because [1]:
1. create a cluster which has 1 recommend version and 1 non-recommend version
2. create alert PodDisruptionBudgetAtLimit with warning severity
3. oc adm upgrade accept PodDisruptionBudgetAtLimit
had been creating fairly noisy output like:
I0408 03:58:45.443626 1 alerts.go:87] refreshed: 10 alerts
I0408 03:58:45.443753 1 availableupdates.go:534] Found 10 alerts
I0408 03:58:45.443818 1 availableupdates.go:631] Got 0 risks
I0408 03:58:45.445403 1 availableupdates.go:868] Risk with name "PodDisruptionBudgetAtLimit" is accepted by the cluster admin and thus not in the evaluation of conditional update
I0408 03:58:45.445421 1 availableupdates.go:868] Risk with name "PodDisruptionBudgetAtLimit" is accepted by the cluster admin and thus not in the evaluation of conditional update
I0408 03:58:45.445429 1 availableupdates.go:868] Risk with name "PodDisruptionBudgetAtLimit" is accepted by the cluster admin and thus not in the evaluation of conditional update
...
I0408 03:58:45.446455 1 availableupdates.go:868] Risk with name "PodDisruptionBudgetAtLimit" is accepted by the cluster admin and thus not in the evaluation of conditional update
I0408 03:58:45.446460 1 availableupdates.go:868] Risk with name "PodDisruptionBudgetAtLimit" is accepted by the cluster admin and thus not in the evaluation of conditional update
I0408 03:58:45.446475 1 availableupdates.go:177] Requeue available-update evaluation, because "4.22.0-ec.4" is Recommended=Unknown: EvaluationFailed: Could not evaluate exposure to update risk SomeInvokerThing (invalid PromQL result length must be one, but is 0)
SomeInvokerThing description: On clusters on default invoker user, this imaginary bug can happen.
SomeInvokerThing URL: https://bug.example.com/a
I0408 03:58:45.449976 1 cvo.go:816] Finished syncing available updates "openshift-cluster-version/version" (159.68958ms)
Logging the version should help distinguish confusedly looping on a
single risk for a single target version, from looping over that same
risks for multiple different target versions.
[1]: https://redhat.atlassian.net/browse/OTA-1813?focusedCommentId=16643738
|
Thanks for bringing this up. /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
catalogd /override ci/prow/e2e-agnostic-ovn-techpreview-serial-2of3 And from test history, all the others have also passed at least once since bbe69d5, and there have been minimal changes since then. /override ci/prow/e2e-agnostic-ovn Actually, techpreview-serial-1of3 is seeing something related to this pull: But that's already out in the wild, e.g. this periodic run, and I don't see how I'd be making it worse with this pull. Also manually tested and confirmed to fix the duplicate release target issue we're seeing in the 4.22/dev main branch today. /verified by @wking |
|
@wking: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-ovn, ci/prow/e2e-agnostic-ovn-techpreview-serial-1of3, ci/prow/e2e-agnostic-ovn-techpreview-serial-2of3, ci/prow/e2e-agnostic-ovn-upgrade-into-change, ci/prow/e2e-hypershift, ci/prow/e2e-hypershift-conformance DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was an issue of duplicated conditional risks when * a risk whose PromQL cannot be evaluated, and * the risk from alert was accepted. The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
This is to follow up a recent finding [1]. Further digging shows that the related risks to our e2e tests are `TestAlertFeatureE2ETestOTA1813`, `SyntheticRiskA`. This pull ignores the risks with prefix `TestAlertFeatureE2ETest` similar to [2]. I will fix `SyntheticRiskA` by modifying "fauxinnati" so that the risks coming from it have the prefix too. [1]. openshift#1367 (comment) [2]. openshift/origin#30929
This is to follow up a recent finding [1]. Further digging shows that the related risks to our e2e tests are `TestAlertFeatureE2ETestOTA1813`, `SyntheticRiskA`. This pull ignores the risks with prefix `TestAlertFeatureE2ETest` similar to [2]. I will fix `SyntheticRiskA` by modifying "fauxinnati" so that the risks coming from it have the prefix too. This will recover the health of the TP-enabled jobs in CI. [1]. openshift#1367 (comment) [2]. openshift/origin#30929
There was an issue of duplicated conditional risks when * a risk whose PromQL cannot be evaluated, and * the risk from alert was accepted. The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
There was an issue of duplicated conditional risks when * a risk whose PromQL cannot be evaluated, and * the risk from alert was accepted. The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
There was an issue of duplicated conditional risks when * a risk whose PromQL cannot be evaluated, and * the risk from alert was accepted. The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
There was an issue of duplicated conditional risks when * a risk whose PromQL cannot be evaluated, and * the risk from alert was accepted. The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
This is to follow up a recent finding [1].
Further digging shows that the related risks to our e2e tests are `TestAlertFeatureE2ETestOTA1813`, `SyntheticRiskA`.
They are in the pending state. This pull bump to a longer pending
time so that it gives more time window to get e2e finish. It could
avoid disruption from Production alerts.
We could use `max by (namespace, risk, reason) (last_over_time(cluster_version_risk_conditions{job="cluster-version-operator", condition="Applies", risk!~"TestAlertFeatureE2ETest.*"}[5m]) != 0)` to ignore testing alerts,
but it does not look good to have code handling special cases only for testing.
This will recover the health of the TP-enabled jobs in CI.
[1]. openshift#1367 (comment)
[2]. openshift/origin#30929
There was an issue of duplicated conditional risks when * a risk whose PromQL cannot be evaluated, and * the risk from alert was accepted. The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
There was an issue of duplicated conditional risks when * a risk whose PromQL cannot be evaluated, and * the risk from alert was accepted. The issue is fixed by [cvo#1367](openshift#1367) and this pull is to extend the current alert risk e2e to cover the case.
3b365d0 (#1329) added the alert-risk logic. But the implementation in that commit was fairly directly integrated into pkg/cvo. We want to extend update-risk detection with additional providers in the future, like preflights from the target release, and that could get complicated with logic tucked into
pkg/cvo. Andpkg/cvois already plenty complicated, even without this functionality inline.This commit refactors to remove a few hundred lines of code from pkg/cvo out to a new pkg/risk. And that allows the
pkg/cvohandling to move from "I'm integrating additional risks from alerts" to "I'm integrating additional risks from a source". As we more forward from here, we can continue to extend the functionality of the source, and will only have to tweak the source incvo.go'sNew(...).Most of the code is either straightforward interface work or lifted-and-shifted pre-existing code. But the
ReplaceAllfor U+00A0 No-break space took some time to work out, so I'll explain that in detail here. It flies in the face of go-cmp docs:I'm manually unwinding the uncertainty they inject in
github.com/google/go-cmp/cmp /report_text.go'sindentMode.appendIndent, because: