NO-JIRA: Remove fixed bugs on CO conditions (2) - 2nd try#31207
NO-JIRA: Remove fixed bugs on CO conditions (2) - 2nd try#31207hongkailiu wants to merge 9 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@hongkailiu: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughRemoves several legacy CVO operator exception branches (network, monitoring, image-registry) and updates issue-tracker URLs from issues.redhat.com to redhat.atlassian.net in monitor and scale test code. ChangesMonitor Test Exception Updates
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 8 | ❌ 4❌ Failed checks (4 warnings)
✅ Passed checks (8 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tcloud.google.com/go/storage@v1.56.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Azure/azure-sdk-for-go@v68.0.0+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Azure/azure-sdk-for-go/sdk/azcore@v1.18.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Azure/azure-sdk-for-go/sdk/azidentity@v1.8.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Azure/azure-sdk-for-go/sdk/monitor/query/azlogs@v1.1.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute/v5@v5.7.0: is explicitly ... [truncated 57388 characters] ... etes: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/metrics: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/mount-utils: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/pod-security-admission: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-apiserver: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-cli-plugin: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-controller: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n" Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hongkailiu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`:
- Around line 94-107: getControlPlaneTopology can fail and leave topology empty
causing downstream helpers (testUpgradeOperatorStateTransitions,
testUpgradeOperatorProgressingStateTransitions,
testStableSystemOperatorStateTransitions) to misbehave; change the
getControlPlaneTopology(err) handling so that if err != nil you either return
the error from the enclosing function or short-circuit/skip topology-dependent
tests immediately (do not proceed with empty topology). Locate the call to
getControlPlaneTopology and replace the current e2e.Logf-only branch with logic
to return fmt.Errorf(...) (or a defined skip/short-circuit path) when err != nil
so topology-dependent calls never run with an invalid topology value.
In `@test/extended/machines/scale.go`:
- Around line 286-296: The code currently logs and skips the topology-based
assertion when exutil.GetControlPlaneTopologyFromConfigClient returns an error
or nil; change this to fail fast by asserting the topology call succeeded and
returned a non-nil value. Replace the manual error log with
o.Expect(err).NotTo(o.HaveOccurred()) for the
exutil.GetControlPlaneTopologyFromConfigClient call and add
o.Expect(topo).NotTo(o.BeNil()) (and then check *topo !=
configv1.SingleReplicaTopologyMode) so the violations assertion always runs when
topology lookup fails or is missing; use the existing symbols cfg,
configv1client.NewForConfig, exutil.GetControlPlaneTopologyFromConfigClient,
topo, and violations to locate and update the logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 5f2f414c-4146-4330-b2af-1fadbfae691b
📒 Files selected for processing (3)
pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.gopkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.gotest/extended/machines/scale.go
| topology, err := getControlPlaneTopology(w.adminRESTConfig) | ||
| if err != nil { | ||
| e2e.Logf("failed to get control plane topology: %v", err) | ||
| } | ||
|
|
||
| if isUpgrade { | ||
| junits = append(junits, testUpgradeOperatorStateTransitions(finalIntervals, w.adminRESTConfig)...) | ||
| junits = append(junits, testUpgradeOperatorStateTransitions(finalIntervals, w.adminRESTConfig, topology)...) | ||
| level, err := getUpgradeLevel(w.adminRESTConfig) | ||
| if err != nil || level == unknownUpgradeLevel { | ||
| return nil, fmt.Errorf("failed to determine upgrade level: %w", err) | ||
| } | ||
| junits = append(junits, testUpgradeOperatorProgressingStateTransitions(finalIntervals, level == patchUpgradeLevel, w.adminRESTConfig)...) | ||
| junits = append(junits, testUpgradeOperatorProgressingStateTransitions(finalIntervals, level == patchUpgradeLevel, topology)...) | ||
| } else { | ||
| junits = append(junits, testStableSystemOperatorStateTransitions(finalIntervals, w.adminRESTConfig)...) | ||
| junits = append(junits, testStableSystemOperatorStateTransitions(finalIntervals, topology)...) |
There was a problem hiding this comment.
Fail or skip when topology lookup fails.
If getControlPlaneTopology errors here, topology stays empty and the downstream helpers behave as if the cluster is neither single-node nor two-node. That bypasses the new SNO skips and dual-replica/arbiter exception paths, so a topology read failure can turn into false monitor-test failures. Return the error or short-circuit the topology-dependent checks instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`
around lines 94 - 107, getControlPlaneTopology can fail and leave topology empty
causing downstream helpers (testUpgradeOperatorStateTransitions,
testUpgradeOperatorProgressingStateTransitions,
testStableSystemOperatorStateTransitions) to misbehave; change the
getControlPlaneTopology(err) handling so that if err != nil you either return
the error from the enclosing function or short-circuit/skip topology-dependent
tests immediately (do not proceed with empty topology). Locate the call to
getControlPlaneTopology and replace the current e2e.Logf-only branch with logic
to return fmt.Errorf(...) (or a defined skip/short-circuit path) when err != nil
so topology-dependent calls never run with an invalid topology value.
| cfg, err := e2e.LoadConfig() | ||
| o.Expect(err).NotTo(o.HaveOccurred()) | ||
| configV1Client, err := configv1client.NewForConfig(cfg) | ||
| o.Expect(err).NotTo(o.HaveOccurred()) | ||
| topo, err := exutil.GetControlPlaneTopologyFromConfigClient(configV1Client) | ||
| if err != nil { | ||
| e2e.Logf("failed to get control plane topology: %v", err) | ||
| } | ||
| if topo != nil && *topo != configv1.SingleReplicaTopologyMode { | ||
| o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations) | ||
| } |
There was a problem hiding this comment.
Fail fast when topology lookup fails instead of skipping the assertion.
If topology retrieval fails or returns nil, the violations check is silently skipped, which can hide real regressions. Please make topology discovery mandatory in this path.
Proposed fix
cfg, err := e2e.LoadConfig()
o.Expect(err).NotTo(o.HaveOccurred())
configV1Client, err := configv1client.NewForConfig(cfg)
o.Expect(err).NotTo(o.HaveOccurred())
topo, err := exutil.GetControlPlaneTopologyFromConfigClient(configV1Client)
- if err != nil {
- e2e.Logf("failed to get control plane topology: %v", err)
- }
- if topo != nil && *topo != configv1.SingleReplicaTopologyMode {
+ o.Expect(err).NotTo(o.HaveOccurred(), "failed to get control plane topology")
+ o.Expect(topo).NotTo(o.BeNil(), "control plane topology must be discoverable")
+ if *topo != configv1.SingleReplicaTopologyMode {
o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cfg, err := e2e.LoadConfig() | |
| o.Expect(err).NotTo(o.HaveOccurred()) | |
| configV1Client, err := configv1client.NewForConfig(cfg) | |
| o.Expect(err).NotTo(o.HaveOccurred()) | |
| topo, err := exutil.GetControlPlaneTopologyFromConfigClient(configV1Client) | |
| if err != nil { | |
| e2e.Logf("failed to get control plane topology: %v", err) | |
| } | |
| if topo != nil && *topo != configv1.SingleReplicaTopologyMode { | |
| o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations) | |
| } | |
| cfg, err := e2e.LoadConfig() | |
| o.Expect(err).NotTo(o.HaveOccurred()) | |
| configV1Client, err := configv1client.NewForConfig(cfg) | |
| o.Expect(err).NotTo(o.HaveOccurred()) | |
| topo, err := exutil.GetControlPlaneTopologyFromConfigClient(configV1Client) | |
| o.Expect(err).NotTo(o.HaveOccurred(), "failed to get control plane topology") | |
| o.Expect(topo).NotTo(o.BeNil(), "control plane topology must be discoverable") | |
| if *topo != configv1.SingleReplicaTopologyMode { | |
| o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/extended/machines/scale.go` around lines 286 - 296, The code currently
logs and skips the topology-based assertion when
exutil.GetControlPlaneTopologyFromConfigClient returns an error or nil; change
this to fail fast by asserting the topology call succeeded and returned a
non-nil value. Replace the manual error log with
o.Expect(err).NotTo(o.HaveOccurred()) for the
exutil.GetControlPlaneTopologyFromConfigClient call and add
o.Expect(topo).NotTo(o.BeNil()) (and then check *topo !=
configv1.SingleReplicaTopologyMode) so the violations assertion always runs when
topology lookup fails or is missing; use the existing symbols cfg,
configv1client.NewForConfig, exutil.GetControlPlaneTopologyFromConfigClient,
topo, and violations to locate and update the logic.
|
TRT-2669 says https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-serial-2of2/2057096534654193664 was failing. /payload-aggregate periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-serial-2of2 5 |
|
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/64b4fca0-550b-11f1-9211-87f77cbc54ea-0 |
|
Scheduling required tests: |
|
@hongkailiu: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
The job from #31207 (comment) succeeded 4/5 runs. The failing run was due to a failing cluster installation. Let us redo. /payload-aggregate periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-serial-2of2 5 |
|
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2e00df00-5536-11f1-8c96-56c71015608c-0 |
ff1ae2c to
66ce89c
Compare
OCPBUGS-23745 has been fixed and shipped with 4.15. However, the symptom is still there in 4.21 and we create OCPBUGS-66230 to track the issue. OCPBUGS-66230 is closed as the fix is included in 4.21.0-0.nightly-2025-11-30-094855 [1]. [1]. https://redhat.atlassian.net/browse/OCPBUGS-66230?focusedCommentId=16861698
62633 is for co/service-ca and the correct one is 62634.
The bug is fixed [1]. [1]. https://redhat.atlassian.net/browse/OCPBUGS-62634?focusedCommentId=16981952
NTO should now correctly report ClusterOperator status conditions Available, Progressing and Degraded. See: OCPBUGS-62632
The bug is fixed [1]. [1]. https://redhat.atlassian.net/browse/OCPBUGS-62626?focusedCommentId=16983042
The bug is fixed [1]. [1]. https://redhat.atlassian.net/browse/CORENET-6605?focusedCommentId=16984425
See the details in [1]. [1]. https://redhat.atlassian.net/browse/OCPBUGS-86009
The symptom is still there after OCPBUGS-62630 is shipped.
The details are in the bug.
66ce89c to
8e592a0
Compare
|
/payload-aggregate periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-serial-2of2 5 |
|
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/54e51ee0-597a-11f1-9689-47d5fb5f83ad-0 |
|
Scheduling required tests: |
This is to redo #31112 which is revert by #31201 because of TRT-2669.
Now https://redhat.atlassian.net/browse/OCPBUGS-86308 is added to address TRT-2669.
I did not reuse https://issues.redhat.com/browse/OCPBUGS-62626 because it turned out to be different cases: OCPBUGS-86308 scale up vs OCPBUGS-62626 node reboot.
Comparing to #31112, OCPBUGS-62635 is missing in this pull because it has been removed by #30775 already.
I will do rebase after #30775 gets in.
Summary by CodeRabbit