Add SBR HyperShift e2e job using persistent management cluster (hypershift-aws)#80919
Add SBR HyperShift e2e job using persistent management cluster (hypershift-aws)#80919maximunited wants to merge 1 commit into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds two new HyperShift-based SBR e2e test jobs ( ChangesHyperShift SBR e2e test infrastructure
Sequence Diagram(s)sequenceDiagram
participant rbac as RBAC
participant mgmt as Nested Mgmt Cluster
participant hs as HyperShift AWS
participant ais as apply-image-sources
participant sc as switch-kubeconfig
participant wn as wait-nodes
participant odf as ODF prepare/apply
participant tests as make run-tests
participant rk as restore-kubeconfig
rbac->>mgmt: install RBAC
mgmt->>hs: provision nested mgmt<br/>create HostedCluster
hs->>ais: management context active
ais->>ais: resolve FBC SHA<br/>fetch images-mirror-set.yaml
ais->>ais: oc patch HostedCluster<br/>imageContentSources
ais->>ais: wait NodePool<br/>AllNodesHealthy (20 min)
ais->>sc: image sources applied
sc->>sc: backup mgmt kubeconfig<br/>load hosted kubeconfig
sc->>wn: switch complete
wn->>wn: wait nodes Ready<br/>poll outdated-revision taint
wn->>odf: node rotation complete
odf->>odf: apply StorageCluster<br/>configurable timeout
odf->>tests: ODF ready
tests->>rk: test phase complete
rk->>rk: restore mgmt kubeconfig
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: maximunited The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@maximunited, Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/step-registry/medik8s/sbr/hypershift/apply-image-sources/medik8s-sbr-hypershift-apply-image-sources-commands.sh`:
- Around line 54-57: The wait command for the NodePool AllNodesHealthy condition
is currently catching the timeout failure with an OR operator and downgrading it
to a warning, allowing the script to continue and report success even when node
rotation is incomplete. Remove the error handling block that begins with the OR
operator (||) after the 20m timeout on the AllNodesHealthy condition check,
including the log warning statement and the oc get nodepool diagnostic command,
so that the step properly fails when the NodePool does not reach AllNodesHealthy
state within the timeout period.
In
`@ci-operator/step-registry/medik8s/sbr/hypershift/wait-nodes/medik8s-sbr-hypershift-wait-nodes-commands.sh`:
- Around line 14-16: The EXPECTED_NODES variable is assigned from
HYPERSHIFT_NODE_COUNT environment variable with a default fallback of 3, but
this value is never validated to ensure it is a valid integer before being used
in numeric comparisons later in the script. Add validation logic immediately
after declaring EXPECTED_NODES to check if the value is a valid integer and exit
with a meaningful error message if it is not. This ensures that any malformed
HYPERSHIFT_NODE_COUNT value is caught early rather than causing unreliable
behavior during the threshold check.
In
`@ci-operator/step-registry/odf/apply-storage-cluster/odf-apply-storage-cluster-commands.sh`:
- Around line 47-51: The oc wait deploy command with the label selector
app=rook-ceph-osd will succeed as long as any deployments with that label become
Available, but does not enforce that all 3 OSD deployments are present and
ready. Add a check before the wait command to verify that exactly 3 OSD
deployments exist with the app=rook-ceph-osd label in the ODF_INSTALL_NAMESPACE,
ensuring the readiness check only passes when all expected OSD deployments are
deployed and available, not just when a subset happens to be ready.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: c6da7a93-9f1b-4ec6-99fe-201d1365de94
⛔ Files ignored due to path filters (1)
ci-operator/jobs/medik8s/system-tests/medik8s-system-tests-main-presubmits.yamlis excluded by!ci-operator/jobs/**
📒 Files selected for processing (27)
ci-operator/config/medik8s/system-tests/medik8s-system-tests-main__4.22-konflux.yamlci-operator/step-registry/medik8s/catalogsource/medik8s-catalogsource-commands.shci-operator/step-registry/medik8s/catalogsource/medik8s-catalogsource-ref.yamlci-operator/step-registry/medik8s/sbr/OWNERSci-operator/step-registry/medik8s/sbr/hypershift/OWNERSci-operator/step-registry/medik8s/sbr/hypershift/apply-image-sources/OWNERSci-operator/step-registry/medik8s/sbr/hypershift/apply-image-sources/medik8s-sbr-hypershift-apply-image-sources-commands.shci-operator/step-registry/medik8s/sbr/hypershift/apply-image-sources/medik8s-sbr-hypershift-apply-image-sources-ref.metadata.jsonci-operator/step-registry/medik8s/sbr/hypershift/apply-image-sources/medik8s-sbr-hypershift-apply-image-sources-ref.yamlci-operator/step-registry/medik8s/sbr/hypershift/medik8s-sbr-hypershift-persistent-workflow.metadata.jsonci-operator/step-registry/medik8s/sbr/hypershift/medik8s-sbr-hypershift-persistent-workflow.yamlci-operator/step-registry/medik8s/sbr/hypershift/medik8s-sbr-hypershift-workflow.metadata.jsonci-operator/step-registry/medik8s/sbr/hypershift/medik8s-sbr-hypershift-workflow.yamlci-operator/step-registry/medik8s/sbr/hypershift/restore-kubeconfig/OWNERSci-operator/step-registry/medik8s/sbr/hypershift/restore-kubeconfig/medik8s-sbr-hypershift-restore-kubeconfig-commands.shci-operator/step-registry/medik8s/sbr/hypershift/restore-kubeconfig/medik8s-sbr-hypershift-restore-kubeconfig-ref.metadata.jsonci-operator/step-registry/medik8s/sbr/hypershift/restore-kubeconfig/medik8s-sbr-hypershift-restore-kubeconfig-ref.yamlci-operator/step-registry/medik8s/sbr/hypershift/switch-kubeconfig/OWNERSci-operator/step-registry/medik8s/sbr/hypershift/switch-kubeconfig/medik8s-sbr-hypershift-switch-kubeconfig-commands.shci-operator/step-registry/medik8s/sbr/hypershift/switch-kubeconfig/medik8s-sbr-hypershift-switch-kubeconfig-ref.metadata.jsonci-operator/step-registry/medik8s/sbr/hypershift/switch-kubeconfig/medik8s-sbr-hypershift-switch-kubeconfig-ref.yamlci-operator/step-registry/medik8s/sbr/hypershift/wait-nodes/OWNERSci-operator/step-registry/medik8s/sbr/hypershift/wait-nodes/medik8s-sbr-hypershift-wait-nodes-commands.shci-operator/step-registry/medik8s/sbr/hypershift/wait-nodes/medik8s-sbr-hypershift-wait-nodes-ref.metadata.jsonci-operator/step-registry/medik8s/sbr/hypershift/wait-nodes/medik8s-sbr-hypershift-wait-nodes-ref.yamlci-operator/step-registry/odf/apply-storage-cluster/odf-apply-storage-cluster-commands.shci-operator/step-registry/odf/apply-storage-cluster/odf-apply-storage-cluster-ref.yaml
fe68ea2 to
aea7b02
Compare
|
@maximunited: |
|
/pj-rehearse pull-ci-medik8s-system-tests-main-4.22-konflux-e2e-sbr-hypershift-persistent-aws-odf |
|
@maximunited: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@maximunited, |
aea7b02 to
53a004a
Compare
|
/pj-rehearse pull-ci-medik8s-system-tests-main-4.22-konflux-e2e-sbr-hypershift-persistent-aws-odf |
|
@maximunited: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
…(hypershift-aws)
Adds e2e-sbr-hypershift-persistent-aws-odf — an optional presubmit that runs
the SBR smoke + acceptance suite on a HyperShift hosted cluster backed by ODF
storage, using the shared persistent management cluster (cluster_profile:
hypershift-aws) instead of provisioning a new management cluster per run.
New workflow medik8s-sbr-hypershift-persistent (in
step-registry/medik8s/sbr/hypershift-persistent/):
pre:
ipi-install-rbac
hypershift-setup-root-management-cluster (~2s vs ~17m for nested)
hypershift-aws-create
medik8s-sbr-hypershift-apply-image-sources
test: (identical to e2e-sbr-hypershift-aws-odf)
medik8s-sbr-hypershift-switch-kubeconfig
medik8s-sbr-hypershift-wait-nodes
medik8s-catalogsource / medik8s-operator-subscribe
odf-prepare-cluster / operatorhub-subscribe-odf-operator
odf-apply-storage-cluster
e2e-test (make run-tests, ECO_TEST_FEATURES=sbr-operator)
post:
medik8s-sbr-hypershift-restore-kubeconfig
hypershift-dump / hypershift-debug / hypershift-k8sgpt
hypershift-aws-destroy (no destroy-management-cluster)
Job config: cluster_profile=hypershift-aws, m5.4xlarge x3,
ODF stable-4.21, optional=true, trigger=
/test 4.22-konflux-e2e-sbr-hypershift-persistent-aws-odf.
Expected savings vs nested variant: ~23 min/run (management cluster
create + hypershift-install + management cluster destroy eliminated).
Related: openshift#80372 (nested variant; introduces shared step-registry steps)
53a004a to
e33b27a
Compare
|
[REHEARSALNOTIFIER]
A total of 26 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse pull-ci-medik8s-system-tests-main-4.22-konflux-e2e-sbr-hypershift-persistent-aws-odf |
|
@maximunited: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@maximunited: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/pj-rehearse pull-ci-medik8s-system-tests-main-4.22-konflux-e2e-sbr-hypershift-persistent-aws-odf |
|
@maximunited: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-medik8s-system-tests-main-4.22-konflux-e2e-sbr-hypershift-persistent-aws-odf |
What
Adds
e2e-sbr-hypershift-persistent-aws-odf— an optional presubmit job formedik8s/system-teststhat validates ODF + Storage-Based Remediation (SBR) watchdog fencing on a HyperShift hosted cluster, using the shared persistent management cluster (cluster_profile: hypershift-aws).The job runs 9 SBR smoke + acceptance tests (non-destructive,
platform:any) covering:/dev/watchdog*) on hosted cluster workersStorageBasedRemediationConfigCR validation and controller behaviorWhy
HyperShift workers are real EC2 instances — CephFS, POSIX locking, and softdog (
/dev/watchdog) all work normally. This job confirms it under CI before medik8s switches its primary CI topology from IPI to HyperShift.Using the shared persistent management cluster (
hypershift-awsprofile) avoids the ~23 min overhead of provisioning and destroying a management cluster per run. Only the HostedCluster is created and destroyed each time.Changes
New workflow
medik8s-sbr-hypershift-persistentWraps the standard
hypershift-awsprovisioning around the medik8s ODF + SBR test sequence:The step-registry steps under
medik8s/sbr/hypershift/are introduced in #80372 (the nested-cluster variant of this job). Both PRs share the same steps; this PR adds only the new workflow and job entry.New job
e2e-sbr-hypershift-persistent-aws-odfcluster_profile: hypershift-awsHYPERSHIFT_INSTANCE_TYPE: m5.4xlarge,HYPERSHIFT_NODE_COUNT: "3"ODF_OPERATOR_SUB_CHANNEL: stable-4.21from defaultredhat-operatorsoptional: true,always_run: false/test 4.22-konflux-e2e-sbr-hypershift-persistent-aws-odfRelated
Summary by CodeRabbit
This PR extends the OpenShift CI infrastructure for
medik8s/system-testsby adding optional Konflux 4.22 presubmit coverage for Storage-Based Remediation (SBR) + ODF on HyperShift, including a persistent management cluster variant to reduce per-run HyperShift provisioning/destruction overhead.New/updated Konflux 4.22 system test jobs (HyperShift + SBR + ODF)
In
ci-operator/config/medik8s/system-tests/medik8s-system-tests-main__4.22-konflux.yaml, the PR adds:e2e-sbr-hypershift-aws-odf(optional;cluster_profile: medik8s-aws; workflow:medik8s-sbr-hypershift)e2e-sbr-hypershift-persistent-aws-odf(optional;cluster_profile: hypershift-aws; workflow:medik8s-sbr-hypershift-persistent)Both jobs:
make run-testsin thee2e-teststep.ECO_TEST_FEATURESfromsrc.ECO_TEST_FEATURES: sbr-operatorHYPERSHIFT_INSTANCE_TYPE: m5.4xlargeHYPERSHIFT_NODE_COUNT: "3"OCP_VERSION: "422"ODF_OPERATOR_SUB_CHANNEL: stable-4.21OO_CHANNEL: stableOPERATORS: storage-based-remediationSC_WAIT_TIMEOUT: 10mSKIP_IDMS: "true"medik8s-sbr-hypershift-switch-kubeconfigmedik8s-sbr-hypershift-wait-nodesmedik8s-catalogsourcemedik8s-operator-subscribeodf-prepare-clusteroperatorhub-subscribe-odf-operatorodf-apply-storage-clusterNew persistent HyperShift SBR workflow
Under
ci-operator/step-registry/medik8s/sbr/hypershift-persistent/, the PR introduces:medik8s-sbr-hypershift-persistent(cluster_profile: hypershift-aws)Key behavior:
ipi-install-rbac→hypershift-setup-root-management-cluster→hypershift-aws-create→medik8s-sbr-hypershift-apply-image-sourcesmedik8s-sbr-hypershift-restore-kubeconfig→hypershift-dump→hypershift-aws-destroySupporting step-registry and ODF readiness logic
The PR also wires the workflow into supporting HyperShift SBR steps and improves related CI primitives:
ci-operator/step-registry/medik8s/sbr/hypershift/:switch-kubeconfig,wait-nodes,apply-image-sources,restore-kubeconfigrhwa-fbcImageDigestMirrorSet and node/rotation waiting logic to avoid ODF label churn during worker rotation.SKIP_IDMSto optionally skip IDMS application inci-operator/step-registry/medik8s/catalogsource/medik8s-catalogsource-commands.sh(and its ref YAML), intended for HyperShift hosted cluster constraints.SC_WAIT_TIMEOUTtoci-operator/step-registry/odf/apply-storage-cluster/odf-apply-storage-cluster-commands.sh, with a fallback readiness check that uses OSD deployment availability whenStorageCluster Availablecannot be observed in time.