Skip to content

feat(cluster_healthcheck): add cluster health validation role#39

Open
stevefulme1 wants to merge 1 commit into
redhat-cop:mainfrom
stevefulme1:feat/cluster-healthcheck-role
Open

feat(cluster_healthcheck): add cluster health validation role#39
stevefulme1 wants to merge 1 commit into
redhat-cop:mainfrom
stevefulme1:feat/cluster-healthcheck-role

Conversation

@stevefulme1
Copy link
Copy Markdown
Contributor

Summary

Adds a new cluster_healthcheck role that validates the health of an OpenShift cluster for virtualization migration readiness. The role performs comprehensive checks across six categories and generates an HTML summary report with pass/fail/warning status and actionable recommendations.

Health checks included

  • OCP Node Health - Node Ready status, MemoryPressure/DiskPressure/PIDPressure conditions, allocatable vs capacity ratios, kubevirt.io/schedulable label verification
  • KubeVirt Health - HyperConverged CR conditions (Available/Degraded), virt-operator/controller/handler/api pods, CDI operator and deployment health
  • MTV Health - ForkliftController CR status, MTV operator pods, Provider readiness, failed migration Plans
  • Storage Health - StorageClass enumeration and default verification, CSI driver discovery, PV capacity, pending PVC detection
  • Network Health - Multus pods, NetworkAttachmentDefinitions, OVN-Kubernetes/OpenShiftSDN health, migration network configuration
  • Post-Migration VM - VirtualMachineInstance running state, guest agent reporting, network interface IPs, optional SSH connectivity

Files added

roles/cluster_healthcheck/
├── defaults/main.yml
├── meta/main.yml
├── README.md
├── tasks/
│   ├── main.yml
│   ├── ocp_node_health.yml
│   ├── kubevirt_health.yml
│   ├── mtv_health.yml
│   ├── storage_health.yml
│   ├── network_health.yml
│   ├── post_migration_vm.yml
│   └── report.yml
├── templates/
│   └── cluster_healthcheck_report.html.j2
├── tests/
│   ├── inventory
│   └── test.yml
└── vars/main.yml
playbooks/cluster_healthcheck.yml

Design decisions

  • Follows existing validate_migration role patterns (task naming, k8s_info usage, variable prefixing)
  • All variables prefixed with cluster_healthcheck_ per collection convention
  • Private/internal variables use __cluster_healthcheck_ double-underscore prefix
  • Uses FQCNs throughout (kubernetes.core.k8s_info, ansible.builtin.*)
  • Check list is configurable via cluster_healthcheck_checks default
  • Post-migration VM checks are opt-in via cluster_healthcheck_post_migration_vms
  • HTML report includes per-category breakdown with recommendations

Testing

  • ansible-lint --profile production passes with 0 errors on the role (playbook FQCN resolution matches existing collection behavior)

@stevefulme1 stevefulme1 requested a review from sabre1041 as a code owner May 21, 2026 21:56
@stevefulme1 stevefulme1 deployed to external-ci May 21, 2026 21:56 — with GitHub Actions Active
Copy link
Copy Markdown
Contributor

@sabre1041 sabre1041 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review the issues that are being reported.

Also, please review conflicted files

kind: Pod
namespace: "{{ cluster_healthcheck_kubevirt_namespace }}"
label_selectors:
- "app=cdi-operator"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This label does not match what is deployed

kind: Pod
namespace: "{{ cluster_healthcheck_kubevirt_namespace }}"
label_selectors:
- "app=cdi-deployment"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This label does not match what is deployed


- name: mtv_health | Evaluate Provider readiness
ansible.builtin.set_fact:
__cluster_healthcheck_providers_not_ready: >-
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not reporting correctly. Both providers are Ready in my testing environment

| selectattr('status.phase', 'equalto', 'Running')
| list | length) }}

- name: network_health | Check migration network configuration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be checked if one has been defined in the HyperConverged CR

kubernetes.core.k8s_info:
api_version: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
namespace: "{{ cluster_healthcheck_mtv_namespace }}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should check in the openshift-cnv namespace

Adds a cluster_healthcheck role that validates OpenShift cluster health
for virtualization migration readiness across six categories: OCP nodes,
KubeVirt, MTV, storage, network, and post-migration VMs.

Generates an HTML summary report with pass/fail/warning status.

Review feedback addressed:
- Fix CDI pod labels to use app.kubernetes.io/component selectors
- Fix Provider readiness to correctly detect Ready condition status
- Make migration network check conditional on HyperConverged CR config
- Check migration NAD in openshift-cnv namespace, not openshift-mtv
- Drop unrelated scaffolding file changes (CODE_OF_CONDUCT, etc.)
@stevefulme1 stevefulme1 force-pushed the feat/cluster-healthcheck-role branch from d4928cd to 51d077e Compare May 22, 2026 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants