The Managed Upgrade Operator is a Kubernetes operator written in Go that manages automated in-place cluster upgrades for OpenShift Dedicated Platform (OSD) and Azure Red Hat OpenShift (ARO). It orchestrates the upgrade process through pre-upgrade validation, capacity management, maintenance windows, and post-upgrade verification.
- Language: Go 1.23.0 (with toolchain go1.23.9)
- Framework: Kubernetes Operator built with
operator-sdk v1.21.0 - Controller Framework:
sigs.k8s.io/controller-runtime v0.20.0 - Testing: Ginkgo v2 (BDD testing framework) + Gomega (matcher/assertion library)
- Mocking: GoMock for interface mocking
- Container Base: UBI9 minimal (registry.access.redhat.com/ubi9/ubi-minimal:9.6-1755695350)
- OpenShift APIs:
github.com/openshift/api,github.com/openshift/client-go - Cluster Version Operator:
github.com/openshift/cluster-version-operator - OCM SDK:
github.com/openshift-online/ocm-sdk-go v0.1.494+(official SDK for managed cluster integration)- Uses typed models from
clustersmgmt/v1andservicelogs/v1packages - Replaces legacy custom OCM model structs
- Uses typed models from
- Prometheus: Metrics collection and alerting management
- Kubernetes APIs:
k8s.io/api,k8s.io/client-go,k8s.io/apimachinery
managed-upgrade-operator/
├── api/v1alpha1/ # Custom Resource Definitions (CRDs)
│ └── upgradeconfig_types.go # UpgradeConfig CRD definition
├── controllers/ # Kubernetes controllers
│ ├── upgradeconfig/ # Main upgrade orchestration controller
│ ├── nodekeeper/ # Node upgrade tracking controller
│ └── machineconfigpool/ # MachineConfigPool controller
├── pkg/ # Core business logic packages
│ ├── upgraders/ # Cluster upgrader implementations (OSD, ARO)
│ ├── upgradesteps/ # Individual upgrade step implementations
│ ├── clusterversion/ # CVO interaction utilities
│ ├── drain/ # Node draining strategies
│ ├── scaler/ # Node capacity scaling
│ ├── maintenance/ # Maintenance window management
│ ├── alertmanager/ # Alert silencing management
│ ├── notifier/ # Notification systems
│ ├── validation/ # Pre/post upgrade validations
│ ├── scheduler/ # Upgrade scheduling logic
│ ├── metrics/ # Prometheus metrics
│ └── configmanager/ # Configuration management
├── test/ # Test infrastructure
│ ├── deploy/ # Test deployment manifests
│ └── e2e/ # End-to-end tests
├── deploy/ # Production deployment manifests
├── docs/ # Documentation
├── boilerplate/ # Shared build tooling
├── build/ # Container build files
│ └── Dockerfile # Multi-stage container build
└── hack/ # Build and maintenance scripts
The operator is driven by UpgradeConfig CRs that define:
- Target OpenShift version (
spec.desired.version) - Upgrade channel (
spec.desired.channel) - Upgrade start time (
spec.upgradeAt) - Drain timeout (
spec.PDBForceDrainTimeout) - Capacity reservation needs (
spec.capacityReservation) - Upgrade type: OSD or ARO (
spec.type)
OCM API Integration:
- Uses official
ocm-sdk-gowith typed models - Supports proxy environments (
HTTP_PROXY,HTTPS_PROXY,NO_PROXY) - Automatic retry with exponential backoff (5 retries, 2s initial delay, 30% jitter)
- Enhanced timeouts for high-latency environments (30s connection, 10s TLS handshake)
Client Implementations:
pkg/ocm: External OCM API client (api.openshift.com) with proxy supportpkg/ocmagent: Local OCM Agent client (cluster service) without proxypkg/dvo: Deployment Validation Operator client with proxy supportpkg/maintenance: AlertManager client with proxy supportpkg/metrics: Prometheus metrics client with proxy support
- UpgradeConfig Controller: Main orchestrator that processes UpgradeConfig CRs
- NodeKeeper Controller: Monitors and remediates stuck node upgrades
- MachineConfigPool Controller: Tracks MCP upgrade progress
- Pre-upgrade: Health checks, capacity reservation, maintenance windows
- Control Plane Upgrade: Triggers CVO to upgrade control plane
- Worker Node Upgrade: Manages worker node drain and upgrade
- Post-upgrade: Cleanup, health validation, notifications
# Install tool dependencies
make tools
# Build the operator binary
make go-build
# Run tests
make go-test
# Run linting
make go-check
# Generate code (mocks, CRDs)
make generate
# Update boilerplate
make boilerplate-update
# Run operator locally (requires cluster access)
make run
# Build container image
make docker-build IMG=quay.io/<username>/managed-upgrade-operator:latest# Install required tools
make tools
# For macOS developers (cross-compile for Linux)
make go-mac-build
# Run with boilerplate container (recommended for consistency)
./boilerplate/_lib/container-make
# Run locally with standard namespace
make run-standard
# Run locally using cluster routes (easier for development)
make run-standard-routes- Uses Ginkgo v2 BDD framework with Gomega assertions
- Tests located alongside source code with
_test.gosuffix - Run with:
make go-testorgo test ./...
# Regenerate all mocks (recommended approach)
./boilerplate/_lib/container-make generate
# Manual mock generation example
mockgen -package mocks -destination=util/mocks/cr-client.go sigs.k8s.io/controller-runtime/pkg/client Client,StatusWriter,Reader,Writer- Located in
test/e2e/ - Uses Ginkgo for E2E test orchestration
- Deployment manifests in
test/deploy/
OPERATOR_NAMESPACE: Namespace where operator runs (default: "openshift-managed-upgrade-operator")WATCH_NAMESPACE: Namespaces to watch for resources (default: all namespaces)
The operator requires extensive cluster-level permissions defined in:
deploy/cluster_role.yamltest/deploy/managed_upgrade_role.yaml- Various monitoring and pull-secret reader roles
- Boilerplate Framework: Uses
app-sre/boilerplatefor standardized builds - Tekton Pipelines:
.tekton/directory contains pipeline definitions - CI Operator:
.ci-operator.yamlconfigures OpenShift CI - Container Registry: Builds push to Quay.io
- golangci-lint: Static analysis (config in
.golangci.yml) - Unit Tests: Ginkgo test suite
- E2E Tests: Full cluster upgrade testing
- Security: FIPS-enabled builds (
FIPS_ENABLED=true)
- Setup: Run
make toolsto install dependencies - Code: Implement changes in appropriate
pkg/subdirectories - Test: Add/update tests using Ginkgo framework
- Validate: Run
make go-checkfor linting andmake go-testfor tests - Build: Use
make go-buildto compile - Deploy: Use local deployment with
make run-standard-routes
# Create UpgradeConfig CR
oc apply -f - <<EOF
apiVersion: upgrade.managed.openshift.io/v1alpha1
kind: UpgradeConfig
metadata:
name: managed-upgrade-config
spec:
type: "OSD"
upgradeAt: "2024-01-01T12:00:00Z"
PDBForceDrainTimeout: 60
desired:
channel: "fast-4.14"
version: "4.14.15"
EOF
# Monitor upgrade progress
oc get upgradeconfig -w
oc describe upgradeconfig managed-upgrade-config- Interfaces: Define interfaces in appropriate
pkg/subdirectory - Implementation: Implement concrete types
- Tests: Add comprehensive unit tests
- Mocks: Generate mocks using
make generate - Integration: Wire into controller or upgrade step chain
main.go: Operator bootstrap, controller setup, metrics configuration
pkg/upgraders/upgrader.go: ClusterUpgrader interfacepkg/upgradesteps/runner.go: UpgradeStep interfacepkg/configmanager/configmanager.go: Configuration management
controllers/upgradeconfig/upgradeconfig_controller.go: Main reconcilercontrollers/nodekeeper/nodekeeper_controller.go: Node managementcontrollers/machineconfigpool/machineconfigpool_controller.go: MCP tracking
- Prometheus metrics defined in
pkg/metrics/ - Custom metrics service and ServiceMonitor auto-created
- Exposes upgrade progress, duration, success/failure rates
- Integrates with AlertManager for alert silencing during upgrades
- Custom alerts for stuck upgrades and operator health
- FIPS Compliance: Builds with FIPS-enabled Go toolchain
- Security Context: Runs as non-root user (UID 1001)
- RBAC: Minimal required permissions with separate roles for different functions
- Container Security: Uses Red Hat UBI minimal base images
docs/development.md: Detailed development setup and workflowsdocs/testing.md: Comprehensive testing guidedocs/design.md: Architecture and design documentationdocs/metrics.md: Prometheus metrics referenceREADME.md: Project overview and basic usage
This operator represents a production-grade Kubernetes operator with sophisticated upgrade orchestration, comprehensive testing, and enterprise security requirements. The codebase follows cloud-native best practices and integrates deeply with OpenShift's upgrade infrastructure.