feat(operator): support persistent storage for managed Alertmanager#1920
feat(operator): support persistent storage for managed Alertmanager#1920brtkwr wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces persistent storage support for the managed Alertmanager by adding a storage field to the ManagedAlertmanagerSpec. The operator now reconciles this configuration by provisioning a PersistentVolumeClaim (PVC) and updating the Alertmanager StatefulSet to use it, while supporting volume expansion and reverting to emptyDir if the spec is removed. Feedback focused on improving the robustness of the reconciliation logic by using Patch instead of Update to avoid concurrency conflicts and enhancing the volume source comparison to include fields like ReadOnly and SizeLimit to ensure manual changes are correctly corrected.
983af0d to
f84fce1
Compare
|
fixed the CI failures in f84fce1:
govulncheck is still expected to fail — it reports |
f84fce1 to
0f9c732
Compare
Closes GoogleCloudPlatform#685. Add a `managedAlertmanager.storage` field on OperatorConfig that opts the managed Alertmanager into a PersistentVolumeClaim-backed data directory. When set, the operator: * provisions a PVC named `alertmanager-data` in the operator namespace from a fully configurable spec (every PersistentVolumeClaim field is exposed via an embedded structure that mirrors prometheus-operator's `EmbeddedPersistentVolumeClaim`), including caller-supplied labels, annotations and finalizers for snapshot/backup tooling integration; * rewrites the StatefulSet's `alertmanager-data` volume from emptyDir to a PVC reference so silences, the notification log and inhibitions survive pod restarts; * applies storage expansion in place when the requested size grows (subject to the StorageClass allowing volume expansion); * logs and ignores shrink requests and changes to PVC fields that Kubernetes treats as immutable (access modes, storage class, selector). Writes go through controller-runtime's strategic-merge Patch (off a snapshot of the original object) rather than Update, so we send only the diff, never blast over fields owned by another controller (e.g. addon-manager-set annotations on the managed Alertmanager StatefulSet), and avoid optimistic-concurrency conflicts (HTTP 409) on races with the cache resync. `volumeSourcesEqual` compares the full set of fields the operator manages — for emptyDir that's medium and sizeLimit, for PVC sources that's ClaimName and ReadOnly — so manual edits to those fields reconcile back to the operator-desired shape rather than being silently preserved. When the field is unset the StatefulSet falls back to the manifest default (emptyDir), preserving existing behavior for clusters that have not opted in. Removing a previously-set storage field restores emptyDir on the pod template but deliberately leaves the PVC in place to avoid accidental silence loss — users wanting to reclaim storage delete the PVC manually. Tests cover: * nil spec preserving emptyDir and not creating a PVC; * setting the spec provisioning a PVC with caller metadata and swapping the volume to the PVC reference; * expanding storage requests patching the existing PVC; * shrink requests being logged and ignored; * removing the spec restoring emptyDir while keeping the PVC.
0f9c732 to
e52e99d
Compare
Closes #685.
What
Adds a
managedAlertmanager.storagefield toOperatorConfigthat opts the managed Alertmanager into a PersistentVolumeClaim-backed data directory. With this, silences, the notification log, and inhibitions survive pod restarts — addressing the long-standing pain reported in #685 where these are wiped on every restart of theemptyDir-backed managed instance.API
The embedded
volumeClaimstructure mirrors prometheus-operator'sEmbeddedPersistentVolumeClaim, so every PersistentVolumeClaim field is configurable (access modes, storage class, resources, selector, volumeMode, dataSource, dataSourceRef) along with caller-supplied labels, annotations, and finalizers for VolumeSnapshot tooling and storage-class provisioners that read PVC metadata.Behaviour
storageunset — StatefulSet falls back to the manifest default (emptyDir). Existing deployments are unaffected; the change is fully opt-in.storageset, no PVC exists — operator creates the PVCalertmanager-datain the operator namespace from the supplied spec (defaultingaccessModesto[ReadWriteOnce]when omitted), and rewrites the StatefulSet'salertmanager-datavolume to reference it. StatefulSet rolls.storage.spec.resources.requests.storagegrown — operator patches the existing PVC's storage request. Actual expansion is subject to the StorageClass allowing volume expansion (allowVolumeExpansion: true).storage.spec.resources.requests.storageshrunk — logged and ignored (Kubernetes does not support PVC shrinking).storageremoved after being set — pod template falls back toemptyDirbut the operator leaves the PVC in place so users don't lose silences from an accidental config removal. Manualkubectl delete pvc alertmanager-data -n gmp-systemto fully reset.Files
pkg/operator/apis/monitoring/v1/operator_types.go—AlertmanagerStorageSpec,EmbeddedPersistentVolumeClaim,EmbeddedObjectMetadata+ the newStoragefield onManagedAlertmanagerSpec.pkg/operator/operator_config.go—reconcileAlertmanagerStorage,ensureAlertmanagerPVC,setAlertmanagerDataVolumeand helpers.pkg/operator/operator_config_test.go— five-case test for the reconciliation matrix (nil spec, fresh provision, expansion, shrink ignore, spec removal restoring emptyDir).pkg/operator/apis/monitoring/v1/zz_generated.deepcopy.go— hand-edited DeepCopy methods for the new types (existing pattern is hand-edited rather than fully autogenerated for this file).charts/operator/crds/monitoring.googleapis.com_operatorconfigs.yaml— regenerated viacontroller-gen crd.doc/api.md— documented new fields and types.Test plan
go test ./pkg/operator/...— all existing tests still pass + 5 new sub-tests forTestEnsureAlertmanagerStatefulSet_Storage.make presubmitin-Docker; happy to fix any presubmit-only issues a reviewer surfaces. Couldn't get a hosted Docker daemon up locally.Multi-replica note
The managed Alertmanager StatefulSet ships with a single replica, so a per-claim PVC +
ReadWriteOnceis sufficient. If/when the managed instance moves to multi-replica HA, this design would need to migrate tovolumeClaimTemplatesper-replica with gossip enabled. That's deliberately out of scope here — the goal is to unblock the silence-persistence pain reported in #685 for the deployment shape that ships today.cc @bwplotka @lyanco @bernot-dev @pintohutch — happy to iterate on the API shape if you'd prefer a different pattern.