Skip to content

feat(operator): support persistent storage for managed Alertmanager#1920

Open
brtkwr wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
brtkwr:managed-alertmanager-pvc
Open

feat(operator): support persistent storage for managed Alertmanager#1920
brtkwr wants to merge 1 commit into
GoogleCloudPlatform:mainfrom
brtkwr:managed-alertmanager-pvc

Conversation

@brtkwr
Copy link
Copy Markdown
Contributor

@brtkwr brtkwr commented May 14, 2026

Closes #685.

What

Adds a managedAlertmanager.storage field to OperatorConfig that opts the managed Alertmanager into a PersistentVolumeClaim-backed data directory. With this, silences, the notification log, and inhibitions survive pod restarts — addressing the long-standing pain reported in #685 where these are wiped on every restart of the emptyDir-backed managed instance.

API

apiVersion: monitoring.googleapis.com/v1
kind: OperatorConfig
metadata:
  name: config
  namespace: gmp-public
managedAlertmanager:
  configSecret:
    key: alertmanager.yaml
    name: alertmanager
  storage:
    volumeClaim:
      metadata:
        labels:
          backup.example.com/include: "true"
        annotations:
          backup.example.com/policy: "daily"
      spec:
        accessModes: [ReadWriteOnce]      # default if omitted
        storageClassName: standard-rwo    # cluster default if omitted
        resources:
          requests:
            storage: 5Gi

The embedded volumeClaim structure mirrors prometheus-operator's EmbeddedPersistentVolumeClaim, so every PersistentVolumeClaim field is configurable (access modes, storage class, resources, selector, volumeMode, dataSource, dataSourceRef) along with caller-supplied labels, annotations, and finalizers for VolumeSnapshot tooling and storage-class provisioners that read PVC metadata.

Behaviour

  • storage unset — StatefulSet falls back to the manifest default (emptyDir). Existing deployments are unaffected; the change is fully opt-in.
  • storage set, no PVC exists — operator creates the PVC alertmanager-data in the operator namespace from the supplied spec (defaulting accessModes to [ReadWriteOnce] when omitted), and rewrites the StatefulSet's alertmanager-data volume to reference it. StatefulSet rolls.
  • storage.spec.resources.requests.storage grown — operator patches the existing PVC's storage request. Actual expansion is subject to the StorageClass allowing volume expansion (allowVolumeExpansion: true).
  • storage.spec.resources.requests.storage shrunk — logged and ignored (Kubernetes does not support PVC shrinking).
  • Immutable field changes — caller-side changes to access modes, storage class, selector etc. after the PVC is bound are silently ignored by the operator to avoid update errors. Documented as such.
  • storage removed after being set — pod template falls back to emptyDir but the operator leaves the PVC in place so users don't lose silences from an accidental config removal. Manual kubectl delete pvc alertmanager-data -n gmp-system to fully reset.

Files

  • pkg/operator/apis/monitoring/v1/operator_types.goAlertmanagerStorageSpec, EmbeddedPersistentVolumeClaim, EmbeddedObjectMetadata + the new Storage field on ManagedAlertmanagerSpec.
  • pkg/operator/operator_config.goreconcileAlertmanagerStorage, ensureAlertmanagerPVC, setAlertmanagerDataVolume and helpers.
  • pkg/operator/operator_config_test.go — five-case test for the reconciliation matrix (nil spec, fresh provision, expansion, shrink ignore, spec removal restoring emptyDir).
  • pkg/operator/apis/monitoring/v1/zz_generated.deepcopy.go — hand-edited DeepCopy methods for the new types (existing pattern is hand-edited rather than fully autogenerated for this file).
  • charts/operator/crds/monitoring.googleapis.com_operatorconfigs.yaml — regenerated via controller-gen crd.
  • doc/api.md — documented new fields and types.

Test plan

  • go test ./pkg/operator/... — all existing tests still pass + 5 new sub-tests for TestEnsureAlertmanagerStatefulSet_Storage.
  • I haven't been able to run the full make presubmit in-Docker; happy to fix any presubmit-only issues a reviewer surfaces. Couldn't get a hosted Docker daemon up locally.

Multi-replica note

The managed Alertmanager StatefulSet ships with a single replica, so a per-claim PVC + ReadWriteOnce is sufficient. If/when the managed instance moves to multi-replica HA, this design would need to migrate to volumeClaimTemplates per-replica with gossip enabled. That's deliberately out of scope here — the goal is to unblock the silence-persistence pain reported in #685 for the deployment shape that ships today.

cc @bwplotka @lyanco @bernot-dev @pintohutch — happy to iterate on the API shape if you'd prefer a different pattern.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces persistent storage support for the managed Alertmanager by adding a storage field to the ManagedAlertmanagerSpec. The operator now reconciles this configuration by provisioning a PersistentVolumeClaim (PVC) and updating the Alertmanager StatefulSet to use it, while supporting volume expansion and reverting to emptyDir if the spec is removed. Feedback focused on improving the robustness of the reconciliation logic by using Patch instead of Update to avoid concurrency conflicts and enhancing the volume source comparison to include fields like ReadOnly and SizeLimit to ensure manual changes are correctly corrected.

Comment thread pkg/operator/operator_config.go Outdated
Comment thread pkg/operator/operator_config.go Outdated
Comment thread pkg/operator/operator_config.go
@brtkwr brtkwr force-pushed the managed-alertmanager-pvc branch 2 times, most recently from 983af0d to f84fce1 Compare May 15, 2026 07:21
@brtkwr
Copy link
Copy Markdown
Contributor Author

brtkwr commented May 15, 2026

fixed the CI failures in f84fce1:

  • conform: commit message had "behaviour" — flipped to "behavior" (US spelling)
  • golangci-lint mapsloop: replaced both for k, v := range m { out[k] = v } loops in mergeLabels and copyMap with maps.Copy (the maps package was already imported)
  • validate: the diff that was triggering CHECK=1 make regen was the same mapsloop fix not being committed yet — should be clean now

govulncheck is still expected to fail — it reports GO-2026-4918 in golang.org/x/net v0.48.0 reached via e2e/kube/pod.go:44:28, which my PR doesn't touch. happy to bundle the golang.org/x/net bump (to v0.53.0) into this PR if maintainers prefer, but flagging it as out-of-scope since it would affect any PR against current main

@brtkwr
Copy link
Copy Markdown
Contributor Author

brtkwr commented May 15, 2026

stacked on #1921 (the golang.org/x/net bump). force-pushed to rebase onto that branch — should clear the govulncheck failure once #1921 lands. once #1921 merges to main, this PR's base will collapse to a single commit naturally.

Closes GoogleCloudPlatform#685.

Add a `managedAlertmanager.storage` field on OperatorConfig that opts the
managed Alertmanager into a PersistentVolumeClaim-backed data directory.
When set, the operator:

  * provisions a PVC named `alertmanager-data` in the operator namespace
    from a fully configurable spec (every PersistentVolumeClaim field is
    exposed via an embedded structure that mirrors prometheus-operator's
    `EmbeddedPersistentVolumeClaim`), including caller-supplied labels,
    annotations and finalizers for snapshot/backup tooling integration;
  * rewrites the StatefulSet's `alertmanager-data` volume from emptyDir
    to a PVC reference so silences, the notification log and
    inhibitions survive pod restarts;
  * applies storage expansion in place when the requested size grows
    (subject to the StorageClass allowing volume expansion);
  * logs and ignores shrink requests and changes to PVC fields that
    Kubernetes treats as immutable (access modes, storage class,
    selector).

Writes go through controller-runtime's strategic-merge Patch (off a
snapshot of the original object) rather than Update, so we send only
the diff, never blast over fields owned by another controller (e.g.
addon-manager-set annotations on the managed Alertmanager StatefulSet),
and avoid optimistic-concurrency conflicts (HTTP 409) on races with the
cache resync.

`volumeSourcesEqual` compares the full set of fields the operator
manages — for emptyDir that's medium and sizeLimit, for PVC sources
that's ClaimName and ReadOnly — so manual edits to those fields
reconcile back to the operator-desired shape rather than being silently
preserved.

When the field is unset the StatefulSet falls back to the manifest
default (emptyDir), preserving existing behavior for clusters that have
not opted in. Removing a previously-set storage field restores emptyDir
on the pod template but deliberately leaves the PVC in place to avoid
accidental silence loss — users wanting to reclaim storage delete the
PVC manually.

Tests cover:
  * nil spec preserving emptyDir and not creating a PVC;
  * setting the spec provisioning a PVC with caller metadata and
    swapping the volume to the PVC reference;
  * expanding storage requests patching the existing PVC;
  * shrink requests being logged and ignored;
  * removing the spec restoring emptyDir while keeping the PVC.
@brtkwr brtkwr force-pushed the managed-alertmanager-pvc branch from 0f9c732 to e52e99d Compare May 15, 2026 09:34
@brtkwr
Copy link
Copy Markdown
Contributor Author

brtkwr commented May 15, 2026

rebased back onto main and dropped the dep bump (now in #1921). this PR is now a single commit again. govulncheck will fail here until #1921 merges — unrelated to this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persisting silences in alertmanager

1 participant