Skip to content

Operator: add a StorageVersionMigrator controller to clean up storedVersions #4969

@ChrisJBurns

Description

@ChrisJBurns

Problem statement

PR #4849 graduated our CRDs from v1alpha1 to v1beta1 using a multi-version serving strategy:

  • Both v1alpha1 and v1beta1 are served; v1beta1 is the storage version.
  • conversion.strategy: None — safe because the schemas are identical.
  • Users can upgrade without deleting resources and migrate their manifests at their own pace.

As users re-apply their manifests at v1beta1, each CRD's status.storedVersions ends up looking like ["v1alpha1", "v1beta1"]. That state is correct and working, but it blocks a future release from dropping v1alpha1 entirely: the API server refuses to remove a version from spec.versions while it's still listed in status.storedVersions, since doing so would orphan objects in etcd that nothing can deserialise.

Kubernetes does not garbage-collect storedVersions automatically. Someone has to re-store every object at the current storage version and then explicitly patch storedVersions to drop the old entry.

Motivation

Before the release that graduates CRDs to v1beta2 (or any future storage version), every cluster's status.storedVersions on the ToolHive CRDs must contain only the current storage version. Without automation, users would have to either:

  1. Re-apply every CR manually (error-prone at scale), or
  2. Deploy kube-storage-version-migrator as a separate controller and run a migration per CRD.

Neither is a good experience for users who just want to upgrade ToolHive. Since the operator already watches every ToolHive CRD, it can do this itself — and ship the migrator one release before any version removal, so the storedVersions cleanup happens quietly in the background during the deprecation window.

Goal

Add a StorageVersionMigrator controller to the ToolHive operator that, for each opted-in toolhive.stacklok.dev CRD:

  1. Detects when status.storedVersions contains versions other than the current storage version.
  2. Re-stores every CR of that kind so the API server re-encodes it at the storage version.
  3. Once all CRs are re-stored, patches status.storedVersions down to [<currentStorageVersion>].

The controller is generic (works for any current-and-future storage-version transition) and scoped per-CRD via an opt-in label so future CRDs that shouldn't be migrated stay out of scope automatically.

Initial rollout policy: the controller ships disabled by default (ENABLE_STORAGE_VERSION_MIGRATOR=false) for one release. This gives the feature a deprecation window where it can be opted into by admins who want it, before we flip the default to on alongside the user-facing docs in a later release.

Proposed solution

A single controller (cmd/thv-operator/controllers/storageversionmigrator_controller.go) that watches CustomResourceDefinition objects, filters to a specific scope, and performs the migration per CRD.

Scoping: opt-in label on the CRD

The controller watches CRDs matching:

  • Group filter: spec.group == "toolhive.stacklok.dev".
  • Opt-in label: metadata.labels["toolhive.stacklok.dev/auto-migrate-storage-version"] == "true".

The label is set at CRD-generation time via a one-line kubebuilder marker on each type:

// +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true
type MCPServer struct { ... }

task operator-manifests bakes the label into the generated CRD YAML. All 12 current v1beta1 root types get the marker. New CRDs opt in by adding the same marker; CRDs that shouldn't be migrated omit it.

To prevent the "silent forgot-the-marker" failure mode, a CI test walks cmd/thv-operator/api/v1beta1/ and fails if any +kubebuilder:object:root=true type lacks either the migrate marker or an explicit // +thv:storage-version-migrator:exclude sibling marker.

Write path: plain Get+Update on the main resource

To re-encode a CR at the current storage version, the controller issues a plain Get + Update against the main resource (no SSA, no annotation bump, no /status write).

Why this works:

  • When an Update arrives, the API server runs the request body through the storage encoder to produce the etcd-shaped bytes, then compares those bytes against what's currently in etcd. If they differ, the Update is committed and the resourceVersion bumps.
  • For a CR stored at an older apiVersion (e.g. v1alpha1), the freshly-encoded bytes from our Update use the current storage version, which differs from the etcd bytes — so the API server commits even though our request body is byte-for-byte the same as what we just Got. That's the storage-version migration.
  • For a CR already stored at the current version, the encoded bytes match what's in etcd and the Update is elided server-side. That's the noop case.

This is the same approach upstream kube-storage-version-migrator uses today. We verified the mechanics empirically in envtest via TestReconcile_ReencodesCRsStoredAtPriorVersion — RV bumps when the prior storage was v1alpha1, doesn't bump when the prior storage was already v1beta1.

Webhook surface: ToolHive's existing validating webhooks (MCPServer, MCPGroup, VirtualMCPServer) run on the Update path, but they only reject changes that break spec invariants. A round-trip Update with the same spec passes by definition because the spec hasn't changed.

Concurrent-update safety

The reconciler treats IsConflict and IsNotFound separately:

  • IsNotFound during the per-CR Get: harmless — the CR was deleted between list and re-store. Skip; nothing to migrate.
  • IsConflict on the Update: another writer raced us. We do not retry inline (that would risk lost-update semantics in a loop). Instead we count conflicts and, if any occurred, return a sentinel error so the outer reconcile does not patch status.storedVersions down. The CRD stays "dirty" and the next reconcile pass retries cleanly.

status.storedVersions is only trimmed after every CR in the pass succeeded with zero conflicts.

Migration cache

A short-lived in-memory cache keyed on (CRD name, CR UID, resourceVersion) prevents re-issuing Updates against objects already re-stored within the last hour. A manager.RunnableFunc GC runs every 10 minutes to drop entries for CRs that have been deleted.

Feature flag (operator-wide off switch)

The controller is gated by an environment variable ENABLE_STORAGE_VERSION_MIGRATOR:

  • This release (PR-A): defaults to false. The chart does not surface the flag; admins who want to opt in must inject the env var via operator.env. Keeps casual enablement out of the standard helm install --set paths during the deprecation window.
  • Follow-up release (PR-C): a operator.features.storageVersionMigrator value lands in the chart and the operator-code default flips to true. From that point onward the feature is on by default; admins running kube-storage-version-migrator externally opt out via the chart value.

How the controller decides what to migrate

The label is just the opt-in gate — it does not drive version decisions. Every version decision is read live from each CRD's own spec.versions:

Field Role
spec.versions[i].storage: true The one version per CRD that the API server currently encodes objects as. This is the target.
status.storedVersions History of every version ever used for storage. If this isn't [target], there's work to do.

Reconcile flow per CRD:

  1. Label present + group matches? Proceed; else return.
  2. Walk spec.versions, find the entry with storage: true → target version V.
  3. If status.storedVersions == [V]: noop, return.
  4. List all CRs of this kind via paginated unstructured list.
  5. For each CR: Get (live, via APIReader) + Update. Skip on IsNotFound. Count IsConflict.
  6. If step 5 had zero conflicts and zero permanent errors: patch CRD.status.storedVersions to [V] using MergeFrom + optimistic lock. Otherwise return the sentinel error and let controller-runtime requeue.

Worked example:

CRD spec.versions storage: true status.storedVersions Controller action
mcpservers.toolhive.stacklok.dev v1alpha1, v1beta1 v1beta1 [v1alpha1, v1beta1] Get+Update each CR, trim to [v1beta1]
mcpgroups.toolhive.stacklok.dev v1alpha1, v1beta1 v1beta1 [v1beta1] noop
newshinything.toolhive.stacklok.dev v1beta1, v1beta2 v1beta2 [v1beta1, v1beta2] Get+Update each CR, trim to [v1beta2]
experimental.toolhive.stacklok.dev (no label) v1alpha1 v1alpha1 [v1alpha1] ignored — label missing

Alternatives considered

Scoping: label vs. hard-coded allowlist vs. per-CRD config

Option Why rejected
Hard-coded Go allowlist Adding a migratable CRD requires a code change + operator release. Worse cadence than the label.
CAPI-style per-CRD config map (ByObjectConfig) Overkill — ToolHive CRDs are structurally uniform. Easy to layer on later.
Label on the CRD (chosen) Declarative, Git-controlled, forward-compatible. New migratable CRD = one marker line.

Write path: Get+Update vs. SSA on /status vs. annotation bump

The original design proposed metadata-only SSA on /status to dodge admission webhooks. Empirical investigation in envtest showed:

  • An empty-body SSA Apply is elided by the API server before reaching the storage encoder — no RV bump, no storage migration.
  • An annotation-bump strategy works but leaves a permanent annotation on every CR (cruft, or requires a clean-up pass later).
  • A plain Get + Update against the main resource gives identical re-encode semantics (the storage encoder runs at the same layer regardless of subresource) while having zero permanent state, no field-manager surface, and matching what kube-storage-version-migrator upstream actually does.

Get+Update was chosen for those reasons. Webhook surface is acceptable: ToolHive's validating webhooks only reject spec-invariant violations, which a same-spec round-trip Update cannot trigger.

Deployment model: in-operator vs. external kube-storage-version-migrator

  • It's a separate deployment with its own RBAC, lifecycle, and observability. More surface for users to understand.
  • It's driven by StorageVersionMigration CRs that someone has to create per migration.
  • For a targeted, single-product scope (our ~12 CRDs), having the operator self-migrate is simpler end-to-end.

Admins who already run kube-storage-version-migrator can still opt out via the feature flag (once it's surfaced in the chart in PR-C).

Precedent and prior art

Label-scoped CRD watches — widely used: Knative Duck typing (duck.knative.dev/<type>=true), OLM (operators.coreos.com/<operator>.<namespace>), Crossplane (pkg.crossplane.io/package), Gatekeeper (admission.gatekeeper.sh/ignore).

Storage-version migration via Get+Update: the upstream pattern used by kube-storage-version-migrator. The CAPI controllers/crdmigrator does the same with its own retry/conflict handling.

The specific combination — in-operator storedVersions migrator scoped by a label on the CRD — is our synthesis for the ergonomics (one-line marker to add a new migratable CRD, admin escape hatch, no compile-time coupling).

Known tradeoffs and caveats

  1. Silent "forgot the marker" failure — adding a new CRD without the marker produces no build error; the migrator just skips it. Mitigated by the PR-B CI test that requires every v1beta1 root type to carry either the migrate marker or an explicit exclude marker.
  2. Label is user-modifiable in-cluster — admin escape hatch, but can fight with GitOps reconciliation. Documented as emergency only; long-term opt-out is the feature flag.
  3. Third-party CRDs in the group — wildcard RBAC toolhive.stacklok.dev/* means any third-party CRD that adopted our group + our label would be picked up. The group is Stacklok-owned by convention; risk is theoretical.
  4. Label contract is public — once shipped and users have written Kyverno rules or GitOps filters against it, renaming becomes a breaking change. Treat the label key as a stable API surface.
  5. Feature-flag footgun (post-PR-C) — once the default flips to on, an admin who disables the flag and forgets to re-enable it before the v1alpha1-removal release will have objects orphaned in etcd. The chart-value docs will call this out.

Design details

Interface contracts

package controllers

// StorageVersionMigratorReconciler reconciles CustomResourceDefinition objects
// in the toolhive.stacklok.dev group that carry the opt-in label
// toolhive.stacklok.dev/auto-migrate-storage-version=true. For each such CRD
// it re-stores every CR at the current storage version via a plain Get+Update
// and cleans status.storedVersions so a future release can drop deprecated
// versions from spec.versions without orphaning etcd objects.
//
// Disabled by default in the initial release; admins opt in via the
// ENABLE_STORAGE_VERSION_MIGRATOR environment variable. A follow-up release
// flips the default to enabled and surfaces a chart value for opt-out.
type StorageVersionMigratorReconciler struct {
    client.Client                  // cached reads/writes for CRs
    APIReader     client.Reader    // live reads (bypasses informer)
    Scheme        *runtime.Scheme
    Recorder      record.EventRecorder

    // cache prevents re-Update of objects whose (UID, resourceVersion) was
    // already successfully migrated within the TTL window. Periodically
    // GC'd for deleted CRs by a manager.RunnableFunc.
    cache *migrationCache
}

func (r *StorageVersionMigratorReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error)
func (r *StorageVersionMigratorReconciler) SetupWithManager(mgr ctrl.Manager) error

const (
    AutoMigrateLabel = "toolhive.stacklok.dev/auto-migrate-storage-version"
    AutoMigrateValue = "true"
    ToolhiveGroup    = "toolhive.stacklok.dev"
)

In cmd/thv-operator/app/app.go:

const envEnableStorageVersionMigrator = "ENABLE_STORAGE_VERSION_MIGRATOR"

func isStorageVersionMigratorEnabled() bool { /* defaults to false */ }
func setupStorageVersionMigrator(mgr ctrl.Manager) error

RBAC markers (regenerated into deploy/charts/operator/templates/clusterrole/role.yaml):

// +kubebuilder:rbac:groups=apiextensions.k8s.io,resources=customresourcedefinitions,verbs=get;list;watch
// +kubebuilder:rbac:groups=apiextensions.k8s.io,resources=customresourcedefinitions/status,verbs=patch;update
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=*,verbs=get;list;update
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=*/status,verbs=update

Tests (envtest)

Test Verifies
TestReconcile_NoopWhenClean storedVersions already [v1beta1]: no writes issued
TestReconcile_HappyPathTrimsStoredVersions storedVersions trimmed after Get+Update on every CR
TestReconcile_ReencodesCRsStoredAtPriorVersion empirical proof RV bumps when etcd holds bytes at an older apiVersion
TestReconcile_PartialFailureLeavesStoredVersionsUntouched permanent error leaves storedVersions intact for retry
TestReconcile_ConflictDuringRestoreLeavesStoredVersionsUntouched IsConflict leaves storedVersions intact, retries cleanly
TestReconcile_SkipsForeignCRDs non-toolhive CRDs are ignored
TestReconcile_SkipsUnlabelledToolhiveCRD toolhive CRDs without the opt-in label are ignored
TestReconcile_Pagination the continue-token list loop works

PR strategy

Split into three sequential PRs to keep each one independently reviewable and to maintain default-off safety throughout the rollout.

PR-A — Controller behind default-off flag (open: #5362)

  • cmd/thv-operator/controllers/storageversionmigrator_controller.go (reconciler + cache + RBAC markers)
  • cmd/thv-operator/app/app.go (scheme registration, setup, env-var gating, defaults to false)
  • cmd/thv-operator/test-integration/storageversionmigrator/ (8 envtest scenarios — CRDs created at runtime, independent of opt-in labels)
  • deploy/charts/operator/templates/clusterrole/role.yaml (regenerated)
  • Not in PR-A: chart values.yaml flag entry, deployment.yaml env var, opt-in labels, user docs

PR-B — Opt-in labels + marker-coverage CI

  • +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true marker on all 12 v1beta1 root types
  • Regenerated CRD YAML (deploy/charts/operator-crds/files/crds/*.yaml, deploy/charts/operator-crds/templates/*.yaml)
  • cmd/thv-operator/api/v1beta1/marker_coverage_test.go CI test enforcing every root type carries either the migrate marker or an explicit +thv:storage-version-migrator:exclude marker

PR-C — Flip default on + chart surface + docs

  • Operator-code default flipped to true
  • operator.features.storageVersionMigrator: true value in values.yaml with helm-docs comment
  • ENABLE_STORAGE_VERSION_MIGRATOR env var wired in deployment.yaml
  • docs/operator/storage-version-migration.md — user-facing reference (auto-migration behaviour, label contract, opt-out via feature flag, interaction with future v1alpha1 removal)
  • docs/operator/upgrade-guide/ — kind-cluster walkthrough + sample CR fixtures verified end-to-end

Acceptance criteria

Mapped across the three PRs:

AC PR
New StorageVersionMigrator controller reconciling on CustomResourceDefinition changes, scoped to toolhive.stacklok.dev + opt-in label PR-A
Env-var feature flag (ENABLE_STORAGE_VERSION_MIGRATOR) gates registration; defaults to false PR-A
8 envtest scenarios pass PR-A
RBAC for the controller's verbs ships with the controller PR-A
All 12 current v1beta1 root types carry the opt-in label via kubebuilder markers PR-B
CI marker-coverage test fails when a v1beta1 root type lacks either the migrate marker or an explicit exclude marker PR-B
Operator-code default flips to true; chart surfaces operator.features.storageVersionMigrator PR-C
User-facing docs explain auto-migration, the label contract, the feature-flag opt-out, and interaction with future v1alpha1 removal PR-C
End-to-end kind walkthrough exercises the full upgrade path with sample CR fixtures PR-C

Risks

Risk Severity Mitigation
Large cluster: many CRs per kind, reconcile timeout LOW Paginated list + per-CR Update is O(n); migration cache prevents re-work
Webhook added later on the main resource breaks migrator LOW A same-spec round-trip Update cannot trigger spec-invariant rejections
Label forgotten on a newly-added CRD that should be migrated LOW CI marker-coverage test (PR-B) catches this at PR time
Admin disables feature flag (post-PR-C) and forgets to re-enable before v1alpha1 removal MED Chart-value docs explicitly warn; the flag is intended for admins running kube-storage-version-migrator externally
GitOps reconciliation fights with admin escape-hatch label removal LOW Docs: label is emergency only, feature flag is the supported long-term opt-out
Cache entries leak for deleted CRs LOW Periodic GC every 10 minutes via a manager.RunnableFunc

When this is needed

Before the release that drops v1alpha1 from the ToolHive CRDs. PR-A ships the controller (default off) so admins can opt in during the deprecation window. PR-B lands the opt-in labels. PR-C flips the default on and surfaces the chart value. By the time the v1alpha1-removal release lands, every cluster's storedVersions is already clean and the version removal is a zero-risk CRD update.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgoPull requests that update go codekubernetesItems related to Kubernetesoperatorp2Low

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions