Operator: add a StorageVersionMigrator controller to clean up storedVersions

## Problem statement

PR [#4849](https://github.com/stacklok/toolhive/pull/4849) graduated our CRDs from `v1alpha1` to `v1beta1` using a **multi-version serving** strategy:

- Both `v1alpha1` and `v1beta1` are served; `v1beta1` is the storage version.
- `conversion.strategy: None` — safe because the schemas are identical.
- Users can upgrade without deleting resources and migrate their manifests at their own pace.

As users re-apply their manifests at `v1beta1`, each CRD's `status.storedVersions` ends up looking like `["v1alpha1", "v1beta1"]`. That state is correct and working, but it **blocks a future release from dropping `v1alpha1` entirely**: the API server refuses to remove a version from `spec.versions` while it's still listed in `status.storedVersions`, since doing so would orphan objects in etcd that nothing can deserialise.

Kubernetes does **not** garbage-collect `storedVersions` automatically. Someone has to re-store every object at the current storage version and then explicitly patch `storedVersions` to drop the old entry.

## Motivation

Before the release that graduates CRDs to `v1beta2` (or any future storage version), every cluster's `status.storedVersions` on the ToolHive CRDs must contain only the current storage version. Without automation, users would have to either:

1. Re-apply every CR manually (error-prone at scale), or
2. Deploy [kube-storage-version-migrator](https://github.com/kubernetes-sigs/kube-storage-version-migrator) as a separate controller and run a migration per CRD.

Neither is a good experience for users who just want to upgrade ToolHive. Since the operator already watches every ToolHive CRD, it can do this itself — and ship the migrator one release before any version removal, so the `storedVersions` cleanup happens quietly in the background during the deprecation window.

## Goal

Add a `StorageVersionMigrator` controller to the ToolHive operator that, for each opted-in `toolhive.stacklok.dev` CRD:

1. Detects when `status.storedVersions` contains versions other than the current storage version.
2. Re-stores every CR of that kind so the API server re-encodes it at the storage version.
3. Once all CRs are re-stored, patches `status.storedVersions` down to `[<currentStorageVersion>]`.

The controller is generic (works for any current-and-future storage-version transition) and scoped per-CRD via an opt-in label so future CRDs that shouldn't be migrated stay out of scope automatically.

**Initial rollout policy**: the controller ships **disabled by default** (`ENABLE_STORAGE_VERSION_MIGRATOR=false`) for one release. This gives the feature a deprecation window where it can be opted into by admins who want it, before we flip the default to on alongside the user-facing docs in a later release.

## Proposed solution

A single controller (`cmd/thv-operator/controllers/storageversionmigrator_controller.go`) that watches `CustomResourceDefinition` objects, filters to a specific scope, and performs the migration per CRD.

### Scoping: opt-in label on the CRD

The controller watches CRDs matching:

- **Group filter**: `spec.group == "toolhive.stacklok.dev"`.
- **Opt-in label**: `metadata.labels["toolhive.stacklok.dev/auto-migrate-storage-version"] == "true"`.

The label is set at CRD-generation time via a one-line kubebuilder marker on each type:

```go
// +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true
type MCPServer struct { ... }
```

`task operator-manifests` bakes the label into the generated CRD YAML. All 12 current `v1beta1` root types get the marker. New CRDs opt in by adding the same marker; CRDs that shouldn't be migrated omit it.

To prevent the "silent forgot-the-marker" failure mode, a CI test walks `cmd/thv-operator/api/v1beta1/` and fails if any `+kubebuilder:object:root=true` type lacks either the migrate marker or an explicit `// +thv:storage-version-migrator:exclude` sibling marker.

### Write path: plain Get+Update on the main resource

To re-encode a CR at the current storage version, the controller issues a **plain `Get` + `Update` against the main resource** (no SSA, no annotation bump, no `/status` write).

Why this works:

- When an `Update` arrives, the API server runs the request body through the **storage encoder** to produce the etcd-shaped bytes, then compares those bytes against what's currently in etcd. If they differ, the Update is committed and the resourceVersion bumps.
- For a CR stored at an older `apiVersion` (e.g. `v1alpha1`), the freshly-encoded bytes from our Update use the current storage version, which differs from the etcd bytes — so the API server commits even though our request body is byte-for-byte the same as what we just Got. That's the storage-version migration.
- For a CR already stored at the current version, the encoded bytes match what's in etcd and the Update is elided server-side. That's the noop case.

This is the same approach upstream [`kube-storage-version-migrator`](https://github.com/kubernetes-sigs/kube-storage-version-migrator) uses today. We verified the mechanics empirically in envtest via `TestReconcile_ReencodesCRsStoredAtPriorVersion` — RV bumps when the prior storage was `v1alpha1`, doesn't bump when the prior storage was already `v1beta1`.

Webhook surface: ToolHive's existing validating webhooks (MCPServer, MCPGroup, VirtualMCPServer) run on the Update path, but they only reject changes that break spec invariants. A round-trip Update with the same spec passes by definition because the spec hasn't changed.

### Concurrent-update safety

The reconciler treats `IsConflict` and `IsNotFound` separately:

- **`IsNotFound`** during the per-CR `Get`: harmless — the CR was deleted between list and re-store. Skip; nothing to migrate.
- **`IsConflict`** on the `Update`: another writer raced us. We do **not** retry inline (that would risk lost-update semantics in a loop). Instead we count conflicts and, if any occurred, return a sentinel error so the **outer reconcile does not patch `status.storedVersions` down**. The CRD stays "dirty" and the next reconcile pass retries cleanly.

`status.storedVersions` is only trimmed after every CR in the pass succeeded with zero conflicts.

### Migration cache

A short-lived in-memory cache keyed on `(CRD name, CR UID, resourceVersion)` prevents re-issuing Updates against objects already re-stored within the last hour. A `manager.RunnableFunc` GC runs every 10 minutes to drop entries for CRs that have been deleted.

### Feature flag (operator-wide off switch)

The controller is gated by an environment variable `ENABLE_STORAGE_VERSION_MIGRATOR`:

- **This release (PR-A)**: defaults to `false`. The chart does **not** surface the flag; admins who want to opt in must inject the env var via `operator.env`. Keeps casual enablement out of the standard `helm install --set` paths during the deprecation window.
- **Follow-up release (PR-C)**: a `operator.features.storageVersionMigrator` value lands in the chart and the operator-code default flips to `true`. From that point onward the feature is on by default; admins running `kube-storage-version-migrator` externally opt out via the chart value.

## How the controller decides what to migrate

The label is just the opt-in gate — it does **not** drive version decisions. Every version decision is read live from each CRD's own `spec.versions`:

| Field | Role |
|---|---|
| `spec.versions[i].storage: true` | The one version per CRD that the API server currently encodes objects as. This is the **target**. |
| `status.storedVersions` | History of every version ever used for storage. If this isn't `[target]`, there's work to do. |

Reconcile flow per CRD:

1. Label present + group matches? Proceed; else return.
2. Walk `spec.versions`, find the entry with `storage: true` → target version `V`.
3. If `status.storedVersions == [V]`: noop, return.
4. List all CRs of this kind via paginated unstructured list.
5. For each CR: `Get` (live, via `APIReader`) + `Update`. Skip on `IsNotFound`. Count `IsConflict`.
6. If step 5 had zero conflicts and zero permanent errors: patch `CRD.status.storedVersions` to `[V]` using `MergeFrom` + optimistic lock. Otherwise return the sentinel error and let controller-runtime requeue.

Worked example:

| CRD | `spec.versions` | `storage: true` | `status.storedVersions` | Controller action |
|---|---|---|---|---|
| `mcpservers.toolhive.stacklok.dev` | v1alpha1, v1beta1 | v1beta1 | `[v1alpha1, v1beta1]` | Get+Update each CR, trim to `[v1beta1]` |
| `mcpgroups.toolhive.stacklok.dev` | v1alpha1, v1beta1 | v1beta1 | `[v1beta1]` | noop |
| `newshinything.toolhive.stacklok.dev` | v1beta1, v1beta2 | v1beta2 | `[v1beta1, v1beta2]` | Get+Update each CR, trim to `[v1beta2]` |
| `experimental.toolhive.stacklok.dev` (no label) | v1alpha1 | v1alpha1 | `[v1alpha1]` | ignored — label missing |

## Alternatives considered

### Scoping: label vs. hard-coded allowlist vs. per-CRD config

| Option | Why rejected |
|---|---|
| Hard-coded Go allowlist | Adding a migratable CRD requires a code change + operator release. Worse cadence than the label. |
| CAPI-style per-CRD config map (`ByObjectConfig`) | Overkill — ToolHive CRDs are structurally uniform. Easy to layer on later. |
| Label on the CRD *(chosen)* | Declarative, Git-controlled, forward-compatible. New migratable CRD = one marker line. |

### Write path: Get+Update vs. SSA on /status vs. annotation bump

The original design proposed metadata-only SSA on `/status` to dodge admission webhooks. Empirical investigation in envtest showed:

- An empty-body SSA Apply is elided by the API server **before** reaching the storage encoder — no RV bump, no storage migration.
- An annotation-bump strategy works but leaves a permanent annotation on every CR (cruft, or requires a clean-up pass later).
- A plain `Get` + `Update` against the main resource gives identical re-encode semantics (the storage encoder runs at the same layer regardless of subresource) while having zero permanent state, no field-manager surface, and matching what `kube-storage-version-migrator` upstream actually does.

Get+Update was chosen for those reasons. Webhook surface is acceptable: ToolHive's validating webhooks only reject spec-invariant violations, which a same-spec round-trip Update cannot trigger.

### Deployment model: in-operator vs. external kube-storage-version-migrator

- It's a separate deployment with its own RBAC, lifecycle, and observability. More surface for users to understand.
- It's driven by `StorageVersionMigration` CRs that someone has to create per migration.
- For a targeted, single-product scope (our ~12 CRDs), having the operator self-migrate is simpler end-to-end.

Admins who already run `kube-storage-version-migrator` can still opt out via the feature flag (once it's surfaced in the chart in PR-C).

## Precedent and prior art

**Label-scoped CRD watches** — widely used: Knative Duck typing (`duck.knative.dev/<type>=true`), OLM (`operators.coreos.com/<operator>.<namespace>`), Crossplane (`pkg.crossplane.io/package`), Gatekeeper (`admission.gatekeeper.sh/ignore`).

**Storage-version migration via Get+Update**: the upstream pattern used by [`kube-storage-version-migrator`](https://github.com/kubernetes-sigs/kube-storage-version-migrator). The CAPI [`controllers/crdmigrator`](https://github.com/kubernetes-sigs/cluster-api/tree/main/controllers/crdmigrator) does the same with its own retry/conflict handling.

**The specific combination — in-operator `storedVersions` migrator scoped by a label on the CRD — is our synthesis** for the ergonomics (one-line marker to add a new migratable CRD, admin escape hatch, no compile-time coupling).

## Known tradeoffs and caveats

1. **Silent "forgot the marker" failure** — adding a new CRD without the marker produces no build error; the migrator just skips it. Mitigated by the PR-B CI test that requires every `v1beta1` root type to carry either the migrate marker or an explicit exclude marker.
2. **Label is user-modifiable in-cluster** — admin escape hatch, but can fight with GitOps reconciliation. Documented as emergency only; long-term opt-out is the feature flag.
3. **Third-party CRDs in the group** — wildcard RBAC `toolhive.stacklok.dev/*` means any third-party CRD that adopted our group + our label would be picked up. The group is Stacklok-owned by convention; risk is theoretical.
4. **Label contract is public** — once shipped and users have written Kyverno rules or GitOps filters against it, renaming becomes a breaking change. Treat the label key as a stable API surface.
5. **Feature-flag footgun (post-PR-C)** — once the default flips to on, an admin who disables the flag and forgets to re-enable it before the `v1alpha1`-removal release will have objects orphaned in etcd. The chart-value docs will call this out.

## Design details

### Interface contracts

```go
package controllers

// StorageVersionMigratorReconciler reconciles CustomResourceDefinition objects
// in the toolhive.stacklok.dev group that carry the opt-in label
// toolhive.stacklok.dev/auto-migrate-storage-version=true. For each such CRD
// it re-stores every CR at the current storage version via a plain Get+Update
// and cleans status.storedVersions so a future release can drop deprecated
// versions from spec.versions without orphaning etcd objects.
//
// Disabled by default in the initial release; admins opt in via the
// ENABLE_STORAGE_VERSION_MIGRATOR environment variable. A follow-up release
// flips the default to enabled and surfaces a chart value for opt-out.
type StorageVersionMigratorReconciler struct {
    client.Client                  // cached reads/writes for CRs
    APIReader     client.Reader    // live reads (bypasses informer)
    Scheme        *runtime.Scheme
    Recorder      record.EventRecorder

    // cache prevents re-Update of objects whose (UID, resourceVersion) was
    // already successfully migrated within the TTL window. Periodically
    // GC'd for deleted CRs by a manager.RunnableFunc.
    cache *migrationCache
}

func (r *StorageVersionMigratorReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error)
func (r *StorageVersionMigratorReconciler) SetupWithManager(mgr ctrl.Manager) error

const (
    AutoMigrateLabel = "toolhive.stacklok.dev/auto-migrate-storage-version"
    AutoMigrateValue = "true"
    ToolhiveGroup    = "toolhive.stacklok.dev"
)
```

In `cmd/thv-operator/app/app.go`:

```go
const envEnableStorageVersionMigrator = "ENABLE_STORAGE_VERSION_MIGRATOR"

func isStorageVersionMigratorEnabled() bool { /* defaults to false */ }
func setupStorageVersionMigrator(mgr ctrl.Manager) error
```

RBAC markers (regenerated into `deploy/charts/operator/templates/clusterrole/role.yaml`):

```go
// +kubebuilder:rbac:groups=apiextensions.k8s.io,resources=customresourcedefinitions,verbs=get;list;watch
// +kubebuilder:rbac:groups=apiextensions.k8s.io,resources=customresourcedefinitions/status,verbs=patch;update
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=*,verbs=get;list;update
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=*/status,verbs=update
```

### Tests (envtest)

| Test | Verifies |
|---|---|
| `TestReconcile_NoopWhenClean` | storedVersions already `[v1beta1]`: no writes issued |
| `TestReconcile_HappyPathTrimsStoredVersions` | storedVersions trimmed after Get+Update on every CR |
| `TestReconcile_ReencodesCRsStoredAtPriorVersion` | empirical proof RV bumps when etcd holds bytes at an older apiVersion |
| `TestReconcile_PartialFailureLeavesStoredVersionsUntouched` | permanent error leaves storedVersions intact for retry |
| `TestReconcile_ConflictDuringRestoreLeavesStoredVersionsUntouched` | IsConflict leaves storedVersions intact, retries cleanly |
| `TestReconcile_SkipsForeignCRDs` | non-toolhive CRDs are ignored |
| `TestReconcile_SkipsUnlabelledToolhiveCRD` | toolhive CRDs without the opt-in label are ignored |
| `TestReconcile_Pagination` | the continue-token list loop works |

## PR strategy

Split into three sequential PRs to keep each one independently reviewable and to maintain default-off safety throughout the rollout.

### PR-A — Controller behind default-off flag (open: [#5362](https://github.com/stacklok/toolhive/pull/5362))

- `cmd/thv-operator/controllers/storageversionmigrator_controller.go` (reconciler + cache + RBAC markers)
- `cmd/thv-operator/app/app.go` (scheme registration, setup, env-var gating, **defaults to false**)
- `cmd/thv-operator/test-integration/storageversionmigrator/` (8 envtest scenarios — CRDs created at runtime, independent of opt-in labels)
- `deploy/charts/operator/templates/clusterrole/role.yaml` (regenerated)
- **Not in PR-A**: chart `values.yaml` flag entry, `deployment.yaml` env var, opt-in labels, user docs

### PR-B — Opt-in labels + marker-coverage CI

- `+kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true` marker on all 12 `v1beta1` root types
- Regenerated CRD YAML (`deploy/charts/operator-crds/files/crds/*.yaml`, `deploy/charts/operator-crds/templates/*.yaml`)
- `cmd/thv-operator/api/v1beta1/marker_coverage_test.go` CI test enforcing every root type carries either the migrate marker or an explicit `+thv:storage-version-migrator:exclude` marker

### PR-C — Flip default on + chart surface + docs

- Operator-code default flipped to `true`
- `operator.features.storageVersionMigrator: true` value in `values.yaml` with helm-docs comment
- `ENABLE_STORAGE_VERSION_MIGRATOR` env var wired in `deployment.yaml`
- `docs/operator/storage-version-migration.md` — user-facing reference (auto-migration behaviour, label contract, opt-out via feature flag, interaction with future `v1alpha1` removal)
- `docs/operator/upgrade-guide/` — kind-cluster walkthrough + sample CR fixtures verified end-to-end

## Acceptance criteria

Mapped across the three PRs:

| AC | PR |
|---|---|
| New `StorageVersionMigrator` controller reconciling on `CustomResourceDefinition` changes, scoped to `toolhive.stacklok.dev` + opt-in label | PR-A |
| Env-var feature flag (`ENABLE_STORAGE_VERSION_MIGRATOR`) gates registration; defaults to false | PR-A |
| 8 envtest scenarios pass | PR-A |
| RBAC for the controller's verbs ships with the controller | PR-A |
| All 12 current `v1beta1` root types carry the opt-in label via kubebuilder markers | PR-B |
| CI marker-coverage test fails when a `v1beta1` root type lacks either the migrate marker or an explicit exclude marker | PR-B |
| Operator-code default flips to `true`; chart surfaces `operator.features.storageVersionMigrator` | PR-C |
| User-facing docs explain auto-migration, the label contract, the feature-flag opt-out, and interaction with future `v1alpha1` removal | PR-C |
| End-to-end kind walkthrough exercises the full upgrade path with sample CR fixtures | PR-C |

## Risks

| Risk | Severity | Mitigation |
|---|---|---|
| Large cluster: many CRs per kind, reconcile timeout | LOW | Paginated list + per-CR Update is O(n); migration cache prevents re-work |
| Webhook added later on the main resource breaks migrator | LOW | A same-spec round-trip Update cannot trigger spec-invariant rejections |
| Label forgotten on a newly-added CRD that *should* be migrated | LOW | CI marker-coverage test (PR-B) catches this at PR time |
| Admin disables feature flag (post-PR-C) and forgets to re-enable before `v1alpha1` removal | MED | Chart-value docs explicitly warn; the flag is intended for admins running `kube-storage-version-migrator` externally |
| GitOps reconciliation fights with admin escape-hatch label removal | LOW | Docs: label is emergency only, feature flag is the supported long-term opt-out |
| Cache entries leak for deleted CRs | LOW | Periodic GC every 10 minutes via a `manager.RunnableFunc` |

## When this is needed

Before the release that drops `v1alpha1` from the ToolHive CRDs. PR-A ships the controller (default off) so admins can opt in during the deprecation window. PR-B lands the opt-in labels. PR-C flips the default on and surfaces the chart value. By the time the `v1alpha1`-removal release lands, every cluster's `storedVersions` is already clean and the version removal is a zero-risk CRD update.

## References

- PR [#4849](https://github.com/stacklok/toolhive/pull/4849) — CRD graduation from `v1alpha1` to `v1beta1` (the reason this is needed)
- PR [#5011](https://github.com/stacklok/toolhive/pull/5011) — original combined-PR reference (kept open during the split)
- PR [#5362](https://github.com/stacklok/toolhive/pull/5362) — PR-A (controller, default-off)
- Upstream reference: [`kube-storage-version-migrator`](https://github.com/kubernetes-sigs/kube-storage-version-migrator)
- CAPI reference: [`cluster-api/controllers/crdmigrator`](https://github.com/kubernetes-sigs/cluster-api/tree/main/controllers/crdmigrator)
- [Kubernetes: serve multiple versions](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/)
- [Kubernetes: Migrate Kubernetes Objects Using Storage Version Migration](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/storage-version-migration/)


Field	Role
`spec.versions[i].storage: true`	The one version per CRD that the API server currently encodes objects as. This is the target.
`status.storedVersions`	History of every version ever used for storage. If this isn't `[target]`, there's work to do.

CRD	`spec.versions`	`storage: true`	`status.storedVersions`	Controller action
`mcpservers.toolhive.stacklok.dev`	v1alpha1, v1beta1	v1beta1	`[v1alpha1, v1beta1]`	Get+Update each CR, trim to `[v1beta1]`
`mcpgroups.toolhive.stacklok.dev`	v1alpha1, v1beta1	v1beta1	`[v1beta1]`	noop
`newshinything.toolhive.stacklok.dev`	v1beta1, v1beta2	v1beta2	`[v1beta1, v1beta2]`	Get+Update each CR, trim to `[v1beta2]`
`experimental.toolhive.stacklok.dev` (no label)	v1alpha1	v1alpha1	`[v1alpha1]`	ignored — label missing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator: add a StorageVersionMigrator controller to clean up storedVersions #4969

Problem statement

Motivation

Goal

Proposed solution

Scoping: opt-in label on the CRD

Write path: plain Get+Update on the main resource

Concurrent-update safety

Migration cache

Feature flag (operator-wide off switch)

How the controller decides what to migrate

Alternatives considered

Scoping: label vs. hard-coded allowlist vs. per-CRD config

Write path: Get+Update vs. SSA on /status vs. annotation bump

Deployment model: in-operator vs. external kube-storage-version-migrator

Precedent and prior art

Known tradeoffs and caveats

Design details

Interface contracts

Tests (envtest)

PR strategy

PR-A — Controller behind default-off flag (open: #5362)

PR-B — Opt-in labels + marker-coverage CI

PR-C — Flip default on + chart surface + docs

Acceptance criteria

Risks

When this is needed

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Option	Why rejected
Hard-coded Go allowlist	Adding a migratable CRD requires a code change + operator release. Worse cadence than the label.
CAPI-style per-CRD config map (`ByObjectConfig`)	Overkill — ToolHive CRDs are structurally uniform. Easy to layer on later.
Label on the CRD (chosen)	Declarative, Git-controlled, forward-compatible. New migratable CRD = one marker line.

Test	Verifies
`TestReconcile_NoopWhenClean`	storedVersions already `[v1beta1]`: no writes issued
`TestReconcile_HappyPathTrimsStoredVersions`	storedVersions trimmed after Get+Update on every CR
`TestReconcile_ReencodesCRsStoredAtPriorVersion`	empirical proof RV bumps when etcd holds bytes at an older apiVersion
`TestReconcile_PartialFailureLeavesStoredVersionsUntouched`	permanent error leaves storedVersions intact for retry
`TestReconcile_ConflictDuringRestoreLeavesStoredVersionsUntouched`	IsConflict leaves storedVersions intact, retries cleanly
`TestReconcile_SkipsForeignCRDs`	non-toolhive CRDs are ignored
`TestReconcile_SkipsUnlabelledToolhiveCRD`	toolhive CRDs without the opt-in label are ignored
`TestReconcile_Pagination`	the continue-token list loop works

AC	PR
New `StorageVersionMigrator` controller reconciling on `CustomResourceDefinition` changes, scoped to `toolhive.stacklok.dev` + opt-in label	PR-A
Env-var feature flag (`ENABLE_STORAGE_VERSION_MIGRATOR`) gates registration; defaults to false	PR-A
8 envtest scenarios pass	PR-A
RBAC for the controller's verbs ships with the controller	PR-A
All 12 current `v1beta1` root types carry the opt-in label via kubebuilder markers	PR-B
CI marker-coverage test fails when a `v1beta1` root type lacks either the migrate marker or an explicit exclude marker	PR-B
Operator-code default flips to `true`; chart surfaces `operator.features.storageVersionMigrator`	PR-C
User-facing docs explain auto-migration, the label contract, the feature-flag opt-out, and interaction with future `v1alpha1` removal	PR-C
End-to-end kind walkthrough exercises the full upgrade path with sample CR fixtures	PR-C

Risk	Severity	Mitigation
Large cluster: many CRs per kind, reconcile timeout	LOW	Paginated list + per-CR Update is O(n); migration cache prevents re-work
Webhook added later on the main resource breaks migrator	LOW	A same-spec round-trip Update cannot trigger spec-invariant rejections
Label forgotten on a newly-added CRD that should be migrated	LOW	CI marker-coverage test (PR-B) catches this at PR time
Admin disables feature flag (post-PR-C) and forgets to re-enable before `v1alpha1` removal	MED	Chart-value docs explicitly warn; the flag is intended for admins running `kube-storage-version-migrator` externally
GitOps reconciliation fights with admin escape-hatch label removal	LOW	Docs: label is emergency only, feature flag is the supported long-term opt-out
Cache entries leak for deleted CRs	LOW	Periodic GC every 10 minutes via a `manager.RunnableFunc`

Operator: add a StorageVersionMigrator controller to clean up storedVersions #4969

Description

Problem statement

Motivation

Goal

Proposed solution

Scoping: opt-in label on the CRD

Write path: plain Get+Update on the main resource

Concurrent-update safety

Migration cache

Feature flag (operator-wide off switch)

How the controller decides what to migrate

Alternatives considered

Scoping: label vs. hard-coded allowlist vs. per-CRD config

Write path: Get+Update vs. SSA on /status vs. annotation bump

Deployment model: in-operator vs. external kube-storage-version-migrator

Precedent and prior art

Known tradeoffs and caveats

Design details

Interface contracts

Tests (envtest)

PR strategy

PR-A — Controller behind default-off flag (open: #5362)

PR-B — Opt-in labels + marker-coverage CI

PR-C — Flip default on + chart surface + docs

Acceptance criteria

Risks

When this is needed

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions