Problem statement
PR #4849 graduated our CRDs from v1alpha1 to v1beta1 using a multi-version serving strategy:
- Both
v1alpha1 and v1beta1 are served; v1beta1 is the storage version.
conversion.strategy: None — safe because the schemas are identical.
- Users can upgrade without deleting resources and migrate their manifests at their own pace.
As users re-apply their manifests at v1beta1, each CRD's status.storedVersions ends up looking like ["v1alpha1", "v1beta1"]. That state is correct and working, but it blocks a future release from dropping v1alpha1 entirely: the API server refuses to remove a version from spec.versions while it's still listed in status.storedVersions, since doing so would orphan objects in etcd that nothing can deserialise.
Kubernetes does not garbage-collect storedVersions automatically. Someone has to re-store every object at the current storage version and then explicitly patch storedVersions to drop the old entry.
Motivation
Before the release that graduates CRDs to v1beta2 (or any future storage version), every cluster's status.storedVersions on the ToolHive CRDs must contain only the current storage version. Without automation, users would have to either:
- Re-apply every CR manually (error-prone at scale), or
- Deploy kube-storage-version-migrator as a separate controller and run a migration per CRD.
Neither is a good experience for users who just want to upgrade ToolHive. Since the operator already watches every ToolHive CRD, it can do this itself — and ship the migrator one release before any version removal, so the storedVersions cleanup happens quietly in the background during the deprecation window.
Goal
Add a StorageVersionMigrator controller to the ToolHive operator that, for each opted-in toolhive.stacklok.dev CRD:
- Detects when
status.storedVersions contains versions other than the current storage version.
- Re-stores every CR of that kind so the API server re-encodes it at the storage version.
- Once all CRs are re-stored, patches
status.storedVersions down to [<currentStorageVersion>].
The controller is generic (works for any current-and-future storage-version transition) and scoped per-CRD via an opt-in label so future CRDs that shouldn't be migrated stay out of scope automatically.
Initial rollout policy: the controller ships disabled by default (ENABLE_STORAGE_VERSION_MIGRATOR=false) for one release. This gives the feature a deprecation window where it can be opted into by admins who want it, before we flip the default to on alongside the user-facing docs in a later release.
Proposed solution
A single controller (cmd/thv-operator/controllers/storageversionmigrator_controller.go) that watches CustomResourceDefinition objects, filters to a specific scope, and performs the migration per CRD.
Scoping: opt-in label on the CRD
The controller watches CRDs matching:
- Group filter:
spec.group == "toolhive.stacklok.dev".
- Opt-in label:
metadata.labels["toolhive.stacklok.dev/auto-migrate-storage-version"] == "true".
The label is set at CRD-generation time via a one-line kubebuilder marker on each type:
// +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true
type MCPServer struct { ... }
task operator-manifests bakes the label into the generated CRD YAML. All 12 current v1beta1 root types get the marker. New CRDs opt in by adding the same marker; CRDs that shouldn't be migrated omit it.
To prevent the "silent forgot-the-marker" failure mode, a CI test walks cmd/thv-operator/api/v1beta1/ and fails if any +kubebuilder:object:root=true type lacks either the migrate marker or an explicit // +thv:storage-version-migrator:exclude sibling marker.
Write path: plain Get+Update on the main resource
To re-encode a CR at the current storage version, the controller issues a plain Get + Update against the main resource (no SSA, no annotation bump, no /status write).
Why this works:
- When an
Update arrives, the API server runs the request body through the storage encoder to produce the etcd-shaped bytes, then compares those bytes against what's currently in etcd. If they differ, the Update is committed and the resourceVersion bumps.
- For a CR stored at an older
apiVersion (e.g. v1alpha1), the freshly-encoded bytes from our Update use the current storage version, which differs from the etcd bytes — so the API server commits even though our request body is byte-for-byte the same as what we just Got. That's the storage-version migration.
- For a CR already stored at the current version, the encoded bytes match what's in etcd and the Update is elided server-side. That's the noop case.
This is the same approach upstream kube-storage-version-migrator uses today. We verified the mechanics empirically in envtest via TestReconcile_ReencodesCRsStoredAtPriorVersion — RV bumps when the prior storage was v1alpha1, doesn't bump when the prior storage was already v1beta1.
Webhook surface: ToolHive's existing validating webhooks (MCPServer, MCPGroup, VirtualMCPServer) run on the Update path, but they only reject changes that break spec invariants. A round-trip Update with the same spec passes by definition because the spec hasn't changed.
Concurrent-update safety
The reconciler treats IsConflict and IsNotFound separately:
IsNotFound during the per-CR Get: harmless — the CR was deleted between list and re-store. Skip; nothing to migrate.
IsConflict on the Update: another writer raced us. We do not retry inline (that would risk lost-update semantics in a loop). Instead we count conflicts and, if any occurred, return a sentinel error so the outer reconcile does not patch status.storedVersions down. The CRD stays "dirty" and the next reconcile pass retries cleanly.
status.storedVersions is only trimmed after every CR in the pass succeeded with zero conflicts.
Migration cache
A short-lived in-memory cache keyed on (CRD name, CR UID, resourceVersion) prevents re-issuing Updates against objects already re-stored within the last hour. A manager.RunnableFunc GC runs every 10 minutes to drop entries for CRs that have been deleted.
Feature flag (operator-wide off switch)
The controller is gated by an environment variable ENABLE_STORAGE_VERSION_MIGRATOR:
- This release (PR-A): defaults to
false. The chart does not surface the flag; admins who want to opt in must inject the env var via operator.env. Keeps casual enablement out of the standard helm install --set paths during the deprecation window.
- Follow-up release (PR-C): a
operator.features.storageVersionMigrator value lands in the chart and the operator-code default flips to true. From that point onward the feature is on by default; admins running kube-storage-version-migrator externally opt out via the chart value.
How the controller decides what to migrate
The label is just the opt-in gate — it does not drive version decisions. Every version decision is read live from each CRD's own spec.versions:
| Field |
Role |
spec.versions[i].storage: true |
The one version per CRD that the API server currently encodes objects as. This is the target. |
status.storedVersions |
History of every version ever used for storage. If this isn't [target], there's work to do. |
Reconcile flow per CRD:
- Label present + group matches? Proceed; else return.
- Walk
spec.versions, find the entry with storage: true → target version V.
- If
status.storedVersions == [V]: noop, return.
- List all CRs of this kind via paginated unstructured list.
- For each CR:
Get (live, via APIReader) + Update. Skip on IsNotFound. Count IsConflict.
- If step 5 had zero conflicts and zero permanent errors: patch
CRD.status.storedVersions to [V] using MergeFrom + optimistic lock. Otherwise return the sentinel error and let controller-runtime requeue.
Worked example:
| CRD |
spec.versions |
storage: true |
status.storedVersions |
Controller action |
mcpservers.toolhive.stacklok.dev |
v1alpha1, v1beta1 |
v1beta1 |
[v1alpha1, v1beta1] |
Get+Update each CR, trim to [v1beta1] |
mcpgroups.toolhive.stacklok.dev |
v1alpha1, v1beta1 |
v1beta1 |
[v1beta1] |
noop |
newshinything.toolhive.stacklok.dev |
v1beta1, v1beta2 |
v1beta2 |
[v1beta1, v1beta2] |
Get+Update each CR, trim to [v1beta2] |
experimental.toolhive.stacklok.dev (no label) |
v1alpha1 |
v1alpha1 |
[v1alpha1] |
ignored — label missing |
Alternatives considered
Scoping: label vs. hard-coded allowlist vs. per-CRD config
| Option |
Why rejected |
| Hard-coded Go allowlist |
Adding a migratable CRD requires a code change + operator release. Worse cadence than the label. |
CAPI-style per-CRD config map (ByObjectConfig) |
Overkill — ToolHive CRDs are structurally uniform. Easy to layer on later. |
| Label on the CRD (chosen) |
Declarative, Git-controlled, forward-compatible. New migratable CRD = one marker line. |
Write path: Get+Update vs. SSA on /status vs. annotation bump
The original design proposed metadata-only SSA on /status to dodge admission webhooks. Empirical investigation in envtest showed:
- An empty-body SSA Apply is elided by the API server before reaching the storage encoder — no RV bump, no storage migration.
- An annotation-bump strategy works but leaves a permanent annotation on every CR (cruft, or requires a clean-up pass later).
- A plain
Get + Update against the main resource gives identical re-encode semantics (the storage encoder runs at the same layer regardless of subresource) while having zero permanent state, no field-manager surface, and matching what kube-storage-version-migrator upstream actually does.
Get+Update was chosen for those reasons. Webhook surface is acceptable: ToolHive's validating webhooks only reject spec-invariant violations, which a same-spec round-trip Update cannot trigger.
Deployment model: in-operator vs. external kube-storage-version-migrator
- It's a separate deployment with its own RBAC, lifecycle, and observability. More surface for users to understand.
- It's driven by
StorageVersionMigration CRs that someone has to create per migration.
- For a targeted, single-product scope (our ~12 CRDs), having the operator self-migrate is simpler end-to-end.
Admins who already run kube-storage-version-migrator can still opt out via the feature flag (once it's surfaced in the chart in PR-C).
Precedent and prior art
Label-scoped CRD watches — widely used: Knative Duck typing (duck.knative.dev/<type>=true), OLM (operators.coreos.com/<operator>.<namespace>), Crossplane (pkg.crossplane.io/package), Gatekeeper (admission.gatekeeper.sh/ignore).
Storage-version migration via Get+Update: the upstream pattern used by kube-storage-version-migrator. The CAPI controllers/crdmigrator does the same with its own retry/conflict handling.
The specific combination — in-operator storedVersions migrator scoped by a label on the CRD — is our synthesis for the ergonomics (one-line marker to add a new migratable CRD, admin escape hatch, no compile-time coupling).
Known tradeoffs and caveats
- Silent "forgot the marker" failure — adding a new CRD without the marker produces no build error; the migrator just skips it. Mitigated by the PR-B CI test that requires every
v1beta1 root type to carry either the migrate marker or an explicit exclude marker.
- Label is user-modifiable in-cluster — admin escape hatch, but can fight with GitOps reconciliation. Documented as emergency only; long-term opt-out is the feature flag.
- Third-party CRDs in the group — wildcard RBAC
toolhive.stacklok.dev/* means any third-party CRD that adopted our group + our label would be picked up. The group is Stacklok-owned by convention; risk is theoretical.
- Label contract is public — once shipped and users have written Kyverno rules or GitOps filters against it, renaming becomes a breaking change. Treat the label key as a stable API surface.
- Feature-flag footgun (post-PR-C) — once the default flips to on, an admin who disables the flag and forgets to re-enable it before the
v1alpha1-removal release will have objects orphaned in etcd. The chart-value docs will call this out.
Design details
Interface contracts
package controllers
// StorageVersionMigratorReconciler reconciles CustomResourceDefinition objects
// in the toolhive.stacklok.dev group that carry the opt-in label
// toolhive.stacklok.dev/auto-migrate-storage-version=true. For each such CRD
// it re-stores every CR at the current storage version via a plain Get+Update
// and cleans status.storedVersions so a future release can drop deprecated
// versions from spec.versions without orphaning etcd objects.
//
// Disabled by default in the initial release; admins opt in via the
// ENABLE_STORAGE_VERSION_MIGRATOR environment variable. A follow-up release
// flips the default to enabled and surfaces a chart value for opt-out.
type StorageVersionMigratorReconciler struct {
client.Client // cached reads/writes for CRs
APIReader client.Reader // live reads (bypasses informer)
Scheme *runtime.Scheme
Recorder record.EventRecorder
// cache prevents re-Update of objects whose (UID, resourceVersion) was
// already successfully migrated within the TTL window. Periodically
// GC'd for deleted CRs by a manager.RunnableFunc.
cache *migrationCache
}
func (r *StorageVersionMigratorReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error)
func (r *StorageVersionMigratorReconciler) SetupWithManager(mgr ctrl.Manager) error
const (
AutoMigrateLabel = "toolhive.stacklok.dev/auto-migrate-storage-version"
AutoMigrateValue = "true"
ToolhiveGroup = "toolhive.stacklok.dev"
)
In cmd/thv-operator/app/app.go:
const envEnableStorageVersionMigrator = "ENABLE_STORAGE_VERSION_MIGRATOR"
func isStorageVersionMigratorEnabled() bool { /* defaults to false */ }
func setupStorageVersionMigrator(mgr ctrl.Manager) error
RBAC markers (regenerated into deploy/charts/operator/templates/clusterrole/role.yaml):
// +kubebuilder:rbac:groups=apiextensions.k8s.io,resources=customresourcedefinitions,verbs=get;list;watch
// +kubebuilder:rbac:groups=apiextensions.k8s.io,resources=customresourcedefinitions/status,verbs=patch;update
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=*,verbs=get;list;update
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=*/status,verbs=update
Tests (envtest)
| Test |
Verifies |
TestReconcile_NoopWhenClean |
storedVersions already [v1beta1]: no writes issued |
TestReconcile_HappyPathTrimsStoredVersions |
storedVersions trimmed after Get+Update on every CR |
TestReconcile_ReencodesCRsStoredAtPriorVersion |
empirical proof RV bumps when etcd holds bytes at an older apiVersion |
TestReconcile_PartialFailureLeavesStoredVersionsUntouched |
permanent error leaves storedVersions intact for retry |
TestReconcile_ConflictDuringRestoreLeavesStoredVersionsUntouched |
IsConflict leaves storedVersions intact, retries cleanly |
TestReconcile_SkipsForeignCRDs |
non-toolhive CRDs are ignored |
TestReconcile_SkipsUnlabelledToolhiveCRD |
toolhive CRDs without the opt-in label are ignored |
TestReconcile_Pagination |
the continue-token list loop works |
PR strategy
Split into three sequential PRs to keep each one independently reviewable and to maintain default-off safety throughout the rollout.
PR-A — Controller behind default-off flag (open: #5362)
cmd/thv-operator/controllers/storageversionmigrator_controller.go (reconciler + cache + RBAC markers)
cmd/thv-operator/app/app.go (scheme registration, setup, env-var gating, defaults to false)
cmd/thv-operator/test-integration/storageversionmigrator/ (8 envtest scenarios — CRDs created at runtime, independent of opt-in labels)
deploy/charts/operator/templates/clusterrole/role.yaml (regenerated)
- Not in PR-A: chart
values.yaml flag entry, deployment.yaml env var, opt-in labels, user docs
PR-B — Opt-in labels + marker-coverage CI
+kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true marker on all 12 v1beta1 root types
- Regenerated CRD YAML (
deploy/charts/operator-crds/files/crds/*.yaml, deploy/charts/operator-crds/templates/*.yaml)
cmd/thv-operator/api/v1beta1/marker_coverage_test.go CI test enforcing every root type carries either the migrate marker or an explicit +thv:storage-version-migrator:exclude marker
PR-C — Flip default on + chart surface + docs
- Operator-code default flipped to
true
operator.features.storageVersionMigrator: true value in values.yaml with helm-docs comment
ENABLE_STORAGE_VERSION_MIGRATOR env var wired in deployment.yaml
docs/operator/storage-version-migration.md — user-facing reference (auto-migration behaviour, label contract, opt-out via feature flag, interaction with future v1alpha1 removal)
docs/operator/upgrade-guide/ — kind-cluster walkthrough + sample CR fixtures verified end-to-end
Acceptance criteria
Mapped across the three PRs:
| AC |
PR |
New StorageVersionMigrator controller reconciling on CustomResourceDefinition changes, scoped to toolhive.stacklok.dev + opt-in label |
PR-A |
Env-var feature flag (ENABLE_STORAGE_VERSION_MIGRATOR) gates registration; defaults to false |
PR-A |
| 8 envtest scenarios pass |
PR-A |
| RBAC for the controller's verbs ships with the controller |
PR-A |
All 12 current v1beta1 root types carry the opt-in label via kubebuilder markers |
PR-B |
CI marker-coverage test fails when a v1beta1 root type lacks either the migrate marker or an explicit exclude marker |
PR-B |
Operator-code default flips to true; chart surfaces operator.features.storageVersionMigrator |
PR-C |
User-facing docs explain auto-migration, the label contract, the feature-flag opt-out, and interaction with future v1alpha1 removal |
PR-C |
| End-to-end kind walkthrough exercises the full upgrade path with sample CR fixtures |
PR-C |
Risks
| Risk |
Severity |
Mitigation |
| Large cluster: many CRs per kind, reconcile timeout |
LOW |
Paginated list + per-CR Update is O(n); migration cache prevents re-work |
| Webhook added later on the main resource breaks migrator |
LOW |
A same-spec round-trip Update cannot trigger spec-invariant rejections |
| Label forgotten on a newly-added CRD that should be migrated |
LOW |
CI marker-coverage test (PR-B) catches this at PR time |
Admin disables feature flag (post-PR-C) and forgets to re-enable before v1alpha1 removal |
MED |
Chart-value docs explicitly warn; the flag is intended for admins running kube-storage-version-migrator externally |
| GitOps reconciliation fights with admin escape-hatch label removal |
LOW |
Docs: label is emergency only, feature flag is the supported long-term opt-out |
| Cache entries leak for deleted CRs |
LOW |
Periodic GC every 10 minutes via a manager.RunnableFunc |
When this is needed
Before the release that drops v1alpha1 from the ToolHive CRDs. PR-A ships the controller (default off) so admins can opt in during the deprecation window. PR-B lands the opt-in labels. PR-C flips the default on and surfaces the chart value. By the time the v1alpha1-removal release lands, every cluster's storedVersions is already clean and the version removal is a zero-risk CRD update.
References
Problem statement
PR #4849 graduated our CRDs from
v1alpha1tov1beta1using a multi-version serving strategy:v1alpha1andv1beta1are served;v1beta1is the storage version.conversion.strategy: None— safe because the schemas are identical.As users re-apply their manifests at
v1beta1, each CRD'sstatus.storedVersionsends up looking like["v1alpha1", "v1beta1"]. That state is correct and working, but it blocks a future release from droppingv1alpha1entirely: the API server refuses to remove a version fromspec.versionswhile it's still listed instatus.storedVersions, since doing so would orphan objects in etcd that nothing can deserialise.Kubernetes does not garbage-collect
storedVersionsautomatically. Someone has to re-store every object at the current storage version and then explicitly patchstoredVersionsto drop the old entry.Motivation
Before the release that graduates CRDs to
v1beta2(or any future storage version), every cluster'sstatus.storedVersionson the ToolHive CRDs must contain only the current storage version. Without automation, users would have to either:Neither is a good experience for users who just want to upgrade ToolHive. Since the operator already watches every ToolHive CRD, it can do this itself — and ship the migrator one release before any version removal, so the
storedVersionscleanup happens quietly in the background during the deprecation window.Goal
Add a
StorageVersionMigratorcontroller to the ToolHive operator that, for each opted-intoolhive.stacklok.devCRD:status.storedVersionscontains versions other than the current storage version.status.storedVersionsdown to[<currentStorageVersion>].The controller is generic (works for any current-and-future storage-version transition) and scoped per-CRD via an opt-in label so future CRDs that shouldn't be migrated stay out of scope automatically.
Initial rollout policy: the controller ships disabled by default (
ENABLE_STORAGE_VERSION_MIGRATOR=false) for one release. This gives the feature a deprecation window where it can be opted into by admins who want it, before we flip the default to on alongside the user-facing docs in a later release.Proposed solution
A single controller (
cmd/thv-operator/controllers/storageversionmigrator_controller.go) that watchesCustomResourceDefinitionobjects, filters to a specific scope, and performs the migration per CRD.Scoping: opt-in label on the CRD
The controller watches CRDs matching:
spec.group == "toolhive.stacklok.dev".metadata.labels["toolhive.stacklok.dev/auto-migrate-storage-version"] == "true".The label is set at CRD-generation time via a one-line kubebuilder marker on each type:
task operator-manifestsbakes the label into the generated CRD YAML. All 12 currentv1beta1root types get the marker. New CRDs opt in by adding the same marker; CRDs that shouldn't be migrated omit it.To prevent the "silent forgot-the-marker" failure mode, a CI test walks
cmd/thv-operator/api/v1beta1/and fails if any+kubebuilder:object:root=truetype lacks either the migrate marker or an explicit// +thv:storage-version-migrator:excludesibling marker.Write path: plain Get+Update on the main resource
To re-encode a CR at the current storage version, the controller issues a plain
Get+Updateagainst the main resource (no SSA, no annotation bump, no/statuswrite).Why this works:
Updatearrives, the API server runs the request body through the storage encoder to produce the etcd-shaped bytes, then compares those bytes against what's currently in etcd. If they differ, the Update is committed and the resourceVersion bumps.apiVersion(e.g.v1alpha1), the freshly-encoded bytes from our Update use the current storage version, which differs from the etcd bytes — so the API server commits even though our request body is byte-for-byte the same as what we just Got. That's the storage-version migration.This is the same approach upstream
kube-storage-version-migratoruses today. We verified the mechanics empirically in envtest viaTestReconcile_ReencodesCRsStoredAtPriorVersion— RV bumps when the prior storage wasv1alpha1, doesn't bump when the prior storage was alreadyv1beta1.Webhook surface: ToolHive's existing validating webhooks (MCPServer, MCPGroup, VirtualMCPServer) run on the Update path, but they only reject changes that break spec invariants. A round-trip Update with the same spec passes by definition because the spec hasn't changed.
Concurrent-update safety
The reconciler treats
IsConflictandIsNotFoundseparately:IsNotFoundduring the per-CRGet: harmless — the CR was deleted between list and re-store. Skip; nothing to migrate.IsConflicton theUpdate: another writer raced us. We do not retry inline (that would risk lost-update semantics in a loop). Instead we count conflicts and, if any occurred, return a sentinel error so the outer reconcile does not patchstatus.storedVersionsdown. The CRD stays "dirty" and the next reconcile pass retries cleanly.status.storedVersionsis only trimmed after every CR in the pass succeeded with zero conflicts.Migration cache
A short-lived in-memory cache keyed on
(CRD name, CR UID, resourceVersion)prevents re-issuing Updates against objects already re-stored within the last hour. Amanager.RunnableFuncGC runs every 10 minutes to drop entries for CRs that have been deleted.Feature flag (operator-wide off switch)
The controller is gated by an environment variable
ENABLE_STORAGE_VERSION_MIGRATOR:false. The chart does not surface the flag; admins who want to opt in must inject the env var viaoperator.env. Keeps casual enablement out of the standardhelm install --setpaths during the deprecation window.operator.features.storageVersionMigratorvalue lands in the chart and the operator-code default flips totrue. From that point onward the feature is on by default; admins runningkube-storage-version-migratorexternally opt out via the chart value.How the controller decides what to migrate
The label is just the opt-in gate — it does not drive version decisions. Every version decision is read live from each CRD's own
spec.versions:spec.versions[i].storage: truestatus.storedVersions[target], there's work to do.Reconcile flow per CRD:
spec.versions, find the entry withstorage: true→ target versionV.status.storedVersions == [V]: noop, return.Get(live, viaAPIReader) +Update. Skip onIsNotFound. CountIsConflict.CRD.status.storedVersionsto[V]usingMergeFrom+ optimistic lock. Otherwise return the sentinel error and let controller-runtime requeue.Worked example:
spec.versionsstorage: truestatus.storedVersionsmcpservers.toolhive.stacklok.dev[v1alpha1, v1beta1][v1beta1]mcpgroups.toolhive.stacklok.dev[v1beta1]newshinything.toolhive.stacklok.dev[v1beta1, v1beta2][v1beta2]experimental.toolhive.stacklok.dev(no label)[v1alpha1]Alternatives considered
Scoping: label vs. hard-coded allowlist vs. per-CRD config
ByObjectConfig)Write path: Get+Update vs. SSA on /status vs. annotation bump
The original design proposed metadata-only SSA on
/statusto dodge admission webhooks. Empirical investigation in envtest showed:Get+Updateagainst the main resource gives identical re-encode semantics (the storage encoder runs at the same layer regardless of subresource) while having zero permanent state, no field-manager surface, and matching whatkube-storage-version-migratorupstream actually does.Get+Update was chosen for those reasons. Webhook surface is acceptable: ToolHive's validating webhooks only reject spec-invariant violations, which a same-spec round-trip Update cannot trigger.
Deployment model: in-operator vs. external kube-storage-version-migrator
StorageVersionMigrationCRs that someone has to create per migration.Admins who already run
kube-storage-version-migratorcan still opt out via the feature flag (once it's surfaced in the chart in PR-C).Precedent and prior art
Label-scoped CRD watches — widely used: Knative Duck typing (
duck.knative.dev/<type>=true), OLM (operators.coreos.com/<operator>.<namespace>), Crossplane (pkg.crossplane.io/package), Gatekeeper (admission.gatekeeper.sh/ignore).Storage-version migration via Get+Update: the upstream pattern used by
kube-storage-version-migrator. The CAPIcontrollers/crdmigratordoes the same with its own retry/conflict handling.The specific combination — in-operator
storedVersionsmigrator scoped by a label on the CRD — is our synthesis for the ergonomics (one-line marker to add a new migratable CRD, admin escape hatch, no compile-time coupling).Known tradeoffs and caveats
v1beta1root type to carry either the migrate marker or an explicit exclude marker.toolhive.stacklok.dev/*means any third-party CRD that adopted our group + our label would be picked up. The group is Stacklok-owned by convention; risk is theoretical.v1alpha1-removal release will have objects orphaned in etcd. The chart-value docs will call this out.Design details
Interface contracts
In
cmd/thv-operator/app/app.go:RBAC markers (regenerated into
deploy/charts/operator/templates/clusterrole/role.yaml):Tests (envtest)
TestReconcile_NoopWhenClean[v1beta1]: no writes issuedTestReconcile_HappyPathTrimsStoredVersionsTestReconcile_ReencodesCRsStoredAtPriorVersionTestReconcile_PartialFailureLeavesStoredVersionsUntouchedTestReconcile_ConflictDuringRestoreLeavesStoredVersionsUntouchedTestReconcile_SkipsForeignCRDsTestReconcile_SkipsUnlabelledToolhiveCRDTestReconcile_PaginationPR strategy
Split into three sequential PRs to keep each one independently reviewable and to maintain default-off safety throughout the rollout.
PR-A — Controller behind default-off flag (open: #5362)
cmd/thv-operator/controllers/storageversionmigrator_controller.go(reconciler + cache + RBAC markers)cmd/thv-operator/app/app.go(scheme registration, setup, env-var gating, defaults to false)cmd/thv-operator/test-integration/storageversionmigrator/(8 envtest scenarios — CRDs created at runtime, independent of opt-in labels)deploy/charts/operator/templates/clusterrole/role.yaml(regenerated)values.yamlflag entry,deployment.yamlenv var, opt-in labels, user docsPR-B — Opt-in labels + marker-coverage CI
+kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=truemarker on all 12v1beta1root typesdeploy/charts/operator-crds/files/crds/*.yaml,deploy/charts/operator-crds/templates/*.yaml)cmd/thv-operator/api/v1beta1/marker_coverage_test.goCI test enforcing every root type carries either the migrate marker or an explicit+thv:storage-version-migrator:excludemarkerPR-C — Flip default on + chart surface + docs
trueoperator.features.storageVersionMigrator: truevalue invalues.yamlwith helm-docs commentENABLE_STORAGE_VERSION_MIGRATORenv var wired indeployment.yamldocs/operator/storage-version-migration.md— user-facing reference (auto-migration behaviour, label contract, opt-out via feature flag, interaction with futurev1alpha1removal)docs/operator/upgrade-guide/— kind-cluster walkthrough + sample CR fixtures verified end-to-endAcceptance criteria
Mapped across the three PRs:
StorageVersionMigratorcontroller reconciling onCustomResourceDefinitionchanges, scoped totoolhive.stacklok.dev+ opt-in labelENABLE_STORAGE_VERSION_MIGRATOR) gates registration; defaults to falsev1beta1root types carry the opt-in label via kubebuilder markersv1beta1root type lacks either the migrate marker or an explicit exclude markertrue; chart surfacesoperator.features.storageVersionMigratorv1alpha1removalRisks
v1alpha1removalkube-storage-version-migratorexternallymanager.RunnableFuncWhen this is needed
Before the release that drops
v1alpha1from the ToolHive CRDs. PR-A ships the controller (default off) so admins can opt in during the deprecation window. PR-B lands the opt-in labels. PR-C flips the default on and surfaces the chart value. By the time thev1alpha1-removal release lands, every cluster'sstoredVersionsis already clean and the version removal is a zero-risk CRD update.References
v1alpha1tov1beta1(the reason this is needed)kube-storage-version-migratorcluster-api/controllers/crdmigrator