Add StorageVersionMigrator controller (opt-in, default off)#5362
Add StorageVersionMigrator controller (opt-in, default off)#5362ChrisJBurns wants to merge 10 commits into
Conversation
Adds a StorageVersionMigrator controller to the operator behind a default-off feature flag (ENABLE_STORAGE_VERSION_MIGRATOR). The controller reconciles status.storedVersions on opted-in toolhive.stacklok.dev CRDs by issuing a Get+Update against each CR, which re-encodes etcd objects to the current storage version. This is required before a future release can drop a deprecated apiVersion (e.g. v1alpha1) from spec.versions without orphaning rows. The flag defaults to false, so this PR has no functional impact for any user until they explicitly opt in via operator.env. Helm chart exposure (values.yaml flag entry and deployment env-var wiring) is deferred to a follow-up PR to reduce accidental enablement during the deprecation window. RBAC for the controller's verbs ships in this PR so the flag can be flipped on at any time without permission-denied errors. Opt-in label markers on the 12 v1beta1 root types and the marker-coverage CI test are deferred to a follow-up PR; the envtest suite creates its own CRDs at runtime and is independent of those changes. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5362 +/- ##
==========================================
+ Coverage 68.70% 68.75% +0.04%
==========================================
Files 625 627 +2
Lines 63422 63700 +278
==========================================
+ Hits 43575 43795 +220
- Misses 16604 16656 +52
- Partials 3243 3249 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ChrisJBurns
left a comment
There was a problem hiding this comment.
Multi-Agent Consensus Review
Agents consulted: kubernetes-go-expert, code-reviewer, go-architect, toolhive-expert (go-security-reviewer's sandbox couldn't read the diff — its threat-model checklist was independently covered by the others)
Consensus Summary
| # | Finding | Consensus | Severity | Action |
|---|---|---|---|---|
| F1 | Doc comment "Enabled by default" contradicts default-OFF | 10/10 | HIGH | Fix |
| F2 | Silent env-var parse fallback (fail-loudly rule) | 9/10 | MEDIUM | Fix |
| F3 | Wildcard RBAC widens group-wide permissions | 7/10 | MEDIUM | Discuss |
| F6 | Status().Patch exception to MutateAndPatchStatus needs in-code note |
7/10 | INFO | Polish |
| F7 | Disabled-state INFO log should be DEBUG | 5/10 | LOW | Optional |
| F8 | Env var name lacks TOOLHIVE_ prefix |
5/10 | LOW | Optional |
Overall
The mechanism is sound and the test design is genuinely strong. The cross-version RV-bump test ("re-encodes CRs that are stored at a prior storage version") is the load-bearing empirical proof that the apiserver actually re-encodes etcd bytes on cross-version Update — that's the only thing reviewers really need to see to trust the design, and it's there with explicit pre/post resourceVersion assertions. The pagination test wraps APIReader with a list-call counter for direct continue-token proof; the partial-failure and conflict-retry tests pin the storedVersions invariant; the use of APIReader (informer bypass) plus runtime isManagedCRD re-verification correctly handles the TOCTOU window between watch-time and reconcile-time. The defense-in-depth around the wildcard RBAC (RBAC bounds the API group; the label gate narrows within it) is well-reasoned and the errMigrationRetriedDueToConflicts sentinel correctly prevents trimming storedVersions when any per-CR write is unverified.
The findings here are mostly polish, not architectural. One HIGH and two MEDIUMs:
- F1 (HIGH) is straightforward doc/code drift — the type doc says one thing, the env-var gate does another, and the referenced
operator.features.storageVersionMigratorhelm path doesn't exist yet. Fixable in a single comment block. - F2 (MEDIUM) is the only operability concern:
ENABLE_STORAGE_VERSION_MIGRATOR=turesilently disables the feature with an INFO log. Per.claude/rules/go-style.md"Constructor Validation: Fail Loudly", that misconfiguration should fail startup or at minimum surface at WARN. - F3 (MEDIUM) is the wildcard RBAC. The runtime gate (
isManagedCRD) defends against the broad RBAC at execution time, but the regenerated chart now shipsgroups=toolhive.stacklok.dev,resources=*on every install regardless of whether the migrator is on. Worth surfacing in the PR's user-facing-change section even if the design is intentional.
Nothing here blocks the merge in a strict sense; all findings have unambiguous, contained fixes. The HIGH is HIGH only because doc/code drift on the activation mechanism is the kind of thing that causes operator escalations during a deprecation window.
Documentation
cmd/thv-operator/controllers/storageversionmigrator_controller.go:97-101 carries the contradicting doc comment (F1). No other documentation files in this PR need updating, but I'd suggest noting the operator SA permission widening (F3) in the PR's "Does this introduce a user-facing change?" section so chart consumers see it on review.
Generated with Claude Code
- F1: fix doc comment claiming "Enabled by default" — the controller is default-off in this release, opt-in via env var only; chart surface lands in a follow-up PR - F2: fail loudly on unparsable env-var value instead of silently defaulting to disabled (per go-style "Fail Loudly") - F6: document why patchStoredVersions cannot use the operator-wide MutateAndPatchStatus helper (CRD storedVersions is co-owned with kube-apiserver, optimistic lock is load-bearing) - F7: demote disabled-state log from INFO to V(1) (silent success) - F8: prefix env var with TOOLHIVE_ to match sibling operator env vars and avoid cross-operator collisions - Codespell: unparseable→unparsable, overrideable→overridable - Chainsaw RBAC fixture: include the migrator's apiextensions verbs and the wildcard toolhive.stacklok.dev rules so the multi-tenancy setup assertion matches the regenerated ClusterRole Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
Large PR justification has been provided. Thank you!
Same update as the multi-tenancy fixture in the previous commit — single-tenancy/setup also snapshots the operator ClusterRole and the migrator's RBAC rules need to appear in its expected YAML too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the parts of storageversionmigrator_controller.go that the envtest suite can't reach or only exercises transitively: migrationCache: TTL expiry with lazy eviction, RV mismatch as a miss, gc() evicting only expired entries, concurrent has/add/gc under the race detector, key uniqueness across CRDs and across UIDs, add() refreshing the expiry on existing entries. Pure predicate functions (isManagedCRD, isToolhiveCRDName, findStorageVersion, isMigrationNeeded): table-driven coverage of every input shape including edge cases the envtest suite would have to spin up an apiserver to exercise (nil labels, no storage version, empty storedVersions, multi-element storedVersions, etc.). Uses the migrationCache's existing `now func() time.Time` seam for frozen-clock TTL tests. No production code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds unit tests for the cheaper of the previously-uncovered controller paths: ensureInitialized: defaults applied on zero values, explicit values preserved, idempotent across repeated calls. Reconcile early returns via the controller-runtime fake client: CRD IsNotFound, foreign group skip, missing opt-in label skip, no-storage-version skip, and already-clean storedVersions short-circuit. Brings file coverage from 25.4% to 41.3% (52/126 statements). The remaining uncovered functions (restoreCRs, restoreOne, patchStoredVersions, SetupWithManager) depend on apiserver-side behaviour that the fake client doesn't simulate — storage-encoder elision, optimistic-lock 409s, manager-driven watch wiring — so they stay in envtest territory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the orchestration code paths in storageversionmigrator_controller.go that the envtest suite also tests, but at code-path level rather than apiserver-semantic level. The fake client can't simulate storage-encoder elision or optimistic-lock 409s, but it can verify: restoreOne: Get + Update propagation, NotFound on Get, Conflict on Update, generic error on Update. restoreCRs: happy path with cache population, empty list, per-CR NotFound silent skip, Conflict counted + sentinel returned, generic error aggregated, mixed conflicts+errors returns aggregate (not sentinel), first List failure wraps and returns immediately, cache hits skip Update, paginating reader confirms Continue token threads through ListOptions across three synthesized pages. patchStoredVersions: storedVersions trimmed in fake store on success, goes through /status subresource (not main resource), patch body carries resourceVersion (proof MergeFromWithOptimisticLock is in effect, not plain MergeFrom), patch errors propagate. File coverage: 41.3% → 77.8% (98/126 statements). - restoreOne, patchStoredVersions: 100% - restoreCRs: 92.3% - Reconcile: 56% (deep paths still envtest-only) - SetupWithManager: 0% (manager wiring not testable without a manager) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Consolidates several test groups into table-driven form and trims redundant table cases without dropping any meaningful assertion or coverage. File shrinks 1261 → 1028 lines (~18%); coverage stays at 77.8% (98/126 statements on storageversionmigrator_controller.go). Notable consolidations: - 5 TestReconcile_* early-return tests → 1 table-driven TestReconcile_EarlyReturns - 3 TestRestoreOne_Propagates* tests → 1 table-driven TestRestoreOne_PropagatesErrors - 4 TestRestoreCRs_* error-classification tests → 1 table-driven TestRestoreCRs_ErrorClassification - 3 TestPatchStoredVersions_* success tests → 1 TestPatchStoredVersions_SuccessAssertsAllProperties - 2 cache key-collision tests → 1 TestMigrationCache_KeyIsolation - Dropped low-signal TestMigrationCache_GCNoopOnEmpty - Dropped duplicate helpers (newReconcileTestScheme, reconcileWithFake) - Trimmed redundant table cases in pure-function tests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds a
StorageVersionMigratorcontroller to the operator behind an opt-in feature flag (TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR). The flag defaults to false — this PR has no functional impact until an admin explicitly opts in. Required before a future release can drop a deprecatedapiVersion(e.g.v1alpha1) fromspec.versionswithout orphaning rows in etcd.This is the first of three sequential PRs that together deliver the migrator described in #4969. The full work lives in #5011 as a reference; this PR extracts just the controller code.
Part of #4969.
Medium level
StorageVersionMigratorReconcileratcmd/thv-operator/controllers/storageversionmigrator_controller.go— issues a Get+Update against each CR on opted-in CRDs, which re-encodes etcd bytes at the current storage version, then trimsstatus.storedVersionscmd/thv-operator/app/app.go:apiextensions/v1scheme registration,setupStorageVersionMigrator()setup,isStorageVersionMigratorEnabled()helper defaulting tofalseand failing loudly on an unparsable env-var valuecmd/thv-operator/test-integration/storageversionmigrator/with 8 scenarios including a cross-version re-encode testdeployment.yamlenv-var wiring — deferred to PR-C alongside the default-on flip. Reduces the chance of accidental enablement during the deprecation window. Early adopters who know about the feature can opt in viaoperator.env.Low level
cmd/thv-operator/controllers/storageversionmigrator_controller.gocmd/thv-operator/app/app.goapiextensions/v1in scheme; addsetupStorageVersionMigrator()+isStorageVersionMigratorEnabled()helper (defaultsfalse, fails loudly on unparsable values); gate registration on the flagcmd/thv-operator/test-integration/storageversionmigrator/suite_test.gocmd/thv-operator/test-integration/storageversionmigrator/controller_test.godeploy/charts/operator/templates/clusterrole/role.yamltask operator-manifests— picks up the controller's+kubebuilder:rbac:markerstest/e2e/chainsaw/operator/multi-tenancy/setup/assert-rbac-clusterrole.yamlType of change
Does this introduce a user-facing change?
Yes — operator ServiceAccount permissions widen on this PR. The regenerated
toolhive-operator-manager-roleClusterRole now grants:get/list/watchonapiextensions.k8s.io/customresourcedefinitionspatch/updateoncustomresourcedefinitions/statusget/list/updateontoolhive.stacklok.dev/*(wildcard within the operator's own API group)updateontoolhive.stacklok.dev/*/status(wildcard within the operator's own API group)These ship with the chart on every install regardless of whether the migrator is opted in. The runtime gate (
isManagedCRDrequiring the opt-in label) prevents the controller from touching anything other than labeled CRDs, but the RBAC widening is unconditional — this is intentional so admins can flip the env var on at any time without hitting permission-denied. Templating the RBAC behind a chart value is deferred to PR-C alongside the chart surface for the feature flag.No CR schema changes. No behavior change for the operator until an admin sets
TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR=true.Test plan
task buildpassesgo test ./cmd/thv-operator/... | grep -v /test-integrationgo test ./cmd/thv-operator/test-integration/storageversionmigrator/...TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR=turereturns a startup error rather than silently disablingtask lint-fix— blocked by a pre-existing tooling issue (golangci-lintbuilt with go1.25 vs. go1.26 target); not introduced by this PRLarge PR Justification
This PR exceeds 1000 lines, but the bulk is non-production code that cannot reasonably be split from the production code it covers:
controller_test.go+suite_test.go) — 8 ginkgo specs including a cross-version re-encode probe that empirically verifies the apiserver's storage-encoder behavior. Shipping the controller without its integration tests would be poor practice.role.yaml) — driven by+kubebuilder:rbac:markers in the controller; cannot be authored separately.The PR ships behind a default-off feature flag, so the production code is dormant unless explicitly opted in. PR-B (opt-in labels + marker-coverage CI) and PR-C (default-on flip + chart surface + docs) follow as separate, smaller PRs.
Special notes for reviewers
helm install --setwithout understanding the semantics.TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR(with prefix) matches sibling operator env vars likeTOOLHIVE_DEFAULT_IMAGE_PULL_SECRETSand avoids collisions with cluster-wide variables set by other operators or CI tooling.v1beta1root types + marker-coverage CI testvalues.yamlflag +deployment.yamlenv var) + user docs + upgrade-guide walkthroughGenerated with Claude Code