fix: manage Karpenter CRDs in lockstep with the controller#325
Conversation
Install the karpenter-crd chart as a second helm.cattle.io/v1 HelmChart pinned to the same karpenter_version as the controller, so a controller bump can never run against stale, hand-applied CRDs. Pre-stamp the existing CRDs with Helm ownership metadata so the chart can adopt them on existing clusters; the stamp is a no-op on greenfield. Order CR creation with dependsOn (CRD chart first). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the manual CustomResourcePatch ownership pre-stamp with the helm-controller's native spec.takeOwnership (helm --take-ownership). Bump helm-controller v0.16.10 -> v0.16.14 (adds takeOwnership) on both AWS and Azure clusters, and add the takeOwnership boolean to PTD's embedded HelmChart CRD schema so the field is not pruned. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Condense the inline comment in awsHelmKarpenter to the essentials and move the full rationale (CRD staleness, takeOwnership adoption, version requirements, limitations, rollout) to docs/infrastructure/karpenter-crds.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@claude review please |
|
Claude finished @amdove's task in 2m 32s —— View job PR Review: fix: manage Karpenter CRDs in lockstep with the controller
Overall this is a solid, well-reasoned fix. The approach follows Karpenter's own recommendation and the code is clean. A few things worth discussing before merge: Key observations
Both the new Ordering: Both NodePool and EC2NodeClass CRs carry
This is unconditional — it applies on every reconcile, not just once on brownfield migration. From the helm-controller source, Helm-controller fleet-wide bump: v0.16.10 → v0.16.14 ( Four patch releases. Worth confirming the v0.16.11–v0.16.14 changelogs contain no breaking changes — specifically, anything that would affect Azure clusters that have no Karpenter. The PR notes this is fleet-wide and recommends staging validation, which is the right call.
"takeOwnership": map[string]interface{}{"type": "boolean"},Other boolean fields in the spec schema (e.g., Documentation ( Well-organized. The doc covers the problem, fix, version requirements, limitations, and rollout clearly. The link back to Suggestions
Minor
Overall the implementation is correct and well-tested locally. The fleet-wide helm-controller bump is the main risk surface — staging validation is essential before broad rollout. |
Summary
Manage Karpenter's CRDs in lockstep with the controller, so a
karpenter_versionbump can never run against stale CRDs again.Today the controller chart (
oci://public.ecr.aws/karpenter/karpenter) bundles its CRDs incrds/. Helm installscrds/once on first install and never upgrades them, so everykarpenter_versionbump updates the controller while leaving the CRDs frozen at the originally-installed version. When a controller ≥1.9 then emits aGte/Lterequirement (normalized from a NodePoolGt/Ltfloor) against a pre-1.9 CRD, NodeClaim creation is rejected and no nodes provision.What changed
karpenter-crdchart as a secondhelm.cattle.io/v1HelmChart, pinned to the samekarpenter_versionas the controller (single source of truth). Its CRDs are templated, so Helm upgrades them on every bump — this is Karpenter's officially recommended way to manage CRD lifecycle.crds/) via the helm-controller's nativespec.takeOwnership(→helm --take-ownership). Scoped to exactly the CRDs thekarpenter-crdchart renders; a no-op on greenfield.v0.16.10→v0.16.14(AWS + Azure), the release that addsspec.takeOwnership, and add thetakeOwnershipfield to PTD's embedded HelmChart CRD schema so the API server doesn't prune it.Notes / known limitations
crds/are intentionally left in place.skipCRDs/crdsis not supported by the k3s helm-controller (verified against the upstreamHelmChartSpecanddoc/helmchart.md), so the controller chart's CRDs can't be disabled. Per Helm + Karpenter docs this coexistence is benign on existing clusters (thecrds/install is a no-op when CRDs already exist).dependsOnorders CR creation (CRD chart before the controller chart and NodePool/EC2NodeClass CRs), but the helm-controller reconciles HelmChart CRs asynchronously, so it doesn't strictly guarantee the CRD helm job finishes first. The CRD changes are additive/backward-compatible and the dependent CRs retry until the CRDs exist.Rollout / risk
clustersstep, not just Karpenter clusters. Validate on a staging cluster (one already in the stale-CRD state) before broad rollout — confirm both the brownfield adoption path and a greenfield install.takeOwnershipis PTD's first use of this helm-controller field.Testing
just check-go,just cli,just test-liball pass.🤖 Generated with Claude Code