Skip to content

fix: manage Karpenter CRDs in lockstep with the controller#325

Draft
amdove wants to merge 3 commits into
mainfrom
manage-karpenter-crds
Draft

fix: manage Karpenter CRDs in lockstep with the controller#325
amdove wants to merge 3 commits into
mainfrom
manage-karpenter-crds

Conversation

@amdove

@amdove amdove commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Manage Karpenter's CRDs in lockstep with the controller, so a karpenter_version bump can never run against stale CRDs again.

Today the controller chart (oci://public.ecr.aws/karpenter/karpenter) bundles its CRDs in crds/. Helm installs crds/ once on first install and never upgrades them, so every karpenter_version bump updates the controller while leaving the CRDs frozen at the originally-installed version. When a controller ≥1.9 then emits a Gte/Lte requirement (normalized from a NodePool Gt/Lt floor) against a pre-1.9 CRD, NodeClaim creation is rejected and no nodes provision.

What changed

  • Install the dedicated karpenter-crd chart as a second helm.cattle.io/v1 HelmChart, pinned to the same karpenter_version as the controller (single source of truth). Its CRDs are templated, so Helm upgrades them on every bump — this is Karpenter's officially recommended way to manage CRD lifecycle.
  • Adopt the pre-existing, Helm-unowned CRDs (created by the controller chart's crds/) via the helm-controller's native spec.takeOwnership (→ helm --take-ownership). Scoped to exactly the CRDs the karpenter-crd chart renders; a no-op on greenfield.
  • Bump the in-cluster helm-controller v0.16.10v0.16.14 (AWS + Azure), the release that adds spec.takeOwnership, and add the takeOwnership field to PTD's embedded HelmChart CRD schema so the API server doesn't prune it.

Notes / known limitations

  • The controller chart's bundled crds/ are intentionally left in place. skipCRDs/crds is not supported by the k3s helm-controller (verified against the upstream HelmChartSpec and doc/helmchart.md), so the controller chart's CRDs can't be disabled. Per Helm + Karpenter docs this coexistence is benign on existing clusters (the crds/ install is a no-op when CRDs already exist).
  • dependsOn orders CR creation (CRD chart before the controller chart and NodePool/EC2NodeClass CRs), but the helm-controller reconciles HelmChart CRs asynchronously, so it doesn't strictly guarantee the CRD helm job finishes first. The CRD changes are additive/backward-compatible and the dependent CRs retry until the CRDs exist.

Rollout / risk

  • Fleet-wide: the helm-controller bump and embedded-schema change apply to every cluster via the clusters step, not just Karpenter clusters. Validate on a staging cluster (one already in the stale-CRD state) before broad rollout — confirm both the brownfield adoption path and a greenfield install.
  • takeOwnership is PTD's first use of this helm-controller field.

Testing

  • just check-go, just cli, just test-lib all pass.

🤖 Generated with Claude Code

amdove and others added 2 commits June 22, 2026 08:47
Install the karpenter-crd chart as a second helm.cattle.io/v1 HelmChart
pinned to the same karpenter_version as the controller, so a controller
bump can never run against stale, hand-applied CRDs. Pre-stamp the
existing CRDs with Helm ownership metadata so the chart can adopt them on
existing clusters; the stamp is a no-op on greenfield. Order CR creation
with dependsOn (CRD chart first).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the manual CustomResourcePatch ownership pre-stamp with the
helm-controller's native spec.takeOwnership (helm --take-ownership).
Bump helm-controller v0.16.10 -> v0.16.14 (adds takeOwnership) on both
AWS and Azure clusters, and add the takeOwnership boolean to PTD's
embedded HelmChart CRD schema so the field is not pruned.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@amdove amdove changed the title Manage Karpenter CRDs in lockstep with the controller fix: manage Karpenter CRDs in lockstep with the controller Jun 23, 2026
Condense the inline comment in awsHelmKarpenter to the essentials and move
the full rationale (CRD staleness, takeOwnership adoption, version
requirements, limitations, rollout) to docs/infrastructure/karpenter-crds.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@amdove

amdove commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@claude review please

@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown

Claude finished @amdove's task in 2m 32s —— View job


PR Review: fix: manage Karpenter CRDs in lockstep with the controller

  • Gather context and read changed files
  • Review lib/steps/helm_aws.go changes
  • Review lib/steps/clusters_aws.go and clusters_azure.go changes
  • Review lib/steps/clusters_helpers.go changes
  • Review docs/infrastructure/karpenter-crds.md
  • Post final review

Overall this is a solid, well-reasoned fix. The approach follows Karpenter's own recommendation and the code is clean. A few things worth discussing before merge:

Key observations

awsHelmChartCR not used for the CRD or controller chart (lib/steps/helm_aws.go:1375–1428)

Both the new karpenter-crd HelmChart and the existing controller HelmChart are created inline via apiextensions.NewCustomResource rather than the awsHelmChartCR helper. The CRD chart needs takeOwnership (unsupported by the helper), which is a good reason to go inline there. The controller chart appears to have moved inline too so pulumi.DependsOn([]pulumi.Resource{crdChart}) could be passed as a resource option — but awsHelmChartCR's variadic aliases ...pulumi.ResourceOption could actually accept pulumi.DependsOn(...) as-is (it's a pulumi.ResourceOption). Not a blocker, just noting the inconsistency exists.

Ordering: dependsOn reaches NodePool/EC2NodeClass CRs (helm_aws.go:1539, 1587)

Both NodePool and EC2NodeClass CRs carry pulumi.DependsOn([]pulumi.Resource{crdChart}). This is correct and good — it ensures the CRD HelmChart CR exists before Karpenter CRs are applied. The known limitation (Pulumi's readiness = "CR created", not "helm job completed") is accurately documented. Worth making sure existing clusters don't have a window where the controller is running against freshly-created CRD CRs during the initial ensure run, but the PR notes the CRD changes are backward-compatible so retries handle it.

takeOwnership: true is always set (helm_aws.go:1388)

This is unconditional — it applies on every reconcile, not just once on brownfield migration. From the helm-controller source, --take-ownership is a no-op once the Helm release already owns the CRDs, so this is safe. But it does mean every helm job run passes --take-ownership permanently. That's how the helm-controller spec.takeOwnership field works (boolean, not a one-time flag), so this is correct behavior given the API.

Helm-controller fleet-wide bump: v0.16.10 → v0.16.14 (clusters_aws.go:702, clusters_azure.go:432)

Four patch releases. Worth confirming the v0.16.11–v0.16.14 changelogs contain no breaking changes — specifically, anything that would affect Azure clusters that have no Karpenter. The PR notes this is fleet-wide and recommends staging validation, which is the right call.

takeOwnership schema entry (clusters_helpers.go:161)

"takeOwnership": map[string]interface{}{"type": "boolean"},

Other boolean fields in the spec schema (e.g., bootstrap, authPassCredentials) use the same pattern without nullable: true. This is consistent and correct — takeOwnership defaults to false when absent.

Documentation (docs/infrastructure/karpenter-crds.md)

Well-organized. The doc covers the problem, fix, version requirements, limitations, and rollout clearly. The link back to awsHelmKarpenter in lib/steps/helm_aws.go at line 4 is accurate.

Suggestions

  1. Consider extending awsHelmChartCR to accept takeOwnership via a variadic option or a small options struct, if more charts are expected to need it. For now inline is fine given it's isolated to one function, but the inconsistency will be noticeable as the file grows.

  2. Staging validation checklist — The PR recommends validating on a staging cluster before broad rollout. Consider adding that as a merge gate (manual approval or a label workflow) so it's tracked explicitly.

Minor

  • The valuesContent field is absent from the CRD chart spec (no values to set). Confirm the karpenter-crd chart has no required values — it shouldn't since it only renders CRDs, but worth a quick check against oci://public.ecr.aws/karpenter/karpenter-crd.

Overall the implementation is correct and well-tested locally. The fleet-wide helm-controller bump is the main risk surface — staging validation is essential before broad rollout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant