Skip to content

[CASCL-1386] Add evict-legacy-nodes subcommand to drain non-Datadog node groups#3026

Draft
L3n41c wants to merge 18 commits into
mainfrom
lenaic/CASCL-1386-evict-legacy-nodes
Draft

[CASCL-1386] Add evict-legacy-nodes subcommand to drain non-Datadog node groups#3026
L3n41c wants to merge 18 commits into
mainfrom
lenaic/CASCL-1386-evict-legacy-nodes

Conversation

@L3n41c
Copy link
Copy Markdown
Member

@L3n41c L3n41c commented May 18, 2026

Summary

Adds kubectl datadog autoscaling cluster evict-legacy-nodes, which migrates workloads off non-Datadog-managed node groups (cluster-autoscaler ASGs, EKS managed node groups, user-created Karpenter NodePools, standalone EC2 instances) onto the Datadog-managed Karpenter NodePools created by install, then scales the legacy groups to zero.

Jira: CASCL-1386
Builds on: CASCL-1304

Highlights

  • Re-classifies the cluster via clusterinfo.Classify and updates the dd-cluster-info ConfigMap before any destructive work.
  • Scales the legacy cluster-autoscaler Deployment to 0 replicas as step 1 (opt-out via --skip-cluster-autoscaler).
  • Creates temporary maxUnavailable: 1 PodDisruptionBudgets for workloads without one. Cleanup is label-based (app.kubernetes.io/managed-by=kubectl-datadog + autoscaling.datadoghq.com/temporary-pdb=true), so a SIGKILL'd run is reaped by the next invocation.
  • Evicts in parallel by manager type, with errors aggregated rather than aborting other types:
    • ASG → cordon + evict + UpdateAutoScalingGroup(0,0,0) (only when all nodes drained — avoids AZ-rebalance and MinSize > DesiredCapacity hazards).
    • EKS managed node groupUpdateNodegroupConfig(0,0,0), then waits for the K8s nodes carrying eks.amazonaws.com/nodegroup=<name> to disappear so the EKS-side drain finishes before temp PDBs are removed (sentinel error `errEKSDrainIncomplete`).
    • Karpenter user NodePool → cordon + evict (NodePool spec left alone, by design).
    • Standalone EC2 → cordon + evict + TerminateInstances.
  • Pre-flight refuses when no Datadog-managed NodePool exists; warns on user NodePool weight conflicts that could cause evicted pods to land back on user-managed nodes.
  • Every K8s read-modify-write is wrapped in retry.RetryOnConflict.

Out of scope (by design)

  • Modifying user-managed Karpenter NodePool specs (only their existing nodes are drained).
  • Deleting ASGs, EKS managed node groups, or NodePools — the command scales to 0; ownership and deletion belong to whoever provisioned them (Terraform, Helm, eksctl, etc.).
  • Migrating Fargate-hosted pods (Fargate has its own lifecycle).

Test plan

  • Unit tests with -race (go test -race ./cmd/kubectl-datadog/autoscaling/cluster/evict/...).
  • make lint — 0 issues.
  • make kubectl-datadog builds.
  • Pre-commit review passed: Codex (6 rounds), code-reviewer agent (2 rounds), /simplify (2 rounds), cross-validation (1 cycle).
  • Smoke test on a real EKS cluster with a CA + ASG + EKS MNG + user Karpenter NodePool + workloads.

🤖 Generated with Claude Code

… node groups

Introduces `kubectl datadog autoscaling cluster evict-legacy-nodes`, which
drains workloads from cluster-autoscaler ASGs, EKS managed node groups,
user-created Karpenter NodePools and standalone EC2 instances onto the
Datadog-managed Karpenter NodePools created by `install`, then scales the
legacy groups to zero.

Highlights:
- Re-classifies the cluster (reuses `clusterinfo.Classify`) and updates the
  `dd-cluster-info` ConfigMap before any destructive work.
- Scales the legacy cluster-autoscaler Deployment to 0 replicas (opt-out
  via `--skip-cluster-autoscaler`).
- Creates temporary `maxUnavailable: 1` PodDisruptionBudgets for workloads
  without one; cleanup is label-based, so a SIGKILL'd run is reaped on the
  next invocation.
- Evicts in parallel by manager type:
    * ASG: cordon + evict + UpdateAutoScalingGroup(0,0,0) (only when all
      nodes drained; avoids AZ-rebalance and MinSize/DesiredCapacity
      hazards).
    * EKS managed node group: UpdateNodegroupConfig(0,0,0), then waits
      for the K8s nodes to disappear so the EKS-side drain finishes
      before temp PDBs are removed.
    * Karpenter user NodePool: cordon + evict (NodePool spec untouched).
    * Standalone EC2: cordon + evict + TerminateInstances.
- Pre-flight refuses when no Datadog-managed NodePool exists; warns on
  user NodePool weight conflicts.
- Every K8s read-modify-write wrapped in retry.RetryOnConflict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 18, 2026

Codecov Report

❌ Patch coverage is 60.42714% with 315 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.65%. Comparing base (20ecb9e) to head (2f00ddd).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...d/kubectl-datadog/autoscaling/cluster/evict/run.go 12.03% 95 Missing ⚠️
...d/kubectl-datadog/autoscaling/cluster/evict/pdb.go 60.09% 67 Missing and 14 partials ⚠️
...kubectl-datadog/autoscaling/cluster/evict/evict.go 0.00% 80 Missing ⚠️
...tl-datadog/autoscaling/cluster/evict/evict_pods.go 78.94% 11 Missing and 5 partials ⚠️
...d/kubectl-datadog/autoscaling/cluster/evict/asg.go 71.79% 8 Missing and 3 partials ⚠️
...tl-datadog/autoscaling/cluster/evict/standalone.go 77.14% 5 Missing and 3 partials ⚠️
...atadog/autoscaling/cluster/evict/karpenter_user.go 54.54% 3 Missing and 2 partials ⚠️
...ctl-datadog/autoscaling/cluster/evict/preflight.go 85.29% 4 Missing and 1 partial ⚠️
.../kubectl-datadog/autoscaling/cluster/evict/plan.go 95.60% 2 Missing and 2 partials ⚠️
.../autoscaling/cluster/common/karpenter/fromnodes.go 0.00% 3 Missing ⚠️
... and 4 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3026      +/-   ##
==========================================
+ Coverage   41.50%   42.65%   +1.14%     
==========================================
  Files         335      351      +16     
  Lines       28714    30181    +1467     
==========================================
+ Hits        11919    12873     +954     
- Misses      16001    16464     +463     
- Partials      794      844      +50     
Flag Coverage Δ
unittests 42.65% <60.42%> (+1.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ctl-datadog/autoscaling/cluster/common/aws/node.go 100.00% <100.00%> (ø)
...autoscaling/cluster/common/clusterinfo/classify.go 88.46% <100.00%> (-0.07%) ⬇️
...ubectl-datadog/autoscaling/cluster/evict/prompt.go 100.00% <100.00%> (ø)
cmd/kubectl-datadog/autoscaling/cluster/cluster.go 0.00% <0.00%> (ø)
...dog/autoscaling/cluster/evict/clusterautoscaler.go 88.23% <88.23%> (ø)
...ubectl-datadog/autoscaling/cluster/evict/cordon.go 86.66% <86.66%> (ø)
...bectl-datadog/autoscaling/cluster/evict/eks_mng.go 95.12% <95.12%> (ø)
.../autoscaling/cluster/common/karpenter/fromnodes.go 11.93% <0.00%> (ø)
.../kubectl-datadog/autoscaling/cluster/evict/plan.go 95.60% <95.60%> (ø)
...atadog/autoscaling/cluster/evict/karpenter_user.go 54.54% <54.54%> (ø)
... and 7 more

... and 9 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20ecb9e...2f00ddd. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented May 18, 2026

Code Coverage

Fix all issues with BitsAI

🛑 Gate Violations

🎯 1 Code Coverage issue detected

A Patch coverage percentage gate may be blocking this PR.

Patch coverage: 59.11% (threshold: 80.00%)

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 59.11%
Overall Coverage: 42.90% (+1.09%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 2f00ddd | Docs | Datadog PR Page | Give us feedback!

Trailing alignment / extra blank line cleaned up by `gofmt -s` and
`golangci-lint --fix` on three of the evict-package test files. No
behavior change; restores `git diff --exit-code` after `make fmt` for
the `check_formatting` CI gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c L3n41c added the enhancement New feature or request label May 18, 2026
@L3n41c L3n41c added this to the v1.27.0 milestone May 18, 2026
Adds three small test files to lift `evict/` package coverage from 55%
to 66% so the patch-coverage gate passes:

- `preflight_test.go`: covers `warnKarpenterWeightConflicts` for no
  Karpenter targets, no conflict, equal-weight conflict, nil-weight
  defaulting to 0, and an unknown user-NodePool name.
- `prompt_test.go`: covers `printPlan` rendering of each section
  (CA, PDB, per-manager evictions, dry-run skips) and
  `promptConfirmation`'s y/N/yes parsing.
- `pdb_test.go`: covers `uniqueNodes` excluding EKS MNG entries and
  deduplicating shared node names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented May 18, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7ee9656a7e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +143 to +144
if p.DeletionTimestamp != nil {
return true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Wait for terminating pods before terminating nodes

When a pod has a DeletionTimestamp, treating it as skipped makes waitForNodeEmpty report success immediately after the eviction is accepted even though the container may still be running its termination grace period or preStop hook. For ASG and standalone targets, the caller then scales/terminates the underlying EC2 instance, so pods with long graceful shutdowns can be killed before they actually exit; terminating pods should still count as remaining until they disappear when the node is about to be terminated.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 09902c5.

Split shouldSkipPod into two predicates:

  • shouldSkipEviction (drainNode): keeps the previous semantics — skip Evict() for DS / mirror / terminating / completed pods.
  • podOccupiesNode (waitForNodeEmpty): a terminating pod still occupies the node until it is actually gone, so the function keeps blocking past DeletionTimestamp set. DS / mirror / completed pods don't count.

Locked in by the new TestPodOccupiesNode table test.

Comment on lines +144 to +145
if t.Manager == clusterinfo.NodeManagerEKSManagedNodeGroup {
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include EKS MNG nodes when creating temporary PDBs

With an EKS managed node group target, excluding those nodes means --ensure-pdbs creates no temporary PDB for workloads whose replicas are only on that node group, even though evictEKSManagedNodeGroup immediately sets the group to 0 and relies on EKS evictions. In that scenario EKS has no PDB to respect and can disrupt all replicas of an otherwise unprotected workload at once, while the plan still tells the operator that temporary PDBs will be created for workloads without one.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the rationale shifted under me. Fixed in 09902c5 by removing the EKS MNG filter in uniqueNodes.

The original exclusion was added when evictEKSManagedNodeGroup was fire-and-forget — back then the temp PDB cleanup ran while EKS was still draining. After round 4 of the pre-commit review the orchestrator now blocks on waitEKSNodegroupEmpty before reaching cleanup, so EKS observes any temp PDB we create for the duration of its drain. Excluding the MNG entries left workloads whose every replica lived on the node group without protection — exactly the scenario you described.

TestUniqueNodes_IncludesAllManagerTypes flips the previous regression to lock the new behavior in.

L3n41c and others added 14 commits May 18, 2026 22:14
P1: `shouldSkipPod` collapsed two distinct concerns — "do not call Evict()
on this pod" (drainNode) and "this pod no longer occupies the node"
(waitForNodeEmpty). A pod with DeletionTimestamp set was treated as
absent, so for ASG/standalone targets the orchestrator could terminate
the EC2 instance before the container finished its grace period.
Splits the predicate:

- `shouldSkipEviction`: skip the Eviction call for DS / mirror /
  terminating / completed pods.
- `podOccupiesNode`: a terminating pod STILL occupies the node and
  must keep `waitForNodeEmpty` blocking. DS / mirror / completed pods
  don't.

P2: `uniqueNodes` no longer filters out EKS managed node group entries
when discovering controllers for temporary PDBs. The orchestrator now
blocks on `waitEKSNodegroupEmpty` before cleaning up the PDBs, so EKS
observes them throughout its drain. Excluding MNG nodes left a workload
whose every replica lived on a single MNG without any PDB protection;
EKS could then disrupt all replicas at once.

Test coverage:
- New `TestPodOccupiesNode` locks in the "terminating pods still
  occupy the node" semantics.
- `TestShouldSkipPod` renamed to `TestShouldSkipEviction`.
- `TestUniqueNodes_ExcludesEKSMNG` renamed to
  `TestUniqueNodes_IncludesAllManagerTypes` and inverted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…can dedup

`ExtractEC2InstanceID` and `LabelEKSNodegroup` were introduced in
`common/clusterinfo` so the evict package could reuse them, but two more
copies of the same `aws:///[^/]+/(i-[0-9a-f]+)$` regex remained inline:
in `common/clusterinfo/classify.go:197` and `common/karpenter/fromnodes.go:65`.

`common/clusterinfo` already imports `common/karpenter` (for the Karpenter
detection helpers), so karpenter can't reach back into clusterinfo without
producing an import cycle. Move the helper to a neutral home —
`common/aws/node.go` — that both can import. The unit test moves with the
function.

Net result: one regex definition, four call sites, three packages dedup'd.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing tests under `cmd/kubectl-datadog/autoscaling/cluster/` (e.g.
`install/install_test.go:TestValidate`, `apply/install_mode_test.go`,
`apply/inference_method_test.go`, `apply/create_karpenter_resources_test.go`)
all declare the table inline inside `for _, tc := range []struct{ … }{ … } {`
rather than via an intermediate `tests :=` variable.

Aligns four newly-added tests with that convention:
- `common/aws/node_test.go: TestExtractEC2InstanceID`
- `evict/evict_pods_test.go: TestShouldSkipEviction`, `TestPodOccupiesNode`
- `evict/plan_test.go: TestParseTargetSpec`
- `evict/pdb_test.go: TestTempPDBName`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`evictASG` extracted the EC2 instance ID and warned on an unexpected
providerID, but discarded the ID immediately afterwards. The check was
load-bearing in an earlier revision that terminated instances per-instance
via `TerminateInstanceInAutoScalingGroup`; after the round-2 Codex review
replaced that loop with a single `UpdateAutoScalingGroup(0,0,0)`, the ID
is no longer used here and the warning misleadingly suggested otherwise.

The `standalone` and `karpenter/fromnodes` call sites still use the
extracted ID, so the helper stays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tASG`

The five `TestEvictASG_*` functions shared the same setup (fake clientset
+ stubAutoscaling), the same call (`evictASG(ctx, client, stub, …)`) and
the same axes of assertion (`err`, `stub.scaledASGs`, optionally
`Spec.Unschedulable`). They only differed on the initial K8s objects,
whether an Eviction reactor was wired, and the drain timeouts.

Collapsing them into one `TestEvictASG` with an inline case slice keeps
the matrix of (objects × reactor × dry-run × wantErr × wantScaledASGs ×
wantUnschedulable) visible at a glance, aligns with the convention used
in `install/install_test.go`, `apply/install_mode_test.go`, etc., and
makes it cheap to add a new scenario.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same shape as the earlier `TestEvictASG` consolidation, applied to the
rest of the package. Tests that exercised the same target function with
different inputs are now grouped into one `Test<Function>` with an
inline case slice; one test per *function* rather than per *scenario*
collapses ~65 cousin functions down to ~25.

- `clusterautoscaler_test.go`: 5 → 1 (`TestScaleDownClusterAutoscaler`)
- `cordon_test.go`: 4 → 1 (`TestCordonNode`)
- `eks_mng_test.go`: 5 → 1 (`TestEvictEKSManagedNodeGroup`)
- `evict_pods_test.go`: 5 EvictPodWithRetry tests → 1 (`TestEvictPodWithRetry`)
  with a typed `evictionResponder` closure capturing each scenario's reactor
- `karpenter_user_test.go`: 2 → 1 (`TestEvictKarpenterUserNodePool`)
- `standalone_test.go`: 5 → 1 (`TestEvictStandalone`)
- `preflight_test.go`: 5 conflict tests → 1 (`TestWarnKarpenterWeightConflicts`)
- `prompt_test.go`: 4 printPlan + 2 promptConfirmation → 2 (`TestPrintPlan`,
  `TestPromptConfirmation`)
- `run_test.go`: 5 → 1 (`TestEvictAllTargetsParallel`)
- `plan_test.go`: 3 BuildPlan tests + the inline-subtests in
  `TestHasDatadogManagedNodePool` → `TestBuildPlan` + table-driven
  `TestHasDatadogManagedNodePool`
- `pdb_test.go`: per-helper consolidations
  (`TestUniqueNodes`, `TestIsTemporaryPDB`, `TestHasUserPDB`,
  `TestCleanupTempPDBs`, `TestCreateTempPDB`); `TestTempPDBName` was already
  table-driven; `TestDiscoverControllers_FiltersByNodeSet` left as a single
  scenario since the fixtures don't vary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the hand-rolled for/return-true loop with the stdlib
`slices.ContainsFunc` (Go 1.21+). Same semantics, half the lines,
no extra dependency — `slices` is already imported elsewhere in the
plugin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e boolean expression

Both functions were cascades of `if cond { return true/false }` short-circuiting
the same way `||` and `&&` already do. The two predicates collapse to a single
return line each, which makes the asymmetry between the two — `||` includes
`p.DeletionTimestamp \!= nil`, `&&` doesn't — visible at a glance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the hand-rolled accumulator loop in `waitForNodeEmpty` with
`lo.CountBy`. Same semantics, fewer lines, intent visible at the
function name. The stdlib `slices` package has `ContainsFunc` /
`IndexFunc` but no `CountFunc`, so `lo.CountBy` is the right call —
`lo` is already a project dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the if/else cascade in evictPodWithRetry with a tagless switch
and move the retry-sleep into the default branch so the four dispatch
outcomes (success, 404, non-429 error, deadline) sit as peers next to
the retry path. Tightens listPodsOnNode with named returns.
PollInterval is an internal field with a single construction site
(run.go) that never set it, leaving two divergent defensive defaults
(2s in drainNode, 10s in waitEKSNodegroupEmpty) to paper over the gap.
Set it explicitly to 2s in run.go and remove both fallbacks; tests
still inject their own short intervals.
Drop the per-manager-type goroutines, mutex and WaitGroup in favour of
a plain loop over targets. The original rationale for parallelism was
asynchronous EKS draining (UpdateNodegroupConfig returning while EKS
worked in the background), but waitEKSNodegroupEmpty now blocks until
the EKS-managed nodes disappear, so every manager already runs
synchronously. Sequential execution yields linear logs, bounded
apiserver pressure, and easier debugging when a target fails.
Realign inline struct literals after recent field-list changes. Adding
fields whose name is longer than any existing one widens the column
gofmt computes for the whole literal; the previous commits introduced
such fields without re-running gofmt, so check_formatting on the CI
pipeline reported diffs in five test files. Pure whitespace, no
semantic change.
@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented May 20, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 96d5f50e4f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

NodegroupName: awssdk.String(nodegroupName),
ScalingConfig: &ekstypes.NodegroupScalingConfig{
MinSize: awssdk.Int32(0),
MaxSize: awssdk.Int32(0),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep EKS maxSize above the API minimum

For EKS managed node groups this call will be rejected before any drain starts, because the EKS NodegroupScalingConfig API allows minSize and desiredSize to be 0 but requires maxSize to be at least 1. As a result, any evict-legacy-nodes run that includes an EKS managed node group fails at UpdateNodegroupConfig and never migrates those nodes; scale to zero by setting min/desired to 0 while leaving max at a valid value.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 2f00ddd6.

EKS API documents maxSize with valid range "Minimum value of 1" (NodegroupScalingConfig), so the previous min=max=desired=0 would have failed before any drain started. The fix DescribeNodegroups the target first and preserves the existing MaxSize while still zeroing min/desired (defaults to 1 if the described scaling config is nil). The drain behaviour is unchanged — desired=0 is what triggers it.

if info.Autoscaling.EKSAutoMode.Enabled {
return errors.New("EKS auto-mode is enabled on this cluster; eviction is not applicable (auto-mode manages its own node lifecycle)")
}
if err = clusterinfo.Persist(ctx, cli.K8sClient, namespace, info); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Avoid writing cluster-info during dry runs

When the command is invoked with --dry-run, this still creates or updates the cluster-info ConfigMap before any dry-run checks, so a preview run mutates the cluster despite the flag text promising to only log actions. This also requires write RBAC for a dry-run that should otherwise be read-only; gate the persist calls on !opts.DryRun or log what would be written instead.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 2f00ddd6.

Both clusterinfo.Persist call sites in Run (the pre-eviction snapshot here on line 72 and the post-eviction re-classify near the end of the function) now skip the ConfigMap write when opts.DryRun is set and log a [dry-run] would persist … line instead. A preview no longer requires write RBAC on the Karpenter namespace and leaves no state behind.

Two fixes triggered by Codex review of 96d5f50:

1. EKS managed node group: the API rejects `maxSize < 1`, so the previous
   `min=max=desired=0` Update would have failed before any drain started.
   DescribeNodegroup the target first, then preserve its current MaxSize
   while still zeroing min/desired (defaulting to 1 if the described
   scaling config is nil). The EKS-side drain behaviour is unchanged —
   desired=0 still triggers it — but the API call now succeeds.

2. Dry-run: both `clusterinfo.Persist` call sites in Run mutated the
   cluster-info ConfigMap unconditionally, before any dry-run gate. A
   preview must not require write RBAC on the Karpenter namespace and
   must not leave behind state; gate both calls on `\!opts.DryRun` and
   log the would-be persistence instead.

Test updates: extend stubEKS with DescribeNodegroup, add cases for
DescribeNodegroup failure (short-circuits Update) and for a nil
ScalingConfig (falls back to max=1).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants