[CASCL-1386] Add evict-legacy-nodes subcommand to drain non-Datadog node groups#3026
[CASCL-1386] Add evict-legacy-nodes subcommand to drain non-Datadog node groups#3026L3n41c wants to merge 18 commits into
Conversation
… node groups
Introduces `kubectl datadog autoscaling cluster evict-legacy-nodes`, which
drains workloads from cluster-autoscaler ASGs, EKS managed node groups,
user-created Karpenter NodePools and standalone EC2 instances onto the
Datadog-managed Karpenter NodePools created by `install`, then scales the
legacy groups to zero.
Highlights:
- Re-classifies the cluster (reuses `clusterinfo.Classify`) and updates the
`dd-cluster-info` ConfigMap before any destructive work.
- Scales the legacy cluster-autoscaler Deployment to 0 replicas (opt-out
via `--skip-cluster-autoscaler`).
- Creates temporary `maxUnavailable: 1` PodDisruptionBudgets for workloads
without one; cleanup is label-based, so a SIGKILL'd run is reaped on the
next invocation.
- Evicts in parallel by manager type:
* ASG: cordon + evict + UpdateAutoScalingGroup(0,0,0) (only when all
nodes drained; avoids AZ-rebalance and MinSize/DesiredCapacity
hazards).
* EKS managed node group: UpdateNodegroupConfig(0,0,0), then waits
for the K8s nodes to disappear so the EKS-side drain finishes
before temp PDBs are removed.
* Karpenter user NodePool: cordon + evict (NodePool spec untouched).
* Standalone EC2: cordon + evict + TerminateInstances.
- Pre-flight refuses when no Datadog-managed NodePool exists; warns on
user NodePool weight conflicts.
- Every K8s read-modify-write wrapped in retry.RetryOnConflict.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🛑 Gate Violations
ℹ️ Info🎯 Code Coverage (details) Useful? React with 👍 / 👎 This comment will be updated automatically if new data arrives.🔗 Commit SHA: 2f00ddd | Docs | Datadog PR Page | Give us feedback! |
Trailing alignment / extra blank line cleaned up by `gofmt -s` and `golangci-lint --fix` on three of the evict-package test files. No behavior change; restores `git diff --exit-code` after `make fmt` for the `check_formatting` CI gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three small test files to lift `evict/` package coverage from 55% to 66% so the patch-coverage gate passes: - `preflight_test.go`: covers `warnKarpenterWeightConflicts` for no Karpenter targets, no conflict, equal-weight conflict, nil-weight defaulting to 0, and an unknown user-NodePool name. - `prompt_test.go`: covers `printPlan` rendering of each section (CA, PDB, per-manager evictions, dry-run skips) and `promptConfirmation`'s y/N/yes parsing. - `pdb_test.go`: covers `uniqueNodes` excluding EKS MNG entries and deduplicating shared node names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7ee9656a7e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if p.DeletionTimestamp != nil { | ||
| return true |
There was a problem hiding this comment.
Wait for terminating pods before terminating nodes
When a pod has a DeletionTimestamp, treating it as skipped makes waitForNodeEmpty report success immediately after the eviction is accepted even though the container may still be running its termination grace period or preStop hook. For ASG and standalone targets, the caller then scales/terminates the underlying EC2 instance, so pods with long graceful shutdowns can be killed before they actually exit; terminating pods should still count as remaining until they disappear when the node is about to be terminated.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 09902c5.
Split shouldSkipPod into two predicates:
shouldSkipEviction(drainNode): keeps the previous semantics — skip Evict() for DS / mirror / terminating / completed pods.podOccupiesNode(waitForNodeEmpty): a terminating pod still occupies the node until it is actually gone, so the function keeps blocking pastDeletionTimestampset. DS / mirror / completed pods don't count.
Locked in by the new TestPodOccupiesNode table test.
| if t.Manager == clusterinfo.NodeManagerEKSManagedNodeGroup { | ||
| continue |
There was a problem hiding this comment.
Include EKS MNG nodes when creating temporary PDBs
With an EKS managed node group target, excluding those nodes means --ensure-pdbs creates no temporary PDB for workloads whose replicas are only on that node group, even though evictEKSManagedNodeGroup immediately sets the group to 0 and relies on EKS evictions. In that scenario EKS has no PDB to respect and can disrupt all replicas of an otherwise unprotected workload at once, while the plan still tells the operator that temporary PDBs will be created for workloads without one.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Right, the rationale shifted under me. Fixed in 09902c5 by removing the EKS MNG filter in uniqueNodes.
The original exclusion was added when evictEKSManagedNodeGroup was fire-and-forget — back then the temp PDB cleanup ran while EKS was still draining. After round 4 of the pre-commit review the orchestrator now blocks on waitEKSNodegroupEmpty before reaching cleanup, so EKS observes any temp PDB we create for the duration of its drain. Excluding the MNG entries left workloads whose every replica lived on the node group without protection — exactly the scenario you described.
TestUniqueNodes_IncludesAllManagerTypes flips the previous regression to lock the new behavior in.
P1: `shouldSkipPod` collapsed two distinct concerns — "do not call Evict() on this pod" (drainNode) and "this pod no longer occupies the node" (waitForNodeEmpty). A pod with DeletionTimestamp set was treated as absent, so for ASG/standalone targets the orchestrator could terminate the EC2 instance before the container finished its grace period. Splits the predicate: - `shouldSkipEviction`: skip the Eviction call for DS / mirror / terminating / completed pods. - `podOccupiesNode`: a terminating pod STILL occupies the node and must keep `waitForNodeEmpty` blocking. DS / mirror / completed pods don't. P2: `uniqueNodes` no longer filters out EKS managed node group entries when discovering controllers for temporary PDBs. The orchestrator now blocks on `waitEKSNodegroupEmpty` before cleaning up the PDBs, so EKS observes them throughout its drain. Excluding MNG nodes left a workload whose every replica lived on a single MNG without any PDB protection; EKS could then disrupt all replicas at once. Test coverage: - New `TestPodOccupiesNode` locks in the "terminating pods still occupy the node" semantics. - `TestShouldSkipPod` renamed to `TestShouldSkipEviction`. - `TestUniqueNodes_ExcludesEKSMNG` renamed to `TestUniqueNodes_IncludesAllManagerTypes` and inverted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…can dedup `ExtractEC2InstanceID` and `LabelEKSNodegroup` were introduced in `common/clusterinfo` so the evict package could reuse them, but two more copies of the same `aws:///[^/]+/(i-[0-9a-f]+)$` regex remained inline: in `common/clusterinfo/classify.go:197` and `common/karpenter/fromnodes.go:65`. `common/clusterinfo` already imports `common/karpenter` (for the Karpenter detection helpers), so karpenter can't reach back into clusterinfo without producing an import cycle. Move the helper to a neutral home — `common/aws/node.go` — that both can import. The unit test moves with the function. Net result: one regex definition, four call sites, three packages dedup'd. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing tests under `cmd/kubectl-datadog/autoscaling/cluster/` (e.g.
`install/install_test.go:TestValidate`, `apply/install_mode_test.go`,
`apply/inference_method_test.go`, `apply/create_karpenter_resources_test.go`)
all declare the table inline inside `for _, tc := range []struct{ … }{ … } {`
rather than via an intermediate `tests :=` variable.
Aligns four newly-added tests with that convention:
- `common/aws/node_test.go: TestExtractEC2InstanceID`
- `evict/evict_pods_test.go: TestShouldSkipEviction`, `TestPodOccupiesNode`
- `evict/plan_test.go: TestParseTargetSpec`
- `evict/pdb_test.go: TestTempPDBName`
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`evictASG` extracted the EC2 instance ID and warned on an unexpected providerID, but discarded the ID immediately afterwards. The check was load-bearing in an earlier revision that terminated instances per-instance via `TerminateInstanceInAutoScalingGroup`; after the round-2 Codex review replaced that loop with a single `UpdateAutoScalingGroup(0,0,0)`, the ID is no longer used here and the warning misleadingly suggested otherwise. The `standalone` and `karpenter/fromnodes` call sites still use the extracted ID, so the helper stays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tASG` The five `TestEvictASG_*` functions shared the same setup (fake clientset + stubAutoscaling), the same call (`evictASG(ctx, client, stub, …)`) and the same axes of assertion (`err`, `stub.scaledASGs`, optionally `Spec.Unschedulable`). They only differed on the initial K8s objects, whether an Eviction reactor was wired, and the drain timeouts. Collapsing them into one `TestEvictASG` with an inline case slice keeps the matrix of (objects × reactor × dry-run × wantErr × wantScaledASGs × wantUnschedulable) visible at a glance, aligns with the convention used in `install/install_test.go`, `apply/install_mode_test.go`, etc., and makes it cheap to add a new scenario. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same shape as the earlier `TestEvictASG` consolidation, applied to the rest of the package. Tests that exercised the same target function with different inputs are now grouped into one `Test<Function>` with an inline case slice; one test per *function* rather than per *scenario* collapses ~65 cousin functions down to ~25. - `clusterautoscaler_test.go`: 5 → 1 (`TestScaleDownClusterAutoscaler`) - `cordon_test.go`: 4 → 1 (`TestCordonNode`) - `eks_mng_test.go`: 5 → 1 (`TestEvictEKSManagedNodeGroup`) - `evict_pods_test.go`: 5 EvictPodWithRetry tests → 1 (`TestEvictPodWithRetry`) with a typed `evictionResponder` closure capturing each scenario's reactor - `karpenter_user_test.go`: 2 → 1 (`TestEvictKarpenterUserNodePool`) - `standalone_test.go`: 5 → 1 (`TestEvictStandalone`) - `preflight_test.go`: 5 conflict tests → 1 (`TestWarnKarpenterWeightConflicts`) - `prompt_test.go`: 4 printPlan + 2 promptConfirmation → 2 (`TestPrintPlan`, `TestPromptConfirmation`) - `run_test.go`: 5 → 1 (`TestEvictAllTargetsParallel`) - `plan_test.go`: 3 BuildPlan tests + the inline-subtests in `TestHasDatadogManagedNodePool` → `TestBuildPlan` + table-driven `TestHasDatadogManagedNodePool` - `pdb_test.go`: per-helper consolidations (`TestUniqueNodes`, `TestIsTemporaryPDB`, `TestHasUserPDB`, `TestCleanupTempPDBs`, `TestCreateTempPDB`); `TestTempPDBName` was already table-driven; `TestDiscoverControllers_FiltersByNodeSet` left as a single scenario since the fixtures don't vary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the hand-rolled for/return-true loop with the stdlib `slices.ContainsFunc` (Go 1.21+). Same semantics, half the lines, no extra dependency — `slices` is already imported elsewhere in the plugin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e boolean expression
Both functions were cascades of `if cond { return true/false }` short-circuiting
the same way `||` and `&&` already do. The two predicates collapse to a single
return line each, which makes the asymmetry between the two — `||` includes
`p.DeletionTimestamp \!= nil`, `&&` doesn't — visible at a glance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the hand-rolled accumulator loop in `waitForNodeEmpty` with `lo.CountBy`. Same semantics, fewer lines, intent visible at the function name. The stdlib `slices` package has `ContainsFunc` / `IndexFunc` but no `CountFunc`, so `lo.CountBy` is the right call — `lo` is already a project dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the if/else cascade in evictPodWithRetry with a tagless switch and move the retry-sleep into the default branch so the four dispatch outcomes (success, 404, non-429 error, deadline) sit as peers next to the retry path. Tightens listPodsOnNode with named returns.
PollInterval is an internal field with a single construction site (run.go) that never set it, leaving two divergent defensive defaults (2s in drainNode, 10s in waitEKSNodegroupEmpty) to paper over the gap. Set it explicitly to 2s in run.go and remove both fallbacks; tests still inject their own short intervals.
Drop the per-manager-type goroutines, mutex and WaitGroup in favour of a plain loop over targets. The original rationale for parallelism was asynchronous EKS draining (UpdateNodegroupConfig returning while EKS worked in the background), but waitEKSNodegroupEmpty now blocks until the EKS-managed nodes disappear, so every manager already runs synchronously. Sequential execution yields linear logs, bounded apiserver pressure, and easier debugging when a target fails.
Realign inline struct literals after recent field-list changes. Adding fields whose name is longer than any existing one widens the column gofmt computes for the whole literal; the previous commits introduced such fields without re-running gofmt, so check_formatting on the CI pipeline reported diffs in five test files. Pure whitespace, no semantic change.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 96d5f50e4f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| NodegroupName: awssdk.String(nodegroupName), | ||
| ScalingConfig: &ekstypes.NodegroupScalingConfig{ | ||
| MinSize: awssdk.Int32(0), | ||
| MaxSize: awssdk.Int32(0), |
There was a problem hiding this comment.
Keep EKS maxSize above the API minimum
For EKS managed node groups this call will be rejected before any drain starts, because the EKS NodegroupScalingConfig API allows minSize and desiredSize to be 0 but requires maxSize to be at least 1. As a result, any evict-legacy-nodes run that includes an EKS managed node group fails at UpdateNodegroupConfig and never migrates those nodes; scale to zero by setting min/desired to 0 while leaving max at a valid value.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 2f00ddd6.
EKS API documents maxSize with valid range "Minimum value of 1" (NodegroupScalingConfig), so the previous min=max=desired=0 would have failed before any drain started. The fix DescribeNodegroups the target first and preserves the existing MaxSize while still zeroing min/desired (defaults to 1 if the described scaling config is nil). The drain behaviour is unchanged — desired=0 is what triggers it.
| if info.Autoscaling.EKSAutoMode.Enabled { | ||
| return errors.New("EKS auto-mode is enabled on this cluster; eviction is not applicable (auto-mode manages its own node lifecycle)") | ||
| } | ||
| if err = clusterinfo.Persist(ctx, cli.K8sClient, namespace, info); err != nil { |
There was a problem hiding this comment.
Avoid writing cluster-info during dry runs
When the command is invoked with --dry-run, this still creates or updates the cluster-info ConfigMap before any dry-run checks, so a preview run mutates the cluster despite the flag text promising to only log actions. This also requires write RBAC for a dry-run that should otherwise be read-only; gate the persist calls on !opts.DryRun or log what would be written instead.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 2f00ddd6.
Both clusterinfo.Persist call sites in Run (the pre-eviction snapshot here on line 72 and the post-eviction re-classify near the end of the function) now skip the ConfigMap write when opts.DryRun is set and log a [dry-run] would persist … line instead. A preview no longer requires write RBAC on the Karpenter namespace and leaves no state behind.
Two fixes triggered by Codex review of 96d5f50: 1. EKS managed node group: the API rejects `maxSize < 1`, so the previous `min=max=desired=0` Update would have failed before any drain started. DescribeNodegroup the target first, then preserve its current MaxSize while still zeroing min/desired (defaulting to 1 if the described scaling config is nil). The EKS-side drain behaviour is unchanged — desired=0 still triggers it — but the API call now succeeds. 2. Dry-run: both `clusterinfo.Persist` call sites in Run mutated the cluster-info ConfigMap unconditionally, before any dry-run gate. A preview must not require write RBAC on the Karpenter namespace and must not leave behind state; gate both calls on `\!opts.DryRun` and log the would-be persistence instead. Test updates: extend stubEKS with DescribeNodegroup, add cases for DescribeNodegroup failure (short-circuits Update) and for a nil ScalingConfig (falls back to max=1).
Summary
Adds
kubectl datadog autoscaling cluster evict-legacy-nodes, which migrates workloads off non-Datadog-managed node groups (cluster-autoscaler ASGs, EKS managed node groups, user-created Karpenter NodePools, standalone EC2 instances) onto the Datadog-managed Karpenter NodePools created byinstall, then scales the legacy groups to zero.Jira: CASCL-1386
Builds on: CASCL-1304
Highlights
clusterinfo.Classifyand updates thedd-cluster-infoConfigMap before any destructive work.--skip-cluster-autoscaler).maxUnavailable: 1PodDisruptionBudgets for workloads without one. Cleanup is label-based (app.kubernetes.io/managed-by=kubectl-datadog+autoscaling.datadoghq.com/temporary-pdb=true), so a SIGKILL'd run is reaped by the next invocation.UpdateAutoScalingGroup(0,0,0)(only when all nodes drained — avoids AZ-rebalance andMinSize > DesiredCapacityhazards).UpdateNodegroupConfig(0,0,0), then waits for the K8s nodes carryingeks.amazonaws.com/nodegroup=<name>to disappear so the EKS-side drain finishes before temp PDBs are removed (sentinel error `errEKSDrainIncomplete`).TerminateInstances.retry.RetryOnConflict.Out of scope (by design)
Test plan
-race(go test -race ./cmd/kubectl-datadog/autoscaling/cluster/evict/...).make lint— 0 issues.make kubectl-datadogbuilds.🤖 Generated with Claude Code