Skip to content

KCP does not remove the etcd member for a machine that failed to join the control plane #13221

@dlipovetsky

Description

@dlipovetsky

What steps did you take and what happened?

  1. Create a cluster with a 3-node control plane.
  2. Create a machine image that is misconfigured, so that the kubelet cannot start.
  3. Update KubeadmControlPlane to use this image.
  4. When kubeadm join --control-plane runs, it adds an etcd member as a learner, tries to promote it, and eventually fails, because etcd is not running, since kubelet is not running. The etcd member remains unstarted.
  5. At some point, KCP remediation will delete the machine that has failed to join the control plane. However, KCP will not remove the etcd member for this machine, because the machine has no NodeRef, as it never joined the cluster.
  6. KCP will create a new machine, and when kubeadm join --control-plane runs, it will try to add a member as a learner, and that will fail, because no etcd members can be added, while there is one unstarted etcd member.

What did you expect to happen?

KCP should remove the unstarted etcd member that was added for the machine that failed to join the cluster.

Cluster API version

v1.10.7

Kubernetes version

v1.34.2

Anything else you would like to add?

This issue came out of a discussion in kubernetes/kubeadm#3269 (comment)

Label(s) to be applied

/kind bug
/area provider/control-plane-kubeadm

Metadata

Metadata

Labels

area/provider/control-plane-kubeadmIssues or PRs related to KCPhelp wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.kind/bugCategorizes issue or PR as related to a bug.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions