Skip to content

KEP-5278: Specify to keep NNN in case of error during binding cycle#5771

Closed
brejman wants to merge 1 commit intokubernetes:masterfrom
brejman:issue-135771-kep-update
Closed

KEP-5278: Specify to keep NNN in case of error during binding cycle#5771
brejman wants to merge 1 commit intokubernetes:masterfrom
brejman:issue-135771-kep-update

Conversation

@brejman
Copy link
Copy Markdown

@brejman brejman commented Jan 7, 2026

  • One-line PR description: Specifying behavior for NNN on binding cycle error
  • Other comments:

Right now the implementation intends to clear NNN on binding error: https://github.com/kubernetes/kubernetes/blob/21f7c3ff68b388b565331724d5a46920854431a8/pkg/scheduler/schedule_one.go#L392.

There are two problems with this:

  1. There's a bug that causes it to only clear on the 2nd binding failure in a row
  2. In case PreBind succeeded, the pod may be "locked into" a node, and clearing NNN may let another pod take its place, e.g. due to dynamically provisioned PV (#135771). Side note: even with NNN, a higher priority pod may take its place, so it's not completely robust

The KEP is a bit unclear on the intended behavior in this case. It mentions

scheduler clears the NominatedNodeName field at the end of failed scheduling cycle if it found the nominated node unschedulable for the pod

There's a comment in code that suggests that a pod that fails to bind isn't considered unschedulable: https://github.com/kubernetes/kubernetes/blob/21f7c3ff68b388b565331724d5a46920854431a8/pkg/scheduler/schedule_one.go#L318. I weren't able to find any docs that would mention it though.

I suggest to not clear NNN if there was an error during binding cycle and update the KEP wording to make it apparent. This is based on the assumption that binding cycle failure is either

  • transient or
  • consistent for all pods or
  • a bug in the scheduler code or one of the plugins

Even without NNN the scheduler may keep picking the same node. The main change is that the pod will continue to take space from other lower priority pods, which shouldn't matter if the assumption above holds.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @brejman!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @brejman. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 7, 2026
@k8s-ci-robot k8s-ci-robot requested a review from dom4ha January 7, 2026 14:45
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jan 7, 2026
@k8s-ci-robot k8s-ci-robot requested a review from macsko January 7, 2026 14:45
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: brejman
Once this PR has been reviewed and has the lgtm label, please assign dom4ha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 7, 2026
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 7, 2026
@brejman
Copy link
Copy Markdown
Author

brejman commented Jan 7, 2026

/cc @ania-borowiec @sanposhiho

@sanposhiho
Copy link
Copy Markdown
Member

There's a bug that causes it to only clear on the 2nd binding failure in a row

Why not just fix it?

In case PreBind succeeded, the pod may be "locked into" a node, and clearing NNN may let another pod take its place, e.g. due to dynamically provisioned PV (#135771). Side note: even with NNN, a higher priority pod may take its place, so it's not completely robust

In that case, a plugin should implement Unreserve to unlock.

@brejman
Copy link
Copy Markdown
Author

brejman commented Jan 8, 2026

Why not just fix it?

Fixing that bug would make the other issue more frequent, so we need to agree on how to address that one first

In that case, a plugin should implement Unreserve to unlock.

Right now, the volumebinding plugin only clears the assumed (in-memory) state in Unreserve. We could change the implementation to also unbind the PV from PVC, but I have a couple of concerns:

  1. Before unbinding, we probably would need to make sure that only a single pod is using the PVC to avoid pulling the PV from another pod, which can be susceptible to race conditions
  2. We probably wouldn't want to delete the PV to avoid data loss, so it will continue taking up storage. I suspect the same can already happen with preemption but it's a bit different because in that case the PVC would remain bound (?)

@sanposhiho
Copy link
Copy Markdown
Member

Can you make a issue on kubernetes/kubernetes repo side with all your argument? and tag @/kubernetes/sig-scheduling-leads and @/kubernetes/sig-storage-misc there.
Currently KEP-5278 doesn't change the previous behavior, which is, binding cycle's failures could result in scheduling pods on other places in next scheduling cycles. So, I don't think it's proper to discuss this on this KEP and discuss this here in k/enhancement repo.

@brejman
Copy link
Copy Markdown
Author

brejman commented Jan 8, 2026

Let's move the discussion to kubernetes/kubernetes#135771

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants