Skip to content

[slice] Evict if Slice is DEGRADED (and not tolerated)#1123

Open
pajakd wants to merge 8 commits intoAI-Hypercomputer:slice-mainfrom
pajakd:evict_if_not_tolerated
Open

[slice] Evict if Slice is DEGRADED (and not tolerated)#1123
pajakd wants to merge 8 commits intoAI-Hypercomputer:slice-mainfrom
pajakd:evict_if_not_tolerated

Conversation

@pajakd
Copy link
Copy Markdown
Collaborator

@pajakd pajakd commented Mar 13, 2026

Description

Evict the workload, when a slice enters DEGRADED state, while the workload requested only HEALTHY slices.

Issue

Testing

@pajakd pajakd force-pushed the evict_if_not_tolerated branch from 6c45282 to 73057ad Compare March 13, 2026 14:58
Comment on lines +144 to +155
if expr.Operator == corev1.NodeSelectorOpIn {
valueAllowedInExpr := false
for _, val := range expr.Values {
if val == value {
valueAllowedInExpr = true
break
}
}
if !valueAllowedInExpr {
termAllowsValue = false
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also handle Operator: corev1.NodeSelectorOpNotIn, Values: ["Degraded"]

  • isn't there some predefined method for checking this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, all the existing functions (like this one https://github.com/kubernetes/component-helpers/blob/master/scheduling/corev1/nodeaffinity/nodeaffinity.go#L321-L323C31) require node as argument. We actually use the existing func in Kueue https://github.com/kubernetes-sigs/kueue/blob/63d93ada52f1a9fa62deea59b65683fc9b6d99ea/pkg/cache/scheduler/tas_flavor_snapshot.go#L1567-L1569 but we don't have nodes in our context, we want to check if a given pods set allows DEGRADED slices or not. So, I think applying the existing one would actually introduce more complexity.

termAllowsValue := true
for _, expr := range term.MatchExpressions {
if expr.Key == key {
if expr.Operator == corev1.NodeSelectorOpIn {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic only evaluates corev1.NodeSelectorOpIn. If a user explicitly specifies NotIn: [Degraded], this function will skip evaluating that expression, and it will
erroneously return true (meaning Degraded is allowed). We should add support for corev1.NodeSelectorOpNotIn to ensure inverse conditions are respected.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

return false
}
for _, term := range spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms {
if len(term.MatchExpressions) > 0 {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about MatchingFields?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health of the cube is exposed only via label (cloud.google.com/gke-tpu-partition-4x4x4-state).

Since MatchFields is used to match against some fields of the nodes (not labels) its seem to be irrelevant in this context. So I'm just completely ignoring it.
(see eg https://github.com/kubernetes/kubernetes/blob/16c5a6be07bb2eab06b6e7732eb1cb759a7fe61c/pkg/apis/core/types.go#L3365-L3370) Let me know if I'm wrong.

@j-skiba
Copy link
Copy Markdown

j-skiba commented Mar 27, 2026

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants