-
Notifications
You must be signed in to change notification settings - Fork 500
Description
What happened:
When I created a MPIJob with runLaucherAsWorker mode, Kueue failed to assign topology to Pods based on the rank ordering (training.kubeflow.org/replica-index) as you can see kueue-controller-manager logs in the following:
"level":"error","ts":"2026-01-08T14:45:32.36847947Z","caller":"tas/topology_ungater.go:415","msg":"failed to read rank information from Pods","controller":"tas_topology_ungater","namespace":"default","name":"mpijob-pi-8e2f6","reconcileID":"8d2976c6-f349-4783-a4ec-780f6435cca7","error":"incorrect label value \"2\" for Pod \"default/pi-worker-1\": validation error: value should be less than 2","stacktrace":"sigs.k8s.io/kueue/pkg/controller/tas.readRanksIfAvailable\n\t/workspace/pkg/controller/tas/topology_ungater.go:415\nsigs.k8s.io/kueue/pkg/controller/tas.assignGatedPodsToDomains\n\t/workspace/pkg/controller/tas/topology_ungater.go:330\nsigs.k8s.io/kueue/pkg/controller/tas.(*topologyUngater).Reconcile\n\t/workspace/pkg/controller/tas/topology_ungater.go:222\nsigs.k8s.io/kueue/pkg/controller/core.(*leaderAwareReconciler).Reconcile\n\t/workspace/pkg/controller/core/leader_aware_reconciler.go:77\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:461\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:421\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func1.1\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:296"}The root cause is that TopologyUngater expects the max rank index is less than the number of ranks in
Lines 106 to 115 in e355ea8
| func readUIntFromStringBelowBound(value string, bound int) (*int, error) { | |
| uintValue, err := strconv.ParseUint(value, 10, 0) | |
| if err != nil { | |
| return nil, fmt.Errorf("%w: %s", ErrInvalidUInt, err.Error()) | |
| } | |
| if uintValue >= uint64(bound) { | |
| return nil, fmt.Errorf("%w: value should be less than %d", ErrValidation, bound) | |
| } | |
| return ptr.To(int(uintValue)), nil | |
| } |
However, runLaucherAsWorker MPIJob worker replica starts from 1 (training.kubeflow.org/replica-index: 1) instead of 0 because index 0 (training.kubeflow.org/replica-index: 0) is the launcher replica.
What you expected to happen:
TAS succeded to assign topologies to Pods based on rank ordering (training.kubeflow.org/replica-index).
How to reproduce it (as minimally and precisely as possible):
The following is a step-by-step reproducible flow:
- Setup cluster, Kueue, and MPIOperator
$ kind create cluster
$ kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.15.2/manifests.yaml
$ cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta2
kind: Topology
metadata:
name: "default"
spec:
levels:
- nodeLabel: "kubernetes.io/hostname"
---
kind: ResourceFlavor
apiVersion: kueue.x-k8s.io/v1beta2
metadata:
name: "tas-flavor"
spec:
nodeLabels:
kubernetes.io/os: linux
topologyName: "default"
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
name: "tas-cluster-queue"
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory"]
flavors:
- name: "tas-flavor"
resources:
- name: "cpu"
nominalQuota: 100
- name: "memory"
nominalQuota: 100Gi
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
namespace: "default"
name: "tas-user-queue"
spec:
clusterQueue: "tas-cluster-queue"
EOF
$ kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.7.0/deploy/v2beta1/mpi-operator.yaml- Create MPIJob with runLauncherAsWorker
$ cat <<EOF | k apply -f -
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: pi
labels:
kueue.x-k8s.io/queue-name: "tas-user-queue"
spec:
slotsPerWorker: 1
runLauncherAsWorker: true
runPolicy:
cleanPodPolicy: Running
ttlSecondsAfterFinished: 60
sshAuthMountPath: /home/mpiuser/.ssh
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: mpioperator/mpi-pi:openmpi
name: mpi-launcher
securityContext:
runAsUser: 1000
command:
- mpirun
args:
- -n
- "2"
- /home/mpiuser/pi
resources:
limits:
cpu: 1
memory: 1Gi
Worker:
replicas: 2
template:
spec:
containers:
- image: mpioperator/mpi-pi:openmpi
name: mpi-worker
securityContext:
runAsUser: 1000
command:
- /usr/sbin/sshd
args:
- -De
- -f
- /home/mpiuser/.sshd_config
resources:
limits:
cpu: 1
memory: 1Gi
EOFAnything else we need to know?:
The created Pods are following:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2026-01-08T14:46:41Z"
finalizers:
- batch.kubernetes.io/job-tracking
generateName: pi-launcher-
labels:
batch.kubernetes.io/controller-uid: 23c06567-c814-4e70-8679-69353e459c23
batch.kubernetes.io/job-name: pi-launcher
controller-uid: 23c06567-c814-4e70-8679-69353e459c23
job-name: pi-launcher
training.kubeflow.org/job-name: pi
training.kubeflow.org/job-role: launcher
training.kubeflow.org/operator-name: mpi-operator
training.kubeflow.org/replica-index: "0"
name: pi-launcher-tntsp
namespace: default
---
apiVersion: v1
kind: Pod
metadata:
annotations:
kueue.x-k8s.io/podset-unconstrained-topology: "true"
kueue.x-k8s.io/workload: mpijob-pi-1584c
creationTimestamp: "2026-01-08T14:46:41Z"
labels:
kueue.x-k8s.io/podset: worker
training.kubeflow.org/job-name: pi
training.kubeflow.org/job-role: worker
training.kubeflow.org/operator-name: mpi-operator
training.kubeflow.org/replica-index: "1"
name: pi-worker-0
namespace: defaultEnvironment:
- Kubernetes version (use
kubectl version): - Kueue version (use
git describe --tags --dirty --always): - Cloud provider or hardware configuration:
- OS (e.g:
cat /etc/os-release): - Kernel (e.g.
uname -a): - Install tools:
- Others: