Skip to content

TAS: TopologyUngater can not recognize rank-based ordering for MPIJob with runLauncherAsWorker #8471

@tenzen-y

Description

@tenzen-y

What happened:
When I created a MPIJob with runLaucherAsWorker mode, Kueue failed to assign topology to Pods based on the rank ordering (training.kubeflow.org/replica-index) as you can see kueue-controller-manager logs in the following:

"level":"error","ts":"2026-01-08T14:45:32.36847947Z","caller":"tas/topology_ungater.go:415","msg":"failed to read rank information from Pods","controller":"tas_topology_ungater","namespace":"default","name":"mpijob-pi-8e2f6","reconcileID":"8d2976c6-f349-4783-a4ec-780f6435cca7","error":"incorrect label value \"2\" for Pod \"default/pi-worker-1\": validation error: value should be less than 2","stacktrace":"sigs.k8s.io/kueue/pkg/controller/tas.readRanksIfAvailable\n\t/workspace/pkg/controller/tas/topology_ungater.go:415\nsigs.k8s.io/kueue/pkg/controller/tas.assignGatedPodsToDomains\n\t/workspace/pkg/controller/tas/topology_ungater.go:330\nsigs.k8s.io/kueue/pkg/controller/tas.(*topologyUngater).Reconcile\n\t/workspace/pkg/controller/tas/topology_ungater.go:222\nsigs.k8s.io/kueue/pkg/controller/core.(*leaderAwareReconciler).Reconcile\n\t/workspace/pkg/controller/core/leader_aware_reconciler.go:77\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:461\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:421\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func1.1\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:296"}

The root cause is that TopologyUngater expects the max rank index is less than the number of ranks in

kueue/pkg/util/pod/pod.go

Lines 106 to 115 in e355ea8

func readUIntFromStringBelowBound(value string, bound int) (*int, error) {
uintValue, err := strconv.ParseUint(value, 10, 0)
if err != nil {
return nil, fmt.Errorf("%w: %s", ErrInvalidUInt, err.Error())
}
if uintValue >= uint64(bound) {
return nil, fmt.Errorf("%w: value should be less than %d", ErrValidation, bound)
}
return ptr.To(int(uintValue)), nil
}
.

However, runLaucherAsWorker MPIJob worker replica starts from 1 (training.kubeflow.org/replica-index: 1) instead of 0 because index 0 (training.kubeflow.org/replica-index: 0) is the launcher replica.

What you expected to happen:
TAS succeded to assign topologies to Pods based on rank ordering (training.kubeflow.org/replica-index).

How to reproduce it (as minimally and precisely as possible):

The following is a step-by-step reproducible flow:

  1. Setup cluster, Kueue, and MPIOperator
$ kind create cluster

$ kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.15.2/manifests.yaml

$ cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta2
kind: Topology
metadata:
  name: "default"
spec:
  levels:
  - nodeLabel: "kubernetes.io/hostname"
---
kind: ResourceFlavor
apiVersion: kueue.x-k8s.io/v1beta2
metadata:
  name: "tas-flavor"
spec:
  nodeLabels:
    kubernetes.io/os: linux
  topologyName: "default"
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: "tas-cluster-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "tas-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 100
      - name: "memory"
        nominalQuota: 100Gi
---
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  namespace: "default"
  name: "tas-user-queue"
spec:
  clusterQueue: "tas-cluster-queue"
EOF

$ kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.7.0/deploy/v2beta1/mpi-operator.yaml
  1. Create MPIJob with runLauncherAsWorker
$ cat <<EOF | k apply -f -
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: pi
  labels:
    kueue.x-k8s.io/queue-name: "tas-user-queue"
spec:
  slotsPerWorker: 1
  runLauncherAsWorker: true  
  runPolicy:
    cleanPodPolicy: Running
    ttlSecondsAfterFinished: 60
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: mpioperator/mpi-pi:openmpi
            name: mpi-launcher
            securityContext:
              runAsUser: 1000
            command:
            - mpirun
            args:
            - -n
            - "2"
            - /home/mpiuser/pi
            resources:
              limits:
                cpu: 1
                memory: 1Gi
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: mpioperator/mpi-pi:openmpi
            name: mpi-worker
            securityContext:
              runAsUser: 1000
            command:
            - /usr/sbin/sshd
            args:
            - -De
            - -f
            - /home/mpiuser/.sshd_config
            resources:
              limits:
                cpu: 1
                memory: 1Gi
EOF

Anything else we need to know?:

The created Pods are following:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2026-01-08T14:46:41Z"
  finalizers:
  - batch.kubernetes.io/job-tracking
  generateName: pi-launcher-
  labels:
    batch.kubernetes.io/controller-uid: 23c06567-c814-4e70-8679-69353e459c23
    batch.kubernetes.io/job-name: pi-launcher
    controller-uid: 23c06567-c814-4e70-8679-69353e459c23
    job-name: pi-launcher
    training.kubeflow.org/job-name: pi
    training.kubeflow.org/job-role: launcher
    training.kubeflow.org/operator-name: mpi-operator
    training.kubeflow.org/replica-index: "0"
  name: pi-launcher-tntsp
  namespace: default
---
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kueue.x-k8s.io/podset-unconstrained-topology: "true"
    kueue.x-k8s.io/workload: mpijob-pi-1584c
  creationTimestamp: "2026-01-08T14:46:41Z"
  labels:
    kueue.x-k8s.io/podset: worker
    training.kubeflow.org/job-name: pi
    training.kubeflow.org/job-role: worker
    training.kubeflow.org/operator-name: mpi-operator
    training.kubeflow.org/replica-index: "1"
  name: pi-worker-0
  namespace: default

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions