generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 500
Open
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.
Description
What happened:
Integrate pod support into kueue, following a instruction by this link:
https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/
However, a plain pod was killed instantially after created, and then event shows message like:
message: Missing Workload; unable to restore pod templates
And kueue-controller-manager error out with:
Operation cannot be fulfilled on workloads.kueue.x-k8s.io, the object has been modified
What you expected to happen:
The plain pod's state was expected to be Running rather than Terminating.
How to reproduce it (as minimally and precisely as possible):
- Deploying kueue by helm template, while
managedJobsNamespaceSelectorandintegrationswere modified:
controllerManagerConfigYaml: |-
apiVersion: config.kueue.x-k8s.io/v1beta2
kind: Configuration
managedJobsNamespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: [ kube-system, kueue-system ]
integrations:
frameworks:
- "pod"
- "deployment"
- "statefulset"
- "leaderworkerset.x-k8s.io/leaderworkerset"- Applying a plain pod into k8s cluster, with label
kueue.x-k8s.io/queue-name: user-queue:
---
apiVersion: v1
kind: Pod
metadata:
name: gpu-kueue-pod-2
labels:
kueue.x-k8s.io/queue-name: user-queue
spec:
restartPolicy: Always
containers:
- name: test
image: docker.inspur.com:5000/library/cke/kubernetes/pause:3.8
resources:
limits:
cpu: 10m
memory: 256Mi
# kubectl apply -f pod-example.yaml
pod/gpu-kueue-pod-1 created
- Watching pod and workloads, the pod and workload were created, and instantly terminated.
# kubectl get po -w
NAME READY STATUS RESTARTS AGE
vllm-0 1/1 Running 0 17h
vllm-0-0 1/1 Running 0 17h
gpu-kueue-pod-2 0/1 Pending 0 0s
gpu-kueue-pod-2 0/1 Pending 0 0s
gpu-kueue-pod-2 0/1 Pending 0 0s
gpu-kueue-pod-2 0/1 Terminating 0 0s
gpu-kueue-pod-2 0/1 Terminating 0 0s
gpu-kueue-pod-2 0/1 Terminating 0 0s
gpu-kueue-pod-2 1/1 Terminating 0 14s
gpu-kueue-pod-2 0/1 Terminating 0 15s
gpu-kueue-pod-2 0/1 Terminating 0 16s
gpu-kueue-pod-2 0/1 Terminating 0 16s
# kubectl get workloads -w
NAME QUEUE RESERVED IN ADMITTED FINISHED AGE
job-sample-job-l88rg-9a1c1 user-queue 12h
leaderworkerset-vllm-0-68909 user-queue-gpu 17h
pod-gpu-kueue-pod-2-09669 user-queue 0s
pod-gpu-kueue-pod-2-09669 user-queue 0s
pod-gpu-kueue-pod-2-09669 user-queue 0s
pod-gpu-kueue-pod-2-09669 user-queue 16s
- Watching k8s event, pod stopped with message
Missing Workload; unable to restore pod templates:
# kubectl get events --sort-by {.lastTimestamp} -w
0s Normal Pulled pod/gpu-kueue-pod-2 Container image "docker.inspur.com:5000/library/cke/kubernetes/pause:3.8" already present on machine
0s Normal Created pod/gpu-kueue-pod-2 Created container test
0s Normal Started pod/gpu-kueue-pod-2 Started container test
0s Normal Killing pod/gpu-kueue-pod-2 Stopping container test
0s Normal Scheduled pod/gpu-kueue-pod-2 Successfully assigned default/gpu-kueue-pod-2 to master01
0s Normal Stopped pod/gpu-kueue-pod-2 Missing Workload; unable to restore pod templates
0s Normal CreatedWorkload pod/gpu-kueue-pod-2 Created Workload: default/pod-gpu-kueue-pod-2-09669
0s Normal Pulled pod/gpu-kueue-pod-2 Container image "docker.inspur.com:5000/library/cke/kubernetes/pause:3.8" already present on machine
0s Normal Created pod/gpu-kueue-pod-2 Created container test
0s Normal Started pod/gpu-kueue-pod-2 Started container test
61m Normal Starting node/worker02-h20
0s Normal Killing pod/gpu-kueue-pod-2 Stopping container test
- Collecting logs from kueue-controller-manager:
{"level":"Level(-2)","ts":"2026-01-09T01:36:04.554856435Z","caller":"core/workload_controller.go:151","msg":"Reconcile Workload","controller":"workload_controller","namespace":"default","name":"pod-gpu-kueue-pod-2-09669","reconcileID":"bc794311-0113-437a-865a-18d2b58cc46e"}
{"level":"error","ts":"2026-01-09T01:36:04.562144631Z","caller":"jobframework/reconciler.go:278","msg":"Removing finalizer","controller":"v1_pod","namespace":"default","name":"gpu-kueue-pod-2","reconcileID":"d0fa15d9-3ee2-4ac0-8014-8a647fb45054","job":"default/gpu-kueue-pod-2","gvk":"/v1, Kind=Pod","error":"Operation cannot be fulfilled on workloads.kueue.x-k8s.io \"pod-gpu-kueue-pod-2-09669\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/kueue/pkg/controller/jobframework.(*JobReconciler).ReconcileGenericJob\n\t/workspace/pkg/controller/jobframework/reconciler.go:278\nsigs.k8s.io/kueue/pkg/controller/jobs/pod.(*Reconciler).Reconcile\n\t/workspace/pkg/controller/jobs/pod/pod_controller.go:123\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224"}
{"level":"error","ts":"2026-01-09T01:36:04.562188092Z","caller":"controller/controller.go:316","msg":"Reconciler error","controller":"v1_pod","namespace":"default","name":"gpu-kueue-pod-2","reconcileID":"d0fa15d9-3ee2-4ac0-8014-8a647fb45054","error":"Operation cannot be fulfilled on workloads.kueue.x-k8s.io \"pod-gpu-kueue-pod-2-09669\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224"}
{"level":"Level(-2)","ts":"2026-01-09T01:36:05.513467937Z","caller":"core/localqueue_controller.go:115","msg":"Reconcile LocalQueue","controller":"localqueue_controller","namespace":"default","name":"user-queue","reconcileID":"0a2abe5d-e764-4433-8f97-e4e1da4da3e0"}
{"level":"Level(-2)","ts":"2026-01-09T01:36:05.513530132Z","caller":"core/clusterqueue_controller.go:178","msg":"Reconcile ClusterQueue","controller":"clusterqueue_controller","namespace":"","name":"cluster-queue","reconcileID":"4af41005-39f8-4019-878c-64dabf18c021"}
{"level":"error","ts":"2026-01-09T01:36:05.523468606Z","caller":"controller/controller.go:316","msg":"Reconciler error","controller":"clusterqueue_controller","namespace":"","name":"cluster-queue","reconcileID":"4af41005-39f8-4019-878c-64dabf18c021","error":"ClusterQueue.kueue.x-k8s.io \"cluster-queue\" is invalid: spec.flavorFungibility.whenCanBorrow: Unsupported value: \"MayStopSearch\": supported values: \"Borrow\", \"TryNextFlavor\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224"}
{"level":"Level(-2)","ts":"2026-01-09T01:36:05.525527897Z","logger":"localqueue-reconciler","caller":"core/localqueue_controller.go:173","msg":"Queue update event","localQueue":{"name":"user-queue","namespace":"default"}}
{"level":"Level(-2)","ts":"2026-01-09T01:36:05.526243535Z","caller":"core/localqueue_controller.go:115","msg":"Reconcile LocalQueue","controller":"localqueue_controller","namespace":"default","name":"user-queue","reconcileID":"bec511c7-a594-4152-aecf-55ea9d55b645"}
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version):
Major:"1", Minor:"25", GitVersion:"v1.25.4-1"
- Kueue version (use
git describe --tags --dirty --always):
Chart Release: kueue-0.11.9
- Cloud provider or hardware configuration:
- OS (e.g:
cat /etc/os-release): - Kernel (e.g.
uname -a): - Install tools:
- Others:
cmd: hostnamectl | egrep 'Operating System|Kernel'
--------------------------------------------------
Operating System: Ubuntu 20.04.6 LTS
Kernel: Linux 5.4.0-216-generic
cmd: systemctl --version | head -n1
--------------------------------------------------
systemd 245 (245.4-4ubuntu3.20)
cmd: runc --version
--------------------------------------------------
runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev
cmd: containerd -v
--------------------------------------------------
containerd github.com/containerd/containerd v1.4.3 269548fa27e0089a8b8278fc4fc781d7f65a939b
# helm list -n kueue-system
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
kueue kueue-system 1 2026-01-08 03:19:11.096477477 +0000 UTC deployed kueue-0.11.9 v0.11.9
Metadata
Metadata
Assignees
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.