How to use it?
What happened?
Noticed that with large number of nodes (4k), KWOK sometimes fails to update node lease resulting to node become unready. This is resolved after automatically, but there is a side effect on pods.
When node becomes unready all it's pod are marked unready too. KWOK fails to update the pods to be ready back again, resulting in some pods to be stuck in unready state forever.
Example:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2025-10-09T08:21:21Z"
generateName: nginx-
generation: 1
labels:
app: nginx
apps.kubernetes.io/pod-index: "3"
controller-revision-hash: nginx-d6df65d5b
statefulset.kubernetes.io/pod-name: nginx-3
name: nginx-3
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: nginx
uid: 14bac6df-c3ed-43db-b636-4a8587aa75e3
resourceVersion: "496007"
uid: ea786112-54bf-4761-a1da-dce13ab4c265
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: type
operator: In
values:
- kwok
containers:
- image: registry.k8s.io/nginx-slim:0.21
imagePullPolicy: IfNotPresent
name: nginx
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-jr9h9
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: nginx--3
nodeName: kwok-node-1
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: kwok.x-k8s.io/node
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-jr9h9
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2025-10-09T08:22:16Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2025-10-09T08:27:40Z"
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2025-10-09T08:22:16Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2025-10-09T08:22:14Z"
status: "True"
type: PodScheduled
containerStatuses:
- image: registry.k8s.io/nginx-slim:0.21
imageID: ""
lastState: {}
name: nginx
ready: true
restartCount: 0
state:
running:
startedAt: "2025-10-09T08:22:16Z"
hostIP: 10.0.0.1
phase: Running
podIP: 10.0.16.123
podIPs:
- ip: 10.0.16.123
qosClass: BestEffort
startTime: "2025-10-09T08:22:16Z"
What did you expect to happen?
KWOK should ensure that all pods it's responsible for have ready condition true.
How can we reproduce it (as minimally and precisely as possible)?
- Create cluster with KWOK (in my case KIND cluster with KWOK outside of cluster). Ensure all nodes are ready:
$ kubectl get nodes kwok-node-1
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 45m v1.34.1-dirty
kwok-node-1 Ready agent 42m fake
- Schedule pods (in my case via statefulsefulset) and wait for them to become ready. Can be confirmed by running kubectl get statefulset. Like:
$ kubectl get statefulset nginx
NAME READY AGE
nginx 4/4 31m
- Stop KWOK for couple for seconds enough for nodes, start it again and notice node flip between NotReady and Ready.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 45m v1.34.1-dirty
kwok-node-1 NotReady agent 44m fake
- After Node becomes ready notice that pods never become ready.
$ kubectl get statefulset nginx
NAME READY AGE
nginx 0/4 36m
Anything else we need to know?
No response
Kwok version
Details
$ kwok --version
kwok version v0.7.0 go1.24.1 (linux/amd64)
OS version
N/A
How to use it?
What happened?
Noticed that with large number of nodes (4k), KWOK sometimes fails to update node lease resulting to node become unready. This is resolved after automatically, but there is a side effect on pods.
When node becomes unready all it's pod are marked unready too. KWOK fails to update the pods to be ready back again, resulting in some pods to be stuck in unready state forever.
Example:
What did you expect to happen?
KWOK should ensure that all pods it's responsible for have ready condition true.
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
No response
Kwok version
Details
OS version
N/A