- Installed Ubuntu 24.04.2 LTS from scratch
- Installed Kubernetes (Kubespray 1.32.2 via Ansible)
- Installed NVIDIA GPU Operator (via Helm)
Problem
- Pod nvidia-operator-validator-lrjkw is stuck in Status CrashLoopBackOff
- Container toolkit-validation failes with "Error: error validating toolkit installation: exec: "nvidia-smi": executable file not found in $PATH"
Notes
- I did NOT select "install third-party drivers like NVIDIA" during Ubuntu install, since I assume it would be better leave it to GPU Operator to pick the right drivers.
Similar Issues
System Summary
OS: Ubuntu 24.04.2 LTS
Kernel: 6.11.0-21-generic
Computer: Dell Precision 7680
GPU: NVIDIA RTX 4090 (Laptop)
Container Runtime Type: contained 2.0.3
K8s: Kubeflow v1.32.2
GPU Operator Installation
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v25.3.0
GPU Operator Namespace Pods Status
kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-88m4r 0/1 Init:0/1 0 75m
gpu-operator-1743284421-node-feature-discovery-gc-848dd4fb28gnd 1/1 Running 0 76m
gpu-operator-1743284421-node-feature-discovery-master-5579j6m7k 1/1 Running 0 76m
gpu-operator-1743284421-node-feature-discovery-worker-zwxkm 1/1 Running 0 76m
gpu-operator-6c54bfb6d5-p7bxb 1/1 Running 0 76m
nvidia-container-toolkit-daemonset-rgh2q 1/1 Running 0 75m
nvidia-dcgm-exporter-n6fzb 0/1 Init:0/1 0 75m
nvidia-device-plugin-daemonset-fqtxs 0/1 Init:0/1 0 75m
nvidia-driver-daemonset-4p7qm 1/1 Running 0 76m
nvidia-operator-validator-lrjkw 0/1 Init:CrashLoopBackOff 19 (98s ago) 75m
Log of Container "toolkit-validation"
kubectl logs -n gpu-operator nvidia-operator-validator-lrjkw -c toolkit-validation
time="2025-03-29T22:55:39Z" level=info msg="version: b5479aaa-amd64, commit: b5479aa"
toolkit is not ready
time="2025-03-29T22:55:39Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
Description for pod nvidia-operator-validator-lrjkw
kubectl describe pod -n gpu-operator nvidia-operator-validator-lrjkw
Name: nvidia-operator-validator-lrjkw
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Runtime Class Name: nvidia
Service Account: nvidia-operator-validator
Node: tensor/192.168.1.111
Start Time: Sat, 29 Mar 2025 21:41:31 +0000
Labels: app=nvidia-operator-validator
app.kubernetes.io/managed-by=gpu-operator
app.kubernetes.io/part-of=gpu-operator
controller-revision-hash=7fcd444d54
helm.sh/chart=gpu-operator-v25.3.0
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: 62e5554f1d12b7abc6a447b23682285922cbd6626a442d0912c67ccf4b5ef240
cni.projectcalico.org/podIP: 10.233.106.205/32
cni.projectcalico.org/podIPs: 10.233.106.205/32
Status: Pending
IP: 10.233.106.205
IPs:
IP: 10.233.106.205
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
driver-validation:
Container ID: containerd://9d9b53912922d5bbe3d768d8d5305d7cb8988819f33763e5db7195eaa82ab2fa
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:07b93914425148f936157ad295649ce100b91b29394669031a585d2458c9f39f
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 29 Mar 2025 21:43:27 +0000
Finished: Sat, 29 Mar 2025 21:43:27 +0000
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-dir (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xbqr5 (ro)
toolkit-validation:
Container ID: containerd://3e2b201f61a786725407f0ca66426fabc50734ea2b2acac8f1164298cdba6669
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:07b93914425148f936157ad295649ce100b91b29394669031a585d2458c9f39f
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Error
Exit Code: 1
Started: Sat, 29 Mar 2025 23:00:42 +0000
Finished: Sat, 29 Mar 2025 23:00:42 +0000
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sat, 29 Mar 2025 22:55:39 +0000
Finished: Sat, 29 Mar 2025 22:55:39 +0000
Ready: False
Restart Count: 20
Environment:
NVIDIA_VISIBLE_DEVICES: all
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xbqr5 (ro)
cuda-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
WITH_WAIT: false
COMPONENT: cuda
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
VALIDATOR_RUNTIME_CLASS: nvidia
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xbqr5 (ro)
plugin-validation:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
COMPONENT: plugin
WITH_WAIT: false
WITH_WORKLOAD: false
MIG_STRATEGY: single
NODE_NAME: (v1:spec.nodeName)
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
VALIDATOR_IMAGE: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
VALIDATOR_IMAGE_PULL_POLICY: IfNotPresent
VALIDATOR_RUNTIME_CLASS: nvidia
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xbqr5 (ro)
Containers:
nvidia-operator-validator:
Container ID:
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
Image ID:
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
echo all validations are successful; sleep infinity
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xbqr5 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-dir:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
kube-api-access-xbqr5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.operator-validator=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 4m7s (x342 over 77m) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-lrjkw_gpu-operator(e08c4646-dd57-4a5a-8df9-c1a4191e5fb2)
Normal Pulled 5s (x21 over 77m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0" already present on machine
Problem
Notes
Similar Issues
System Summary
OS: Ubuntu 24.04.2 LTS
Kernel: 6.11.0-21-generic
Computer: Dell Precision 7680
GPU: NVIDIA RTX 4090 (Laptop)
Container Runtime Type: contained 2.0.3
K8s: Kubeflow v1.32.2
GPU Operator Installation
GPU Operator Namespace Pods Status
kubectl get pods -n gpu-operator
Log of Container "toolkit-validation"
kubectl logs -n gpu-operator nvidia-operator-validator-lrjkw -c toolkit-validation
Description for pod nvidia-operator-validator-lrjkw
kubectl describe pod -n gpu-operator nvidia-operator-validator-lrjkw