Skip to content

toolkit-validation container fails with "nvidia-smi": executable file not found in $PATH (after clean installation on Ubuntu 24.04.2 + Kubespray 1.32.2) #1374

@botterweck

Description

@botterweck
  • Installed Ubuntu 24.04.2 LTS from scratch
  • Installed Kubernetes (Kubespray 1.32.2 via Ansible)
  • Installed NVIDIA GPU Operator (via Helm)

Problem

  • Pod nvidia-operator-validator-lrjkw is stuck in Status CrashLoopBackOff
  • Container toolkit-validation failes with "Error: error validating toolkit installation: exec: "nvidia-smi": executable file not found in $PATH"

Notes

  • I did NOT select "install third-party drivers like NVIDIA" during Ubuntu install, since I assume it would be better leave it to GPU Operator to pick the right drivers.

Similar Issues

System Summary

OS: Ubuntu 24.04.2 LTS
Kernel: 6.11.0-21-generic
Computer: Dell Precision 7680
GPU: NVIDIA RTX 4090 (Laptop)
Container Runtime Type: contained 2.0.3
K8s: Kubeflow v1.32.2

GPU Operator Installation

helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --version=v25.3.0

GPU Operator Namespace Pods Status

kubectl get pods -n gpu-operator

NAME                                                              READY   STATUS                  RESTARTS       AGE
gpu-feature-discovery-88m4r                                       0/1     Init:0/1                0              75m
gpu-operator-1743284421-node-feature-discovery-gc-848dd4fb28gnd   1/1     Running                 0              76m
gpu-operator-1743284421-node-feature-discovery-master-5579j6m7k   1/1     Running                 0              76m
gpu-operator-1743284421-node-feature-discovery-worker-zwxkm       1/1     Running                 0              76m
gpu-operator-6c54bfb6d5-p7bxb                                     1/1     Running                 0              76m
nvidia-container-toolkit-daemonset-rgh2q                          1/1     Running                 0              75m
nvidia-dcgm-exporter-n6fzb                                        0/1     Init:0/1                0              75m
nvidia-device-plugin-daemonset-fqtxs                              0/1     Init:0/1                0              75m
nvidia-driver-daemonset-4p7qm                                     1/1     Running                 0              76m
nvidia-operator-validator-lrjkw                                   0/1     Init:CrashLoopBackOff   19 (98s ago)   75m

Log of Container "toolkit-validation"

kubectl logs -n gpu-operator nvidia-operator-validator-lrjkw -c toolkit-validation

time="2025-03-29T22:55:39Z" level=info msg="version: b5479aaa-amd64, commit: b5479aa"
toolkit is not ready
time="2025-03-29T22:55:39Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"

Description for pod nvidia-operator-validator-lrjkw

kubectl describe pod -n gpu-operator nvidia-operator-validator-lrjkw

Name:                 nvidia-operator-validator-lrjkw
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      nvidia-operator-validator
Node:                 tensor/192.168.1.111
Start Time:           Sat, 29 Mar 2025 21:41:31 +0000
Labels:               app=nvidia-operator-validator
                      app.kubernetes.io/managed-by=gpu-operator
                      app.kubernetes.io/part-of=gpu-operator
                      controller-revision-hash=7fcd444d54
                      helm.sh/chart=gpu-operator-v25.3.0
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: 62e5554f1d12b7abc6a447b23682285922cbd6626a442d0912c67ccf4b5ef240
                      cni.projectcalico.org/podIP: 10.233.106.205/32
                      cni.projectcalico.org/podIPs: 10.233.106.205/32
Status:               Pending
IP:                   10.233.106.205
IPs:
  IP:           10.233.106.205
Controlled By:  DaemonSet/nvidia-operator-validator
Init Containers:
  driver-validation:
    Container ID:  containerd://9d9b53912922d5bbe3d768d8d5305d7cb8988819f33763e5db7195eaa82ab2fa
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:07b93914425148f936157ad295649ce100b91b29394669031a585d2458c9f39f
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 29 Mar 2025 21:43:27 +0000
      Finished:     Sat, 29 Mar 2025 21:43:27 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:           true
      COMPONENT:           driver
      OPERATOR_NAMESPACE:  gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /host-dev-char from host-dev-char (rw)
      /run/nvidia/driver from driver-install-dir (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xbqr5 (ro)
  toolkit-validation:
    Container ID:  containerd://3e2b201f61a786725407f0ca66426fabc50734ea2b2acac8f1164298cdba6669
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:07b93914425148f936157ad295649ce100b91b29394669031a585d2458c9f39f
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 29 Mar 2025 23:00:42 +0000
      Finished:     Sat, 29 Mar 2025 23:00:42 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 29 Mar 2025 22:55:39 +0000
      Finished:     Sat, 29 Mar 2025 22:55:39 +0000
    Ready:          False
    Restart Count:  20
    Environment:
      NVIDIA_VISIBLE_DEVICES:  all
      WITH_WAIT:               false
      COMPONENT:               toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xbqr5 (ro)
  cuda-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xbqr5 (ro)
  plugin-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                false
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xbqr5 (ro)
Containers:
  nvidia-operator-validator:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xbqr5 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 False
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  host-dev-char:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/char
    HostPathType:
  kube-api-access-xbqr5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.operator-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Warning  BackOff  4m7s (x342 over 77m)  kubelet  Back-off restarting failed container toolkit-validation in pod nvidia-operator-validator-lrjkw_gpu-operator(e08c4646-dd57-4a5a-8df9-c1a4191e5fb2)
  Normal   Pulled   5s (x21 over 77m)     kubelet  Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.0" already present on machine

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageissue or PR has not been assigned a priority-px label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions