Once a node is restarted, the absence of a guaranteed sequence of pod restarts may result in pods that were started before the nvidia-container-toolkit cannot access the GPU devices, including device plugin.
Could you please provide some guidance on how to handle this situation? Is there any config from gpu-operator side to ensure nvidia-container-toolkit gets started at first?
Thanks.
Once a node is restarted, the absence of a guaranteed sequence of pod restarts may result in pods that were started before the nvidia-container-toolkit cannot access the GPU devices, including device plugin.
Could you please provide some guidance on how to handle this situation? Is there any config from gpu-operator side to ensure nvidia-container-toolkit gets started at first?
Thanks.