runsc: bind-mount host /proc/driver/nvidia for CDI createContainer hooks#13284
runsc: bind-mount host /proc/driver/nvidia for CDI createContainer hooks#13284a7i wants to merge 1 commit into
Conversation
After google#13034 enabled createContainer hooks in CDI specs, three of four NVIDIA CDI hooks emitted by k8s-device-plugin succeed inside the gofer (create-symlinks, enable-cuda-compat, update-ldcache). The fourth -- disable-device-node-modification -- still fails because it opens /proc/driver/nvidia/params inside the container's rootfs and bind-mounts a modified copy over it; that path does not exist in containerRootFs at hook time because procfs is not mounted (the sentry serves /proc itself later): nvidia-ctk hook disable-device-node-modification stderr: failed to mount modified params file: open o_path procfd: open /run/containerd/.../rootfs/proc/driver/nvidia/params: no such file or directory FATAL ERROR: error executing CreateContainer hooks Under runc, /proc is mounted into the container's mount namespace before createContainer hooks run, so the hook just works. Mirror that here for the gofer: when nvproxy is enabled, bind-mount the host's /proc/driver/nvidia directory onto containerRootFs/proc/driver/nvidia (read-only) before invoking hooks. The hook's semantic effect (set ModifyDeviceFiles=0 to prevent libnvidia-ml from auto-creating extra /dev/nvidiaN nodes) does not apply under gVisor -- nvproxy mediates all device access and the sentry owns /dev -- but the hook needs to be able to *complete* for sandbox creation to proceed. Fixes google#13283
|
I only tested running with cdi-cri mode which is probably why I didn't catch this. I'll take a look tomorrow to better understand what this mode is doing. |
|
Thank you! I'll give cdi-cri a try now |
|
@LandonTClipp confirmed The CDI spec at So this fix is needed regardless of strategy. |
|
I will have to get to the bottom of why my system did not encounter this. There is probably an nvcdi difference between us that is accounting for it but I do not know for certain. |
|
Hmm here is a hint:
So it has to do with the container image. |
Note that gVisor exposes gvisor/pkg/sentry/devices/nvproxy/nvproxy.go Lines 82 to 84 in cd64409 So we can basically special case |
|
@ayushr2 implemented your suggestion in #13308 as an alternative to this PR. It filters |
|
This hook was added in nvidia-container-toolkit v1.18.0-rc.1. k8s-device-plugin v1.17.0 uses v1.17.0 of nvidia-container-toolkit. It uses the nvcdi library directly. This is what I'm running on my system. So I am confused why you said you also received this hook if you're running k8s-device-plugin 1.17.0. That shouldn't happen. Anyway, I did ask NVIDIA if they could help us with this and their solution is a CLI flag/env var you can add to k8s-device-plugin: NVIDIA/k8s-device-plugin#1818 I think it makes sense if gvisor still explicitly ignores this, but I also feel that it's more theoretically correct to leave the onus on cluster operators to disable hooks they don't need. Well, I could go either way. |
ah! nice find. I am using nvidia-container-toolkit 1.19.1 with device plugin v0.19.1 |
|
@a7i I am sending a fix for this. Feel free to close your PRs. Side note: LLM agents have bad habit of generating useless unit tests to satiate testing requirements from their users. But in practice, a lot of these unit tests are useless. For instance, the unit test generated in this PR will never run because we don't run unit tests on GPU VMs in BuildKite. Unit tests can also be equally buggy; i.e. you can have a buggy implementation and generate an equally buggy unit test to accept the incorrect output. It is not a good "proof of correctness". e2e tests are harder to fool like this. LLMs will happily generate unit tests that are skipped or pass the current implementation because they overfit their own implementation. So in effect we just add a bunch of lines of code to the project that don't provide value. Please be mindful of this! |
Fixes #13283.
What this does
After #13034 enabled CDI
createContainerhooks in the gofer, three of the four NVIDIA hooks emitted byk8s-device-plugin(create-symlinks,enable-cuda-compat,update-ldcache) now succeed inside the gofer. The fourth —disable-device-node-modification— still fails because it bind-mounts a modifiedparamsfile over<containerRootFs>/proc/driver/nvidia/params, and that path doesn't exist incontainerRootFsat hook time (procfs is not mounted there; the sentry serves/procitself later).This change bind-mounts the host's
/proc/driver/nvidiadirectory (read-only) ontocontainerRootFs/proc/driver/nvidiawhen nvproxy is enabled, right afterSetupDevand before hooks execute. That mirrors what runc gets for free from mounting procfs into the container beforecreateContainerhooks run.The hook's semantic effect (set
ModifyDeviceFiles=0so in-containerlibnvidia-mlwon't auto-create extra/dev/nvidiaNnodes) doesn't apply under gVisor because nvproxy mediates all device access and the sentry owns/dev. The hook just needs to be able to complete so sandbox creation proceeds — which this fix achieves.Sequence
Tested
Verified on a Tesla T4 host (Ubuntu 22.04, kernel 6.8, containerd 2.2.2, kubelet 1.35) with
k8s-device-plugininDEVICE_LIST_STRATEGY=cdi-annotationsmode andrunsc release-20260520.0(pre-patch baseline). Pre-patch the gofer log shows:With this patch applied, all four hooks log
Execute hook success!, the sandbox starts, and a PyTorch CUDA pod reportscuda available: Trueend-to-end.Tests
TestSetupNvidiaProcDriverNoHostDrivercovers the graceful-skip path (host with no NVIDIA driver loaded) using a tmpdir as the rootfs.Risks
/proc/driver/nvidia/(kernel-provided NVIDIA driver metadata). Same surface the host driver already exposes to userspace; nothing privileged is added.specutils.NVProxyEnabled(spec, conf).os.Statof/proc/driver/nvidiareturnsENOENT→ function returns nil cleanly.if rootfsConf.ShouldUseLisafs()only for the hook execution itself, butSetupNvidiaProcDriveris called outside that branch (alongsideSetupDev). EROFS rootfs is still covered by theos.MkdirAllfailing → fatal, with a clear error; that matches the existingSetupDevbehavior and the EROFS hook caveat already documented by feat: Support running createContainer hooks in CDI spec #13034.