Skip to content

runsc: bind-mount host /proc/driver/nvidia for CDI createContainer hooks#13284

Closed
a7i wants to merge 1 commit into
google:masterfrom
a7i:a7i/nvidia-procfs-for-cdi-hooks
Closed

runsc: bind-mount host /proc/driver/nvidia for CDI createContainer hooks#13284
a7i wants to merge 1 commit into
google:masterfrom
a7i:a7i/nvidia-procfs-for-cdi-hooks

Conversation

@a7i
Copy link
Copy Markdown
Contributor

@a7i a7i commented May 26, 2026

Fixes #13283.

What this does

After #13034 enabled CDI createContainer hooks in the gofer, three of the four NVIDIA hooks emitted by k8s-device-plugin (create-symlinks, enable-cuda-compat, update-ldcache) now succeed inside the gofer. The fourth — disable-device-node-modification — still fails because it bind-mounts a modified params file over <containerRootFs>/proc/driver/nvidia/params, and that path doesn't exist in containerRootFs at hook time (procfs is not mounted there; the sentry serves /proc itself later).

This change bind-mounts the host's /proc/driver/nvidia directory (read-only) onto containerRootFs/proc/driver/nvidia when nvproxy is enabled, right after SetupDev and before hooks execute. That mirrors what runc gets for free from mounting procfs into the container before createContainer hooks run.

The hook's semantic effect (set ModifyDeviceFiles=0 so in-container libnvidia-ml won't auto-create extra /dev/nvidiaN nodes) doesn't apply under gVisor because nvproxy mediates all device access and the sentry owns /dev. The hook just needs to be able to complete so sandbox creation proceeds — which this fix achieves.

Sequence

gofer setup (containerRootFs)
  SetupMounts          → libcuda / library bind-mounts
  SetupDev             → /dev/nvidia* cdevs
  SetupNvidiaProcDriver  ← NEW: bind-mount /proc/driver/nvidia (ro)
  ExecuteHooks         → all four CDI createContainer hooks now succeed
bind-mount containerRootFs → goferRootFs (existing behavior)
pivot_root into goferRootFs
sentry boots; serves its own /proc

Tested

Verified on a Tesla T4 host (Ubuntu 22.04, kernel 6.8, containerd 2.2.2, kubelet 1.35) with k8s-device-plugin in DEVICE_LIST_STRATEGY=cdi-annotations mode and runsc release-20260520.0 (pre-patch baseline). Pre-patch the gofer log shows:

hooks.go:63] Executing hook nvidia-ctk hook disable-device-node-modification
util.go:107] FATAL ERROR: error executing CreateContainer hooks
stderr: failed to mount modified params file: open o_path procfd:
  open /run/containerd/.../rootfs/proc/driver/nvidia/params: no such file or directory

With this patch applied, all four hooks log Execute hook success!, the sandbox starts, and a PyTorch CUDA pod reports cuda available: True end-to-end.

Tests

  • TestSetupNvidiaProcDriverNoHostDriver covers the graceful-skip path (host with no NVIDIA driver loaded) using a tmpdir as the rootfs.
  • Real bind-mount behavior is covered by the existing GPU-host integration tests; this change drops into the same code path they exercise after feat: Support running createContainer hooks in CDI spec #13034.

Risks

  • Scope of the bind-mount: read-only, only the contents of /proc/driver/nvidia/ (kernel-provided NVIDIA driver metadata). Same surface the host driver already exposes to userspace; nothing privileged is added.
  • No-op when nvproxy is disabled: gated on specutils.NVProxyEnabled(spec, conf).
  • No-op when the host driver is not loaded: os.Stat of /proc/driver/nvidia returns ENOENT → function returns nil cleanly.
  • Compat with EROFS rootfs: Setup runs inside if rootfsConf.ShouldUseLisafs() only for the hook execution itself, but SetupNvidiaProcDriver is called outside that branch (alongside SetupDev). EROFS rootfs is still covered by the os.MkdirAll failing → fatal, with a clear error; that matches the existing SetupDev behavior and the EROFS hook caveat already documented by feat: Support running createContainer hooks in CDI spec #13034.

After google#13034 enabled createContainer hooks in CDI specs, three of four
NVIDIA CDI hooks emitted by k8s-device-plugin succeed inside the gofer
(create-symlinks, enable-cuda-compat, update-ldcache). The fourth --
disable-device-node-modification -- still fails because it opens
/proc/driver/nvidia/params inside the container's rootfs and bind-mounts a
modified copy over it; that path does not exist in containerRootFs at hook
time because procfs is not mounted (the sentry serves /proc itself later):

  nvidia-ctk hook disable-device-node-modification
  stderr: failed to mount modified params file: open o_path procfd:
    open /run/containerd/.../rootfs/proc/driver/nvidia/params:
    no such file or directory
  FATAL ERROR: error executing CreateContainer hooks

Under runc, /proc is mounted into the container's mount namespace before
createContainer hooks run, so the hook just works. Mirror that here for
the gofer: when nvproxy is enabled, bind-mount the host's
/proc/driver/nvidia directory onto containerRootFs/proc/driver/nvidia
(read-only) before invoking hooks.

The hook's semantic effect (set ModifyDeviceFiles=0 to prevent libnvidia-ml
from auto-creating extra /dev/nvidiaN nodes) does not apply under gVisor --
nvproxy mediates all device access and the sentry owns /dev -- but the
hook needs to be able to *complete* for sandbox creation to proceed.

Fixes google#13283
@LandonTClipp
Copy link
Copy Markdown
Contributor

LandonTClipp commented May 28, 2026

I only tested running with cdi-cri mode which is probably why I didn't catch this. I'll take a look tomorrow to better understand what this mode is doing.

@a7i
Copy link
Copy Markdown
Contributor Author

a7i commented May 28, 2026

Thank you! I'll give cdi-cri a try now

@a7i
Copy link
Copy Markdown
Contributor Author

a7i commented May 28, 2026

@LandonTClipp confirmed cdi-cri hits the same failure end-to-end (k8s-device-plugin v0.17.0, containerd 2.2.2, runsc release-20260520.0, Tesla T4):

hooks.go:63] Executing hook nvidia-ctk hook disable-device-node-modification
W util.go:107] FATAL ERROR: error executing CreateContainer hooks: failure executing hook "/usr/bin/nvidia-ctk", err: exit status 1
stderr: failed to mount modified params file: open o_path procfd: open .../rootfs/proc/driver/nvidia/params: no such file or directory

The CDI spec at /var/run/cdi/k8s.device-plugin.nvidia.com-gpu.json is identical between cdi-cri and cdi-annotations. The device plugin writes the same 4 hooks regardless of DEVICE_LIST_STRATEGY; that flag only changes how the plugin signals desired devices to containerd (pod annotations vs CRI ContainerConfig.CDIDevices), the hook list passed to runsc is the same. With disable-device-node-modification stripped from the CDI spec (chattr +i to keep the plugin from regenerating), a PyTorch CUDA pod completes cleanly under cdi-cri:

torch: 2.2.2
cuda available: True
device: Tesla T4
ok, sum: -54.53

So this fix is needed regardless of strategy.

@LandonTClipp
Copy link
Copy Markdown
Contributor

I will have to get to the bottom of why my system did not encounter this. There is probably an nvcdi difference between us that is accounting for it but I do not know for certain.

@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented May 28, 2026

Hmm here is a hint:

  • This works: docker run --runtime=runsc --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 echo hello
  • This fails: docker run --runtime=runsc --rm --gpus all ubuntu:22.04 echo hello

So it has to do with the container image.

@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented May 28, 2026

The hook's semantic effect (set ModifyDeviceFiles=0 so in-container libnvidia-ml won't auto-create extra /dev/nvidiaN nodes)

Note that gVisor exposes /proc/driver/nvidia/params inside the sandbox as well and we do the same thing, we force ModifyDeviceFiles=0:

// Force ModifyDeviceFiles in /proc/driver/nvidia/params to 0. This is
// consistent with libnvidia-container's src/nvc_mount.c:mount_procfs().
nvp.procDriverNvidiaParams = strings.Replace(nvp.procDriverNvidiaParams, "ModifyDeviceFiles: 1", "ModifyDeviceFiles: 0", 1)

So we can basically special case disable-device-node-modification and skip it in gVisor.

@a7i
Copy link
Copy Markdown
Contributor Author

a7i commented May 28, 2026

@ayushr2 implemented your suggestion in #13308 as an alternative to this PR. It filters nvidia-ctk hook disable-device-node-modification out of spec.Hooks.CreateContainer when nvproxy is enabled (matched by argv shape), rather than bind-mounting host /proc/driver/nvidia into containerRootFs. PTAL when you have a moment, happy to close this PR in favor of #13308 if that is the preferred approach.

cc @LandonTClipp

@LandonTClipp
Copy link
Copy Markdown
Contributor

This hook was added in nvidia-container-toolkit v1.18.0-rc.1. k8s-device-plugin v1.17.0 uses v1.17.0 of nvidia-container-toolkit. It uses the nvcdi library directly. This is what I'm running on my system. So I am confused why you said you also received this hook if you're running k8s-device-plugin 1.17.0. That shouldn't happen.

Anyway, I did ask NVIDIA if they could help us with this and their solution is a CLI flag/env var you can add to k8s-device-plugin: NVIDIA/k8s-device-plugin#1818

I think it makes sense if gvisor still explicitly ignores this, but I also feel that it's more theoretically correct to leave the onus on cluster operators to disable hooks they don't need. Well, I could go either way.

@a7i
Copy link
Copy Markdown
Contributor Author

a7i commented May 28, 2026

This hook was added in nvidia-container-toolkit v1.18.0-rc.1. k8s-device-plugin v1.17.0 uses v1.17.0 of nvidia-container-toolkit. It uses the nvcdi library directly. This is what I'm running on my system. So I am confused why you said you also received this hook if you're running k8s-device-plugin 1.17.0. That shouldn't happen.

Anyway, I did ask NVIDIA if they could help us with this and their solution is a CLI flag/env var you can add to k8s-device-plugin: NVIDIA/k8s-device-plugin#1818

I think it makes sense if gvisor still explicitly ignores this, but I also feel that it's more theoretically correct to leave the onus on cluster operators to disable hooks they don't need. Well, I could go either way.

ah! nice find. I am using nvidia-container-toolkit 1.19.1 with device plugin v0.19.1
apologies for the incorrect info earlier

@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented May 28, 2026

@a7i I am sending a fix for this. Feel free to close your PRs.

Side note: LLM agents have bad habit of generating useless unit tests to satiate testing requirements from their users. But in practice, a lot of these unit tests are useless. For instance, the unit test generated in this PR will never run because we don't run unit tests on GPU VMs in BuildKite. Unit tests can also be equally buggy; i.e. you can have a buggy implementation and generate an equally buggy unit test to accept the incorrect output. It is not a good "proof of correctness". e2e tests are harder to fool like this. LLMs will happily generate unit tests that are skipped or pass the current implementation because they overfit their own implementation. So in effect we just add a bunch of lines of code to the project that don't provide value. Please be mindful of this!

@a7i a7i closed this May 28, 2026
@a7i a7i deleted the a7i/nvidia-procfs-for-cdi-hooks branch May 28, 2026 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

createContainer hook disable-device-node-modification fails: /proc/driver/nvidia/params not available in containerRootFs (gvisor#13034 follow-up)

3 participants