feat: Support running createContainer hooks in CDI spec by LandonTClipp · Pull Request #13034 · google/gvisor

LandonTClipp · 2026-04-28T19:35:26Z

Description

This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in DEVICE_LIST_STRATEGY=cdi-cri. In this mode, the plugin creates a CDI spec file at /var/run/cdi/[...].json that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which nvidia-ctk hooks need to be run.

While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1) and updated the ldconfig cache (nvidia-ctk hook update-ldcache) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the /dev/nvidiactl and /dev/nvidia${n} cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.

gVisor previously solved this problem by using the nvidia-container-cli configure command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.

How it Works

In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.

This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.

Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.

Result

I ran this on an H200 system and confirmed both nvidia-smi:

root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   27C    P0             76W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#

And CUDA vectoradd:

lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

both work.

This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.

LandonTClipp · 2026-04-28T19:38:02Z

@ayushr2 FYI. This should be better than my last attempt because this doesn't touch the old NVIDIA_VISIBLE_DEVICES codepath and relies on CDI spec files defining the hooks to run.

I still suspect that a config parameter that enables this behavior would be desired so you'll just need to let me know how you want to proceed so that this does not have backwards-incompatible changes with your system.

ayushr2 · 2026-05-02T20:21:55Z

At this point in the codebase, we're already in the gofer's mount namespace, so not only would nvidia-ctk update-ldcache be doing something redundant, but it messes up paths inside of the new namespace.

I don't understand this part and why the nvidia-ctk hook update-ldcache hook can not be run normally in the gofer. This is what runc does as well. It runs the CreateContainer hooks from inside the container namespace but before pivot_rooting.

So running it normally and not special casing it like it is done now should work...

ayushr2 · 2026-05-02T20:23:08Z

I still suspect that a config parameter that enables this behavior would be desired so you'll just need to let me know how you want to proceed so that this does not have backwards-incompatible changes with your system.

I think if we take the approach of adding general support for running CreateContainer in the gofer before pivot_root, I think your fix will be backwards compatible and not need any flag gating. I don't think GKE device plugin relies on CreateContainer hooks

LandonTClipp · 2026-05-04T17:30:02Z

Here is some more context as to why the nvidia-ctk hooks update-ldconfig is breaking. nvidia-ctk does a pivot_root into the rootfs. This requires the rootfs to be a bind mount. Currently when I try running the update-ldconfig command in the gofer using something like this in gofer_mount.go:

	// Set up /dev directory if needed.
	if devIoFD >= 0 {
		if err := SetupDev(spec, conf, root, procPath); err != nil {
			util.Fatalf("error setting up /dev: %v", err)
		}
	}

	if spec.Hooks != nil {
		state := specs.State{
			Version:     specs.Version,
			ID:          containerID,
			Status:      specs.StateCreating,
			Pid:         0,
			Bundle:      bundleDir,
			Annotations: spec.Annotations,
		}
		if err := container.ExecuteHooks(spec.Hooks.CreateContainer, state); err != nil {
			util.Fatalf("error executing CreateContainer hooks: %v", err)
		}
	}

This fails to run and the gofer emits this error:

W0504 16:35:52.912118       1 util.go:107] FATAL ERROR: error executing CreateContainer hooks: failure executing hook "/usr/bin/nvidia-ctk", err: exit status 1
stdout: 
stderr: 2026/05/04 16:35:52 Error updating ldcache: error running pivot_root: pivot_root .: invalid argument
exit status 1

error executing CreateContainer hooks: failure executing hook "/usr/bin/nvidia-ctk", err: exit status 1
stdout: 
stderr: 2026/05/04 16:35:52 Error updating ldcache: error running pivot_root: pivot_root .: invalid argument
exit status 1

This is because . (which is the container rootfs) is not a bind mount. Runc does the following:

Create a new container namespace clone(CLONE_NEWNS | ...)
In prepareRootfs():
- Self-bind-mount spec.Root.Path onto itself (necessary for pivot_root).
- Apply all spec mounts.
- Run CreateContainer hooks
- Call pivot_root.

The gVisor gofer on the other hand does not bind mount spec.Root.Path (SetupMounts()). It attaches all CDI mounts as children of /proc/fs/root/<dest>. So even if you make spec.Root.Path a bind-mount to make pivot_root succeed, the CDI mounts are in /proc/fs/root, not in spec.Root.Path, so nvidia-ctk/ldconfig won't see the libraries.

I ran an experiment to prove this. I bind-mount spec.Root.Path just so that pivot_root succeeds:

	if spec.Hooks != nil {
		origRoot := spec.Root.Path
		if err := unix.Mount(origRoot, origRoot, "", unix.MS_BIND|unix.MS_REC, ""); err != nil {
			log.Warningf("PoC: failed to self-bind-mount rootfs %q: %v", origRoot, err)
		} else {
			log.Infof("PoC: self-bind-mounted %q; pivot_root in hooks should now succeed", origRoot)
			defer unix.Unmount(origRoot, unix.MNT_DETACH)
		}

		state := specs.State{
			Version:     specs.Version,
			ID:          containerID,
			Status:      specs.StateCreating,
			Pid:         0,
			Bundle:      bundleDir,
			Annotations: spec.Annotations,
		}
		if err := container.ExecuteHooks(spec.Hooks.CreateContainer, state); err != nil {
			util.Fatalf("error executing CreateContainer hooks: %v", err)
		}
	}

My pod now starts successfully:

  Normal  Started    0s    kubelet            spec.containers{ubuntu}: Started container ubuntu
lclipp@CW-HP216DG9DT-L gvisor %

But nvidia-smi doesn't work:

root@debug-pod:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

And you can see the ldconfig step didn't produce the libcuda.so.1 -> libcuda.so.580.126.20 symlink:

root@debug-pod:/# ls -lah /usr/lib/x86_64-linux-gnu/libcuda*
lrwxrwxrwx 1 root root  12 May  4 19:13 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
-rw-r--r-- 1 root root 92M Apr 29 20:54 /usr/lib/x86_64-linux-gnu/libcuda.so.580.126.20
-rw-r--r-- 1 root root 10M Apr 29 20:54 /usr/lib/x86_64-linux-gnu/libcudadebugger.so.580.126.20

This is why my hacky method of extracting the --folder arguments and then running ldconfig ourselves works because it removes the need to do a pivot_root that is being done in nvidia-ctk and we can point it to the /proc/fs/root path instead of spec.Root.Path.

TL;DR

I think the core problem can be distilled down to the fact that gVisor bind-mounts CDI files at /proc/fs/root instead of spec.Root.Path when TestOnlyAllowRunAsCurrentUserWithoutChroot = false. Then when nvidia-ctk hook update-ldcache goes to pivot_root into spec.Root.Path, it fails because gVisor does not self-bind-mount, and even if you do this, spec.Root.Path does not have the client libraries bind mounted (they're under /proc/fs/root instead).

The root of the problem is a mismatch between what nvidia-ctk expects (a rootfs at spec.Root.Path with CDI mounts attached) and what gVisor provides (CDI mounts attached to /proc/fs/root, not to spec.Root.Path).

I hope that long explanation makes some sort of sense.

Options

There are a few routes we can take to fix this:

Run ldconfig manually like what I was doing before.
Modify the spec.State bundle being passed as stdin to the hooks to use /proc/fs/root for the root. This is messy and weird but I think it would work.
Change gVisor to do all CDI mounting on spec.Root.Path instead of /proc/fs/root.

Options 2 and 3 would preserve the OCI semantics and let nvidia-ctk run as normal. I'm not sure which of those two would be easier to implement. I do not understand why gVisor ever mounted devices onto /proc/fs/root.

ayushr2 · 2026-05-04T20:26:33Z

I do not understand why gVisor ever mounted devices onto /proc/fs/root

The startup sequence of the gofer is fairly complex.

We want to run the gofer with minimal capabilities. This capability set does not include CAP_SYS_ADMIN.
After most of the gofer set-up work is done, we re-execute the gofer and drop all capabilities except for the ones linked above. See this.
pivot_root(2) requires CAP_SYS_ADMIN. So it needs to happen before we drop capabilities during re-exec. Hence we pivot_root in sandboxsetup.SetupRootFS() before re-exec and set --setup-root=false during the re-exec.
To re-exec, we need access to /proc/self/exe. We can't create bind mount the host /proc into the container rootfs, that'd be a security risk. This is why the pivot_root(2) is done in /proc/fs. Inside that directory, /root is the rootfs and /proc is the host procfs. Then later, we unmount the host procfs and chroot into /root.

But I wonder what we can do is:

Do all work in container rootfs
Create 2 bind mounts in container rootfs: /proc/self/fd and /proc/self/exe. This should be done after other previous bind mounts are created.
pivot_root into container rootfs
re-exec without capabilities
open /proc/self/fd as we do now
unmount both these directories

RE option (2): The hook takes in the spec.State bundle. Do you know how the nvidia hook figures out the container root from there? The ldconfig command expects a container-root flag: https://github.com/NVIDIA/nvidia-container-toolkit/blob/3cfea27c9a7fb47af2d9607e2f661fefd67c0ab3/internal/ldconfig/ldconfig.go#L99. Who is setting this? Is this already present in the OCI spec's hook arguments? And does it point to spec.Root.Path?

shayonj · 2026-05-04T20:40:53Z

Not to add too much noise here, but I was wondering if there’s a way to avoid special-casing ldconfig. Given that nvidia-cdi-hook gets the container root by reading state.Bundle/config.json and using root.path, could we instead make the assembled rootfs visible at spec.Root.Path before running CreateContainer hooks?

If we self-bind spec.Root.Path, apply the CDI mounts and /dev setup there, run the hooks, and then recursively bind that prepared tree to /proc/fs/root before the existing gVisor pivot_root, I think the NVIDIA hooks would see the same root they expect from runc while preserving the current gofer reexec flow. That might let this stay as generic CreateContainer hook support without parsing NVIDIA hook args or running ldconfig ourselves. Def feel free to lmk if I missed something :D

ayushr2 · 2026-05-04T20:46:23Z

Not to add too much noise here, but I was wondering if there’s a way to avoid special-casing ldconfig. Given that nvidia-cdi-hook gets the container root by reading state.Bundle/config.json and using root.path, could we instead make the assembled rootfs visible at spec.Root.Path before running CreateContainer hooks?

Yes, this is what Landon is proposing in option (2).

If we self-bind spec.Root.Path, apply the CDI mounts and /dev setup there, run the hooks, and then recursively bind that prepared tree to /proc/fs/root before the existing gVisor pivot_root, I think the NVIDIA hooks would see the same root they expect from runc while preserving the current gofer reexec flow. That might let this stay as generic CreateContainer hook support without parsing NVIDIA hook args or running ldconfig ourselves. Def feel free to lmk if I missed something :D

Yes this is possible too.

LandonTClipp · 2026-05-05T19:19:14Z

Do you know how the nvidia hook figures out the container root from there?

Yes the specs.State has a Bundle attribute that is a path to a directory which contains further configuration, one of which is the rootfs path. This is what nvidia-ctk uses to pivot_root into.

I will look into the route you described. I think that's the correct way of doing it since runc also operates on the rootfs according to the bundle.

ayushr2 · 2026-05-05T19:34:17Z

I discussed more with @avagin. He pointed out a legit usecase. That the container rootfs provided in the OCI spec is read-only from the beginning and the spec doesn't contain any mounts. In this case, current implementation will not create any new directories/files inside container rootfs. But my proposal will fail while creating the /proc/self/fd and /proc/self/exe bind mounts inside it.

What @shayonj described in the second paragraph of his comment seems more correct than my proposal. Maybe that is the path to pursue. @avagin thoughts?

@shayonj

This implements the idea @shayonj suggested in google#13034 (comment). We do all of the CDI mounts and createContainer hooks inside of spec.Root.Path instead of /proc/fs/root. This should allow hooks which need to pivot_root(2) to successfully run AFTER the CDI mounts have been performed.

LandonTClipp · 2026-05-05T21:07:21Z

I implemented @shayonj's idea but I'm on a plane and transferring binaries around is hard, so I'll do some final testing maybe tomorrow. I think it will work.

LandonTClipp · 2026-05-06T18:12:47Z

@ayushr2 @shayonj I confirmed the current changes work:

root@debug-pod:/# nvidia-smi
Wed May  6 20:09:03 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   30C    P0             77W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

@shayonj's idea turned out to be way simpler than I expected. Please take a look, I think this might be the one!

shayonj

Looking good, just some minor comments from my review if its useful to you.

LandonTClipp · 2026-05-06T21:16:56Z

@shayonj please see the most recent set of changes. Highlights:

I split out two variables: containerRootFs and goferRootFs. In the LisaFS case, containerRootFs = spec.Root.Path and goferRootFs = /proc/fs/root. In the non-LisaFS case, both are set to /proc/fs/root.
Most/all rootfs preparation is done on the containerRootFs variable.
Non-lisafs cases do not get createContainer hooks run with warnings logged in such cases.

I don't yet know how we can solve createContainer hooks on readonly rootfs and I don't have a system I can test this out, so I'm happy with implementing it only for lisafs.

shayonj

Looking good to me, just some minor comments. Perhaps the gVisor team can help with spotting anything else and seeing it through 🙏🏾

great work!

LandonTClipp · 2026-05-11T21:08:20Z

@ayushr2 please take another look at your earliest convenience.

ayushr2 · 2026-05-11T21:26:50Z

Reviewing! Could you please squash the commits. Copybara doesn't have the ability to squash and merge yet so all commits from PR are applied. We want to keep the master branch clean.

ayushr2 · 2026-05-13T06:13:07Z

Pulling this in and running all tests.

LandonTClipp · 2026-05-13T15:39:53Z

With the latest changes, I'm able to confirm this works on our systems:

lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
lclipp@CW-HP216DG9DT-L gvisor % k exec -it debug-pod -- /bin/bash 
root@debug-pod:/# nvidia-smi
Wed May 13 17:39:16 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   31C    P0             78W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#

Description ------------ This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run. While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it. gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all. How it Works ------------ In gofer_mount.go, the code is changed to have explicit understandings as to what is the containerRootFs (usually under /var/lib/.../root) and the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they would pivot_root(2) into the containerRootFs while gVisor would operate under the goferRootFs. This meant that nvidia-ctk did not see any CDI devices mounted into the containerRootFs. This commit changes gVisor such that all devices and setup is done under the containerRootFs. We then bind-mount containerRootFs into goferRootFs after running the CreateContainer hooks. The gofer pivot_roots into the goferRootFs as before. Note that createContainer hooks are only run if the underlying rootfs is writable. There are many scenarios, such as when using EROFS, where createContainer hooks can't be executed. This problem will be saved for another day to solve. Result ------- I ran this on an H200 system and confirmed both nvidia-smi: ``` root@debug-pod:/# nvidia-smi Tue Apr 28 19:34:25 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H200 Off | N/A Off | 0 | | N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ root@debug-pod:/# ``` And CUDA vectoradd: ``` lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` both work. This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode. FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support f3bd6c0 PiperOrigin-RevId: 914666567

Description ------------ This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run. While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it. gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all. How it Works ------------ In gofer_mount.go, the code is changed to have explicit understandings as to what is the containerRootFs (usually under /var/lib/.../root) and the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they would pivot_root(2) into the containerRootFs while gVisor would operate under the goferRootFs. This meant that nvidia-ctk did not see any CDI devices mounted into the containerRootFs. This commit changes gVisor such that all devices and setup is done under the containerRootFs. We then bind-mount containerRootFs into goferRootFs after running the CreateContainer hooks. The gofer pivot_roots into the goferRootFs as before. Note that createContainer hooks are only run if the underlying rootfs is writable. There are many scenarios, such as when using EROFS, where createContainer hooks can't be executed. This problem will be saved for another day to solve. Signed-off-by: LandonTClipp <lclipp@coreweave.com>

Description ------------ This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run. While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it. gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all. How it Works ------------ In gofer_mount.go, the code is changed to have explicit understandings as to what is the containerRootFs (usually under /var/lib/.../root) and the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they would pivot_root(2) into the containerRootFs while gVisor would operate under the goferRootFs. This meant that nvidia-ctk did not see any CDI devices mounted into the containerRootFs. This commit changes gVisor such that all devices and setup is done under the containerRootFs. We then bind-mount containerRootFs into goferRootFs after running the CreateContainer hooks. The gofer pivot_roots into the goferRootFs as before. Note that createContainer hooks are only run if the underlying rootfs is writable. There are many scenarios, such as when using EROFS, where createContainer hooks can't be executed. This problem will be saved for another day to solve. Result ------- I ran this on an H200 system and confirmed both nvidia-smi: ``` root@debug-pod:/# nvidia-smi Tue Apr 28 19:34:25 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H200 Off | N/A Off | 0 | | N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ root@debug-pod:/# ``` And CUDA vectoradd: ``` lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` both work. This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode. FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support ae18a84 PiperOrigin-RevId: 914666567

ayushr2 · 2026-05-17T05:17:32Z

FYI I am submitting this as #13202. @relkochta pointed out an issue that we were using unix.Mount() instead of specutils.SafeMount() for mounting the container filesystem. I made some cosmetic changes other than that (wrapping comments to 80 chars).

Description ------------ This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run. While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it. gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all. How it Works ------------ In gofer_mount.go, the code is changed to have explicit understandings as to what is the containerRootFs (usually under /var/lib/.../root) and the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they would pivot_root(2) into the containerRootFs while gVisor would operate under the goferRootFs. This meant that nvidia-ctk did not see any CDI devices mounted into the containerRootFs. This commit changes gVisor such that all devices and setup is done under the containerRootFs. We then bind-mount containerRootFs into goferRootFs after running the CreateContainer hooks. The gofer pivot_roots into the goferRootFs as before. Note that createContainer hooks are only run if the underlying rootfs is writable. There are many scenarios, such as when using EROFS, where createContainer hooks can't be executed. This problem will be saved for another day to solve. Result ------- I ran this on an H200 system and confirmed both nvidia-smi: ``` root@debug-pod:/# nvidia-smi Tue Apr 28 19:34:25 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H200 Off | N/A Off | 0 | | N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ root@debug-pod:/# ``` And CUDA vectoradd: ``` lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` both work. This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode. FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support ae18a84 PiperOrigin-RevId: 916649544

LandonTClipp · 2026-05-19T13:56:02Z

Thanks for working with me on this!

Add a section that describes support for CDI compatibility as it relates to NVIDIA's k8s-device-plugin. Refs: #13034 FUTURE_COPYBARA_INTEGRATE_REVIEW=#13233 from LandonTClipp:k8s-device-plugin-docs 4ce7849 PiperOrigin-RevId: 919264193

a7i · 2026-05-28T00:05:25Z

Hey @LandonTClipp Thanks for adding this feature.

Did you observe createContainer hook disable-device-node-modification failing?
See PR for backgound: #13284

LandonTClipp · 2026-05-28T00:08:14Z

I have not observed this. I don't recall this hook existing in my CDI. I'll take a look at your PR tomorrow.

LandonTClipp force-pushed the k8s-device-plugin-support branch from 3bbfaaa to 86b7d45 Compare April 28, 2026 19:36

LandonTClipp force-pushed the k8s-device-plugin-support branch from 86b7d45 to f01a0fb Compare April 28, 2026 19:43

LandonTClipp mentioned this pull request Apr 29, 2026

feat: Support NVIDIA k8s-device-plugin with CDI #13024

Closed

ayushr2 mentioned this pull request May 2, 2026

runsc: NVIDIA CDI / libcuda not applied for io.containerd.runsc.v1 (direct shim) #13060

Closed

ayushr2 reviewed May 2, 2026

View reviewed changes

Comment thread runsc/container/hook.go Outdated

Comment thread runsc/container/hook.go Outdated

Comment thread runsc/container/hook.go Outdated

Comment thread runsc/container/container.go Outdated

Comment thread runsc/container/hook.go Outdated

ayushr2 reviewed May 2, 2026

View reviewed changes

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

LandonTClipp force-pushed the k8s-device-plugin-support branch from 10e0830 to 4c3968d Compare May 4, 2026 16:33

shayonj reviewed May 6, 2026

View reviewed changes

Comment thread runsc/cmd/sandboxsetup/BUILD Outdated

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

LandonTClipp force-pushed the k8s-device-plugin-support branch from aeec302 to c262c0b Compare May 6, 2026 21:08

LandonTClipp requested review from ayushr2 and shayonj May 6, 2026 21:21

shayonj reviewed May 10, 2026

View reviewed changes

Comment thread runsc/cmd/sandboxsetup/BUILD Outdated

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

LandonTClipp force-pushed the k8s-device-plugin-support branch from e2ae571 to 64680ff Compare May 11, 2026 21:37

ayushr2 reviewed May 11, 2026

View reviewed changes

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

Comment thread runsc/cmd/gofer.go Outdated

ayushr2 reviewed May 12, 2026

View reviewed changes

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go

copybara-service Bot mentioned this pull request May 13, 2026

feat: Support running createContainer hooks in CDI spec #13162

Closed

ayushr2 reviewed May 13, 2026

View reviewed changes

Comment thread runsc/container/container.go Outdated

LandonTClipp force-pushed the k8s-device-plugin-support branch 3 times, most recently from a489149 to f3bd6c0 Compare May 13, 2026 15:36

LandonTClipp force-pushed the k8s-device-plugin-support branch from f3bd6c0 to ae18a84 Compare May 13, 2026 19:00

copybara-service Bot mentioned this pull request May 17, 2026

feat: Support running createContainer hooks in CDI spec #13202

Merged

copybara-service Bot merged commit a7924c4 into google:master May 18, 2026
2 of 3 checks passed

LandonTClipp mentioned this pull request May 21, 2026

chore(docs): Update docs about CDI compatibility #13233

Merged

copybara-service Bot mentioned this pull request May 21, 2026

chore(docs): Update docs about CDI compatibility #13236

Merged

ayushr2 mentioned this pull request May 21, 2026

release-20260520.0: runsc gofer fails rootfs self-bind when bundle path uses /var/run symlink to /run #13463 #13238

Closed

This was referenced May 26, 2026

createContainer hook disable-device-node-modification fails: /proc/driver/nvidia/params not available in containerRootFs (gvisor#13034 follow-up) #13283

Closed

runsc: bind-mount host /proc/driver/nvidia for CDI createContainer hooks #13284

Closed

Conversation

LandonTClipp commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How it Works

Result

Uh oh!

LandonTClipp commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayushr2 commented May 2, 2026

Uh oh!

ayushr2 commented May 2, 2026

Uh oh!

LandonTClipp commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Options

Uh oh!

ayushr2 commented May 4, 2026

Uh oh!

shayonj commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ayushr2 commented May 4, 2026

Uh oh!

LandonTClipp commented May 5, 2026

Uh oh!

ayushr2 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LandonTClipp commented May 5, 2026

Uh oh!

LandonTClipp commented May 6, 2026

Uh oh!

shayonj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LandonTClipp commented May 6, 2026

Uh oh!

shayonj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LandonTClipp commented May 11, 2026

Uh oh!

ayushr2 commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayushr2 commented May 13, 2026

Uh oh!

Uh oh!

LandonTClipp commented May 13, 2026

Uh oh!

ayushr2 commented May 17, 2026

Uh oh!

Uh oh!

LandonTClipp commented May 19, 2026

Uh oh!

a7i commented May 28, 2026

Uh oh!

LandonTClipp commented May 28, 2026

Uh oh!

Reviewers

Assignees

LandonTClipp commented Apr 28, 2026 •

edited

Loading

LandonTClipp commented Apr 28, 2026 •

edited

Loading

LandonTClipp commented May 4, 2026 •

edited

Loading

shayonj commented May 4, 2026 •

edited

Loading

ayushr2 commented May 5, 2026 •

edited

Loading