Describe the bug
Commit 54befe3 ("Add sysfs mount to enable auto-onlining of memory") adds an unconditional hostPath volume for /sys/devices/system/memory/auto_online_blocks to the driver daemonset (0500_daemonset.yaml).
This path only exists on kernels with CONFIG_MEMORY_HOTPLUG=y. Since it lives on sysfs, it cannot be created by userspace. Consequently the driver pod fails to start on kernels that don't set CONFIG_MEMORY_HOTPLUG.
Additionally, the driver pod runs as privileged and already has full access to the host's /sys/devices/system/.
Further, the NVIDIA driver treats CONFIG_MEMORY_HOTPLUG as optional. The build system (kernel/conftest.sh) probes for add_memory_driver_managed() at compile time:
#if defined(CONFIG_MEMORY_HOTPLUG)
add_memory_driver_managed();
#endif
On kernels without CONFIG_MEMORY_HOTPLUG, the conftest produces:
#undef NV_ADD_MEMORY_DRIVER_MANAGED_PRESENT
To Reproduce
- Deploy gpu-operator on a node with a kernel built without
CONFIG_MEMORY_HOTPLUG=y
- Observe the driver pod fails to start with a mount error for
/sys/devices/system/memory/auto_online_blocks
Expected behavior
The driver pod should start successfully. The driver itself handles the absence of CONFIG_MEMORY_HOTPLUG gracefully; the operator should not impose a stricter requirement.
Environment (please provide the following information):
- GPU Operator Version: any version containing commit 54befe3
- OS: any
- Kernel Version: any kernel with '# CONFIG_MEMORY_HOTPLUG is not set'
- Container Runtime Version: any
- Kubernetes Distro and Version: starlingx
Information to attach (optional if deemed irrelevant)
Normal Scheduled 14m default-scheduler Successfully assigned gpu-operator/nvidia-driver-daemonset-9ckgb to controller-0
Normal AddedInterface 14m multus Add eth0 [172.19.192.121/32] from chain
Normal Pulled 14m kubelet Container image "registry.local:9001/nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.8.0" already present on machine
Normal Created 14m kubelet Created container: k8s-driver-manager
Normal Started 14m kubelet Started container k8s-driver-manager
Warning Failed 14m kubelet Error: failed to generate container "6fe5f76bc265bd470d6eafd6f37940800b44c362940c217347e6448bb83db489" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 14m kubelet Error: failed to generate container "0d65c59c4adc8cab7407265bb9a28b561c44f79f1292bbeeab67d98146818ec9" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 13m kubelet Error: failed to generate container "2485089e13e9bfb4f75694a62364b7d430211b1ed6a7d692207924781a08682b" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 13m kubelet Error: failed to generate container "484fb6fda5d12112c8fa81bf02fabebeef5c242c57fdde08f9865bb962ce0cf4" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 13m kubelet Error: failed to generate container "f290f75d4466ce14f3037db3b478a507a4d731bceb31d4ff38cbfd8586744869" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 13m kubelet Error: failed to generate container "a363d3af9497e65419c54d36fa4ab943499c3f989b55fb0a145fa73a0f662499" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 12m kubelet Error: failed to generate container "c982ea74d7737966fe26e964c6008f1f50f28dc5facb5acf88aa68866377f012" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 12m kubelet Error: failed to generate container "1a0845dbeef54981148347cd4af01f3dd4aae1b5d7c2262f867b102e43708e4b" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 12m kubelet Error: failed to generate container "a1f241e1fefc6ffebded0d69b7eb5eea7e5cb1e31756bae0465dbf33804e35b3" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Normal Pulled 4m47s (x44 over 14m) kubelet Container image "registry.local:9001/wrcp/nvidia-driver:v570.172.08_6.12.0-1-amd64-wrcp2603_0000" already present on machine
Warning Failed 4m11s (x38 over 12m) kubelet (combined from similar events): Error: failed to generate container "9df66c4944d2d7a7e14df2902e95bdf3eaf3a43a5ff20c8ecf6ec7b73e9e9643" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
[sysadmin@controller-0 4(keystone_admin)]$
Describe the bug
Commit 54befe3 ("Add sysfs mount to enable auto-onlining of memory") adds an unconditional hostPath volume for
/sys/devices/system/memory/auto_online_blocksto the driver daemonset (0500_daemonset.yaml).This path only exists on kernels with
CONFIG_MEMORY_HOTPLUG=y. Since it lives on sysfs, it cannot be created by userspace. Consequently the driver pod fails to start on kernels that don't set CONFIG_MEMORY_HOTPLUG.Additionally, the driver pod runs as privileged and already has full access to the host's
/sys/devices/system/.Further, the NVIDIA driver treats
CONFIG_MEMORY_HOTPLUGas optional. The build system (kernel/conftest.sh) probes foradd_memory_driver_managed()at compile time:On kernels without
CONFIG_MEMORY_HOTPLUG, the conftest produces:To Reproduce
CONFIG_MEMORY_HOTPLUG=y/sys/devices/system/memory/auto_online_blocksExpected behavior
The driver pod should start successfully. The driver itself handles the absence of
CONFIG_MEMORY_HOTPLUGgracefully; the operator should not impose a stricter requirement.Environment (please provide the following information):
Information to attach (optional if deemed irrelevant)
kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE[sysadmin@controller-0 4(keystone_admin)]$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-65kwb 0/1 Init:0/1 0 11m
gpu-operator-d4589d596-7b6fw 1/1 Running 0 12m
nvidia-dcgm-exporter-5j8fd 0/1 Init:0/1 0 11m
nvidia-dcgm-pv7gz 0/1 Init:0/1 0 11m
nvidia-device-plugin-daemonset-tp9rw 0/1 Init:0/1 0 11m
nvidia-driver-daemonset-9ckgb 0/1 CreateContainerError 0 12m
nvidia-operator-validator-mxb96 0/1 Init:0/4 0 11m
[sysadmin@controller-0 4(keystone_admin)]$
kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE[sysadmin@controller-0 4(keystone_admin)]$ kubectl get ds -n gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 13m
nvidia-dcgm 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm=true 13m
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 13m
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 13m
nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 13m
nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 13m
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 13m
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 13m
If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME[sysadmin@controller-0 4(keystone_admin)]$ kubectl describe pods -n gpu-operator nvidia-driver-daemonset-9ckgb
Name: nvidia-driver-daemonset-9ckgb
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-driver
Node: controller-0/192.168.206.1
Start Time: Thu, 14 May 2026 20:01:38 +0000
Labels: app=nvidia-driver-daemonset
app.kubernetes.io/component=nvidia-driver
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=6bf9c6f588
helm.sh/chart=gpu-operator-v25.3.2
nvidia.com/precompiled=false
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: c8fd2149915fe58452a6cc1f468f56150f8c70a33859b1fe50a7ef38bb13c342
cni.projectcalico.org/podIP: 172.19.192.121/32
cni.projectcalico.org/podIPs: 172.19.192.121/32
k8s.v1.cni.cncf.io/network-status:
[{
"name": "chain",
"interface": "eth0",
"ips": [
"172.19.192.121"
],
"mac": "52:a3:19:67:f6:4f",
"default": true,
"dns": {}
}]
kubectl.kubernetes.io/default-container: nvidia-driver-ctr
Status: Pending
IP: 172.19.192.121
IPs:
IP: 172.19.192.121
Controlled By: DaemonSet/nvidia-driver-daemonset
Init Containers:
k8s-driver-manager:
Container ID: containerd://961a22a434296a67d4cc1cab6c585f95922166875885d2ba6fc2bc0ec12be285
Image: registry.local:9001/nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.8.0
Image ID: registry.local:9001/nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:71f2dde3155a2b4e0cde6bed433fe51ceea476f21fc8d80da3bb51a1956803c9
Port:
Host Port:
Command:
driver-manager
Args:
uninstall_driver
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 14 May 2026 20:01:39 +0000
Finished: Thu, 14 May 2026 20:02:12 +0000
Ready: True
Restart Count: 0
Environment:
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
ENABLE_GPU_POD_EVICTION: true
ENABLE_AUTO_DRAIN: false
DRAIN_USE_FORCE: false
DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS: 0s
DRAIN_DELETE_EMPTYDIR_DATA: false
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/nvidia from run-nvidia (rw)
/sys from host-sys (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7r6k5 (ro)
Containers:
nvidia-driver-ctr:
Container ID:
Image: registry.local:9001/wrcp/nvidia-driver:v570.172.08_6.12.0-1-amd64-wrcp2603_0000
Image ID:
Port:
Host Port:
Command:
nvidia-driver
Args:
init
State: Waiting
Reason: CreateContainerError
Ready: False
Restart Count: 0
Startup: exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
Environment:
NODE_NAME: (v1:spec.nodeName)
NODE_IP: (v1:status.hostIP)
KERNEL_MODULE_TYPE: auto
Mounts:
/dev/log from dev-log (rw)
/host-etc/os-release from host-os-release (ro)
/lib/firmware from nv-firmware (rw)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/run/nvidia from run-nvidia (rw)
/run/nvidia-fabricmanager from run-nvidia-fabricmanager (rw)
/run/nvidia-topologyd from run-nvidia-topologyd (rw)
/sys/devices/system/memory/auto_online_blocks from sysfs-memory-online (rw)
/sys/module/firmware_class/parameters/path from firmware-search-path (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7r6k5 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
var-log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dev-log:
Type: HostPath (bare host directory volume)
Path: /dev/log
HostPathType:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
run-nvidia-fabricmanager:
Type: HostPath (bare host directory volume)
Path: /run/nvidia-fabricmanager
HostPathType: DirectoryOrCreate
run-nvidia-topologyd:
Type: HostPath (bare host directory volume)
Path: /run/nvidia-topologyd
HostPathType: DirectoryOrCreate
mlnx-ofed-usr-src:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers/usr/src
HostPathType: DirectoryOrCreate
run-mellanox-drivers:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
firmware-search-path:
Type: HostPath (bare host directory volume)
Path: /sys/module/firmware_class/parameters/path
HostPathType:
sysfs-memory-online:
Type: HostPath (bare host directory volume)
Path: /sys/devices/system/memory/auto_online_blocks
HostPathType:
nv-firmware:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver/lib/firmware
HostPathType: DirectoryOrCreate
kube-api-access-7r6k5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.driver=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
Normal Scheduled 14m default-scheduler Successfully assigned gpu-operator/nvidia-driver-daemonset-9ckgb to controller-0
Normal AddedInterface 14m multus Add eth0 [172.19.192.121/32] from chain
Normal Pulled 14m kubelet Container image "registry.local:9001/nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.8.0" already present on machine
Normal Created 14m kubelet Created container: k8s-driver-manager
Normal Started 14m kubelet Started container k8s-driver-manager
Warning Failed 14m kubelet Error: failed to generate container "6fe5f76bc265bd470d6eafd6f37940800b44c362940c217347e6448bb83db489" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 14m kubelet Error: failed to generate container "0d65c59c4adc8cab7407265bb9a28b561c44f79f1292bbeeab67d98146818ec9" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 13m kubelet Error: failed to generate container "2485089e13e9bfb4f75694a62364b7d430211b1ed6a7d692207924781a08682b" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 13m kubelet Error: failed to generate container "484fb6fda5d12112c8fa81bf02fabebeef5c242c57fdde08f9865bb962ce0cf4" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 13m kubelet Error: failed to generate container "f290f75d4466ce14f3037db3b478a507a4d731bceb31d4ff38cbfd8586744869" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 13m kubelet Error: failed to generate container "a363d3af9497e65419c54d36fa4ab943499c3f989b55fb0a145fa73a0f662499" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 12m kubelet Error: failed to generate container "c982ea74d7737966fe26e964c6008f1f50f28dc5facb5acf88aa68866377f012" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 12m kubelet Error: failed to generate container "1a0845dbeef54981148347cd4af01f3dd4aae1b5d7c2262f867b102e43708e4b" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Warning Failed 12m kubelet Error: failed to generate container "a1f241e1fefc6ffebded0d69b7eb5eea7e5cb1e31756bae0465dbf33804e35b3" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
Normal Pulled 4m47s (x44 over 14m) kubelet Container image "registry.local:9001/wrcp/nvidia-driver:v570.172.08_6.12.0-1-amd64-wrcp2603_0000" already present on machine
Warning Failed 4m11s (x38 over 12m) kubelet (combined from similar events): Error: failed to generate container "9df66c4944d2d7a7e14df2902e95bdf3eaf3a43a5ff20c8ecf6ec7b73e9e9643" spec: failed to generate spec: failed to mkdir "/sys/devices/system/memory/auto_online_blocks": mkdir /sys/devices/system/memory: operation not permitted
[sysadmin@controller-0 4(keystone_admin)]$