Place inner container cgroups under the pod's cgroup tree#169
Place inner container cgroups under the pod's cgroup tree#169jamestoyer wants to merge 1 commit intocoder:mainfrom
Conversation
|
Hey there, I've tagged another coder engineer to do the approval, but I did read through and have a agent help review. OverviewThis PR wraps The approach is sound — it mirrors moby's canonical CI Status
Issues Found1. CRITICAL:
|
|
I believe this issue with |
Closes #168
Summary
Wrap the dockerd invocation with
unshare --cgroup, a freshcgroup2remount, and the cgroup-delegation logic from moby'shack/dindscript. After the wrapper runs, dockerd's view of/sys/fs/cgroupis rooted at the envbox container's own cgroup, so all inner container cgroups are created as descendants of that scope on the host. Host-level cgroup-aware observability tools (Tetragon, Falco, custom eBPF agents) can then attribute processes inside inner containers to the parent pod.This is the approach Isovalent (Cilium/Tetragon vendor) recommended after testing it themselves — see moby/moby#45378 (comment).
Changes
cli/docker.gowrapDockerdCmd(dargs)helper that returns theunshare-prefixed command + args. The shell snippet does:umount /sys/fs/cgroup+mount -t cgroup2 cgroup /sys/fs/cgroup— fresh cgroup2 mount rooted at the new cgroup namespace's root (the envbox container's cgroup on the host)mkdir /sys/fs/cgroup/init+ retry loop that moves all processes viaxargs -rn1 < cgroup.procs > init/cgroup.procsand enables every available controller incgroup.subtree_control— needed because cgroupv2 forbids a cgroup from having both processes andsubtree_controlsetexecinto dockerd with the original argsdockerd(initial start and the twoIsNoSpaceErrrecovery paths) to go through the wrapper.xunix/sys.goreadCPUQuotaCGroupV2: if/sys/fs/cgroup/<self>/cpu.maxdoesn't exist (because the cgroup remount above has reparented the mount root), read/sys/fs/cgroup/cpu.maxdirectly. The mount root IS the current cgroup in that case so the values are equivalent.Why no
--mounton unshare?unshare --cgroupalone leaves the mount namespace shared with the parent envbox process, so theumount+mount -t cgroup2operations leak back to envbox. We tried isolating with--mount:--propagation private→ breaks sysbox-fs: its per-container mounts under/var/lib/sysboxfs/<id>/stop being visible tosysbox-runcin the dockerd namespace and inner-container creation fails.--propagation slave→ same failure: the parent mount points areprivateby default, so nothing propagates to the slave child.Making
/var/lib/sysboxfs/(or/)rsharedin envbox's mount namespace before the unshare would work but is more invasive. The pragmatic compromise: accept the mount-namespace leak and handle the one observable side effect (the CPU quota read) via thexunix/sys.gofallback. CPU enforcement is unaffected — the inner workspace container's cgroup is now a descendant of the pod's cgroup tree, so the pod's resource limits apply transitively.Verification
Tested against:
--enable-cgtrackerid=trueEnd-to-end check: after a workspace start, running
whoamifrom a Coder terminal (inside the inner workspace container) now produces a Tetragonprocess_execevent with a fullpodfield — namespace, name, UID, container ID, image, security context, and pod labels. Before this change the same event had nopodfield at all.Workspace startup is otherwise unchanged. Sysbox-fs continues to provide its
/sysand/procvirtualization correctly. The "no internal processes" cgroupv2 rule that previously blocked nested container creation under a privileged pod's.scopecgroup is now satisfied because all envbox/sysbox processes are first migrated into the/initsibling cgroup.Backwards compatibility
The wrapper is a no-op for cgroup configurations the delegation block doesn't apply to:
if [ -f /sys/fs/cgroup/cgroup.controllers ]guard skips the delegation block; only the umount/mount happens.unshare --cgroup(< 4.6, very rare today): would surface as a startup error.For cgroup v2 hosts the only behavioural change is the cgroup placement of inner containers — which is what we want.
References
hack/dindscript (cgroup delegation logic borrowed from there)cgtracker(consumes the new hierarchy on the agent side)