Place inner container cgroups under the pod's cgroup tree by jamestoyer · Pull Request #169 · coder/envbox

jamestoyer · 2026-04-27T12:03:02Z

Closes #168

Summary

Wrap the dockerd invocation with unshare --cgroup, a fresh cgroup2 remount, and the cgroup-delegation logic from moby's hack/dind script. After the wrapper runs, dockerd's view of /sys/fs/cgroup is rooted at the envbox container's own cgroup, so all inner container cgroups are created as descendants of that scope on the host. Host-level cgroup-aware observability tools (Tetragon, Falco, custom eBPF agents) can then attribute processes inside inner containers to the parent pod.

This is the approach Isovalent (Cilium/Tetragon vendor) recommended after testing it themselves — see moby/moby#45378 (comment).

Changes

cli/docker.go

Adds wrapDockerdCmd(dargs) helper that returns the unshare-prefixed command + args. The shell snippet does:
- umount /sys/fs/cgroup + mount -t cgroup2 cgroup /sys/fs/cgroup — fresh cgroup2 mount rooted at the new cgroup namespace's root (the envbox container's cgroup on the host)
- mkdir /sys/fs/cgroup/init + retry loop that moves all processes via xargs -rn1 < cgroup.procs > init/cgroup.procs and enables every available controller in cgroup.subtree_control — needed because cgroupv2 forbids a cgroup from having both processes and subtree_control set
- exec into dockerd with the original args
Updates the three sites that launch/restart dockerd (initial start and the two IsNoSpaceErr recovery paths) to go through the wrapper.

xunix/sys.go

Adds a fallback in readCPUQuotaCGroupV2: if /sys/fs/cgroup/<self>/cpu.max doesn't exist (because the cgroup remount above has reparented the mount root), read /sys/fs/cgroup/cpu.max directly. The mount root IS the current cgroup in that case so the values are equivalent.

Why no `--mount` on unshare?

unshare --cgroup alone leaves the mount namespace shared with the parent envbox process, so the umount + mount -t cgroup2 operations leak back to envbox. We tried isolating with --mount:

--propagation private → breaks sysbox-fs: its per-container mounts under /var/lib/sysboxfs/<id>/ stop being visible to sysbox-runc in the dockerd namespace and inner-container creation fails.
--propagation slave → same failure: the parent mount points are private by default, so nothing propagates to the slave child.

Making /var/lib/sysboxfs/ (or /) rshared in envbox's mount namespace before the unshare would work but is more invasive. The pragmatic compromise: accept the mount-namespace leak and handle the one observable side effect (the CPU quota read) via the xunix/sys.go fallback. CPU enforcement is unaffected — the inner workspace container's cgroup is now a descendant of the pod's cgroup tree, so the pod's resource limits apply transitively.

Verification

Tested against:

containerd 2.2.1
Linux 6.8 / Ubuntu 22.04 with cgroup v2
Tetragon v1.18.1 with --enable-cgtrackerid=true

End-to-end check: after a workspace start, running whoami from a Coder terminal (inside the inner workspace container) now produces a Tetragon process_exec event with a full pod field — namespace, name, UID, container ID, image, security context, and pod labels. Before this change the same event had no pod field at all.

Workspace startup is otherwise unchanged. Sysbox-fs continues to provide its /sys and /proc virtualization correctly. The "no internal processes" cgroupv2 rule that previously blocked nested container creation under a privileged pod's .scope cgroup is now satisfied because all envbox/sysbox processes are first migrated into the /init sibling cgroup.

Backwards compatibility

The wrapper is a no-op for cgroup configurations the delegation block doesn't apply to:

cgroup v1 hosts: the if [ -f /sys/fs/cgroup/cgroup.controllers ] guard skips the delegation block; only the umount/mount happens.
Kernels without unshare --cgroup (< 4.6, very rare today): would surface as a startup error.

For cgroup v2 hosts the only behavioural change is the cgroup placement of inner containers — which is what we want.

References

DinD cgroupv2 problem inside K8s moby/moby#45378 (and Isovalent's comment that inspired this)
moby's hack/dind script (cgroup delegation logic borrowed from there)
Cilium Tetragon's cgtracker (consumes the new hierarchy on the agent side)

f0ssel · 2026-05-04T19:12:02Z

Hey there, I've tagged another coder engineer to do the approval, but I did read through and have a agent help review.

Overview

This PR wraps dockerd invocation with unshare --cgroup + cgroup2 remount + cgroupv2 delegation so inner container cgroups become descendants of the pod's cgroup tree. This enables host-level observability tools (Tetragon, Falco, eBPF agents) to attribute inner-container processes to the parent pod.

The approach is sound — it mirrors moby's canonical hack/dind pattern and was recommended by Isovalent.

CI Status

Check	Result	Related to PR?
build	FAIL	No — `aquasecurity/trivy-action@0.34.2` version resolution failure
unit-tests	FAIL	Yes — `TestDocker/DockerdConfigured` expects `dockerd` but now gets `unshare`
integration-tests	FAIL	Yes — same `DockerdConfigured` test failure
lint	PASS	—
fmt	PASS	—
codeql	PASS	—

Issues Found

1. CRITICAL: `background.Process.kill()` race after exec chain

This is the most significant correctness concern. The background.Process tracks processes by binName set from the cmd argument to New(). After wrapDockerdCmd, binName = "unshare". However, due to the exec chain (unshare → sh → dockerd), the actual process's /proc/<pid>/cmdline becomes "dockerd".

In background/process.go, the kill() function sends SIGTERM then polls isProcExited():

func isProcExited(fs afero.Fs, pid int, cmd string) (bool, error) {
    cmdline, err := afero.ReadFile(fs, fmt.Sprintf("/proc/%d/cmdline", pid))
    // ...
    args := bytes.Split(cmdline, []byte{'0'})
    return cmd != string(args[0]), nil  // "unshare" != "dockerd" → true (thinks it exited!)
}

Since "unshare" != "dockerd", isProcExited returns true immediately after SIGTERM is sent — before dockerd has actually exited. Restart() then starts a new instance while the old one may still be binding ports/sockets. This could cause startup failures or data corruption on the two no-space recovery restart paths.

Before this PR: binName was "dockerd", and the process cmdline was dockerd, so the PID exit check worked correctly.

Fix: Either:

Pass "dockerd" as binName separately (requires refactoring Process)
Have wrapDockerdCmd return a binName hint and use it
Or simplest: after the kill() call in Restart, also wait for the process channel to drain before starting

2. Must-fix: Existing test `DockerdConfigured` is broken

The test at docker_test.go:182-191 registers a fake command matching dockerd --debug --log-level=debug .... After this PR the binary is unshare, not dockerd, so the fake is never matched. The test needs to expect unshare --cgroup /bin/sh -c <script> dockerd <args> instead.

3. Missing test coverage for new code

wrapDockerdCmd: No unit test. Should verify the returned command/args structure, the exec "$0" "$@" pattern, and argument passthrough.
readCPUQuotaCGroupV2 fallback: The existing sys_test.go has no test case where the self-relative path fails but /sys/fs/cgroup/cpu.max succeeds. Should add a CGroupV2_RemountedRoot test case.

4. Minor: Error wrapping in fallback path (`xunix/sys.go:78`)

if rootErr != nil {
    return CPUQuota{}, xerrors.Errorf("read cpu.max outside container: %w", err) // wraps `err`, not `rootErr`
}

When both reads fail, the error message says "read cpu.max outside container" but wraps err (the self-relative path error) rather than rootErr (the fallback error). A developer debugging this might be confused since the error path and the wrapped error don't match. Consider wrapping both:

return CPUQuota{}, xerrors.Errorf("read cpu.max: self-path: %v, root-path: %w", err, rootErr)

5. Shell script: infinite retry loop with no bound

while ! {
    xargs -rn1 < /sys/fs/cgroup/cgroup.procs > /sys/fs/cgroup/init/cgroup.procs || :
    sed -e 's/ / +/g' -e 's/^/+/' < /sys/fs/cgroup/cgroup.controllers > /sys/fs/cgroup/cgroup.subtree_control
}; do :; done

This loop has no timeout or retry limit. If the subtree_control write persistently fails (e.g., a controller that can't be enabled), this blocks dockerd startup forever with no logs. Moby's hack/dind has the same pattern, but moby runs as a developer tool, not a long-lived production component.

Consider adding a max-retry counter with a warning message, or at minimum a sleep to avoid CPU-spinning on persistent failures.

6. `umount /sys/fs/cgroup` without guard

The script runs umount /sys/fs/cgroup unconditionally before the cgroupv2 guard. If cgroup is not mounted at that path (unusual but possible), set -e will abort the entire script before reaching exec dockerd. This would prevent workspace startup entirely with a cryptic error. Consider umount /sys/fs/cgroup 2>/dev/null || true or a mountpoint -q check.

Things Done Well

Excellent PR description: The tradeoffs (no --mount, mount namespace leak) are clearly documented with reasoning.
Minimal footprint: Only 2 files changed, well-scoped.
cgroup v1 guard: The if [ -f /sys/fs/cgroup/cgroup.controllers ] correctly skips delegation on v1.
exec "$0" "$@" pattern: Properly replaces the shell with dockerd, preserving the PID for signal handling.
CPU quota fallback: The fallback to reading /sys/fs/cgroup/cpu.max at mount root is the right approach for the mount-namespace leak side effect.
All three dockerd launch sites updated consistently.

Recommendation

Request changes. The isProcExited / binName mismatch (issue #1) is a real correctness regression for the restart paths. The broken test (issue #2) must also be fixed. Issues #3-6 are worth discussing but less critical.

f0ssel · 2026-05-04T19:13:30Z

I believe this issue with aquasecurity/trivy-action@0.34.2 is fixed if you rebase fresh from main, as I believe the build error has been fixed. Thanks

Place inner container cgroups under the pod's cgroup tree

c5aa82e

f0ssel requested a review from johnstcn May 4, 2026 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Place inner container cgroups under the pod's cgroup tree#169

Place inner container cgroups under the pod's cgroup tree#169
jamestoyer wants to merge 1 commit intocoder:mainfrom
jamestoyer:feat/dind-cgroup-attribution

jamestoyer commented Apr 27, 2026

Uh oh!

f0ssel commented May 4, 2026 •

edited

Loading

Uh oh!

f0ssel commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamestoyer commented Apr 27, 2026

Summary

Changes

Why no --mount on unshare?

Verification

Backwards compatibility

References

Uh oh!

f0ssel commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

CI Status

Issues Found

1. CRITICAL: background.Process.kill() race after exec chain

2. Must-fix: Existing test DockerdConfigured is broken

3. Missing test coverage for new code

4. Minor: Error wrapping in fallback path (xunix/sys.go:78)

5. Shell script: infinite retry loop with no bound

6. umount /sys/fs/cgroup without guard

Things Done Well

Recommendation

Uh oh!

f0ssel commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why no `--mount` on unshare?

f0ssel commented May 4, 2026 •

edited

Loading

1. CRITICAL: `background.Process.kill()` race after exec chain

2. Must-fix: Existing test `DockerdConfigured` is broken

4. Minor: Error wrapping in fallback path (`xunix/sys.go:78`)

6. `umount /sys/fs/cgroup` without guard