Skip to content

Place inner container cgroups under the pod's cgroup tree#169

Open
jamestoyer wants to merge 1 commit intocoder:mainfrom
jamestoyer:feat/dind-cgroup-attribution
Open

Place inner container cgroups under the pod's cgroup tree#169
jamestoyer wants to merge 1 commit intocoder:mainfrom
jamestoyer:feat/dind-cgroup-attribution

Conversation

@jamestoyer
Copy link
Copy Markdown

Closes #168

Summary

Wrap the dockerd invocation with unshare --cgroup, a fresh cgroup2 remount, and the cgroup-delegation logic from moby's hack/dind script. After the wrapper runs, dockerd's view of /sys/fs/cgroup is rooted at the envbox container's own cgroup, so all inner container cgroups are created as descendants of that scope on the host. Host-level cgroup-aware observability tools (Tetragon, Falco, custom eBPF agents) can then attribute processes inside inner containers to the parent pod.

This is the approach Isovalent (Cilium/Tetragon vendor) recommended after testing it themselves — see moby/moby#45378 (comment).

Changes

cli/docker.go

  • Adds wrapDockerdCmd(dargs) helper that returns the unshare-prefixed command + args. The shell snippet does:
    • umount /sys/fs/cgroup + mount -t cgroup2 cgroup /sys/fs/cgroup — fresh cgroup2 mount rooted at the new cgroup namespace's root (the envbox container's cgroup on the host)
    • mkdir /sys/fs/cgroup/init + retry loop that moves all processes via xargs -rn1 < cgroup.procs > init/cgroup.procs and enables every available controller in cgroup.subtree_control — needed because cgroupv2 forbids a cgroup from having both processes and subtree_control set
    • exec into dockerd with the original args
  • Updates the three sites that launch/restart dockerd (initial start and the two IsNoSpaceErr recovery paths) to go through the wrapper.

xunix/sys.go

  • Adds a fallback in readCPUQuotaCGroupV2: if /sys/fs/cgroup/<self>/cpu.max doesn't exist (because the cgroup remount above has reparented the mount root), read /sys/fs/cgroup/cpu.max directly. The mount root IS the current cgroup in that case so the values are equivalent.

Why no --mount on unshare?

unshare --cgroup alone leaves the mount namespace shared with the parent envbox process, so the umount + mount -t cgroup2 operations leak back to envbox. We tried isolating with --mount:

  • --propagation private → breaks sysbox-fs: its per-container mounts under /var/lib/sysboxfs/<id>/ stop being visible to sysbox-runc in the dockerd namespace and inner-container creation fails.
  • --propagation slave → same failure: the parent mount points are private by default, so nothing propagates to the slave child.

Making /var/lib/sysboxfs/ (or /) rshared in envbox's mount namespace before the unshare would work but is more invasive. The pragmatic compromise: accept the mount-namespace leak and handle the one observable side effect (the CPU quota read) via the xunix/sys.go fallback. CPU enforcement is unaffected — the inner workspace container's cgroup is now a descendant of the pod's cgroup tree, so the pod's resource limits apply transitively.

Verification

Tested against:

  • containerd 2.2.1
  • Linux 6.8 / Ubuntu 22.04 with cgroup v2
  • Tetragon v1.18.1 with --enable-cgtrackerid=true

End-to-end check: after a workspace start, running whoami from a Coder terminal (inside the inner workspace container) now produces a Tetragon process_exec event with a full pod field — namespace, name, UID, container ID, image, security context, and pod labels. Before this change the same event had no pod field at all.

Workspace startup is otherwise unchanged. Sysbox-fs continues to provide its /sys and /proc virtualization correctly. The "no internal processes" cgroupv2 rule that previously blocked nested container creation under a privileged pod's .scope cgroup is now satisfied because all envbox/sysbox processes are first migrated into the /init sibling cgroup.

Backwards compatibility

The wrapper is a no-op for cgroup configurations the delegation block doesn't apply to:

  • cgroup v1 hosts: the if [ -f /sys/fs/cgroup/cgroup.controllers ] guard skips the delegation block; only the umount/mount happens.
  • Kernels without unshare --cgroup (< 4.6, very rare today): would surface as a startup error.

For cgroup v2 hosts the only behavioural change is the cgroup placement of inner containers — which is what we want.

References

@f0ssel f0ssel requested a review from johnstcn May 4, 2026 17:17
@f0ssel
Copy link
Copy Markdown
Member

f0ssel commented May 4, 2026

Hey there, I've tagged another coder engineer to do the approval, but I did read through and have a agent help review.

Overview

This PR wraps dockerd invocation with unshare --cgroup + cgroup2 remount + cgroupv2 delegation so inner container cgroups become descendants of the pod's cgroup tree. This enables host-level observability tools (Tetragon, Falco, eBPF agents) to attribute inner-container processes to the parent pod.

The approach is sound — it mirrors moby's canonical hack/dind pattern and was recommended by Isovalent.

CI Status

Check Result Related to PR?
build FAIL No — aquasecurity/trivy-action@0.34.2 version resolution failure
unit-tests FAIL YesTestDocker/DockerdConfigured expects dockerd but now gets unshare
integration-tests FAIL Yes — same DockerdConfigured test failure
lint PASS
fmt PASS
codeql PASS

Issues Found

1. CRITICAL: background.Process.kill() race after exec chain

This is the most significant correctness concern. The background.Process tracks processes by binName set from the cmd argument to New(). After wrapDockerdCmd, binName = "unshare". However, due to the exec chain (unshareshdockerd), the actual process's /proc/<pid>/cmdline becomes "dockerd".

In background/process.go, the kill() function sends SIGTERM then polls isProcExited():

func isProcExited(fs afero.Fs, pid int, cmd string) (bool, error) {
    cmdline, err := afero.ReadFile(fs, fmt.Sprintf("/proc/%d/cmdline", pid))
    // ...
    args := bytes.Split(cmdline, []byte{'0'})
    return cmd != string(args[0]), nil  // "unshare" != "dockerd" → true (thinks it exited!)
}

Since "unshare" != "dockerd", isProcExited returns true immediately after SIGTERM is sent — before dockerd has actually exited. Restart() then starts a new instance while the old one may still be binding ports/sockets. This could cause startup failures or data corruption on the two no-space recovery restart paths.

Before this PR: binName was "dockerd", and the process cmdline was dockerd, so the PID exit check worked correctly.

Fix: Either:

  • Pass "dockerd" as binName separately (requires refactoring Process)
  • Have wrapDockerdCmd return a binName hint and use it
  • Or simplest: after the kill() call in Restart, also wait for the process channel to drain before starting

2. Must-fix: Existing test DockerdConfigured is broken

The test at docker_test.go:182-191 registers a fake command matching dockerd --debug --log-level=debug .... After this PR the binary is unshare, not dockerd, so the fake is never matched. The test needs to expect unshare --cgroup /bin/sh -c <script> dockerd <args> instead.

3. Missing test coverage for new code

  • wrapDockerdCmd: No unit test. Should verify the returned command/args structure, the exec "$0" "$@" pattern, and argument passthrough.
  • readCPUQuotaCGroupV2 fallback: The existing sys_test.go has no test case where the self-relative path fails but /sys/fs/cgroup/cpu.max succeeds. Should add a CGroupV2_RemountedRoot test case.

4. Minor: Error wrapping in fallback path (xunix/sys.go:78)

if rootErr != nil {
    return CPUQuota{}, xerrors.Errorf("read cpu.max outside container: %w", err) // wraps `err`, not `rootErr`
}

When both reads fail, the error message says "read cpu.max outside container" but wraps err (the self-relative path error) rather than rootErr (the fallback error). A developer debugging this might be confused since the error path and the wrapped error don't match. Consider wrapping both:

return CPUQuota{}, xerrors.Errorf("read cpu.max: self-path: %v, root-path: %w", err, rootErr)

5. Shell script: infinite retry loop with no bound

while ! {
    xargs -rn1 < /sys/fs/cgroup/cgroup.procs > /sys/fs/cgroup/init/cgroup.procs || :
    sed -e 's/ / +/g' -e 's/^/+/' < /sys/fs/cgroup/cgroup.controllers > /sys/fs/cgroup/cgroup.subtree_control
}; do :; done

This loop has no timeout or retry limit. If the subtree_control write persistently fails (e.g., a controller that can't be enabled), this blocks dockerd startup forever with no logs. Moby's hack/dind has the same pattern, but moby runs as a developer tool, not a long-lived production component.

Consider adding a max-retry counter with a warning message, or at minimum a sleep to avoid CPU-spinning on persistent failures.

6. umount /sys/fs/cgroup without guard

The script runs umount /sys/fs/cgroup unconditionally before the cgroupv2 guard. If cgroup is not mounted at that path (unusual but possible), set -e will abort the entire script before reaching exec dockerd. This would prevent workspace startup entirely with a cryptic error. Consider umount /sys/fs/cgroup 2>/dev/null || true or a mountpoint -q check.

Things Done Well

  • Excellent PR description: The tradeoffs (no --mount, mount namespace leak) are clearly documented with reasoning.
  • Minimal footprint: Only 2 files changed, well-scoped.
  • cgroup v1 guard: The if [ -f /sys/fs/cgroup/cgroup.controllers ] correctly skips delegation on v1.
  • exec "$0" "$@" pattern: Properly replaces the shell with dockerd, preserving the PID for signal handling.
  • CPU quota fallback: The fallback to reading /sys/fs/cgroup/cpu.max at mount root is the right approach for the mount-namespace leak side effect.
  • All three dockerd launch sites updated consistently.

Recommendation

Request changes. The isProcExited / binName mismatch (issue #1) is a real correctness regression for the restart paths. The broken test (issue #2) must also be fixed. Issues #3-6 are worth discussing but less critical.

@f0ssel
Copy link
Copy Markdown
Member

f0ssel commented May 4, 2026

I believe this issue with aquasecurity/trivy-action@0.34.2 is fixed if you rebase fresh from main, as I believe the build error has been fixed. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Processes inside the inner workspace container cannot be attributed to the pod by host-level observability tools

2 participants