feat(shim): add per-pod user namespace injection via annotation#13304
Draft
a7i wants to merge 1 commit into
Draft
feat(shim): add per-pod user namespace injection via annotation#13304a7i wants to merge 1 commit into
a7i wants to merge 1 commit into
Conversation
When the runtime opts in (`enable_user_namespace_annotation = true` in `runsc.toml`) and a pod sets `dev.gvisor.spec.user-namespace = "true"` in its `metadata.annotations`, the runsc containerd shim injects a Linux user namespace and a contiguous, non-overlapping uid/gid block into the sandbox container's OCI spec before invoking runsc. Application/exec containers in the same pod inherit the sandbox's user namespace from runsc; only the sandbox spec is modified. Caller-provided mappings (e.g. from kubelet pod.spec.hostUsers: false plumbing) take precedence. Two gates are required so a misconfigured pod cannot unilaterally enable a userns on a runtime that is not provisioned for one: 1. operator opt-in: `enable_user_namespace_annotation = true`. 2. pod opt-in: `dev.gvisor.spec.user-namespace: "true"` annotation. This exists to let runsc workloads run inside a user namespace on Kubernetes nodes whose kubelet+containerd stack does not yet plumb pod.spec.hostUsers (KEP-127) through to runsc. The shim never claims CRI RuntimeFeatures.UserNamespaces, so kubelet's KEP-127 admission is unaffected; this annotation is the per-pod opt-in until the upstream path lands. When that happens, drop the annotation and use `hostUsers: false` on the pod spec instead. Per-sandbox uniqueness is provided by a directory-based allocator under `user_namespace_state_dir` (default `/run/runsc/userns-pool`). os.Mkdir is the synchronization primitive: the kernel guarantees mkdir(2) is atomic, so two shim invocations racing on the same slot resolve correctly. Allocations survive shim restarts and clear on reboot (`/run` is tmpfs). Defaults: range_size=65536 UIDs per sandbox, pool_size=1000 concurrent sandboxes. host_uid_base / host_gid_base must be configured explicitly. Refs: google#13303
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
enable_user_namespace_annotationoption tocontainerd-shim-runsc-v1. When the runtime opts in and a pod setsdev.gvisor.spec.user-namespace: "true"in itsmetadata.annotations, the shim injects a Linux user namespace and a contiguous, non-overlapping uid/gid block into the sandbox container's OCI spec before invoking runsc. Application/exec containers in the same pod inherit the sandbox's user namespace from runsc.The injection is a no-op when the caller already declared a user namespace or uid/gid mappings, so kubelet's
pod.spec.hostUsers: falseplumbing (KEP-127) takes precedence when it lands for runsc.Refs: #13303.
Why
Per #13303,
runtimeClassName: gvisorworkloads cannot opt into user namespaces viapod.spec.hostUsers: falsetoday because:introspectRuntimeFeatures(internal/cri/server/service.go) hardcodesr.Type == plugins.RuntimeRuncV2, so the runsc shim'smanager.Info()features are never consulted.supportsCRIUsernsrequiresLinux.MountExtensions.IDMap.Enabled = true, whichrunsc/specutils/specutils.goreturnsfalsefor since runsc does not implement OCImount.idmap.This patch sidesteps both issues at the runtime layer with a per-pod opt-in: a pod annotation tells the shim to mutate the sandbox's OCI spec the same way kubelet would, without requiring kubelet/CRI cooperation. A separate operator-side gate ensures a misconfigured workload cannot unilaterally request a userns on a runtime that is not provisioned for one.
This is intended as a stopgap, not a long-term replacement for KEP-127. When the upstream path lands, operators set
enable_user_namespace_annotation = falseand pods drop the annotation in favor ofhostUsers: false. The change is forward-compatible: the shim respects caller-supplied mappings and yields without allocating a slot.How
pkg/shim/v1/runsc/options.go:enable_user_namespace_annotation,user_namespace_host_uid_base,user_namespace_host_gid_base,user_namespace_range_size(default 65536),user_namespace_pool_size(default 1000),user_namespace_state_dir(default/run/runsc/userns-pool).pkg/shim/v1/utils/userns.go:UserNamespaceConfig,UserNamespaceRequestAnnotation(dev.gvisor.spec.user-namespace),HasUserNamespaceRequest,AllocateUserNamespaceSlot,ReleaseUserNamespaceSlot,InjectUserNamespace. Usesos.Mkdiras the per-slot synchronization primitive (kernel-atomic), so concurrent shim invocations don't collide. State persisted underuser_namespace_state_dirso allocations survive shim restarts; cleared on reboot via tmpfs.pkg/shim/v1/runsc/service.go:newInitregisters a third spec mutator alongside the existingUpdateVolumeAnnotations/setPodCgroupchain. Only runs for sandbox containers (utils.IsSandbox(spec)) AND when both gates are satisfied; app containers inherit.pkg/shim/v1/runsc/container.go:Container.userNSfield threads the allocator config through toContainer.Delete, which releases the slot when the init task is torn down. Failure paths inNewContainerregister acleanup.Addhook so a partially-created sandbox releases its slot.g3doc/user_guide/containerd/configuration.mdlinking back to Support pod.spec.hostUsers: false (KEP-127): runsc not advertised as supporting user namespaces in CRI #13303.Per-sandbox UID range is
host_uid_base + slot * range_size, whereslotis allocated from[0, pool_size). With defaults, the pool occupies[100000, 100000 + 1000*65536).Pod-side UX
Test Plan
Unit tests in
pkg/shim/v1/utils/userns_test.go:TestUserNamespaceConfigValidate-- defaults populated, missing UID/GID base fails, uint32 overflow fails.TestAllocateUserNamespaceSlotUnique-- 5 sandboxes get 5 distinct slots.TestAllocateUserNamespaceSlotIdempotent-- same sandbox-id returns same slot.TestAllocateUserNamespaceSlotPoolExhausted-- fills pool, returnserrPoolExhausted.TestReleaseUserNamespaceSlotFreesSlot-- release then re-allocate works; releasing unknown sandbox is a no-op.TestAllocateUserNamespaceSlotConcurrent-- 20 goroutines, 64-slot pool, no collisions.TestInjectUserNamespaceMutatesSpec-- spec gets userns + correct mappings + slot annotation.TestInjectUserNamespaceRespectsCallerMappings-- caller-supplied uid/gid mappings preserved (no-op).TestInjectUserNamespaceRejectsOutOfRangeSlot-- slot >= PoolSize fails.TestHasUserNamespaceRequest-- annotation absent / "false" / "1" / "true" / nil-spec / nil-annotations.End-to-end validated against
kubernetes 1.35.2,containerd 2.2.2, runsc release-20260520.0 by enablingenable_user_namespace_annotation = trueon the runsc runtime, applying a pod with the opt-in annotation, and confirming thatcat /proc/self/uid_mapshows a non-zero, per-pod-unique host UID range.Open Questions for Maintainers
introspectRuntimeFeatures, plus whatever is needed gVisor-side to satisfysupportsCRIUserns./etc/subuid//etc/subgidand have the shim consume them. The current design is more self-contained but does its own bookkeeping. Would maintainers prefer the subuid path?dev.gvisor.spec.user-namespace(matches the existingdev.gvisor.spec.mount.{name}pattern) andenable_user_namespace_annotation(snake_case to match other shim TOML options). Alternatives welcome.Happy to iterate or split into smaller commits.