Skip to content

feat(shim): add per-pod user namespace injection via annotation#13304

Draft
a7i wants to merge 1 commit into
google:masterfrom
a7i:feat/shim-userns
Draft

feat(shim): add per-pod user namespace injection via annotation#13304
a7i wants to merge 1 commit into
google:masterfrom
a7i:feat/shim-userns

Conversation

@a7i
Copy link
Copy Markdown
Contributor

@a7i a7i commented May 27, 2026

Summary

Adds enable_user_namespace_annotation option to containerd-shim-runsc-v1. When the runtime opts in and a pod sets dev.gvisor.spec.user-namespace: "true" in its metadata.annotations, the shim injects a Linux user namespace and a contiguous, non-overlapping uid/gid block into the sandbox container's OCI spec before invoking runsc. Application/exec containers in the same pod inherit the sandbox's user namespace from runsc.

The injection is a no-op when the caller already declared a user namespace or uid/gid mappings, so kubelet's pod.spec.hostUsers: false plumbing (KEP-127) takes precedence when it lands for runsc.

Refs: #13303.

Why

Per #13303, runtimeClassName: gvisor workloads cannot opt into user namespaces via pod.spec.hostUsers: false today because:

  1. containerd's introspectRuntimeFeatures (internal/cri/server/service.go) hardcodes r.Type == plugins.RuntimeRuncV2, so the runsc shim's manager.Info() features are never consulted.
  2. Even if (1) were lifted, supportsCRIUserns requires Linux.MountExtensions.IDMap.Enabled = true, which runsc/specutils/specutils.go returns false for since runsc does not implement OCI mount.idmap.

This patch sidesteps both issues at the runtime layer with a per-pod opt-in: a pod annotation tells the shim to mutate the sandbox's OCI spec the same way kubelet would, without requiring kubelet/CRI cooperation. A separate operator-side gate ensures a misconfigured workload cannot unilaterally request a userns on a runtime that is not provisioned for one.

This is intended as a stopgap, not a long-term replacement for KEP-127. When the upstream path lands, operators set enable_user_namespace_annotation = false and pods drop the annotation in favor of hostUsers: false. The change is forward-compatible: the shim respects caller-supplied mappings and yields without allocating a slot.

How

  • New shim options in pkg/shim/v1/runsc/options.go: enable_user_namespace_annotation, user_namespace_host_uid_base, user_namespace_host_gid_base, user_namespace_range_size (default 65536), user_namespace_pool_size (default 1000), user_namespace_state_dir (default /run/runsc/userns-pool).
  • New pkg/shim/v1/utils/userns.go: UserNamespaceConfig, UserNamespaceRequestAnnotation (dev.gvisor.spec.user-namespace), HasUserNamespaceRequest, AllocateUserNamespaceSlot, ReleaseUserNamespaceSlot, InjectUserNamespace. Uses os.Mkdir as the per-slot synchronization primitive (kernel-atomic), so concurrent shim invocations don't collide. State persisted under user_namespace_state_dir so allocations survive shim restarts; cleared on reboot via tmpfs.
  • pkg/shim/v1/runsc/service.go:newInit registers a third spec mutator alongside the existing UpdateVolumeAnnotations / setPodCgroup chain. Only runs for sandbox containers (utils.IsSandbox(spec)) AND when both gates are satisfied; app containers inherit.
  • pkg/shim/v1/runsc/container.go: Container.userNS field threads the allocator config through to Container.Delete, which releases the slot when the init task is torn down. Failure paths in NewContainer register a cleanup.Add hook so a partially-created sandbox releases its slot.
  • Doc: new "User Namespace Injection" section in g3doc/user_guide/containerd/configuration.md linking back to Support pod.spec.hostUsers: false (KEP-127): runsc not advertised as supporting user namespaces in CRI #13303.

Per-sandbox UID range is host_uid_base + slot * range_size, where slot is allocated from [0, pool_size). With defaults, the pool occupies [100000, 100000 + 1000*65536).

Pod-side UX

apiVersion: v1
kind: Pod
metadata:
  annotations:
    dev.gvisor.spec.user-namespace: "true"
spec:
  runtimeClassName: gvisor   # no new RuntimeClass required
  containers:
    - name: app
      image: ...

Test Plan

Unit tests in pkg/shim/v1/utils/userns_test.go:

  • TestUserNamespaceConfigValidate -- defaults populated, missing UID/GID base fails, uint32 overflow fails.
  • TestAllocateUserNamespaceSlotUnique -- 5 sandboxes get 5 distinct slots.
  • TestAllocateUserNamespaceSlotIdempotent -- same sandbox-id returns same slot.
  • TestAllocateUserNamespaceSlotPoolExhausted -- fills pool, returns errPoolExhausted.
  • TestReleaseUserNamespaceSlotFreesSlot -- release then re-allocate works; releasing unknown sandbox is a no-op.
  • TestAllocateUserNamespaceSlotConcurrent -- 20 goroutines, 64-slot pool, no collisions.
  • TestInjectUserNamespaceMutatesSpec -- spec gets userns + correct mappings + slot annotation.
  • TestInjectUserNamespaceRespectsCallerMappings -- caller-supplied uid/gid mappings preserved (no-op).
  • TestInjectUserNamespaceRejectsOutOfRangeSlot -- slot >= PoolSize fails.
  • TestHasUserNamespaceRequest -- annotation absent / "false" / "1" / "true" / nil-spec / nil-annotations.

End-to-end validated against kubernetes 1.35.2, containerd 2.2.2, runsc release-20260520.0 by enabling enable_user_namespace_annotation = true on the runsc runtime, applying a pod with the opt-in annotation, and confirming that cat /proc/self/uid_map shows a non-zero, per-pod-unique host UID range.

Open Questions for Maintainers

  1. Is this an acceptable shape for a stopgap, or would you prefer to wait for KEP-127 / CRI feature plumbing? Happy to abandon this in favor of an upstream containerd PR that lifts the runc-only restriction in introspectRuntimeFeatures, plus whatever is needed gVisor-side to satisfy supportsCRIUserns.
  2. The allocator lives in the shim. An alternative is to require operators to pre-allocate ranges in /etc/subuid//etc/subgid and have the shim consume them. The current design is more self-contained but does its own bookkeeping. Would maintainers prefer the subuid path?
  3. Naming conventions: dev.gvisor.spec.user-namespace (matches the existing dev.gvisor.spec.mount.{name} pattern) and enable_user_namespace_annotation (snake_case to match other shim TOML options). Alternatives welcome.

Happy to iterate or split into smaller commits.

When the runtime opts in (`enable_user_namespace_annotation = true` in
`runsc.toml`) and a pod sets `dev.gvisor.spec.user-namespace = "true"` in
its `metadata.annotations`, the runsc containerd shim injects a Linux
user namespace and a contiguous, non-overlapping uid/gid block into the
sandbox container's OCI spec before invoking runsc. Application/exec
containers in the same pod inherit the sandbox's user namespace from
runsc; only the sandbox spec is modified. Caller-provided mappings (e.g.
from kubelet pod.spec.hostUsers: false plumbing) take precedence.

Two gates are required so a misconfigured pod cannot unilaterally enable
a userns on a runtime that is not provisioned for one:
  1. operator opt-in: `enable_user_namespace_annotation = true`.
  2. pod opt-in: `dev.gvisor.spec.user-namespace: "true"` annotation.

This exists to let runsc workloads run inside a user namespace on
Kubernetes nodes whose kubelet+containerd stack does not yet plumb
pod.spec.hostUsers (KEP-127) through to runsc. The shim never claims CRI
RuntimeFeatures.UserNamespaces, so kubelet's KEP-127 admission is
unaffected; this annotation is the per-pod opt-in until the upstream
path lands. When that happens, drop the annotation and use
`hostUsers: false` on the pod spec instead.

Per-sandbox uniqueness is provided by a directory-based allocator under
`user_namespace_state_dir` (default `/run/runsc/userns-pool`). os.Mkdir
is the synchronization primitive: the kernel guarantees mkdir(2) is
atomic, so two shim invocations racing on the same slot resolve
correctly. Allocations survive shim restarts and clear on reboot
(`/run` is tmpfs).

Defaults: range_size=65536 UIDs per sandbox, pool_size=1000 concurrent
sandboxes. host_uid_base / host_gid_base must be configured explicitly.

Refs: google#13303
@a7i a7i force-pushed the feat/shim-userns branch from d68e671 to 14c1f2b Compare May 28, 2026 00:04
@a7i a7i changed the title feat(shim): add force_user_namespace option for sandbox containers feat(shim): add per-pod user namespace injection via annotation May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant