From 23b3bb923a52bab340df2fa24fd77ee53707c769 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Wed, 13 May 2026 09:08:47 +0300 Subject: [PATCH 1/3] spec: refuse to boot when Linux effective capabilities exceed allowlist (#79) Co-Authored-By: Claude Opus 4.7 --- .../79-capability-allowlist-boot-check.md | 417 ++++++++++++++++++ 1 file changed, 417 insertions(+) create mode 100644 docs/specs/architecture/79-capability-allowlist-boot-check.md diff --git a/docs/specs/architecture/79-capability-allowlist-boot-check.md b/docs/specs/architecture/79-capability-allowlist-boot-check.md new file mode 100644 index 0000000..84a1b23 --- /dev/null +++ b/docs/specs/architecture/79-capability-allowlist-boot-check.md @@ -0,0 +1,417 @@ +# Spec: refuse to boot when Linux effective capabilities exceed allowlist + +Ticket: [#79](https://github.com/pyrycode/pyrycode-relay/issues/79). Size S. Split from #42. Sibling of #77. + +## Files to read first + +- `cmd/pyrycode-relay/main.go` (whole file, 136 lines) — the only call site. Lines 45–51 hold the just-landed `CheckInsecureListenInProduction` wiring from #77; the new capability check slots in immediately after, before `startedAt := time.Now()` (line 53). Lines 62–127 show the listener-start branches the check must run *before*. +- `internal/relay/production.go` (whole file, 47 lines) — the canonical "boot-time refusal helper" pattern in this package, just merged via #77. The shape this spec commits to (single exported `Check…` returning `error`, sentinel branchable via `errors.Is`, injected test seam instead of mutating process state) mirrors it line-for-line; deviate only where Linux/non-Linux split forces it. +- `internal/relay/production_test.go` (whole file, 119 lines) — the table-driven, `t.Parallel()`-safe, `errors.Is`-against-sentinel test style for boot-check helpers. The fake-getter helper (`fakeGetenv` closure over `map[string]string`) is the exact pattern this spec uses for its `readStatus` seam. +- `internal/relay/tls.go:15-19` — the canonical sentinel-error declaration shape (`var ErrCacheDirInsecure = errors.New("relay: …")` plus a Go doc comment that names the contract). Mirror it. +- `internal/relay/tls.go:29-49` — the canonical wrapped-sentinel return shape (`fmt.Errorf("%w: %s …", sentinel, detail)`). The capability check uses this when including the offending bit list. +- `docs/specs/architecture/77-refuse-insecure-in-production.md` — the immediate precedent (sibling ticket, same parent #42). Same call-site, same sentinel-error contract, same testing approach. The "Why not a `Config` struct" section at line 111 covers the rationale for keeping each boot-check as its own exported `Check…` function rather than consolidating; that rationale carries over here. +- `docs/specs/architecture/9-autocert-tls.md` — the original `ErrCacheDirInsecure` rationale. The "fail loud, before any listener starts" framing is identical to this ticket's intent. +- `docs/specs/architecture/32-dockerfile-base-hardening.md:213-214` — the prior architect's note that "Binding `:80` and `:443` from uid 65532 is the host's problem (port mapping, `CAP_NET_BIND_SERVICE`, or a host-side proxy); the portable artifact stays uid-neutral." This determines the allowlist: `CAP_NET_BIND_SERVICE` is the *only* capability the relay's autocert mode legitimately needs. +- `Dockerfile:31-39` — the runtime image (distroless `:nonroot`, uid 65532, no `setcap` on the binary). Combined with the fly.toml `internal_port = 80 / 443` lines, this establishes that the running container needs `CAP_NET_BIND_SERVICE` (or equivalent host-side kernel sysctl) to bind privileged ports. +- `docs/PROJECT-MEMORY.md` § "Project-level conventions" — the "Sentinel errors + `errors.Is` branching", "Loud failure over silent correction", and "Tests live in the same package" rules apply directly. + +## Context + +Container runtimes can grant Linux capabilities via `--cap-add`, capability bounding sets, or default Docker profiles. A misconfigured deploy that grants the relay extra capabilities (CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_DAC_OVERRIDE, …) escapes CI build scans and runs with more privilege than intended. + +The failure mode this ticket prevents is the silent-elevation class: a `--cap-add SYS_ADMIN` slipped into a deploy manifest does not fail the build; the container starts; `/healthz` returns 200; the relay processes traffic with capabilities it never needed. The defence must run *at boot, before any listener accepts a connection*, so a misconfigured deploy fails the health-check rather than serving traffic. + +The implementation is a Linux-only parse of `/proc/self/status`'s `CapEff:` hex mask, compared against an explicit allowlist. On non-Linux platforms (darwin dev runs, future Windows/BSD CI) the check is a no-op with a single startup log line — the build-tag split keeps both the production path (Linux container) and the developer path (darwin) honest without runtime conditionals. + +The allowlist is `{CAP_NET_BIND_SERVICE}` — the single capability the autocert mode legitimately needs to bind `:80` and `:443` from uid 65532 inside the distroless image. All other capabilities are unexpected. + +This ticket establishes the `_linux.go` / `_other.go` build-tag convention for the repo. No prior file uses that split (`grep -r '//go:build linux' internal/ cmd/` returns nothing as of `main` at 15dc00f). + +Capability bit positions are stable kernel ABI defined in `include/uapi/linux/capability.h`. The bit positions and symbolic names are frozen for the kernels the relay supports (Linux ≥ 4.x); future capability additions only ever extend the table, never renumber. + +Related: #9 (`ErrCacheDirInsecure` set the boot-time-refusal precedent); #77 (the immediate sibling — production-mode env contract and `--insecure-listen` boot refusal, just merged); #32 (distroless `:nonroot` hardening that motivates this check); #42 (parent — was split into #77 / #78 / #79). + +## Design + +Three new files in `internal/relay/`, one new test file (cross-platform), one Linux-only test file, one ~7-line addition to `cmd/pyrycode-relay/main.go`. No existing code is refactored. + +### File layout + +| Path | Purpose | Build tag | +|---|---|---| +| `internal/relay/caps.go` | Sentinel, `Capability` type, `AllowedCapabilities`, name table, `parseCapEff`, `checkCapEffMask` (pure functions) | none (compiled on every GOOS) | +| `internal/relay/caps_linux.go` | `CheckCapabilities` Linux entry + `readProcSelfStatus` + `checkCapabilitiesWithReader` seam | `_linux.go` suffix (auto-applied for GOOS=linux) | +| `internal/relay/caps_other.go` | `CheckCapabilities` no-op + skip log | explicit `//go:build !linux` | +| `internal/relay/caps_test.go` | Tests for `parseCapEff` + `checkCapEffMask` (pure-function cases) | none | +| `internal/relay/caps_linux_test.go` | Tests for `checkCapabilitiesWithReader` (the seam) | `_linux_test.go` suffix | + +**Convention chosen:** `_linux.go` (filename suffix that Go's build system auto-applies for GOOS=linux) and `_other.go` (with explicit `//go:build !linux`). The `_other` suffix is not a recognised GOOS, so the explicit build tag is required; the `_linux` half is tag-free for the same reason `os/file_unix.go` is tag-free in stdlib. Future per-GOOS splits in this repo should follow the same `_.go` / `_other.go` pairing. + +### Sentinel error and allowlist (in `caps.go`, exported) + +```go +// ErrUnexpectedCapability is returned by CheckCapabilities when the +// process's effective Linux capability set (CapEff) contains a bit +// outside AllowedCapabilities. Stray capabilities (CAP_SYS_ADMIN, +// CAP_NET_ADMIN, etc.) usually mean a misconfigured container runtime +// — --cap-add in a Docker run, an over-broad bounding set, or a +// default profile granting more than the relay needs. The relay +// refuses to start so the misconfiguration fails the deploy's +// health check rather than running with elevated privilege. +// +// Branchable via errors.Is. The wrapped error message identifies +// each unexpected bit (numeric position plus symbolic name where +// known) and the current allowlist. +var ErrUnexpectedCapability = errors.New("relay: process has unexpected effective Linux capabilities") + +// Capability is a Linux capability bit and its symbolic name. +type Capability struct { + Bit uint // 0-based bit position as defined in . + Name string // e.g. "CAP_NET_BIND_SERVICE". Empty for unknown bits. +} + +// AllowedCapabilities is the explicit allowlist of effective Linux +// capabilities the relay legitimately needs. +// +// CAP_NET_BIND_SERVICE (bit 10) is included because the autocert mode +// binds :80 and :443 from uid 65532 inside the distroless image +// (Dockerfile, fly.toml). Hosts that drop this cap and instead lower +// net.ipv4.ip_unprivileged_port_start are fine — CapEff will be 0 and +// the check passes; hosts that grant it via Docker's default profile +// are also fine — CapEff has the bit set and it matches the allowlist. +// +// All other Linux capabilities are unexpected. To extend, add an entry +// and document the deployment shape that requires it. +var AllowedCapabilities = []Capability{ + {Bit: 10, Name: "CAP_NET_BIND_SERVICE"}, +} +``` + +**Decision: allowlist as a slice of typed records, not a bitmask constant.** Two reasons: (a) the public allowlist is operator-facing — slog'ing it in the failure log needs `Name`s, not a hex int; (b) appending a new entry is a one-line diff with the symbolic name visible in code review, vs. a `1<= len(capabilityNames) { + return "" + } + return capabilityNames[bit] +} + +// Indexed by bit position. The exact names below are the kernel's +// CAP_* macros without the CAP_ prefix removed. +var capabilityNames = []string{ + 0: "CAP_CHOWN", + 1: "CAP_DAC_OVERRIDE", + 2: "CAP_DAC_READ_SEARCH", + 3: "CAP_FOWNER", + 4: "CAP_FSETID", + 5: "CAP_KILL", + 6: "CAP_SETGID", + 7: "CAP_SETUID", + 8: "CAP_SETPCAP", + 9: "CAP_LINUX_IMMUTABLE", + 10: "CAP_NET_BIND_SERVICE", + 11: "CAP_NET_BROADCAST", + 12: "CAP_NET_ADMIN", + 13: "CAP_NET_RAW", + 14: "CAP_IPC_LOCK", + 15: "CAP_IPC_OWNER", + 16: "CAP_SYS_MODULE", + 17: "CAP_SYS_RAWIO", + 18: "CAP_SYS_CHROOT", + 19: "CAP_SYS_PTRACE", + 20: "CAP_SYS_PACCT", + 21: "CAP_SYS_ADMIN", + 22: "CAP_SYS_BOOT", + 23: "CAP_SYS_NICE", + 24: "CAP_SYS_RESOURCE", + 25: "CAP_SYS_TIME", + 26: "CAP_SYS_TTY_CONFIG", + 27: "CAP_MKNOD", + 28: "CAP_LEASE", + 29: "CAP_AUDIT_WRITE", + 30: "CAP_AUDIT_CONTROL", + 31: "CAP_SETFCAP", + 32: "CAP_MAC_OVERRIDE", + 33: "CAP_MAC_ADMIN", + 34: "CAP_SYSLOG", + 35: "CAP_WAKE_ALARM", + 36: "CAP_BLOCK_SUSPEND", + 37: "CAP_AUDIT_READ", + 38: "CAP_PERFMON", + 39: "CAP_BPF", + 40: "CAP_CHECKPOINT_RESTORE", +} +``` + +The table is unexported because callers only ever consume it via the error message produced by `checkCapEffMask`. If a future ticket needs programmatic lookup, exporting is a one-line change. + +### Pure parsing and check functions (in `caps.go`, package-private) + +```go +// parseCapEff extracts the CapEff: hex mask from /proc/self/status +// content. The expected line shape is: +// CapEff: 0000000000000400 +// (whitespace separator may be tab or spaces; the value is 0–16 hex +// digits with no 0x prefix). Returns a wrapped error if the line is +// missing or the value does not parse as hex. Does NOT wrap +// ErrUnexpectedCapability — malformed input is a separate failure mode +// from "capabilities exceed allowlist". +func parseCapEff(procStatus string) (uint64, error) { + // Iterate lines, look for "CapEff:" prefix, strconv.ParseUint(value, 16, 64). + // Return fmt.Errorf("relay: parsing /proc/self/status CapEff: %w", err) on failure. + // Return fmt.Errorf("relay: /proc/self/status missing CapEff line") if not found. +} + +// checkCapEffMask returns ErrUnexpectedCapability (wrapped with the +// offending bit list and the allowlist contents) if the mask has any +// bit set outside AllowedCapabilities. Returns nil otherwise. +// +// The wrapped error message format is: +// relay: process has unexpected effective Linux capabilities: \ +// CAP_SYS_ADMIN (bit 21), bit 63; allowlist: [CAP_NET_BIND_SERVICE (bit 10)]; \ +// drop with --cap-drop=ALL --cap-add=NET_BIND_SERVICE or equivalent +// +// Unknown bits (no entry in capabilityNames) are reported as "bit N". +func checkCapEffMask(mask uint64) error { + allowed := allowedMask() // bitwise OR of AllowedCapabilities[*].Bit + unexpected := mask &^ allowed // bits set in mask but not in allowed + if unexpected == 0 { + return nil + } + // Build the bit list, format the message, wrap ErrUnexpectedCapability. + return fmt.Errorf("%w: %s; allowlist: %s; drop with --cap-drop=ALL --cap-add=NET_BIND_SERVICE or equivalent", + ErrUnexpectedCapability, formatBits(unexpected), formatAllowlist()) +} +``` + +**Decision: report all unexpected bits, not just the first.** A misconfigured manifest often grants several caps in one breath (`--cap-add ALL` minus a handful); listing one bit at a time would make an operator restart the deploy loop N times. The error message lists every offending bit. + +**Decision: include the operator fix string in the error message itself.** Mirrors `ErrCacheDirInsecure`'s wrapped form (`"%w: %s (mode %o)"`) and the AC's "structured log line that names the unexpected capability and the operator fix." The fix is fixed text — `--cap-drop=ALL --cap-add=NET_BIND_SERVICE or equivalent` — because the operator's runtime might be Docker, Kubernetes (`securityContext.capabilities.drop`), or Podman; "or equivalent" covers them all without naming each. + +### Linux entry point (in `caps_linux.go`) + +```go +// CheckCapabilities reads /proc/self/status's CapEff line and returns +// ErrUnexpectedCapability (wrapped) if the effective capability set +// contains any bit outside AllowedCapabilities. Returns nil otherwise. +// +// Intended to be called from main after flag parse, before any listener +// is started. Read errors on /proc/self/status are returned wrapped +// (not as ErrUnexpectedCapability) so callers can distinguish "kernel +// /proc gone weird" from "operator handed us extra caps." +func CheckCapabilities() error { + return checkCapabilitiesWithReader(readProcSelfStatus) +} + +func readProcSelfStatus() (string, error) { + data, err := os.ReadFile("/proc/self/status") + if err != nil { + return "", fmt.Errorf("relay: reading /proc/self/status: %w", err) + } + return string(data), nil +} + +// checkCapabilitiesWithReader is the test seam. Tests pass a closure +// returning canned /proc/self/status contents; production passes +// readProcSelfStatus. +func checkCapabilitiesWithReader(readStatus func() (string, error)) error { + status, err := readStatus() + if err != nil { + return err + } + mask, err := parseCapEff(status) + if err != nil { + return err + } + return checkCapEffMask(mask) +} +``` + +**Decision: inject the *full status reader*, not just the parsed mask.** The seam at the `readStatus func() (string, error)` boundary lets the malformed-input test (case d) exercise `parseCapEff` end-to-end without a separate seam at the mask-injection layer. Injecting `mask uint64` directly would skip `parseCapEff` and leave AC case (d) uncovered by the seam. + +`checkCapabilitiesWithReader` is package-private (lowercase) — tests live in the same package per the PROJECT-MEMORY "Tests live in the same package" rule, so they can call it directly. External callers go through `CheckCapabilities`. + +### Non-Linux no-op (in `caps_other.go`) + +```go +//go:build !linux + +package relay + +import ( + "log/slog" + "runtime" +) + +// CheckCapabilities is a no-op on non-Linux platforms. The /proc/self/status +// path the Linux variant reads exists only on Linux; on darwin/Windows/BSD +// there is no capability model to check, so the function returns nil and +// logs a one-line note that the check was skipped. The log emits via +// slog.Default(), so the caller's slog handler captures it alongside the +// rest of startup output. +func CheckCapabilities() error { + slog.Default().Info("skipping linux-only capability check", "goos", runtime.GOOS) + return nil +} +``` + +The Linux variant does *not* log a "running capability check" line on success — the absence of a log line is the success signal, mirroring `CheckInsecureListenInProduction` (which logs only on failure). The non-Linux variant logs because the "exactly once" skip note is the only signal the check ran at all. + +### Wiring in `cmd/pyrycode-relay/main.go` + +Insert immediately after the existing `CheckInsecureListenInProduction` block (lines 45–51 on `main` at 15dc00f), before `startedAt := time.Now()`: + +```go +if err := relay.CheckCapabilities(); err != nil { + logger.Error("refusing to start: unexpected Linux capabilities", + "err", err, + "fix", "drop extra capabilities (e.g. --cap-drop=ALL --cap-add=NET_BIND_SERVICE on docker, or securityContext.capabilities on kubernetes)") + os.Exit(2) +} +``` + +**Three details:** + +- **Exit code 2**, matching the existing flag-validation and production-mode guards. Exit 1 is for runtime listener failures; exit 2 is for boot-time configuration refusal. Distinct codes let ops dashboards split "deploy never started" from "deploy started and crashed." +- **No production-mode gating.** The AC is explicit: this check runs in every environment, including dev. The reasoning is that stray capabilities are *never* legitimate — a dev macOS box doesn't have Linux caps, a dev Linux container with extra caps is a misconfigured rehearsal of the production deploy. The check is uniformly on. +- **Placement: after `CheckInsecureListenInProduction`, before `relay.NewRegistry()`.** No side effects between flag-parse and `NewRegistry` on `main` as of 15dc00f, so the relative order between the two `Check…` calls is purely stylistic; lining them up adjacent makes a future "extract a `runStartupChecks` helper" refactor a one-block move. + +### Why not also check CapPrm / CapBnd / CapInh + +The PO body invites the architect to choose. The decision is **CapEff only**, for three reasons: + +1. **CapEff is the load-bearing set.** The kernel consults CapEff (not CapPrm) when authorising a privileged syscall. A capability in CapPrm but not CapEff is held in reserve — it can be raised into CapEff via `capset(2)` but cannot be used until then. The relay binary never calls `capset`, never `setuid`s, never `exec`s into a setuid binary; CapPrm bits that aren't in CapEff are inert. +2. **CapBnd / CapInh broaden the false-positive surface.** CapBnd is the bounding set inherited from the container runtime; it can legitimately be a superset of CapEff (the runtime says "you *could* hold these if you raise them"). Refusing to boot on a wide CapBnd would reject Kubernetes pods running under default policy where CapBnd is broader than CapEff by design. +3. **Minimum surface = minimum noise.** Adding the other three masks now would flag legitimate deployments and force operators to over-grant. If a future incident shows CapPrm/CapBnd matters (e.g. a Go runtime change makes `capset(2)` reachable from a goroutine), follow-up ticket adds the additional checks with a clear motivating failure. + +This decision is documented in the Go doc comment on `CheckCapabilities` so a future contributor extending the check has the rationale on hand. + +### Why not gate behind PYRYCODE_RELAY_PRODUCTION + +The PO body's last AC is explicit: this check has no production-mode predicate. Repeated here for the spec record: extra capabilities are never legitimate. A dev Linux box that picks up `--cap-add SYS_ADMIN` from a copy-pasted Docker compose stanza fails fast the same way prod would, which is the whole point of running the dev loop and prod from the same binary. The check is unconditional. + +## Concurrency model + +None. `CheckCapabilities` runs on the main goroutine before any listener is started. No locks, no channels, no goroutines. The file read (`os.ReadFile("/proc/self/status")`) is a single synchronous syscall; the kernel atomically snapshots the process's capability state for that read. + +## Error handling + +Three failure modes, each with a distinct return shape: + +| Mode | Cause | Return | Branchable via | +|---|---|---|---| +| Allowlist violation | CapEff has bit outside `AllowedCapabilities` | `fmt.Errorf("%w: …", ErrUnexpectedCapability)` | `errors.Is(err, ErrUnexpectedCapability)` | +| Malformed `/proc/self/status` | Missing `CapEff:` line, non-hex value | `fmt.Errorf("relay: parsing /proc/self/status CapEff: …", …)` | — (treated as fatal by main) | +| `os.ReadFile` failed | `/proc` unmounted, permission denied | `fmt.Errorf("relay: reading /proc/self/status: %w", err)` | underlying `os.PathError` | + +All three lead to `os.Exit(2)` in main with the same log line — operators don't need to distinguish (the `err` field carries the wrapped detail). The branchable sentinel exists so future callers (e.g. an integration test that wants to assert "this misconfiguration produces *this specific* refusal") can match on `ErrUnexpectedCapability` without string-matching the log. + +No retry, fallback, or partial-state recovery. Boot-time refusal is total. + +## Testing strategy + +All tests live alongside the implementation and are `t.Parallel()`-safe. No `os.Setenv`, no `t.Setenv`, no touching real `/proc`. + +### `caps_test.go` — cross-platform pure-function tests + +These run on every GOOS the project builds for (linux, darwin, …). They exercise `parseCapEff` and `checkCapEffMask` directly. Helper: + +```go +// Build a /proc/self/status fixture with a custom CapEff line. +func statusFixture(capEffHex string) string { + return "Name:\trelay\nState:\tR (running)\nCapEff:\t" + capEffHex + "\nCapBnd:\tffffffffffffffff\n" +} +``` + +#### Test: `parseCapEff` value matrix + +Table-driven over `(input string, want uint64, wantErr bool)`: + +| input | want | wantErr | +|---|---|---| +| `statusFixture("0000000000000000")` | `0` | false | +| `statusFixture("0000000000000400")` | `0x400` (bit 10) | false | +| `statusFixture("00000000003fffffffff")` | `0x3fffffffff` (bits 0–37) | false | +| `statusFixture("ffffffffffffffff")` | `^uint64(0)` | false | +| `statusFixture("0")` (short value) | `0` | false | +| `"Name:\trelay\n"` (no CapEff line) | — | true | +| `statusFixture("not-hex")` | — | true | +| `""` (empty file) | — | true | +| `statusFixture("400 trailing junk")` | — | true (ParseUint rejects spaces in the field) | + +The malformed cases cover AC (d). The "short value" case documents that the kernel does not zero-pad to 16 digits in older procfs versions. + +#### Test: `checkCapEffMask` allowlist matrix + +| mask | want | +|---|---| +| `0` (empty CapEff) | `nil` | +| `0x400` (only CAP_NET_BIND_SERVICE) | `nil` | +| `0x200000` (only CAP_SYS_ADMIN, bit 21) | `errors.Is(err, ErrUnexpectedCapability)` | +| `0x400 \| 0x200000` (allowed + disallowed) | sentinel; error message names `CAP_SYS_ADMIN` | +| `1 << 63` (unknown bit) | sentinel; error message contains `"bit 63"` (no symbolic name) | +| `^uint64(0)` (all bits) | sentinel; error message lists allowlist contents | + +These cover AC (a), (b), (c). The unknown-bit row protects against a future kernel adding `CAP_CHECKPOINT_RESTORE+1` without an entry in `capabilityNames`. + +#### Test: `ErrUnexpectedCapability` is branchable + +A one-line test that `errors.Is(checkCapEffMask(0x200000), ErrUnexpectedCapability)` returns true. Locks in the `errors.Is` contract so a future "let me return `fmt.Errorf("relay: %s", ...)` instead" refactor breaks the test, not downstream callers. + +### `caps_linux_test.go` — Linux-only seam test + +Has the `_linux_test.go` suffix so it compiles only when building on Linux (CI on Ubuntu, plus any contributor on a Linux dev box). Exercises `checkCapabilitiesWithReader` with fake `readStatus` closures: + +| `readStatus` returns | want | +|---|---| +| `(statusFixture("0000000000000400"), nil)` (only allowed) | `nil` | +| `(statusFixture("0000000000200000"), nil)` (CAP_SYS_ADMIN) | `errors.Is(err, ErrUnexpectedCapability)` | +| `("Name:\trelay\n", nil)` (missing CapEff line) | wrapped error (not sentinel), no panic | +| `("", errors.New("io: synthetic"))` | wrapped read error, no panic | + +The seam test confirms (a) the parse + mask check is correctly threaded through `checkCapabilitiesWithReader`, and (b) read errors from the injected reader propagate without crashing — which is the production-relevant case for "what if /proc is unmounted in a stripped container." + +### What is NOT tested + +- The `main.go` wiring. Same rationale as #77's spec: `main` is a coordination function, and the unit tests on `CheckCapabilities` cover the behaviour. A fork-exec integration test for one `if err != nil { os.Exit(2) }` block is over-engineering. Code review of the diff is the gate. +- The non-Linux `CheckCapabilities` no-op. It is two lines (slog + return nil); the log message is operator-facing prose, not a downstream-machine-parsed format. Asserting on slog output requires capturing the handler, which is friction the line count does not earn. +- The real `/proc/self/status` on the CI runner. The seam test runs on Linux but with injected input; reading the *actual* `/proc/self/status` would couple the test to the CI runner's capability configuration (which varies between GitHub Actions standard runners and our distroless image). Production behaviour is verified by the failure-path's loud refusal — a CI environment whose CapEff contains an unexpected bit would cause `make test` to … pass (these are unit tests, not the binary running). The intended deployment-level check is "deploy fails health check," which is the explicit design. +- The exact format of the error message string. We test the *content* (sentinel branchable, contains "CAP_SYS_ADMIN", contains "bit 63") but not the exact whitespace or punctuation. The format is operator-facing prose; cosmetic changes should not break tests. + +## Open questions + +None. The allowlist contents (`CAP_NET_BIND_SERVICE` only), the build-tag file naming convention (`_linux.go` / `_other.go`), the choice to inject at the reader boundary rather than the mask boundary, the choice to check CapEff only, and the choice to keep the check unconditional (no production-mode gating) are all settled above. + +The capability-bit name table tracks `include/uapi/linux/capability.h` through `CAP_CHECKPOINT_RESTORE` (bit 40, kernel 5.9, ~2020). The relay's deployment kernel (Fly's Firecracker-on-Linux fleet) is currently 5.x+. Future kernel additions (none queued as of 2026-05) only ever extend the table — appending new entries is a one-line follow-up. + +## Security review + +**Verdict:** PASS + +This review was conducted per `architect/security-review.md` for the security-sensitive ticket #79. The spec was walked adversarially against each category. Findings: + +- **[Trust boundaries]** The single input crossing into `CheckCapabilities` is `/proc/self/status`, sourced from the kernel — not from a network attacker, not from operator-supplied input. The kernel's procfs is part of the relay's TCB; if it lies about CapEff, the kernel itself is compromised and the relay's other defences (TLS, header validation, …) are also moot. No untrusted parser surface. No allocation driven by attacker-controlled length (the status file is bounded by kernel formatting). No findings. +- **[Tokens, secrets, credentials]** N/A. The ticket reads no credential material. The CapEff value is a state observation, not a secret. The log line carries the capability *name* (`CAP_SYS_ADMIN`) and the bit position, not anything operator-sensitive. No findings. +- **[File operations]** Single read of `/proc/self/status`, hard-coded path. No path concatenation, no TOCTOU window (the file is read once, parsed in memory, the result is consumed in the same syscall sequence). No write, no unlink, no permission change. No findings. +- **[Subprocess / external command execution]** None. No `exec.Command`, no `os.StartProcess`, no shell-out. The "operator fix" string in the error message is data, not a command we run. No findings. +- **[Cryptographic primitives]** N/A. No RNG, no hash, no comparison of attacker-controlled values. The bit-mask comparison is a bitwise `&^` on two `uint64`s; no timing side-channel risk because neither input is attacker-controlled (one is the kernel's view of the process, the other is a compile-time constant). No findings. +- **[Network & I/O]** The whole purpose of the ticket is to refuse to open listeners when the capability state is wrong. The check runs before any `http.Server` is started. Listener timeouts, frame caps, header gates — all unchanged. No findings. +- **[Error messages, logs, telemetry]** The wrapped error message contains: (a) the sentinel prose, (b) the unexpected capability *names* (or `bit N` for unknown), (c) the allowlist contents, (d) a fixed-text operator fix. None of these are user-controlled — they are constants drawn from `capabilityNames` and the literal allowlist slice. The main-side log line emits `err` (the wrapped detail) and `fix` (a fixed string). The CapEff hex value itself is not logged separately; it lives only inside the formatted error message via the symbolic names that derive from it. **Worth flagging to the developer:** do not extend the log line to add a `cap_eff` field with the raw hex value. The symbolic names already convey what an operator needs; the raw mask is forensic detail that, if logged, would clutter centralised logging for every misconfigured boot and would also leak (via the bit positions) any future capabilities we add to the allowlist before we've documented them. No findings against the spec as written. +- **[Concurrency]** N/A. `CheckCapabilities` runs on the main goroutine before any goroutine is spawned. No shared mutable state. The `capabilityNames` slice and `AllowedCapabilities` slice are package-level vars; both are initialised once at program start and only read thereafter. No data race surface. No findings. +- **[Build-tag correctness]** The Linux-only entry point lives in `caps_linux.go` (auto-applied filename suffix); the no-op lives in `caps_other.go` with explicit `//go:build !linux`. The two are mutually exclusive — no GOOS triggers both, no GOOS triggers neither. A future contributor who adds a `caps_freebsd.go` (matching `_freebsd.go` suffix) would auto-disable the `_other.go` no-op for that GOOS, which is the correct extension shape. The build-tag boundary itself is not a security boundary (it's a compile-time selector), but worth confirming the no-op is genuinely no-op: it has zero behavior beyond logging, returns nil unconditionally. No findings. +- **[Threat model alignment]** `docs/threat-model.md` § Deploy treats "operator misconfiguration" as the dominant failure class for a single-machine, internet-exposed relay. This ticket adds an in-binary defence against one specific shape of that class (over-broad capability grant). It complements `ErrCacheDirInsecure` (mode-bit drift), `ErrInsecureListenInProduction` (transport drift, just merged via #77), and a future #78 (uid-0 in production). All four share the "refuse to start; fail the health check; never serve traffic in this configuration" shape. No findings. + +**One SHOULD FIX flagged inline** (Error messages, logs, telemetry): the developer must not extend the log line to include the raw CapEff hex value. The spec is written that way; code review must double-check. + +**Reviewer:** architect (self-review per `architect/security-review.md`) +**Date:** 2026-05-13 From ccd45d7ae400fc3060d3c44893c591c672e0f0c4 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Wed, 13 May 2026 09:12:20 +0300 Subject: [PATCH 2/3] feat(relay): refuse to boot when Linux effective capabilities exceed allowlist (#79) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a Linux-only boot-time check that parses /proc/self/status's CapEff hex mask and aborts if any bit outside an explicit allowlist (CAP_NET_BIND_SERVICE only) is set. On non-Linux platforms the check is a no-op that logs a single skip line at startup. Build-tag split via the new _linux.go / _other.go convention. Wired into cmd/pyrycode-relay/main.go after flag-parse, before any listener starts; exit code 2 matches existing boot-time configuration refusals. No production-mode gating — stray capabilities are never legitimate. Co-Authored-By: Claude Opus 4.7 --- cmd/pyrycode-relay/main.go | 7 ++ internal/relay/caps.go | 184 ++++++++++++++++++++++++++++++ internal/relay/caps_linux.go | 46 ++++++++ internal/relay/caps_linux_test.go | 62 ++++++++++ internal/relay/caps_other.go | 26 +++++ internal/relay/caps_test.go | 183 +++++++++++++++++++++++++++++ 6 files changed, 508 insertions(+) create mode 100644 internal/relay/caps.go create mode 100644 internal/relay/caps_linux.go create mode 100644 internal/relay/caps_linux_test.go create mode 100644 internal/relay/caps_other.go create mode 100644 internal/relay/caps_test.go diff --git a/cmd/pyrycode-relay/main.go b/cmd/pyrycode-relay/main.go index c9464cd..675a7d8 100644 --- a/cmd/pyrycode-relay/main.go +++ b/cmd/pyrycode-relay/main.go @@ -50,6 +50,13 @@ func main() { os.Exit(2) } + if err := relay.CheckCapabilities(); err != nil { + logger.Error("refusing to start: unexpected Linux capabilities", + "err", err, + "fix", "drop extra capabilities (e.g. --cap-drop=ALL --cap-add=NET_BIND_SERVICE on docker, or securityContext.capabilities on kubernetes)") + os.Exit(2) + } + startedAt := time.Now() reg := relay.NewRegistry() diff --git a/internal/relay/caps.go b/internal/relay/caps.go new file mode 100644 index 0000000..be9a827 --- /dev/null +++ b/internal/relay/caps.go @@ -0,0 +1,184 @@ +package relay + +import ( + "errors" + "fmt" + "strconv" + "strings" +) + +// ErrUnexpectedCapability is returned by CheckCapabilities when the +// process's effective Linux capability set (CapEff) contains a bit +// outside AllowedCapabilities. Stray capabilities (CAP_SYS_ADMIN, +// CAP_NET_ADMIN, etc.) usually mean a misconfigured container runtime +// — --cap-add in a Docker run, an over-broad bounding set, or a +// default profile granting more than the relay needs. The relay +// refuses to start so the misconfiguration fails the deploy's +// health check rather than running with elevated privilege. +// +// Branchable via errors.Is. The wrapped error message identifies +// each unexpected bit (numeric position plus symbolic name where +// known) and the current allowlist. +var ErrUnexpectedCapability = errors.New("relay: process has unexpected effective Linux capabilities") + +// Capability is a Linux capability bit and its symbolic name. +type Capability struct { + // Bit is the 0-based bit position as defined in . + Bit uint + // Name is the kernel CAP_* macro (e.g. "CAP_NET_BIND_SERVICE"). + // Empty for unknown bits. + Name string +} + +// AllowedCapabilities is the explicit allowlist of effective Linux +// capabilities the relay legitimately needs. +// +// CAP_NET_BIND_SERVICE (bit 10) is included because the autocert mode +// binds :80 and :443 from uid 65532 inside the distroless image +// (Dockerfile, fly.toml). Hosts that drop this cap and instead lower +// net.ipv4.ip_unprivileged_port_start are fine — CapEff will be 0 and +// the check passes; hosts that grant it via Docker's default profile +// are also fine — CapEff has the bit set and it matches the allowlist. +// +// All other Linux capabilities are unexpected. To extend, add an entry +// and document the deployment shape that requires it. +var AllowedCapabilities = []Capability{ + {Bit: 10, Name: "CAP_NET_BIND_SERVICE"}, +} + +// capabilityNames maps a Linux capability bit position to its kernel +// CAP_* macro name. Indexed by bit position; tracks +// include/uapi/linux/capability.h through CAP_CHECKPOINT_RESTORE +// (bit 40, kernel 5.9). Bit positions are stable kernel ABI; new +// capabilities only ever append. +var capabilityNames = []string{ + 0: "CAP_CHOWN", + 1: "CAP_DAC_OVERRIDE", + 2: "CAP_DAC_READ_SEARCH", + 3: "CAP_FOWNER", + 4: "CAP_FSETID", + 5: "CAP_KILL", + 6: "CAP_SETGID", + 7: "CAP_SETUID", + 8: "CAP_SETPCAP", + 9: "CAP_LINUX_IMMUTABLE", + 10: "CAP_NET_BIND_SERVICE", + 11: "CAP_NET_BROADCAST", + 12: "CAP_NET_ADMIN", + 13: "CAP_NET_RAW", + 14: "CAP_IPC_LOCK", + 15: "CAP_IPC_OWNER", + 16: "CAP_SYS_MODULE", + 17: "CAP_SYS_RAWIO", + 18: "CAP_SYS_CHROOT", + 19: "CAP_SYS_PTRACE", + 20: "CAP_SYS_PACCT", + 21: "CAP_SYS_ADMIN", + 22: "CAP_SYS_BOOT", + 23: "CAP_SYS_NICE", + 24: "CAP_SYS_RESOURCE", + 25: "CAP_SYS_TIME", + 26: "CAP_SYS_TTY_CONFIG", + 27: "CAP_MKNOD", + 28: "CAP_LEASE", + 29: "CAP_AUDIT_WRITE", + 30: "CAP_AUDIT_CONTROL", + 31: "CAP_SETFCAP", + 32: "CAP_MAC_OVERRIDE", + 33: "CAP_MAC_ADMIN", + 34: "CAP_SYSLOG", + 35: "CAP_WAKE_ALARM", + 36: "CAP_BLOCK_SUSPEND", + 37: "CAP_AUDIT_READ", + 38: "CAP_PERFMON", + 39: "CAP_BPF", + 40: "CAP_CHECKPOINT_RESTORE", +} + +// capabilityName returns the symbolic name for a Linux capability bit +// (e.g. "CAP_SYS_ADMIN" for bit 21), or the empty string if unknown. +func capabilityName(bit uint) string { + if int(bit) >= len(capabilityNames) { + return "" + } + return capabilityNames[bit] +} + +// allowedMask returns the bitwise OR of every Bit in AllowedCapabilities. +func allowedMask() uint64 { + var m uint64 + for _, c := range AllowedCapabilities { + m |= uint64(1) << c.Bit + } + return m +} + +// parseCapEff extracts the CapEff: hex mask from /proc/self/status +// content. The expected line shape is: +// +// CapEff:\t0000000000000400 +// +// The whitespace separator may be tab or spaces; the value is 0–16 hex +// digits with no 0x prefix. Returns a wrapped error if the line is +// missing or the value does not parse as hex. Does NOT wrap +// ErrUnexpectedCapability — malformed input is a separate failure mode +// from "capabilities exceed allowlist". +func parseCapEff(procStatus string) (uint64, error) { + for _, line := range strings.Split(procStatus, "\n") { + rest, ok := strings.CutPrefix(line, "CapEff:") + if !ok { + continue + } + value := strings.TrimSpace(rest) + mask, err := strconv.ParseUint(value, 16, 64) + if err != nil { + return 0, fmt.Errorf("relay: parsing /proc/self/status CapEff %q: %w", value, err) + } + return mask, nil + } + return 0, fmt.Errorf("relay: /proc/self/status missing CapEff line") +} + +// checkCapEffMask returns ErrUnexpectedCapability (wrapped with the +// offending bit list and the allowlist contents) if mask has any bit +// set outside AllowedCapabilities. Returns nil otherwise. +// +// All unexpected bits are reported, not just the first — a misconfigured +// manifest often grants several caps in one breath. Unknown bits (no +// entry in capabilityNames) are reported as "bit N". +func checkCapEffMask(mask uint64) error { + unexpected := mask &^ allowedMask() + if unexpected == 0 { + return nil + } + return fmt.Errorf("%w: %s; allowlist: %s; drop with --cap-drop=ALL --cap-add=NET_BIND_SERVICE or equivalent", + ErrUnexpectedCapability, formatBits(unexpected), formatAllowlist()) +} + +// formatBits renders the set bits of mask as a comma-separated list of +// "CAP_NAME (bit N)" or "bit N" if the bit has no entry in capabilityNames. +// Bits are listed in ascending order. +func formatBits(mask uint64) string { + var parts []string + for bit := uint(0); bit < 64; bit++ { + if mask&(uint64(1)<") keeps the operator-visible string identical without growing +// the allowlist for a startup-only diagnostic. +func CheckCapabilities() error { + slog.Default().Info(fmt.Sprintf("skipping linux-only capability check on %s", runtime.GOOS)) + return nil +} diff --git a/internal/relay/caps_test.go b/internal/relay/caps_test.go new file mode 100644 index 0000000..56aa687 --- /dev/null +++ b/internal/relay/caps_test.go @@ -0,0 +1,183 @@ +package relay + +import ( + "errors" + "strings" + "testing" +) + +// statusFixture builds a /proc/self/status fixture with a custom CapEff line. +func statusFixture(capEffHex string) string { + return "Name:\trelay\nState:\tR (running)\nCapEff:\t" + capEffHex + "\nCapBnd:\tffffffffffffffff\n" +} + +func TestParseCapEff_ValueMatrix(t *testing.T) { + t.Parallel() + + cases := []struct { + name string + input string + want uint64 + wantErr bool + }{ + {name: "zero mask", input: statusFixture("0000000000000000"), want: 0}, + {name: "only CAP_NET_BIND_SERVICE", input: statusFixture("0000000000000400"), want: 0x400}, + {name: "bits 0-37", input: statusFixture("0000003fffffffff"), want: 0x3fffffffff}, + {name: "all bits", input: statusFixture("ffffffffffffffff"), want: ^uint64(0)}, + {name: "short value", input: statusFixture("0"), want: 0}, + {name: "missing CapEff line", input: "Name:\trelay\n", wantErr: true}, + {name: "non-hex value", input: statusFixture("not-hex"), wantErr: true}, + {name: "empty file", input: "", wantErr: true}, + {name: "trailing junk after hex", input: statusFixture("400 trailing junk"), wantErr: true}, + } + + for _, tc := range cases { + tc := tc + t.Run(tc.name, func(t *testing.T) { + t.Parallel() + got, err := parseCapEff(tc.input) + if tc.wantErr { + if err == nil { + t.Fatalf("parseCapEff(%q) err=nil, want error", tc.input) + } + return + } + if err != nil { + t.Fatalf("parseCapEff(%q) unexpected err: %v", tc.input, err) + } + if got != tc.want { + t.Errorf("parseCapEff(%q) = %#x, want %#x", tc.input, got, tc.want) + } + }) + } +} + +func TestCheckCapEffMask_AllowlistMatrix(t *testing.T) { + t.Parallel() + + cases := []struct { + name string + mask uint64 + wantSentinel bool + wantSubstrings []string + }{ + { + name: "empty CapEff is nil", + mask: 0, + wantSentinel: false, + }, + { + name: "only CAP_NET_BIND_SERVICE is nil", + mask: 0x400, + wantSentinel: false, + }, + { + name: "CAP_SYS_ADMIN is sentinel", + mask: uint64(1) << 21, + wantSentinel: true, + wantSubstrings: []string{"CAP_SYS_ADMIN", "bit 21"}, + }, + { + name: "allowed plus disallowed reports only disallowed", + mask: 0x400 | (uint64(1) << 21), + wantSentinel: true, + wantSubstrings: []string{"CAP_SYS_ADMIN"}, + }, + { + name: "unknown bit 63 is sentinel", + mask: uint64(1) << 63, + wantSentinel: true, + wantSubstrings: []string{"bit 63"}, + }, + { + name: "all bits names allowlist contents", + mask: ^uint64(0), + wantSentinel: true, + wantSubstrings: []string{"CAP_NET_BIND_SERVICE"}, + }, + } + + for _, tc := range cases { + tc := tc + t.Run(tc.name, func(t *testing.T) { + t.Parallel() + err := checkCapEffMask(tc.mask) + if !tc.wantSentinel { + if err != nil { + t.Fatalf("checkCapEffMask(%#x) = %v, want nil", tc.mask, err) + } + return + } + if !errors.Is(err, ErrUnexpectedCapability) { + t.Fatalf("checkCapEffMask(%#x) = %v, want errors.Is(err, ErrUnexpectedCapability)", tc.mask, err) + } + msg := err.Error() + for _, want := range tc.wantSubstrings { + if !strings.Contains(msg, want) { + t.Errorf("checkCapEffMask(%#x) error %q missing substring %q", tc.mask, msg, want) + } + } + }) + } +} + +func TestCheckCapEffMask_DisallowedReportsOnlyOffendingBits(t *testing.T) { + t.Parallel() + + // CAP_NET_BIND_SERVICE is allowed; ensure its name does not appear in + // the unexpected-bits portion when combined with a disallowed bit. + err := checkCapEffMask(0x400 | (uint64(1) << 21)) + if !errors.Is(err, ErrUnexpectedCapability) { + t.Fatalf("got %v, want sentinel", err) + } + msg := err.Error() + // "CAP_NET_BIND_SERVICE" appears in the allowlist portion; the + // unexpected-bits portion (before "; allowlist:") must not contain it. + idx := strings.Index(msg, "; allowlist:") + if idx < 0 { + t.Fatalf("error %q missing allowlist section", msg) + } + if strings.Contains(msg[:idx], "CAP_NET_BIND_SERVICE") { + t.Errorf("unexpected-bits section %q must not name an allowed cap", msg[:idx]) + } +} + +func TestErrUnexpectedCapability_IsBranchable(t *testing.T) { + t.Parallel() + + if !errors.Is(ErrUnexpectedCapability, ErrUnexpectedCapability) { + t.Fatal("ErrUnexpectedCapability should errors.Is itself") + } + + err := checkCapEffMask(uint64(1) << 21) + if !errors.Is(err, ErrUnexpectedCapability) { + t.Errorf("returned error %v should satisfy errors.Is(err, ErrUnexpectedCapability)", err) + } +} + +func TestCapabilityName(t *testing.T) { + t.Parallel() + + if got := capabilityName(10); got != "CAP_NET_BIND_SERVICE" { + t.Errorf("capabilityName(10) = %q, want CAP_NET_BIND_SERVICE", got) + } + if got := capabilityName(21); got != "CAP_SYS_ADMIN" { + t.Errorf("capabilityName(21) = %q, want CAP_SYS_ADMIN", got) + } + if got := capabilityName(63); got != "" { + t.Errorf("capabilityName(63) = %q, want empty (unknown bit)", got) + } + if got := capabilityName(999); got != "" { + t.Errorf("capabilityName(999) = %q, want empty (out of range)", got) + } +} + +func TestAllowedMask(t *testing.T) { + t.Parallel() + + // AllowedCapabilities contains CAP_NET_BIND_SERVICE (bit 10). + want := uint64(1) << 10 + if got := allowedMask(); got != want { + t.Errorf("allowedMask() = %#x, want %#x", got, want) + } +} From ebc011fab58807e87637aa19dd80722b13482a50 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Wed, 13 May 2026 09:18:34 +0300 Subject: [PATCH 3/3] docs: Linux capability allowlist & build-tag split convention (#79) Adds feature doc for the boot-time CapEff check, ADR-0009 capturing the new _.go / _other.go build-tag convention, and a per-ticket codebase note. Co-Authored-By: Claude Opus 4.7 --- docs/knowledge/INDEX.md | 2 + docs/knowledge/codebase/79.md | 49 ++++++++++++ ...009-build-tag-platform-split-convention.md | 45 +++++++++++ .../features/capability-allowlist.md | 75 +++++++++++++++++++ 4 files changed, 171 insertions(+) create mode 100644 docs/knowledge/codebase/79.md create mode 100644 docs/knowledge/decisions/0009-build-tag-platform-split-convention.md create mode 100644 docs/knowledge/features/capability-allowlist.md diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index e7e15f1..e94d057 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Features +- [Linux capability allowlist (boot-time refusal)](features/capability-allowlist.md) — relay parses `/proc/self/status`'s `CapEff:` hex mask at boot and refuses to start (exit 2) if any bit is set outside `AllowedCapabilities` (currently `{CAP_NET_BIND_SERVICE}` only, motivated by autocert binding `:80`/`:443` from uid 65532 in the distroless image). Exported sentinel `ErrUnexpectedCapability` is branchable via `errors.Is`; the wrapped error names every offending bit symbolically (`CAP_SYS_ADMIN (bit 21)` or `bit 63` for unknown), lists the allowlist contents, and embeds the operator fix string. `CapEff` only — `CapPrm/CapBnd/CapInh` would broaden false-positives (legitimate K8s default policy grants wide CapBnd) without adding load-bearing protection (relay never `capset(2)`s). Linux/non-Linux split at compile time via the new `_.go` / `_other.go` build-tag convention (see ADR-0009); non-Linux GOOS logs one skip line and returns nil. Unconditional — no production-mode gating, no env-var bypass, because stray capabilities are never legitimate. Reader-boundary test seam (`func() (string, error)`) exercises the parse + mask check end-to-end without touching real `/proc`. Joins the boot-time-refusal sentinel family (#9, #77, #79; future #78) (#79). - [Production-mode contract & `--insecure-listen` startup refusal](features/production-mode.md) — `PYRYCODE_RELAY_PRODUCTION=1` env-var contract (exact-string match, lazy read via injected getter, mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE` shape from #64/#65) plus the first boot-time check that consumes it: `relay.CheckInsecureListenInProduction` returns the exported `ErrInsecureListenInProduction` sentinel (branchable via `errors.Is`) when production mode is on AND `--insecure-listen` is set, wired into `cmd/pyrycode-relay/main.go` after flag-parse with `os.Exit(2)` (config-rejected-at-boot, distinct from runtime-failure exit 1) and structured log fields naming the env var (name only, never value) and a one-line `fix` listing both valid resolutions. `IsProductionMode` exported so sibling startup checks (#78 = uid-0) compose on the same predicate without re-reading the env var. Test seam is a `func(string) string` getter (smallest possible — no interface, no struct, no package-level var) so the 2×2 AC matrix and the value-space matrix run under `t.Parallel()` + `-race` without mutating process env (#77). - [Fly.io deploy](features/fly-deploy.md) — production host wiring: `fly.toml` declares TCP-passthrough on `:80`/`:443` (no Fly HTTP proxy, no Fly-managed certs) so TLS keeps terminating in the relay via autocert (#9), persistent Fly volume `relay_autocert` mounted at `/var/lib/relay/autocert`, and a single-machine hard cap encoded via `min_machines_running=1` + `auto_start_machines=false` + `auto_stop_machines="off"` + `[deploy] strategy="immediate"` (Fly Apps v2 has no `max_machines` key; the in-binary `PYRYCODE_RELAY_SINGLE_INSTANCE` self-check from #65 is the backstop). CI `deploy` job in `.github/workflows/ci.yml` runs `flyctl deploy --remote-only` on push to `main`, gated by branch-condition + `needs: [test, security, image-scan]` + `permissions: contents: read` so `FLY_API_TOKEN` is structurally unreachable from PR code; `superfly/flyctl-actions/setup-flyctl` pinned by commit SHA with `# Tracks:` comment (same convention as #68 / #41). Dedicated IPv4 is required (not optional) for autocert's HTTP-01 challenge; TCP passthrough preserves the real socket peer IP that #34's rate limiter reads. `__REGION__` / `__DOMAIN__` ship as placeholders that fail loud on first deploy (#38). - [Connection-count gauges](features/connection-count-gauges.md) — `pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones` exposed via a pull-based `prometheus.Collector` reading `Registry.Counts()` on each scrape; zero edits to `registry.go`; scalar (no labels) by design — `{server="..."}` would carry the attacker-influenced `x-pyrycode-server` header onto the metrics surface, which threat-model § Log hygiene forbids; stale grace-expiry fires can't move the gauge because the pointer-identity guard (ADR-0006) keeps the maps unchanged and the gauge IS the map size; race-tested against 16 mutator goroutines + a tight-loop scraper under `-race`. First collector wired into the #59 seam (#61). @@ -22,6 +23,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Decisions +- [ADR-0009: Build-tag platform-split convention — `_.go` / `_other.go`](decisions/0009-build-tag-platform-split-convention.md) — first per-GOOS split in the repo (introduced by #79's `caps_linux.go` / `caps_other.go`); locks in the asymmetric pair (tag-free `_.go` rides on Go's filename-suffix auto-constraint; explicit `//go:build !` on `_other.go` because `other` is not a recognised GOOS); test files follow the same suffix rules; rejects single-file `runtime.GOOS` switching (compiles dead `/proc` code into darwin builds, adds untaken branch on every Linux startup) and `_unix.go` (would route darwin dev runs into the `/proc` reader); extension shape: a future `_freebsd.go` claims FreeBSD via its filename and the `_other.go` half auto-narrows to `!linux && !freebsd` without edits; mirrors stdlib (`os/file_unix.go` tag-free, `os/file_posix.go` tag-explicit) so the convention is self-documenting to a Go-literate reader. - [ADR-0008: Adopt `github.com/prometheus/client_golang` for relay metrics](decisions/0008-prometheus-client-adoption.md) — first non-stdlib direct dep since #15 (`nhooyr.io/websocket`); alternatives (hand-rolled text format, `VictoriaMetrics/metrics`, `expvar`+sidecar) rejected; transitive set enumerated and pinned via `go.sum`; scope of use bans `DefaultRegisterer`, process/runtime collectors, `slog` adapter, Prometheus router/config helpers; pattern established: ADR-before-import for any next direct dep. - [ADR-0007: `WSConn.CloseWithCode` for active-conn application close codes](decisions/0007-wsconn-closewithcode-for-active-conn.md) — extends ADR-0005 for the post-claim window: heartbeat (#7) needs `1011 "heartbeat timeout"` on a live WSConn; `Close()` delegates to `CloseWithCode(StatusNormalClosure, "")`, both share `closeOnce`. - [ADR-0006: Grace window IS the reclaim path](decisions/0006-grace-period-as-reclaim-path.md) — `ClaimServer` during a pending grace timer succeeds (not `4409`); pointer-identity wrapper defends against stale `time.AfterFunc` fires after `Stop()`. diff --git a/docs/knowledge/codebase/79.md b/docs/knowledge/codebase/79.md new file mode 100644 index 0000000..f8c37c8 --- /dev/null +++ b/docs/knowledge/codebase/79.md @@ -0,0 +1,49 @@ +# Ticket #79 — refuse to boot when Linux effective capabilities exceed allowlist + +Adds a Linux-only boot-time check that parses `/proc/self/status`'s `CapEff:` hex mask and refuses to start (exit 2) when any bit is set outside an explicit allowlist (`CAP_NET_BIND_SERVICE` only). A misconfigured container runtime (`--cap-add SYS_ADMIN`, default Docker profile, over-broad bounding set) fails the deploy's health check rather than serving traffic with elevated privilege. On non-Linux GOOS the check is a no-op with a single startup log line. Split from #42; sibling of #77 (production-mode `--insecure-listen`) and #78 (queued, uid-0-in-production). + +## Implementation + +- **`internal/relay/caps.go` (new, 184 lines)** — cross-platform code, no build tag: + - `ErrUnexpectedCapability` sentinel — branchable via `errors.Is`. Wrapped message names each offending bit (`CAP_NAME (bit N)` or `bit N` if unknown), lists the allowlist contents, and embeds the operator fix string. + - `Capability{Bit uint; Name string}` exported record type; `AllowedCapabilities []Capability` exported allowlist (currently one entry: `CAP_NET_BIND_SERVICE`). + - `capabilityNames []string` (package-private) — index-by-bit table covering `CAP_CHOWN` through `CAP_CHECKPOINT_RESTORE` (bits 0–40, kernel 5.9+). Bit positions are stable kernel ABI; new caps only append. + - Pure helpers: `parseCapEff`, `checkCapEffMask`, `capabilityName`, `allowedMask`, `formatBits`, `formatAllowlist`. +- **`internal/relay/caps_linux.go` (new, 46 lines, no build tag)** — Linux entry point. `CheckCapabilities` delegates to `checkCapabilitiesWithReader(readProcSelfStatus)`; the seam takes `func() (string, error)` so tests inject canned `/proc/self/status` content without touching real `/proc`. Reader-boundary injection (not mask-boundary) so the malformed-input test still exercises `parseCapEff` end-to-end. +- **`internal/relay/caps_other.go` (new, 26 lines, `//go:build !linux`)** — no-op variant. Logs `"skipping linux-only capability check on "` via `slog.Default().Info`, returns nil. GOOS lives in the message string rather than as a structured field because the log-key allowlist (`log_allowlist.go`, #36) gates structured keys and a startup-only diagnostic doesn't earn a new key. +- **`internal/relay/caps_test.go` (new, 183 lines)** — cross-platform pure-function tests. Covers AC (a)–(d) on `parseCapEff` (9-row value matrix) and `checkCapEffMask` (6-row allowlist matrix, plus a regression test that the unexpected-bits section of the message does not name an allowlisted cap, plus an `errors.Is` branchability assertion). +- **`internal/relay/caps_linux_test.go` (new, 62 lines, `_linux_test.go` suffix)** — Linux-only seam tests on `checkCapabilitiesWithReader`. Four-row matrix: allowed-only → nil; CAP_SYS_ADMIN → sentinel; missing `CapEff:` → wrapped parse error (not sentinel); reader error → propagated wrapped (not sentinel). +- **`cmd/pyrycode-relay/main.go` (+7 lines)** — call inserted immediately after the #77 `CheckInsecureListenInProduction` block, before `relay.NewRegistry()`. Exit code 2; structured log fields are `err` and `fix` (Docker + Kubernetes resolutions). No `cap_eff` raw-hex field — the wrapped error names every bit symbolically already. + +## Acceptance criteria — verification map + +- **AC-1** (sentinel + allowlist exported; message names unexpected cap + allowlist): `ErrUnexpectedCapability` on `caps.go:22`, `AllowedCapabilities` on `caps.go:45-47`, message format in `checkCapEffMask` on `caps.go:149-156`. +- **AC-2** (Linux check parses `CapEff:`, returns sentinel; non-Linux no-op + skip log; build-tag split): `caps_linux.go` (no tag, auto-applies linux) and `caps_other.go` (`//go:build !linux`). Skip log format in `caps_other.go:24`. +- **AC-3** (wired in `main.go` after flag parse, before listener, with fail-fast + structured log naming fix): `main.go:53-58`. +- **AC-4** (unit tests cover empty / allowlisted / out-of-allowlist / malformed; test seam, no real /proc): `caps_test.go` (cross-platform) + `caps_linux_test.go` (seam). +- **AC-5** (no production-mode gating): no reference to `IsProductionMode` or `PYRYCODE_RELAY_PRODUCTION` in the new files; `main.go:53` is unconditional. + +## Patterns established + +- **`_.go` / `_other.go` build-tag split convention.** First per-GOOS split in this repo. Captured as [ADR-0009](../decisions/0009-build-tag-platform-split-convention.md). Future per-GOOS implementations in this repo follow the same shape (tag-free `_.go`, explicit `//go:build !` on `_other.go`). Tests follow the same suffix rules (`*__test.go` auto-constrains). +- **Reader-boundary test seam for parse-and-check pipelines.** The seam takes `func() (string, error)` returning raw file content, not the parsed mask. This keeps the malformed-input test exercising `parseCapEff` end-to-end rather than skipping the parser. Reach for the *coarsest* seam that still avoids the system dependency (here: real `/proc`); finer-grained seams reduce coverage. +- **Allowlist as a slice of typed records, not a bitmask constant.** The allowlist is operator-facing (its contents are formatted by name into the failure log) and append-friendly in code review. The bitmask form (`allowedMask()`) is derived state, computed inside the check. Apply the same shape to any future allowlist that surfaces in operator-visible output. +- **Boot-time-refusal sentinel family.** With #79 the family is four: `ErrCacheDirInsecure` (#9), `ErrInsecureListenInProduction` (#77), `ErrUnexpectedCapability` (#79), future #78. All four share the same wiring shape: `if err := relay.CheckXxx(...); err != nil { logger.Error("refusing to start: ...", "err", err, "fix", "..."); os.Exit(2) }`. When the count reaches ~5 and the boilerplate becomes a real cost, a follow-up consolidates them into a single `Config.Validate()` returning a multi-error — not before (per #77's "do not pre-build that abstraction"). + +## Lessons learned + +- **Inject the seam at the coarsest boundary that still avoids the system dependency.** The first design instinct is to inject `mask uint64` directly — it's the smallest API. But that skips `parseCapEff` and leaves AC (d) (malformed `/proc/self/status`) uncovered by the seam. Pushing the seam out to the reader (`func() (string, error)`) means one fake exercises parse + mask-check together. Rule for parse-and-check pipelines: the seam goes at the I/O edge, not at the inter-stage boundary. +- **Compile-time platform splits beat runtime `runtime.GOOS` branches.** A single-file `if runtime.GOOS == "linux" { ... }` would have worked, but it forces every darwin build to compile the `/proc` reader (dead code on darwin) and adds an untaken runtime branch to every Linux startup. The build-tag split keeps each binary lean and, more importantly, makes "this code path only exists on Linux" a property the type system enforces. Apply the same instinct to any future "this only makes sense on container hosts" gate. +- **Pick the minimum capability set; expand only on a motivating failure.** The spec walked through CapPrm / CapBnd / CapInh and rejected each because (a) the relay never `capset(2)`s, so CapPrm is inert, and (b) Kubernetes default policy legitimately grants a wide CapBnd. Checking all four would cause false-positive boot refusals on legitimate K8s deployments and force operators to over-grant, which would erode the check's value. Evidence-based fix selection from CLAUDE.md: defer until an observed failure shows the additional check is load-bearing. +- **Do not log the raw CapEff hex value alongside the wrapped error.** The wrapped error message already names every offending bit symbolically (`CAP_SYS_ADMIN (bit 21)`). Adding a `cap_eff` structured field with the raw mask would (a) clutter centralised logging for every misconfigured boot, (b) potentially leak future-allowlisted bits before they're documented, and (c) duplicate information the prose already conveys. Architect's security-review SHOULD-FIX from the #79 spec; preserve the rule going forward for any future log line that touches capability data. +- **Log GOOS in the message string when the structured-key allowlist is closed.** The non-Linux `CheckCapabilities` wants to record which GOOS skipped the check. The log-key allowlist (#36) gates structured keys — adding `"goos"` for a one-line startup diagnostic would widen the allowlist for low value. Embedding GOOS into the message string (`"skipping linux-only capability check on darwin"`) keeps the operator-visible signal intact without touching the closed-set defence. General principle: prefer prose-in-message over new structured keys for low-frequency, operator-visible-only signals. + +## Cross-links + +- [Capability allowlist (feature)](../features/capability-allowlist.md) — operator-facing doc covering contract, API, wiring, threat-model alignment. +- [ADR-0009: Build-tag platform-split convention](../decisions/0009-build-tag-platform-split-convention.md) — first-instance pattern this ticket established. +- [Codebase note #77](77.md) — sibling boot-time refusal (production-mode `--insecure-listen`); same wiring shape. +- [Production-mode contract & startup refusal (feature)](../features/production-mode.md) — `PYRYCODE_RELAY_PRODUCTION` env contract; #79 intentionally does NOT consume it. +- [Autocert TLS (feature)](../features/autocert-tls.md) — `ErrCacheDirInsecure` is the original boot-time refusal sentinel (#9). +- [Docker image (feature)](../features/docker-image.md) — distroless `:nonroot` (uid 65532) shape that motivates `CAP_NET_BIND_SERVICE` on the allowlist. +- [#42 — parent ticket](https://github.com/pyrycode/pyrycode-relay/issues/42) — split into #77 / #78 / #79. diff --git a/docs/knowledge/decisions/0009-build-tag-platform-split-convention.md b/docs/knowledge/decisions/0009-build-tag-platform-split-convention.md new file mode 100644 index 0000000..a173fb0 --- /dev/null +++ b/docs/knowledge/decisions/0009-build-tag-platform-split-convention.md @@ -0,0 +1,45 @@ +# ADR-0009: Build-tag platform-split convention — `_.go` / `_other.go` + +**Status:** Accepted (#79, 2026-05-13) + +## Context + +Ticket #79 needed a Linux-only implementation of `CheckCapabilities` (parses `/proc/self/status`) and a no-op fallback on every other GOOS (darwin dev runs, future Windows/BSD CI). Prior to #79 no file in `internal/relay/` or `cmd/` used Go's per-GOOS build tags — `grep -r '//go:build' internal/ cmd/` returned nothing on `main` at 15dc00f. The PO body explicitly delegated the file-naming convention to the architect; locking it in now sets the shape for every future per-GOOS split in this repo. + +## Decision + +Per-GOOS files in this repo use the pair `_.go` + `_other.go`: + +- `_.go` — implementation for one specific GOOS. **No build tag.** Go's build system auto-applies a `//go:build ` constraint based on the filename suffix when the suffix matches a recognised GOOS (`_linux.go`, `_darwin.go`, `_windows.go`, `_freebsd.go`, …). This mirrors stdlib (`os/file_unix.go`). +- `_other.go` — implementation for every other GOOS. **Explicit `//go:build !` tag required** because `other` is not a recognised GOOS suffix, so the build system does not auto-constrain it. + +First instance: `caps.go` (cross-platform pure code), `caps_linux.go` (Linux entry point, no tag), `caps_other.go` (`//go:build !linux`). Tests follow the same suffix rules: `caps_test.go` (cross-platform), `caps_linux_test.go` (Linux-only, suffix auto-applies). + +## Rationale + +**Why not a single file with runtime `runtime.GOOS` switching.** The Linux branch reads `/proc/self/status`; on darwin that path does not exist and the import (`os`, `strings`, `strconv`) would have to handle a syscall that can't happen. Compile-time elimination keeps the darwin binary free of dead /proc-handling code and the Linux binary free of "skip this, we're on darwin" runtime branches. It also keeps darwin builds honest — a future maintainer who breaks the Linux path won't accidentally hide it behind an untaken runtime branch on a darwin dev box. + +**Why `_.go` tag-free, `_other.go` tag-explicit (asymmetric).** Symmetric explicit tags on both halves would work but duplicate information the filename already carries — `_linux.go` with `//go:build linux` is redundant noise. The asymmetry is forced by the language: `_other` is not a GOOS suffix and won't auto-constrain, so the tag has to be explicit there. This is the same shape stdlib uses (`os/file_unix.go` is tag-free; `os/file_plan9.go` is tag-free; the fallback layer `os/file_posix.go` uses an explicit build tag). + +**Why `_other.go` rather than `_default.go` or `_stub.go`.** "Other" reads correctly at the call site: a future contributor reading `caps_other.go` understands "this is the non-Linux branch" without needing to map `_default`/`_stub` jargon to "anything not covered by a sibling file." If/when this repo grows a `_freebsd.go` sibling, `_other.go` automatically narrows (still `!linux`, but `_freebsd.go` claims FreeBSD via its own filename) without touching the tag. + +**Why establish this as a *convention* and not just decide ad-hoc per ticket.** Capability checks, uid-0 checks (queued #78), and any future "this only makes sense on Linux containers" gate will all face the same split. Standardising on one shape now means code review can assert against a single rule rather than re-litigating the file layout per ticket; it also means cross-linking from CLAUDE.md / spec docs is stable. + +## Alternatives considered + +- **Symmetric explicit tags on both files** (`//go:build linux` on `caps_linux.go`, `//go:build !linux` on `caps_other.go`). Rejected — redundant with the filename suffix on the Linux half. The implicit constraint is a deliberate Go convention; opting out adds noise without adding clarity. +- **Single file, `runtime.GOOS` branch.** Rejected — see above; couples darwin builds to /proc-shaped code and adds an untaken runtime branch on every Linux invocation. +- **`_unix.go` + `_other.go`.** Rejected — `_unix.go` is recognised by Go as a synthetic build tag (linux/darwin/bsd), but the relay's deployment target is specifically Linux containers, not "any Unix." Conflating Linux with darwin here would mean darwin dev runs hit the `/proc` reader and fail; the dev-run experience would be objectively worse than the no-op log. + +## Consequences + +- **Going forward:** any per-GOOS implementation in this repo follows `_.go` + `_other.go`. Tests follow the same suffix rules. The architect spec for any such ticket should name the convention by reference to this ADR instead of re-deriving it. +- **Extension path:** if a second-tier GOOS needs a non-stub implementation (e.g. FreeBSD CI grows a `/compat/linux/proc` reader), add `_freebsd.go` (tag-free, filename suffix auto-constrains); the `_other.go` half auto-narrows to "neither linux nor freebsd" without edits. If a third GOOS joins, the same pattern repeats. +- **Cost paid:** a contributor must remember the asymmetry (`_other.go` needs an explicit tag, `_linux.go` does not). The two new files in #79 carry doc comments that make the asymmetry visible; future PRs land via review on the same lines. +- **What this does NOT change:** the cross-platform pure functions (`parseCapEff`, `checkCapEffMask`, `capabilityName`) live in `caps.go` *without* a build tag — they compile on every GOOS and are tested on every GOOS. The split applies only to the platform-specific entry point. + +## Cross-links + +- [Capability allowlist (feature)](../features/capability-allowlist.md) — first consumer. +- [Codebase note #79](../codebase/79.md) — per-ticket implementation detail. +- Go reference: [Build constraints — `go/build` docs](https://pkg.go.dev/cmd/go#hdr-Build_constraints) — the `_.go` / `__.go` filename suffix rules this ADR rides on. diff --git a/docs/knowledge/features/capability-allowlist.md b/docs/knowledge/features/capability-allowlist.md new file mode 100644 index 0000000..e52407e --- /dev/null +++ b/docs/knowledge/features/capability-allowlist.md @@ -0,0 +1,75 @@ +# Linux capability allowlist (boot-time refusal) + +The relay refuses to start when its process's *effective* Linux capability set (`CapEff`) contains any bit outside an explicit allowlist. A container runtime that grants stray capabilities (`CAP_SYS_ADMIN`, `CAP_NET_ADMIN`, `--cap-add ALL`, an over-broad bounding set, …) fails the deploy's health check rather than serving traffic with elevated privilege. Added by #79. + +## The contract + +- **Source of truth:** `/proc/self/status`'s `CapEff:` hex mask. CapEff is what the kernel consults when authorising a privileged syscall; CapPrm/CapBnd/CapInh are intentionally not checked (see "Why CapEff only" below). +- **Allowlist:** `AllowedCapabilities` — currently `{CAP_NET_BIND_SERVICE (bit 10)}` only. That single capability is needed because autocert mode binds `:80` and `:443` from uid 65532 inside the distroless image (Dockerfile, fly.toml). Hosts that drop the cap and lower `net.ipv4.ip_unprivileged_port_start` instead also pass (CapEff = 0 satisfies the allowlist). +- **Unconditional.** No production-mode gating, no env-var bypass. Stray capabilities are never legitimate; a dev container with `--cap-add SYS_ADMIN` slipped in from a copy-pasted Compose stanza fails the same way prod would. +- **Linux only.** Non-Linux GOOS (darwin dev runs, future Windows/BSD CI) skips the check with a one-line `slog` info log. Split at compile time via build tags (see ADR-0009). + +## API (`internal/relay/caps.go` + `caps_linux.go` + `caps_other.go`) + +- `ErrUnexpectedCapability` — exported sentinel, branchable via `errors.Is`. Wrapped error message names each unexpected bit (`CAP_NAME (bit N)` or `bit N` for unknown), lists the current allowlist contents, and embeds the operator fix string. +- `Capability{Bit uint; Name string}` — exported record type used by the allowlist. +- `AllowedCapabilities []Capability` — exported package-level slice. Operator-facing (formatted by name into the failure message), one-line diff to extend. +- `CheckCapabilities() error` — Linux entry point reads `/proc/self/status`, parses the `CapEff:` line, returns `ErrUnexpectedCapability` (wrapped) on violation or a separately-wrapped error on read / parse failure. On non-Linux GOOS the function is a no-op that returns nil and logs once at info. + +Internal helpers (`parseCapEff`, `checkCapEffMask`, `capabilityName`, `allowedMask`, `formatBits`, `formatAllowlist`) live in `caps.go` (cross-platform, pure functions). The `checkCapabilitiesWithReader(readStatus func() (string, error))` seam lives in `caps_linux.go` so tests can inject canned `/proc/self/status` content without touching real `/proc`. + +## Wiring (`cmd/pyrycode-relay/main.go`) + +Immediately after the `CheckInsecureListenInProduction` block (#77), before `relay.NewRegistry()`: + +```go +if err := relay.CheckCapabilities(); err != nil { + logger.Error("refusing to start: unexpected Linux capabilities", + "err", err, + "fix", "drop extra capabilities (e.g. --cap-drop=ALL --cap-add=NET_BIND_SERVICE on docker, or securityContext.capabilities on kubernetes)") + os.Exit(2) +} +``` + +- **Exit code 2** = config-rejected-at-boot, matching the boot-time-refusal convention (#77). +- **No `cap_eff` raw-hex log field.** The wrapped error already names every offending bit symbolically; logging the raw mask alongside would clutter centralised logging and could leak future-allowlisted bits before they're documented. Security review SHOULD-FIX from #79's spec — carry forward to any future extension. +- **`fix` field** lists Docker and Kubernetes resolutions explicitly; "or equivalent" inside the wrapped error covers Podman / nerdctl / Fly Machines. + +## Failure modes (three distinct return shapes) + +| Cause | Return | Branchable via | +|---|---|---| +| CapEff has bit outside allowlist | `fmt.Errorf("%w: …", ErrUnexpectedCapability)` | `errors.Is(err, ErrUnexpectedCapability)` | +| `/proc/self/status` missing `CapEff:` or non-hex value | `fmt.Errorf("relay: parsing /proc/self/status CapEff …", …)` | — | +| `os.ReadFile("/proc/self/status")` failed | `fmt.Errorf("relay: reading /proc/self/status: %w", err)` | underlying `os.PathError` | + +All three lead to `os.Exit(2)` in main. The sentinel exists so future callers (integration tests asserting "this misconfiguration produces *this* refusal") can match without string-matching the log. + +## Why CapEff only + +CapEff is the load-bearing set: the kernel consults it (not CapPrm) when authorising privileged syscalls. The relay never calls `capset(2)`, never `setuid`s, never exec's into a setuid binary — CapPrm bits not in CapEff are inert. CapBnd is the bounding set inherited from the runtime; refusing on a wide CapBnd would reject Kubernetes pods running under default policy (CapBnd legitimately broader than CapEff there). Minimum surface = minimum noise. If a future incident shows CapPrm/CapBnd matters, a follow-up ticket extends the check with a clear motivating failure — rationale frozen in the `CheckCapabilities` doc comment so the next contributor has it on hand. + +## Threat model alignment + +`docs/threat-model.md` § Deploy treats operator misconfiguration as the dominant failure class for a single-machine internet-exposed relay. This check joins three siblings in the "refuse-to-boot on operator drift" family: + +- `ErrCacheDirInsecure` (#9) — autocert cache dir mode drift +- `ErrInsecureListenInProduction` (#77) — plaintext-in-prod transport drift +- `ErrUnexpectedCapability` (#79) — over-broad capability grant +- *future #78* — uid-0 in production + +All four share the same shape: detect at boot, refuse loudly, fail the health check, never serve traffic in the misconfigured state. + +## Out of scope (deferred) + +- **CapPrm / CapBnd / CapInh checks.** See "Why CapEff only" above; revisit on a motivating failure. +- **Real-`/proc` integration test.** Would couple test outcome to the CI runner's capability configuration (varies by GitHub Actions image vs. distroless). The seam test on the injected reader is the production-relevant assertion. +- **Programmatic export of `capabilityName`.** Package-private until a caller actually needs it; one-line change when one does. + +## Cross-links + +- [ADR-0009: Build-tag platform-split convention (`_linux.go` / `_other.go`)](../decisions/0009-build-tag-platform-split-convention.md) — convention this ticket established. +- [Codebase note #79](../codebase/79.md) — per-ticket implementation detail. +- [Production-mode contract](production-mode.md) — sibling boot-time refusal (#77). +- [Autocert TLS](autocert-tls.md) — `ErrCacheDirInsecure` is the original boot-time-refusal sentinel (#9). +- [Docker image](docker-image.md) — distroless `:nonroot` (uid 65532) deployment shape that motivates `CAP_NET_BIND_SERVICE` on the allowlist.