Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions cmd/pyrycode-relay/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,13 @@ func main() {
os.Exit(2)
}

if err := relay.CheckCapabilities(); err != nil {
logger.Error("refusing to start: unexpected Linux capabilities",
"err", err,
"fix", "drop extra capabilities (e.g. --cap-drop=ALL --cap-add=NET_BIND_SERVICE on docker, or securityContext.capabilities on kubernetes)")
os.Exit(2)
}

startedAt := time.Now()
reg := relay.NewRegistry()

Expand Down
2 changes: 2 additions & 0 deletions docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [Linux capability allowlist (boot-time refusal)](features/capability-allowlist.md) — relay parses `/proc/self/status`'s `CapEff:` hex mask at boot and refuses to start (exit 2) if any bit is set outside `AllowedCapabilities` (currently `{CAP_NET_BIND_SERVICE}` only, motivated by autocert binding `:80`/`:443` from uid 65532 in the distroless image). Exported sentinel `ErrUnexpectedCapability` is branchable via `errors.Is`; the wrapped error names every offending bit symbolically (`CAP_SYS_ADMIN (bit 21)` or `bit 63` for unknown), lists the allowlist contents, and embeds the operator fix string. `CapEff` only — `CapPrm/CapBnd/CapInh` would broaden false-positives (legitimate K8s default policy grants wide CapBnd) without adding load-bearing protection (relay never `capset(2)`s). Linux/non-Linux split at compile time via the new `_<goos>.go` / `_other.go` build-tag convention (see ADR-0009); non-Linux GOOS logs one skip line and returns nil. Unconditional — no production-mode gating, no env-var bypass, because stray capabilities are never legitimate. Reader-boundary test seam (`func() (string, error)`) exercises the parse + mask check end-to-end without touching real `/proc`. Joins the boot-time-refusal sentinel family (#9, #77, #79; future #78) (#79).
- [Production-mode contract & `--insecure-listen` startup refusal](features/production-mode.md) — `PYRYCODE_RELAY_PRODUCTION=1` env-var contract (exact-string match, lazy read via injected getter, mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE` shape from #64/#65) plus the first boot-time check that consumes it: `relay.CheckInsecureListenInProduction` returns the exported `ErrInsecureListenInProduction` sentinel (branchable via `errors.Is`) when production mode is on AND `--insecure-listen` is set, wired into `cmd/pyrycode-relay/main.go` after flag-parse with `os.Exit(2)` (config-rejected-at-boot, distinct from runtime-failure exit 1) and structured log fields naming the env var (name only, never value) and a one-line `fix` listing both valid resolutions. `IsProductionMode` exported so sibling startup checks (#78 = uid-0) compose on the same predicate without re-reading the env var. Test seam is a `func(string) string` getter (smallest possible — no interface, no struct, no package-level var) so the 2×2 AC matrix and the value-space matrix run under `t.Parallel()` + `-race` without mutating process env (#77).
- [Fly.io deploy](features/fly-deploy.md) — production host wiring: `fly.toml` declares TCP-passthrough on `:80`/`:443` (no Fly HTTP proxy, no Fly-managed certs) so TLS keeps terminating in the relay via autocert (#9), persistent Fly volume `relay_autocert` mounted at `/var/lib/relay/autocert`, and a single-machine hard cap encoded via `min_machines_running=1` + `auto_start_machines=false` + `auto_stop_machines="off"` + `[deploy] strategy="immediate"` (Fly Apps v2 has no `max_machines` key; the in-binary `PYRYCODE_RELAY_SINGLE_INSTANCE` self-check from #65 is the backstop). CI `deploy` job in `.github/workflows/ci.yml` runs `flyctl deploy --remote-only` on push to `main`, gated by branch-condition + `needs: [test, security, image-scan]` + `permissions: contents: read` so `FLY_API_TOKEN` is structurally unreachable from PR code; `superfly/flyctl-actions/setup-flyctl` pinned by commit SHA with `# Tracks:` comment (same convention as #68 / #41). Dedicated IPv4 is required (not optional) for autocert's HTTP-01 challenge; TCP passthrough preserves the real socket peer IP that #34's rate limiter reads. `__REGION__` / `__DOMAIN__` ship as placeholders that fail loud on first deploy (#38).
- [Connection-count gauges](features/connection-count-gauges.md) — `pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones` exposed via a pull-based `prometheus.Collector` reading `Registry.Counts()` on each scrape; zero edits to `registry.go`; scalar (no labels) by design — `{server="..."}` would carry the attacker-influenced `x-pyrycode-server` header onto the metrics surface, which threat-model § Log hygiene forbids; stale grace-expiry fires can't move the gauge because the pointer-identity guard (ADR-0006) keeps the maps unchanged and the gauge IS the map size; race-tested against 16 mutator goroutines + a tight-loop scraper under `-race`. First collector wired into the #59 seam (#61).
Expand All @@ -22,6 +23,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Decisions

- [ADR-0009: Build-tag platform-split convention — `_<goos>.go` / `_other.go`](decisions/0009-build-tag-platform-split-convention.md) — first per-GOOS split in the repo (introduced by #79's `caps_linux.go` / `caps_other.go`); locks in the asymmetric pair (tag-free `_<goos>.go` rides on Go's filename-suffix auto-constraint; explicit `//go:build !<goos>` on `_other.go` because `other` is not a recognised GOOS); test files follow the same suffix rules; rejects single-file `runtime.GOOS` switching (compiles dead `/proc` code into darwin builds, adds untaken branch on every Linux startup) and `_unix.go` (would route darwin dev runs into the `/proc` reader); extension shape: a future `_freebsd.go` claims FreeBSD via its filename and the `_other.go` half auto-narrows to `!linux && !freebsd` without edits; mirrors stdlib (`os/file_unix.go` tag-free, `os/file_posix.go` tag-explicit) so the convention is self-documenting to a Go-literate reader.
- [ADR-0008: Adopt `github.com/prometheus/client_golang` for relay metrics](decisions/0008-prometheus-client-adoption.md) — first non-stdlib direct dep since #15 (`nhooyr.io/websocket`); alternatives (hand-rolled text format, `VictoriaMetrics/metrics`, `expvar`+sidecar) rejected; transitive set enumerated and pinned via `go.sum`; scope of use bans `DefaultRegisterer`, process/runtime collectors, `slog` adapter, Prometheus router/config helpers; pattern established: ADR-before-import for any next direct dep.
- [ADR-0007: `WSConn.CloseWithCode` for active-conn application close codes](decisions/0007-wsconn-closewithcode-for-active-conn.md) — extends ADR-0005 for the post-claim window: heartbeat (#7) needs `1011 "heartbeat timeout"` on a live WSConn; `Close()` delegates to `CloseWithCode(StatusNormalClosure, "")`, both share `closeOnce`.
- [ADR-0006: Grace window IS the reclaim path](decisions/0006-grace-period-as-reclaim-path.md) — `ClaimServer` during a pending grace timer succeeds (not `4409`); pointer-identity wrapper defends against stale `time.AfterFunc` fires after `Stop()`.
Expand Down
49 changes: 49 additions & 0 deletions docs/knowledge/codebase/79.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Ticket #79 — refuse to boot when Linux effective capabilities exceed allowlist

Adds a Linux-only boot-time check that parses `/proc/self/status`'s `CapEff:` hex mask and refuses to start (exit 2) when any bit is set outside an explicit allowlist (`CAP_NET_BIND_SERVICE` only). A misconfigured container runtime (`--cap-add SYS_ADMIN`, default Docker profile, over-broad bounding set) fails the deploy's health check rather than serving traffic with elevated privilege. On non-Linux GOOS the check is a no-op with a single startup log line. Split from #42; sibling of #77 (production-mode `--insecure-listen`) and #78 (queued, uid-0-in-production).

## Implementation

- **`internal/relay/caps.go` (new, 184 lines)** — cross-platform code, no build tag:
- `ErrUnexpectedCapability` sentinel — branchable via `errors.Is`. Wrapped message names each offending bit (`CAP_NAME (bit N)` or `bit N` if unknown), lists the allowlist contents, and embeds the operator fix string.
- `Capability{Bit uint; Name string}` exported record type; `AllowedCapabilities []Capability` exported allowlist (currently one entry: `CAP_NET_BIND_SERVICE`).
- `capabilityNames []string` (package-private) — index-by-bit table covering `CAP_CHOWN` through `CAP_CHECKPOINT_RESTORE` (bits 0–40, kernel 5.9+). Bit positions are stable kernel ABI; new caps only append.
- Pure helpers: `parseCapEff`, `checkCapEffMask`, `capabilityName`, `allowedMask`, `formatBits`, `formatAllowlist`.
- **`internal/relay/caps_linux.go` (new, 46 lines, no build tag)** — Linux entry point. `CheckCapabilities` delegates to `checkCapabilitiesWithReader(readProcSelfStatus)`; the seam takes `func() (string, error)` so tests inject canned `/proc/self/status` content without touching real `/proc`. Reader-boundary injection (not mask-boundary) so the malformed-input test still exercises `parseCapEff` end-to-end.
- **`internal/relay/caps_other.go` (new, 26 lines, `//go:build !linux`)** — no-op variant. Logs `"skipping linux-only capability check on <GOOS>"` via `slog.Default().Info`, returns nil. GOOS lives in the message string rather than as a structured field because the log-key allowlist (`log_allowlist.go`, #36) gates structured keys and a startup-only diagnostic doesn't earn a new key.
- **`internal/relay/caps_test.go` (new, 183 lines)** — cross-platform pure-function tests. Covers AC (a)–(d) on `parseCapEff` (9-row value matrix) and `checkCapEffMask` (6-row allowlist matrix, plus a regression test that the unexpected-bits section of the message does not name an allowlisted cap, plus an `errors.Is` branchability assertion).
- **`internal/relay/caps_linux_test.go` (new, 62 lines, `_linux_test.go` suffix)** — Linux-only seam tests on `checkCapabilitiesWithReader`. Four-row matrix: allowed-only → nil; CAP_SYS_ADMIN → sentinel; missing `CapEff:` → wrapped parse error (not sentinel); reader error → propagated wrapped (not sentinel).
- **`cmd/pyrycode-relay/main.go` (+7 lines)** — call inserted immediately after the #77 `CheckInsecureListenInProduction` block, before `relay.NewRegistry()`. Exit code 2; structured log fields are `err` and `fix` (Docker + Kubernetes resolutions). No `cap_eff` raw-hex field — the wrapped error names every bit symbolically already.

## Acceptance criteria — verification map

- **AC-1** (sentinel + allowlist exported; message names unexpected cap + allowlist): `ErrUnexpectedCapability` on `caps.go:22`, `AllowedCapabilities` on `caps.go:45-47`, message format in `checkCapEffMask` on `caps.go:149-156`.
- **AC-2** (Linux check parses `CapEff:`, returns sentinel; non-Linux no-op + skip log; build-tag split): `caps_linux.go` (no tag, auto-applies linux) and `caps_other.go` (`//go:build !linux`). Skip log format in `caps_other.go:24`.
- **AC-3** (wired in `main.go` after flag parse, before listener, with fail-fast + structured log naming fix): `main.go:53-58`.
- **AC-4** (unit tests cover empty / allowlisted / out-of-allowlist / malformed; test seam, no real /proc): `caps_test.go` (cross-platform) + `caps_linux_test.go` (seam).
- **AC-5** (no production-mode gating): no reference to `IsProductionMode` or `PYRYCODE_RELAY_PRODUCTION` in the new files; `main.go:53` is unconditional.

## Patterns established

- **`_<goos>.go` / `_other.go` build-tag split convention.** First per-GOOS split in this repo. Captured as [ADR-0009](../decisions/0009-build-tag-platform-split-convention.md). Future per-GOOS implementations in this repo follow the same shape (tag-free `_<goos>.go`, explicit `//go:build !<goos>` on `_other.go`). Tests follow the same suffix rules (`*_<goos>_test.go` auto-constrains).
- **Reader-boundary test seam for parse-and-check pipelines.** The seam takes `func() (string, error)` returning raw file content, not the parsed mask. This keeps the malformed-input test exercising `parseCapEff` end-to-end rather than skipping the parser. Reach for the *coarsest* seam that still avoids the system dependency (here: real `/proc`); finer-grained seams reduce coverage.
- **Allowlist as a slice of typed records, not a bitmask constant.** The allowlist is operator-facing (its contents are formatted by name into the failure log) and append-friendly in code review. The bitmask form (`allowedMask()`) is derived state, computed inside the check. Apply the same shape to any future allowlist that surfaces in operator-visible output.
- **Boot-time-refusal sentinel family.** With #79 the family is four: `ErrCacheDirInsecure` (#9), `ErrInsecureListenInProduction` (#77), `ErrUnexpectedCapability` (#79), future #78. All four share the same wiring shape: `if err := relay.CheckXxx(...); err != nil { logger.Error("refusing to start: ...", "err", err, "fix", "..."); os.Exit(2) }`. When the count reaches ~5 and the boilerplate becomes a real cost, a follow-up consolidates them into a single `Config.Validate()` returning a multi-error — not before (per #77's "do not pre-build that abstraction").

## Lessons learned

- **Inject the seam at the coarsest boundary that still avoids the system dependency.** The first design instinct is to inject `mask uint64` directly — it's the smallest API. But that skips `parseCapEff` and leaves AC (d) (malformed `/proc/self/status`) uncovered by the seam. Pushing the seam out to the reader (`func() (string, error)`) means one fake exercises parse + mask-check together. Rule for parse-and-check pipelines: the seam goes at the I/O edge, not at the inter-stage boundary.
- **Compile-time platform splits beat runtime `runtime.GOOS` branches.** A single-file `if runtime.GOOS == "linux" { ... }` would have worked, but it forces every darwin build to compile the `/proc` reader (dead code on darwin) and adds an untaken runtime branch to every Linux startup. The build-tag split keeps each binary lean and, more importantly, makes "this code path only exists on Linux" a property the type system enforces. Apply the same instinct to any future "this only makes sense on container hosts" gate.
- **Pick the minimum capability set; expand only on a motivating failure.** The spec walked through CapPrm / CapBnd / CapInh and rejected each because (a) the relay never `capset(2)`s, so CapPrm is inert, and (b) Kubernetes default policy legitimately grants a wide CapBnd. Checking all four would cause false-positive boot refusals on legitimate K8s deployments and force operators to over-grant, which would erode the check's value. Evidence-based fix selection from CLAUDE.md: defer until an observed failure shows the additional check is load-bearing.
- **Do not log the raw CapEff hex value alongside the wrapped error.** The wrapped error message already names every offending bit symbolically (`CAP_SYS_ADMIN (bit 21)`). Adding a `cap_eff` structured field with the raw mask would (a) clutter centralised logging for every misconfigured boot, (b) potentially leak future-allowlisted bits before they're documented, and (c) duplicate information the prose already conveys. Architect's security-review SHOULD-FIX from the #79 spec; preserve the rule going forward for any future log line that touches capability data.
- **Log GOOS in the message string when the structured-key allowlist is closed.** The non-Linux `CheckCapabilities` wants to record which GOOS skipped the check. The log-key allowlist (#36) gates structured keys — adding `"goos"` for a one-line startup diagnostic would widen the allowlist for low value. Embedding GOOS into the message string (`"skipping linux-only capability check on darwin"`) keeps the operator-visible signal intact without touching the closed-set defence. General principle: prefer prose-in-message over new structured keys for low-frequency, operator-visible-only signals.

## Cross-links

- [Capability allowlist (feature)](../features/capability-allowlist.md) — operator-facing doc covering contract, API, wiring, threat-model alignment.
- [ADR-0009: Build-tag platform-split convention](../decisions/0009-build-tag-platform-split-convention.md) — first-instance pattern this ticket established.
- [Codebase note #77](77.md) — sibling boot-time refusal (production-mode `--insecure-listen`); same wiring shape.
- [Production-mode contract & startup refusal (feature)](../features/production-mode.md) — `PYRYCODE_RELAY_PRODUCTION` env contract; #79 intentionally does NOT consume it.
- [Autocert TLS (feature)](../features/autocert-tls.md) — `ErrCacheDirInsecure` is the original boot-time refusal sentinel (#9).
- [Docker image (feature)](../features/docker-image.md) — distroless `:nonroot` (uid 65532) shape that motivates `CAP_NET_BIND_SERVICE` on the allowlist.
- [#42 — parent ticket](https://github.com/pyrycode/pyrycode-relay/issues/42) — split into #77 / #78 / #79.
Loading