From 8601725653474efdc2bb519a477b3e46e6a05518 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Wed, 13 May 2026 09:39:48 +0300 Subject: [PATCH 1/3] spec: refuse to boot as effective uid 0 in production mode (#78) --- .../78-refuse-boot-as-root-in-production.md | 135 ++++++++++++++++++ 1 file changed, 135 insertions(+) create mode 100644 docs/specs/architecture/78-refuse-boot-as-root-in-production.md diff --git a/docs/specs/architecture/78-refuse-boot-as-root-in-production.md b/docs/specs/architecture/78-refuse-boot-as-root-in-production.md new file mode 100644 index 0000000..6c3fb44 --- /dev/null +++ b/docs/specs/architecture/78-refuse-boot-as-root-in-production.md @@ -0,0 +1,135 @@ +# Spec: refuse to boot as effective uid 0 in production mode + +Ticket: [#78](https://github.com/pyrycode/pyrycode-relay/issues/78). Size S. Split from #42. + +## Files to read first + +- `internal/relay/production.go` (whole file, 47 lines) — the sibling helper this ticket extends in place. Mirror the shapes exactly: + - `envProductionMode` const (line 9) — reuse, do not redeclare. + - `IsProductionMode` (line 30) — call from the new check; do not re-read the env var. + - `ErrInsecureListenInProduction` (line 17) and `CheckInsecureListenInProduction` (line 41) — the new sentinel + check function are literal siblings of these, same package, same file, same style. +- `internal/relay/production_test.go` (whole file, 119 lines) — the test style for this package. Reuse `fakeGetenv` (line 8); mirror the table shape of `TestCheckInsecureListenInProduction_Matrix` (line 47); add an `Is*Branchable` test for the new sentinel paralleling line 107. +- `cmd/pyrycode-relay/main.go:46-70` — the wiring slot. The new check installs immediately after the existing `CheckInsecureListenInProduction` block (line 64–70) and immediately before `relay.CheckCapabilities()` (line 72). Read those lines to see the structured-log shape (`logger.Error(..., "err", err, "env_var", "...", "fix", "...")`) and the `os.Exit(2)` convention for production-mode misconfigurations. +- `docs/specs/architecture/77-refuse-insecure-in-production.md` — the sibling architect spec. Sections "Sentinel error", "Wiring in `cmd/pyrycode-relay/main.go`", and "Testing strategy" set the precedent this ticket follows verbatim. +- `docs/specs/architecture/9-autocert-tls.md` (for the `ErrCacheDirInsecure` precedent narrative; skim only if context on "fail-loud, before any listener starts" is needed) — the boot-time refusal pattern. +- `syscall` stdlib — `syscall.Geteuid() int`. Returns the effective uid on Linux and darwin. No build tag required. On Windows it returns -1, but the relay binary is built for linux/amd64 (Dockerfile, #32) and darwin for local dev; either way the function is in scope without conditional compilation. + +## Context + +Ticket #77 closed the "production-mode + plaintext listener" silent-misconfiguration class. This ticket closes the parallel silent-misconfiguration class: the relay process running as root. + +CI already verifies the *build* image runs as a non-root user (Dockerfile USER directive, verified by #32 and the Trivy scan in #68). But the build is only one half of the contract. At *deploy* time, an operator (or an AI agent generating a manifest) can: + +- `docker run --user 0 …` — overrides the image's USER and runs as root. +- `kubectl apply` a pod spec with `securityContext: { runAsUser: 0 }` — same outcome, different cause. +- Modify the Dockerfile to drop the USER directive — slips past code review if the diff is small enough. + +None of these are caught by CI: the build is green, the image scans clean, the deploy succeeds, and the internet-exposed process runs as root. The blast radius is "any RCE in the relay is now a root RCE on the host"; in containers without strict capability drops, that escalates further. The defence must run *at boot, before any listener accepts a connection*, so a misconfigured deploy fails the deploy's health check rather than serving traffic. + +This ticket adds the deterministic in-process backstop: when `PYRYCODE_RELAY_PRODUCTION=1` AND `syscall.Geteuid() == 0`, the process refuses to start. + +The `PYRYCODE_RELAY_PRODUCTION` contract is defined by #77 (`internal/relay/production.go`). This ticket consumes `IsProductionMode` and adds a sibling check; it does not redefine the env-var contract or introduce a second read path. + +Related: #9 (the `ErrCacheDirInsecure` boot-time-refusal precedent); #32 (Dockerfile non-root build); #68 (Trivy image scan); #77 (sibling — introduced `ErrInsecureListenInProduction` and `IsProductionMode`); #42 (parent — was split into #77 / #78). + +## Design + +No new files. Append to `internal/relay/production.go` (new sentinel + new check), append to `internal/relay/production_test.go` (new test cases + new branchability test), insert ~7 lines of wiring in `cmd/pyrycode-relay/main.go`. + +### Sentinel error (exported) + +A new package-level `var ErrRunningAsRoot = errors.New("relay: …")` declared alongside `ErrInsecureListenInProduction` in `internal/relay/production.go`. Message names both the observed condition (effective uid 0) and the env-var contract (`PYRYCODE_RELAY_PRODUCTION=1`), mirroring how `ErrInsecureListenInProduction` names the flag and the env var. Doc comment follows the same Godoc shape as `ErrInsecureListenInProduction` (lines 11–17): names the function that returns it, explains the production-mode condition, justifies fail-fast over runtime degradation, and says the contract that triggers it is internet-exposure of a root-uid process. + +### Check function (exported) + +```go +func CheckRunningAsRoot(geteuid func() int, getenv func(string) string) error +``` + +Behaviour, in one sentence: returns `ErrRunningAsRoot` when `IsProductionMode(getenv)` is true AND `geteuid()` returns 0; returns nil otherwise. No wrapping, no formatting with caller-supplied fields — the structured log fields in `main` carry the operator-facing context. The test that asserts this contract is the 3-row matrix below; if the implementation deviates from "production AND uid 0", the matrix breaks. + +Two design decisions worth naming: + +1. **`geteuid` is an injected `func() int` parameter, not a package var or interface.** Mirrors the `getenv func(string) string` seam pattern that `CheckInsecureListenInProduction` already establishes (line 41). The call site in `main.go` passes `syscall.Geteuid` directly (function values, no closure needed); tests pass a closure returning a fixed int. No `t.Setenv`-equivalent for uid exists in stdlib (you cannot change a process's uid mid-test without re-exec), so an injected seam is the *only* way to exercise the uid-0 branch in a unit test without re-execing as root. The AC names this requirement explicitly. +2. **Single check covers both prongs.** An alternative is two helpers — `IsRunningAsRoot() bool` and `CheckRunningAsRoot(...)` — paralleling `IsProductionMode` / `CheckInsecureListenInProduction`. **Rejected.** `IsRunningAsRoot` has no second consumer (no logging needs to branch on it, no metric labels it); exporting it is dead surface area today and a temptation to call it without the production guard tomorrow. Keep it inline inside the check; export `IsRunningAsRoot` later if and when a second consumer appears. + +### Wiring in `cmd/pyrycode-relay/main.go` + +Insert immediately after the existing `CheckInsecureListenInProduction` block (currently lines 64–70) and immediately before `CheckCapabilities` (line 72): + +- Call signature: `relay.CheckRunningAsRoot(syscall.Geteuid, os.Getenv)`. +- Error branch: structured log at `Error` level with fields `err`, `env_var="PYRYCODE_RELAY_PRODUCTION"`, `effective_uid=syscall.Geteuid()`, and a `fix` field naming the two valid resolutions (drop privileges before exec; unset `PYRYCODE_RELAY_PRODUCTION` if the deploy is truly dev). Then `os.Exit(2)`. +- The AC explicitly requires the log line to include both the observed effective uid and `env_var=PYRYCODE_RELAY_PRODUCTION`. The check returns *only* the sentinel — `main.go` is where the operator-facing context (the uid value, the remediation hint) is composed. Capturing the uid for the log line means calling `syscall.Geteuid()` a second time at the log site; this is fine — it is the same uid (the process cannot change its own euid between two adjacent syscalls without an intervening `setuid` call, which the relay never makes) and the cost is negligible. +- Exit code 2 matches the sibling block immediately above (line 69) and the env-config validator block above that (line 61). Exit 1 stays reserved for runtime failures (listener died, autocert failed); exit 2 is configuration-rejected-at-boot. The split lets ops dashboards distinguish "deploy never started" from "deploy started and crashed." + +The wiring order — `CheckEnvConfig` → `CheckInsecureListenInProduction` → `CheckRunningAsRoot` → `CheckCapabilities` — is intentional and worth a one-line comment at the new block: + +- `CheckEnvConfig` (#80) validates the env-var shapes including `PYRYCODE_RELAY_PRODUCTION`'s "exact `1` or unset" contract. Running it first means downstream production-mode checks cannot be fooled by `PYRYCODE_RELAY_PRODUCTION=true` slipping through as "non-production." This is the wiring-order invariant #80's spec calls out. +- `CheckInsecureListenInProduction` (#77) and `CheckRunningAsRoot` (#78) are siblings on the production-mode axis; the order between them is not load-bearing. Place #78 second to preserve `git blame` legibility (the new block lands adjacent to the new function it calls). +- `CheckCapabilities` (#79) runs after both — capability-allowlist failures are a Linux-specific runtime concern; production-mode misconfiguration is a deploy-shape concern that should be reported first if both are wrong. + +### Why a `Config` struct is still out of scope + +Same reasoning as #77 § "Why not a Config struct": `main.go` now has four boot-time checks. When the count reaches ~5 and the wiring boilerplate becomes a real cost, a follow-up ticket consolidates them into `Config.Validate() error`. Doing it now is premature abstraction and would balloon this S ticket past its red lines. + +## Concurrency model + +None. The new check runs on the main goroutine before any listener is started, before any goroutine is spawned. No locks, no channels, no shared mutable state. `syscall.Geteuid` is a stateless syscall — concurrent callers would see the same value, but there are no concurrent callers. + +## Error handling + +Single sentinel, no wrapping, no formatting with caller-supplied fields. Same shape as `CheckInsecureListenInProduction`. The error message is the same on every failure; the structured log fields in `main` (`effective_uid`, `env_var`, `fix`) provide the operator-facing context. Downstream code that wants to branch on this failure mode uses `errors.Is(err, ErrRunningAsRoot)`. + +No error case requires retry, fallback, or partial-state recovery. Boot-time refusal is total: the process exits 2 immediately. + +## Testing strategy + +All new tests live in `internal/relay/production_test.go`, are `t.Parallel()`-safe, and never mutate process env or call `syscall.Setuid` (which the test binary cannot do as non-root anyway). Reuse the existing `fakeGetenv` helper. Define a tiny helper `fakeGeteuid(n int) func() int { return func() int { return n } }` if the indirection helps readability, or inline `func() int { return n }` at each row. + +### Test 1 — `CheckRunningAsRoot` matrix (the AC verbatim) + +Table-driven, four rows minimum: + +| `PYRYCODE_RELAY_PRODUCTION` | `geteuid()` returns | want | +|---|---|---| +| unset | `0` | `nil` (non-production overrides uid-0; we don't refuse to boot a dev relay running as root in a sandbox) | +| `"1"` | `1000` | `nil` (production but not root — the happy path) | +| `"1"` | `0` | `errors.Is(err, ErrRunningAsRoot)` (production + root — refuse) | +| `"1"` | `65534` | `nil` (nobody-uid, sanity check that non-zero non-1000 also returns nil) | + +The fourth row is a small over-add to lock in "uid 0 is the *only* refused uid"; a future refactor that ranges over a "privileged uids" list would break it. Drop the row if it feels gratuitous — the AC names only the three primary cases. + +### Test 2 — sentinel is branchable + +One-line test paralleling `TestErrInsecureListenInProduction_IsBranchable` (line 107): assert `errors.Is(ErrRunningAsRoot, ErrRunningAsRoot)` and assert the error returned from `CheckRunningAsRoot(func() int { return 0 }, fakeGetenv(map[string]string{envProductionMode: "1"}))` satisfies `errors.Is(err, ErrRunningAsRoot)`. The point is to lock in the `errors.Is` contract so a future "let me return `fmt.Errorf("uid %d: %w", ...)` instead" refactor breaks the test, not downstream callers. + +### What is NOT tested + +- The `main.go` wiring (same justification as #77): adding a fork-exec integration test for one `if err != nil { os.Exit(2) }` block is over-engineering. Code review of the diff is the gate. +- The structured log line's exact field names (same justification as #77): operator-facing prose, not a machine-parsed format. +- Actually re-execing as root to verify `syscall.Geteuid()` returns 0 in that case. The standard-library function is trusted; the test seam is precisely so we do not need to. +- Per-OS behaviour (linux vs darwin). `syscall.Geteuid` is implemented on both with identical semantics; no build tag, no platform fork. If the Windows port ever materialises (it will not — the relay targets linux for prod and darwin for dev), `syscall.Geteuid` returns -1 there, which trivially does not equal 0, so the check is benignly inert. No test required. + +## Open questions + +None. The sentinel name (`ErrRunningAsRoot`), check function name (`CheckRunningAsRoot`), uid source (injected `func() int`), env-var reuse (`IsProductionMode` from #77), wiring slot (after `CheckInsecureListenInProduction`, before `CheckCapabilities`), and exit code (2) are all settled by the ticket body, the sibling spec (#77), and the existing wiring in `main.go`. + +## Security review + +**Verdict:** PASS + +**Findings:** + +- **[Trust boundaries]** No findings. Two inputs cross into the check: the `PYRYCODE_RELAY_PRODUCTION` env var (operator-controlled, not network-attacker-controlled; reused via `IsProductionMode`) and the effective uid (kernel-supplied, not user-supplied). Neither flows from a network-facing path. The check is a fail-closed gate on two known shapes; there is no parser, no allocator, no value-dependent code path beyond the boolean conjunction. +- **[Tokens, secrets, credentials]** N/A. No tokens, no secrets, no credential material is read, written, or compared. +- **[File operations]** N/A. No file is read, written, statted, or unlinked. No path concatenation. +- **[Subprocess / external command execution]** N/A. No `exec.Command`, no `os.StartProcess`, no `syscall.ForkExec`. The check observes the existing process; it does not spawn another. +- **[Cryptographic primitives]** N/A. No RNG, no hash, no comparison of attacker-controlled values. +- **[Network & I/O]** No findings. The check runs *before* any listener is opened (`mux := http.NewServeMux()` is on line 88 of the current `main.go`, several lines after where this check inserts). The whole purpose of the ticket is to *prevent* a listener from starting in a misconfigured state. The autocert path (#9) and the existing `http.Server` timeout configuration are unchanged. +- **[Error messages, logs, telemetry]** One **SHOULD FIX** for the developer (called out inline in the Wiring section above). The log line includes `effective_uid` as a structured field. The value comes from `syscall.Geteuid()` — a kernel-supplied integer, not a user-supplied string, so log-injection is structurally impossible. The `env_var` field is the *name* of the env var, not its value (same convention as #77's wiring); do not extend the log to include `os.Getenv("PYRYCODE_RELAY_PRODUCTION")` for the same reason #77 calls out: a confused operator might write anything there and we do not want it ending up in centralised logs. The `fix` field is a static string constant. No PII, no token, no path. +- **[Concurrency]** N/A. The check runs on the main goroutine before any goroutine is spawned; no shared mutable state. `syscall.Geteuid` is reentrant and stateless. +- **[Threat model alignment]** No findings. `pyrycode/pyrycode/docs/protocol-mobile.md` § Security model assumes the relay is internet-exposed and untrusted by the binary; the host the relay runs on is assumed to enforce least-privilege so that a relay RCE does not escalate to host-root RCE. This ticket is the in-process enforcement of the "non-root execution" half of that assumption when production-mode is explicitly tagged. The complement (a CI / deploy-manifest check that prod manifests do not set `runAsUser: 0` or `--user 0`) is out of scope and could be a follow-up ticket against the deploy manifest (#38). +- **[Adversarial framing]** What if an attacker controls `PYRYCODE_RELAY_PRODUCTION`? They cannot — the env var is set by the operator/orchestrator before exec; a network attacker has no path to mutate it. What if an attacker controls the effective uid? Same answer — uid is set by the kernel based on the exec context; a network attacker has no path to flip it. What if an attacker exploits an RCE *during* boot, before the check runs? The check is the third in a sequence (`CheckEnvConfig` → `CheckInsecureListenInProduction` → `CheckRunningAsRoot` → `CheckCapabilities`); none of those open a listener or accept untrusted input, so there is no pre-check RCE surface. What if the check itself has a logic bug that lets uid 0 through in production? The matrix test (3 of the 4 rows) is the lock; a regression would break the third row. + +**Reviewer:** architect (self-review per `architect/security-review.md`) +**Date:** 2026-05-13 From 772018186dfc9858c3a71460b1ba53be3594a7c8 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Wed, 13 May 2026 09:41:34 +0300 Subject: [PATCH 2/3] feat(relay): refuse to boot as effective uid 0 in production mode (#78) Adds a deterministic in-process backstop for the CI non-root-build contract: when PYRYCODE_RELAY_PRODUCTION=1 AND syscall.Geteuid() == 0, the relay refuses to start with exit 2 before any listener is opened. A `docker run --user 0` or a missing/overridden USER directive at deploy time would otherwise silently run the internet-facing process as root. Mirrors the sibling pattern from #77: exported sentinel ErrRunningAsRoot, CheckRunningAsRoot(geteuid, getenv) with injected seams for tests, and structured log fields (effective_uid, env_var, fix) at the call site. Co-Authored-By: Claude Opus 4.7 --- cmd/pyrycode-relay/main.go | 17 +++++++ internal/relay/production.go | 31 ++++++++++++ internal/relay/production_test.go | 79 +++++++++++++++++++++++++++++++ 3 files changed, 127 insertions(+) diff --git a/cmd/pyrycode-relay/main.go b/cmd/pyrycode-relay/main.go index 47cea7e..0c6932d 100644 --- a/cmd/pyrycode-relay/main.go +++ b/cmd/pyrycode-relay/main.go @@ -13,6 +13,7 @@ import ( "log/slog" "net/http" "os" + "syscall" "time" "github.com/pyrycode/pyrycode-relay/internal/relay" @@ -69,6 +70,22 @@ func main() { os.Exit(2) } + // CheckRunningAsRoot is the in-process backstop for the CI non-root-build + // contract: docker run --user 0 or a missing/overridden USER directive at + // deploy time escapes CI and would otherwise silently run the + // internet-facing process as root. Runs before CheckCapabilities because + // production-mode misconfiguration is a deploy-shape concern that should + // be reported before Linux-specific runtime concerns. See + // docs/specs/architecture/78-refuse-boot-as-root-in-production.md § Wiring. + if err := relay.CheckRunningAsRoot(syscall.Geteuid, os.Getenv); err != nil { + logger.Error("refusing to start: production-mode misconfiguration", + "err", err, + "env_var", "PYRYCODE_RELAY_PRODUCTION", + "effective_uid", syscall.Geteuid(), + "fix", "drop privileges before exec (e.g. Dockerfile USER directive or --user , kubernetes securityContext.runAsUser), or unset PYRYCODE_RELAY_PRODUCTION if the deploy is truly dev") + os.Exit(2) + } + if err := relay.CheckCapabilities(); err != nil { logger.Error("refusing to start: unexpected Linux capabilities", "err", err, diff --git a/internal/relay/production.go b/internal/relay/production.go index e23d37c..8c3759f 100644 --- a/internal/relay/production.go +++ b/internal/relay/production.go @@ -44,3 +44,34 @@ func CheckInsecureListenInProduction(insecureListen string, getenv func(string) } return nil } + +// ErrRunningAsRoot is returned by CheckRunningAsRoot when the relay is +// configured for production mode (PYRYCODE_RELAY_PRODUCTION=1) AND the +// effective uid is 0. Running an internet-exposed process as root in +// production is a fail-fast misconfiguration, not a runtime degradation: +// a `docker run --user 0` at deploy time or a missing/overridden USER +// directive escapes the CI non-root-build check, and any RCE in the relay +// would then escalate to a root RCE on the host. The relay refuses to +// start so the misconfigured deploy fails its health check rather than +// serving traffic. +var ErrRunningAsRoot = errors.New("relay: effective uid is 0 with PYRYCODE_RELAY_PRODUCTION=1; refusing to start") + +// CheckRunningAsRoot returns ErrRunningAsRoot when production mode is on +// (per IsProductionMode) AND geteuid returns 0. Returns nil otherwise. +// Intended to be called from main after flag parse, before any listener +// is started. +// +// geteuid is the effective-uid lookup function; pass syscall.Geteuid at +// the call site, an injected func in tests. The seam exists because no +// stdlib equivalent of t.Setenv exists for uid (a process cannot change +// its own euid mid-test without re-exec), so the uid-0 branch can only +// be exercised in a unit test via an injected function. +// +// getenv is the env-var lookup function; pass os.Getenv at the call +// site, an injected func in tests. +func CheckRunningAsRoot(geteuid func() int, getenv func(string) string) error { + if IsProductionMode(getenv) && geteuid() == 0 { + return ErrRunningAsRoot + } + return nil +} diff --git a/internal/relay/production_test.go b/internal/relay/production_test.go index b06388f..d305b15 100644 --- a/internal/relay/production_test.go +++ b/internal/relay/production_test.go @@ -117,3 +117,82 @@ func TestErrInsecureListenInProduction_IsBranchable(t *testing.T) { t.Errorf("returned error %v should satisfy errors.Is(err, ErrInsecureListenInProduction)", err) } } + +func fakeGeteuid(n int) func() int { + return func() int { return n } +} + +func TestCheckRunningAsRoot_Matrix(t *testing.T) { + t.Parallel() + + cases := []struct { + name string + productionMode string // "" with setEnv=false means unset + setEnv bool + uid int + wantSentinel bool + }{ + { + name: "non-production + uid 0 returns nil", + setEnv: false, + uid: 0, + wantSentinel: false, + }, + { + name: "production + uid 1000 returns nil", + productionMode: "1", + setEnv: true, + uid: 1000, + wantSentinel: false, + }, + { + name: "production + uid 0 returns sentinel", + productionMode: "1", + setEnv: true, + uid: 0, + wantSentinel: true, + }, + { + name: "production + nobody uid 65534 returns nil", + productionMode: "1", + setEnv: true, + uid: 65534, + wantSentinel: false, + }, + } + + for _, tc := range cases { + tc := tc + t.Run(tc.name, func(t *testing.T) { + t.Parallel() + env := map[string]string{} + if tc.setEnv { + env[envProductionMode] = tc.productionMode + } + err := CheckRunningAsRoot(fakeGeteuid(tc.uid), fakeGetenv(env)) + if tc.wantSentinel { + if !errors.Is(err, ErrRunningAsRoot) { + t.Errorf("got err %v, want errors.Is(err, ErrRunningAsRoot)", err) + } + } else { + if err != nil { + t.Errorf("got err %v, want nil", err) + } + } + }) + } +} + +func TestErrRunningAsRoot_IsBranchable(t *testing.T) { + t.Parallel() + + if !errors.Is(ErrRunningAsRoot, ErrRunningAsRoot) { + t.Fatal("ErrRunningAsRoot should be errors.Is itself") + } + + env := map[string]string{envProductionMode: "1"} + err := CheckRunningAsRoot(fakeGeteuid(0), fakeGetenv(env)) + if !errors.Is(err, ErrRunningAsRoot) { + t.Errorf("returned error %v should satisfy errors.Is(err, ErrRunningAsRoot)", err) + } +} From c01e076902ae037bdf8e01ebcbfc4575a262ff88 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Wed, 13 May 2026 09:46:09 +0300 Subject: [PATCH 3/3] docs: refuse-to-boot-as-root-in-production (#78) Folds the second consumer of the PYRYCODE_RELAY_PRODUCTION contract into the production-mode feature doc (CheckRunningAsRoot + ErrRunningAsRoot), adds per-ticket codebase note for #78, and refreshes the INDEX entry. Co-Authored-By: Claude Opus 4.7 --- docs/knowledge/INDEX.md | 2 +- docs/knowledge/codebase/78.md | 46 ++++++++++++++++++++ docs/knowledge/features/production-mode.md | 49 +++++++++++++++++----- 3 files changed, 86 insertions(+), 11 deletions(-) create mode 100644 docs/knowledge/codebase/78.md diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index 66b87ae..abcd26c 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -6,7 +6,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o - [Env-var config validator (boot-time refusal)](features/env-config-validator.md) — table-driven validation of every env var the relay reads at boot. Single source of truth is the unexported `envContracts []envContract` registry in `internal/relay/env_config.go`; each row carries `name`, `required` bool, and an inline `validate func(string) error`. `CheckEnvConfig(lookup func(string) (string, bool)) error` walks the registry and returns the structured `*ErrInvalidConfig{Key, Reason}` on the first failure (`Reason` is `"missing"` or `"malformed-value: "`); the package-level sentinel `ErrInvalidConfigSentinel` is matched via a custom `Is` method (not `Unwrap`, which would double-print the message prefix) so `errors.Is(err, ErrInvalidConfigSentinel)` and `errors.As(err, &cfgErr)` form a dual contract. The `func(string) (string, bool)` (= `os.LookupEnv` shape) getter coexists with #77's `func(string) string` getter — the presence bit is necessary here to distinguish "missing-but-required" from "present-but-empty", semantically inert for `IsProductionMode`'s exact-`"1"` match. **Ordering is load-bearing**: wired in `main.go` BEFORE `CheckInsecureListenInProduction` so a typo like `PYRYCODE_RELAY_PRODUCTION=true` cannot slip through `IsProductionMode`'s silent-non-production fallback and reach the insecure-listen guard with an unvalidated value. Today's registry has one row (`PYRYCODE_RELAY_PRODUCTION`, optional-but-format-validated); future env-var reads register here at code-review time. `checkEnvConfigWith(lookup, contracts)` is the parameterised inner used by the `required: true` test case (today's production table has no required entries). Exit 2 = config-rejected-at-boot, matching the sibling refusals (#9, #77, #79) (#80). - [Linux capability allowlist (boot-time refusal)](features/capability-allowlist.md) — relay parses `/proc/self/status`'s `CapEff:` hex mask at boot and refuses to start (exit 2) if any bit is set outside `AllowedCapabilities` (currently `{CAP_NET_BIND_SERVICE}` only, motivated by autocert binding `:80`/`:443` from uid 65532 in the distroless image). Exported sentinel `ErrUnexpectedCapability` is branchable via `errors.Is`; the wrapped error names every offending bit symbolically (`CAP_SYS_ADMIN (bit 21)` or `bit 63` for unknown), lists the allowlist contents, and embeds the operator fix string. `CapEff` only — `CapPrm/CapBnd/CapInh` would broaden false-positives (legitimate K8s default policy grants wide CapBnd) without adding load-bearing protection (relay never `capset(2)`s). Linux/non-Linux split at compile time via the new `_.go` / `_other.go` build-tag convention (see ADR-0009); non-Linux GOOS logs one skip line and returns nil. Unconditional — no production-mode gating, no env-var bypass, because stray capabilities are never legitimate. Reader-boundary test seam (`func() (string, error)`) exercises the parse + mask check end-to-end without touching real `/proc`. Joins the boot-time-refusal sentinel family (#9, #77, #79; future #78) (#79). -- [Production-mode contract & `--insecure-listen` startup refusal](features/production-mode.md) — `PYRYCODE_RELAY_PRODUCTION=1` env-var contract (exact-string match, lazy read via injected getter, mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE` shape from #64/#65) plus the first boot-time check that consumes it: `relay.CheckInsecureListenInProduction` returns the exported `ErrInsecureListenInProduction` sentinel (branchable via `errors.Is`) when production mode is on AND `--insecure-listen` is set, wired into `cmd/pyrycode-relay/main.go` after flag-parse with `os.Exit(2)` (config-rejected-at-boot, distinct from runtime-failure exit 1) and structured log fields naming the env var (name only, never value) and a one-line `fix` listing both valid resolutions. `IsProductionMode` exported so sibling startup checks (#78 = uid-0) compose on the same predicate without re-reading the env var. Test seam is a `func(string) string` getter (smallest possible — no interface, no struct, no package-level var) so the 2×2 AC matrix and the value-space matrix run under `t.Parallel()` + `-race` without mutating process env (#77). +- [Production-mode contract & startup refusals](features/production-mode.md) — `PYRYCODE_RELAY_PRODUCTION=1` env-var contract (exact-string match, lazy read via injected getter, mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE` shape from #64/#65) plus the boot-time checks that consume it. **#77** introduced `relay.CheckInsecureListenInProduction` + exported `ErrInsecureListenInProduction` sentinel (branchable via `errors.Is`) firing when production mode is on AND `--insecure-listen` is set. **#78** added the second consumer: `relay.CheckRunningAsRoot(geteuid, getenv)` + exported `ErrRunningAsRoot` sentinel firing when production mode is on AND `syscall.Geteuid() == 0`, closing the deploy-time gap (`docker run --user 0`, `securityContext.runAsUser: 0`, hand-edited Dockerfile dropping `USER`) that escapes the CI non-root-build contract (#32 Dockerfile, #68 Trivy). Both wired in `cmd/pyrycode-relay/main.go` after flag-parse with `os.Exit(2)` (config-rejected-at-boot, distinct from runtime-failure exit 1) and structured log fields: `env_var` carries the name only (never the value, even though `effective_uid` carries the kernel-supplied int — log-injection structurally impossible), one-line `fix` listing valid resolutions. `IsProductionMode` exported so siblings compose on the same predicate without re-reading the env var. Test seams: `func(string) string` for env, `func() int` for euid — both the smallest possible (no interface, no struct, no package-level var) and the only way to exercise the uid-0 branch in a unit test without re-execing the test binary as root. Two instances of the shape (#77, #78) now codify the "sibling boot-check" pattern; `Config.Validate()` consolidation deferred until ~5 checks exist (#77, #78). - [Fly.io deploy](features/fly-deploy.md) — production host wiring: `fly.toml` declares TCP-passthrough on `:80`/`:443` (no Fly HTTP proxy, no Fly-managed certs) so TLS keeps terminating in the relay via autocert (#9), persistent Fly volume `relay_autocert` mounted at `/var/lib/relay/autocert`, and a single-machine hard cap encoded via `min_machines_running=1` + `auto_start_machines=false` + `auto_stop_machines="off"` + `[deploy] strategy="immediate"` (Fly Apps v2 has no `max_machines` key; the in-binary `PYRYCODE_RELAY_SINGLE_INSTANCE` self-check from #65 is the backstop). CI `deploy` job in `.github/workflows/ci.yml` runs `flyctl deploy --remote-only` on push to `main`, gated by branch-condition + `needs: [test, security, image-scan]` + `permissions: contents: read` so `FLY_API_TOKEN` is structurally unreachable from PR code; `superfly/flyctl-actions/setup-flyctl` pinned by commit SHA with `# Tracks:` comment (same convention as #68 / #41). Dedicated IPv4 is required (not optional) for autocert's HTTP-01 challenge; TCP passthrough preserves the real socket peer IP that #34's rate limiter reads. `__REGION__` / `__DOMAIN__` ship as placeholders that fail loud on first deploy (#38). - [Connection-count gauges](features/connection-count-gauges.md) — `pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones` exposed via a pull-based `prometheus.Collector` reading `Registry.Counts()` on each scrape; zero edits to `registry.go`; scalar (no labels) by design — `{server="..."}` would carry the attacker-influenced `x-pyrycode-server` header onto the metrics surface, which threat-model § Log hygiene forbids; stale grace-expiry fires can't move the gauge because the pointer-identity guard (ADR-0006) keeps the maps unchanged and the gauge IS the map size; race-tested against 16 mutator goroutines + a tight-loop scraper under `-race`. First collector wired into the #59 seam (#61). - [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars) — first instantiated by #61's `connectionsCollector`. Listener still pending (#60). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59). diff --git a/docs/knowledge/codebase/78.md b/docs/knowledge/codebase/78.md new file mode 100644 index 0000000..6fb246c --- /dev/null +++ b/docs/knowledge/codebase/78.md @@ -0,0 +1,46 @@ +# Ticket #78 — refuse to boot as effective uid 0 in production mode + +Adds the second consumer of the `PYRYCODE_RELAY_PRODUCTION=1` contract introduced by #77: when production mode is on AND `syscall.Geteuid() == 0`, the relay refuses to start with exit 2 before any listener is opened. Closes the deploy-time half of the non-root contract that CI verifies for the build image (Dockerfile USER directive, #32; Trivy scan, #68) — `docker run --user 0`, `securityContext.runAsUser: 0`, or a hand-edited Dockerfile that drops the USER directive all escape CI today and would otherwise silently run the internet-facing process as root. Split from #42; sibling of #77. + +## Implementation + +- **`internal/relay/production.go` (+31 lines)** — appended to the file #77 created. Two new exported names; the unexported `envProductionMode` const and the `IsProductionMode` predicate are reused unchanged. + - `ErrRunningAsRoot` — sentinel; message names both prongs (`"relay: effective uid is 0 with PYRYCODE_RELAY_PRODUCTION=1; refusing to start"`) so a log line is self-documenting. Branchable via `errors.Is`. + - `CheckRunningAsRoot(geteuid func() int, getenv func(string) string) error` — single-conjunction body: `if IsProductionMode(getenv) && geteuid() == 0 { return ErrRunningAsRoot }`. No wrapping, no formatting; structured-log fields in `main.go` carry the operator-facing context. +- **`internal/relay/production_test.go` (+79 lines)** — appended alongside #77's tests: + - `fakeGeteuid(n int) func() int` — companion to the existing `fakeGetenv` closure helper. + - `TestCheckRunningAsRoot_Matrix` — four rows (the AC's three plus a `uid=65534` "nobody" row that locks in "uid 0 is the *only* refused uid"). All `t.Parallel()`-safe; process env and process uid are never mutated. + - `TestErrRunningAsRoot_IsBranchable` — locks the `errors.Is` contract so a future `fmt.Errorf(...: %w, ...)` refactor breaks the test, not downstream callers. +- **`cmd/pyrycode-relay/main.go` (+17 lines)** — `syscall` import added; check inserted between `CheckInsecureListenInProduction` (#77, lines 62-68) and `CheckCapabilities` (#79, lines 72+). `syscall.Geteuid` is passed directly as a function value (no closure). Structured fields: `err`, `env_var="PYRYCODE_RELAY_PRODUCTION"` (name only, never value), `effective_uid=syscall.Geteuid()` (kernel-supplied int — log-injection structurally impossible), and a `fix` string naming the two valid resolutions (drop privileges before exec; unset `PYRYCODE_RELAY_PRODUCTION` if the deploy is truly dev). Exit code 2. + +The whole check is one boolean conjunction over two stateless inputs — no allocator, no parser, no value-dependent path. The `syscall.Geteuid` call at the log site is a second syscall, deliberate: the relay never calls `setuid`, so the value cannot change between the check and the log line. + +## Acceptance criteria — verification map + +- AC-1 (`ErrRunningAsRoot` exported sentinel, branchable via `errors.Is`): `production.go:48-57`, locked by `TestErrRunningAsRoot_IsBranchable`. +- AC-2 (check function returns sentinel iff production + uid 0, uid source injected): `production.go:59-77`. +- AC-3 (wired in `main.go` after flag parse, before any listener, log includes effective uid + `env_var=PYRYCODE_RELAY_PRODUCTION`, exits 2): `main.go:72-86`. +- AC-4 (unit tests cover the three primary rows via injected uid seam, not re-exec): `production_test.go:122-194`. +- AC-5 (no duplicate read of `PYRYCODE_RELAY_PRODUCTION`): `CheckRunningAsRoot` calls `IsProductionMode(getenv)` — single read path, no `os.Getenv` call inside `internal/relay/production.go` outside the predicate. + +## Patterns established + +- **Sibling boot-check shape: `ErrXxx` + `CheckXxxInProduction(injected_seams...)`.** Each new production-only startup check follows the two-name shape and wires into `main.go` as `if err := relay.CheckXxx(...); err != nil { logger.Error(...); os.Exit(2) }`. Twice is precedent, twice is the pattern — #77 and #78 together codify the shape for any future production-only check. When the wiring boilerplate becomes a real cost (~5 checks; we are at 4 today including the unconditional `CheckCapabilities`), a follow-up consolidates into `relay.Config` + `Config.Validate()`. +- **Injected `func() int` for syscall-backed inputs, paralleling injected `func(string) string` for env.** When a check depends on process-global state that has no stdlib `t.Set*` (uid, gid, hostname, pid), the smallest test seam is a typed function value — no interface, no struct, no package-level var. Call site passes the stdlib function directly (`syscall.Geteuid`); tests pass a closure returning a fixed value. The matrix test exercises every branch under `t.Parallel()` + `-race` without re-execing the test binary as root. +- **Log kernel-supplied integers as structured fields without sanitisation.** `effective_uid` is `int`; log-injection requires a string-formatting path. The `env_var` companion field still carries the *name*, never the env var's value, for the same reason #77 calls out (a confused operator might write a secret into the env var and we don't want it in centralised logs). + +## Lessons learned + +- **Once is precedent, twice is the pattern — don't file an ADR for the second instance.** #77 established the "production-mode sibling check" shape; #78 is the second instance of that shape. Filing an ADR for "we did the same thing again" would be process noise (same reasoning as #77's own "skip the ADR when the design follows precedent" lesson). The shape is now documented as a pattern in `features/production-mode.md` and `codebase/77.md`; subsequent instances are codebase notes only. +- **Inject the syscall, don't `t.Setenv` it.** The AC explicitly forbids exercising the uid-0 branch by re-execing as root, and there is no stdlib equivalent of `t.Setenv` for uid in the first place — a process cannot change its own euid mid-test. The `func() int` seam is the only shape that makes the branch testable without privileged re-exec. Reach for the smallest such seam whenever a check depends on a process-global the test cannot mutate; do *not* reach for an interface or a package-level var. +- **Resist the "two helpers per concern" symmetry.** The natural shape mirrors `IsProductionMode` + `CheckInsecureListenInProduction` → `IsRunningAsRoot` + `CheckRunningAsRoot`. Rejected during architect review. A bool helper with no second consumer is dead surface area today and a temptation tomorrow to call it without the production-mode guard. Inline the predicate inside the check until a second consumer materialises; export the helper at that point if it ever does. +- **Call the syscall a second time at the log site rather than threading the value back through the error.** The check returns the bare sentinel — no `fmt.Errorf("uid %d: %w", ...)` — so `errors.Is` works without unwrapping rules. `main.go` does the second `syscall.Geteuid()` for the structured-log field. The two syscalls observe the same value because the relay never calls `setuid`; the alternative (threading the int through the error) would have forced every downstream `errors.Is` caller to also unwrap. + +## Cross-links + +- [Production-mode contract & startup refusals](../features/production-mode.md) — feature doc; now documents both consumers (`CheckInsecureListenInProduction`, `CheckRunningAsRoot`) of the contract. +- [Codebase note #77](77.md) — sibling check (`--insecure-listen`); same wiring slot, same exit-code split, same getter-seam pattern. #78 is the second instance of the shape #77 established. +- [Docker image](../features/docker-image.md) — #32; build-time non-root contract this ticket complements at deploy time. +- [Linux capability allowlist](../features/capability-allowlist.md) — #79; the next boot-time refusal in the wiring sequence (unconditional, not production-mode-gated). +- [`internal/relay/tls.go`](../../../internal/relay/tls.go) — `ErrCacheDirInsecure`, the canonical boot-time-refusal sentinel that started this family. +- [#42 — parent ticket](https://github.com/pyrycode/pyrycode-relay/issues/42) — split into #77 / #78. diff --git a/docs/knowledge/features/production-mode.md b/docs/knowledge/features/production-mode.md index c8a4a90..958abd0 100644 --- a/docs/knowledge/features/production-mode.md +++ b/docs/knowledge/features/production-mode.md @@ -1,6 +1,6 @@ -# Production-mode contract & `--insecure-listen` startup refusal +# Production-mode contract & startup refusals -Single-env-var production-mode signal for the relay, plus the first boot-time check that consumes it. Defined by #77; sibling startup checks (#78 = refuse-to-run-as-uid-0; future others) reuse the contract rather than re-reading the env var. +Single-env-var production-mode signal for the relay, plus the boot-time checks that consume it. Defined by #77; #78 added a second consumer (refuse-to-run-as-uid-0) that reuses the contract rather than re-reading the env var. Future production-only startup checks compose on `IsProductionMode` the same way. ## The contract @@ -12,9 +12,11 @@ Strict-equality (not `strconv.ParseBool` "truthy" parsing) is intentional and mi ## API (`internal/relay/production.go`) -- `IsProductionMode(getenv func(string) string) bool` — reports whether `getenv("PYRYCODE_RELAY_PRODUCTION") == "1"`. Exported so sibling checks (#78+) compose on the same predicate. +- `IsProductionMode(getenv func(string) string) bool` — reports whether `getenv("PYRYCODE_RELAY_PRODUCTION") == "1"`. Exported so sibling checks compose on the same predicate (today: `CheckInsecureListenInProduction`, `CheckRunningAsRoot`). - `CheckInsecureListenInProduction(insecureListen string, getenv func(string) string) error` — returns `ErrInsecureListenInProduction` when production mode is on AND `insecureListen != ""`, nil otherwise. Intended to run after `flag.Parse()`, before any listener is started. - `ErrInsecureListenInProduction` — exported sentinel, branchable via `errors.Is`. Message names both inputs (`"relay: --insecure-listen is set with PYRYCODE_RELAY_PRODUCTION=1; refusing to start"`) so a log line is self-documenting. +- `CheckRunningAsRoot(geteuid func() int, getenv func(string) string) error` — returns `ErrRunningAsRoot` when production mode is on AND `geteuid() == 0`, nil otherwise. Intended to run after `flag.Parse()`, before any listener is started. `geteuid` exists as an injected `func() int` for the same reason `getenv` does: there is no stdlib equivalent of `t.Setenv` for the uid (a process cannot change its own euid mid-test without re-exec), so the uid-0 branch is only reachable in a unit test via the seam. +- `ErrRunningAsRoot` — exported sentinel, branchable via `errors.Is`. Message names both inputs (`"relay: effective uid is 0 with PYRYCODE_RELAY_PRODUCTION=1; refusing to start"`). - `envProductionMode` (unexported) — the env-var-name constant. In-package siblings reuse it; out-of-package callers go through `IsProductionMode`. ## Why this shape (test seam) @@ -23,7 +25,7 @@ The check takes a `func(string) string` rather than calling `os.Getenv` directly ## Wiring (`cmd/pyrycode-relay/main.go`) -After flag-parse and after the existing "either `--domain` or `--insecure-listen`" guard, before `relay.NewRegistry()`: +Boot-time check ordering, top to bottom: `CheckEnvConfig` (#80) → `CheckInsecureListenInProduction` (#77) → `CheckRunningAsRoot` (#78) → `CheckCapabilities` (#79). Each `if err != nil` branch logs at error level and `os.Exit(2)`s; no listener has been opened yet at any point in the sequence. ```go if err := relay.CheckInsecureListenInProduction(*insecureListen, os.Getenv); err != nil { @@ -33,14 +35,27 @@ if err := relay.CheckInsecureListenInProduction(*insecureListen, os.Getenv); err "fix", "remove --insecure-listen and set --domain, or unset PYRYCODE_RELAY_PRODUCTION") os.Exit(2) } + +if err := relay.CheckRunningAsRoot(syscall.Geteuid, os.Getenv); err != nil { + logger.Error("refusing to start: production-mode misconfiguration", + "err", err, + "env_var", "PYRYCODE_RELAY_PRODUCTION", + "effective_uid", syscall.Geteuid(), + "fix", "drop privileges before exec (e.g. Dockerfile USER directive or --user , kubernetes securityContext.runAsUser), or unset PYRYCODE_RELAY_PRODUCTION if the deploy is truly dev") + os.Exit(2) +} ``` -- **Exit code 2** matches the flag-validation guard immediately above. Exit 1 is reserved for runtime failures in this binary (listener died, autocert failed); exit 2 = configuration-rejected-at-boot. Splitting the codes lets ops dashboards distinguish "deploy never started" from "deploy started and crashed." -- **`fix` field** lists both valid resolutions (remove the flag OR unset the env var). The operator picks whichever input was wrong. +- **Exit code 2** matches the flag-validation guard and every other config-rejected-at-boot block. Exit 1 is reserved for runtime failures in this binary (listener died, autocert failed); exit 2 = configuration-rejected-at-boot. Splitting the codes lets ops dashboards distinguish "deploy never started" from "deploy started and crashed." +- **`fix` field** lists both valid resolutions (remove the flag / drop privileges OR unset the env var). The operator picks whichever input was wrong. - **`env_var` field carries the name, never the value.** Even if a confused operator put a secret in `PYRYCODE_RELAY_PRODUCTION`, it would not be logged. Do not extend this log line to include the env var value. +- **`effective_uid` field is the kernel-supplied integer**, not a user-supplied string — log-injection is structurally impossible. The value is read a second time at the log site (after the check returns `ErrRunningAsRoot`); the process cannot change its own euid between two adjacent syscalls without an intervening `setuid` call, which the relay never makes. +- **Order between `CheckInsecureListenInProduction` and `CheckRunningAsRoot` is not load-bearing** (they are siblings on the production-mode axis). They both run before `CheckCapabilities` because production-mode misconfigurations are deploy-shape concerns that should surface before Linux-specific runtime concerns. ## Behaviour matrix +### `CheckInsecureListenInProduction` (#77) + | `PYRYCODE_RELAY_PRODUCTION` | `--insecure-listen` | Result | |---|---|---| | unset / not `"1"` | any | nil (no check fires) | @@ -49,23 +64,37 @@ if err := relay.CheckInsecureListenInProduction(*insecureListen, os.Getenv); err Autocert (`--domain`) is not inspected — the contract is purely about plaintext-in-prod. Setting `--domain` with `PYRYCODE_RELAY_PRODUCTION=1` is the happy path. +### `CheckRunningAsRoot` (#78) + +| `PYRYCODE_RELAY_PRODUCTION` | `geteuid()` | Result | +|---|---|---| +| unset / not `"1"` | `0` | nil (dev-mode root is allowed; sandboxed dev runs as root are unremarkable) | +| `"1"` | `1000` (any non-zero) | nil | +| `"1"` | `0` | `ErrRunningAsRoot` → `os.Exit(2)` | + +uid `0` is the *only* refused uid — the check has no notion of a "privileged uids" set. A non-zero uid that nominally owns sensitive resources (e.g. a service-account uid mapped onto host root via user namespaces) is the deploy layer's problem, not the relay's. The fourth test row (`uid=65534`) exists to lock in this scope. + Since #80 the matrix's "not `"1"`" row only describes the *post-validation* state: a malformed `PYRYCODE_RELAY_PRODUCTION` value (`"true"`, `"yes"`, `" 1"`, `"PRODUCTION"`, etc.) is now caught by [`CheckEnvConfig`](env-config-validator.md) *before* `CheckInsecureListenInProduction` is consulted, so the relay refuses to boot at the env-validation stage rather than silently treating the typo as "not production". `IsProductionMode`'s strict-`"1"` contract is unchanged; the validator simply ensures no other value ever reaches it. ## Threat model alignment -`pyrycode/pyrycode/docs/protocol-mobile.md` § Security model assumes TLS for all production traffic. This is the in-binary enforcement of that assumption *when production mode is explicitly tagged*. The complement — a CI / deploy-manifest check that prod manifests actually set `PYRYCODE_RELAY_PRODUCTION=1` — is out of scope and the responsibility of the deploy layer. +`pyrycode/pyrycode/docs/protocol-mobile.md` § Security model assumes TLS for all production traffic and that the host running the relay enforces least-privilege so a relay RCE does not escalate to host-root RCE. These checks are the in-binary enforcement of both assumptions *when production mode is explicitly tagged*. CI already verifies the *build* image is non-root (#32 Dockerfile USER directive, Trivy scan in #68); `CheckRunningAsRoot` closes the deploy-time gap (`docker run --user 0`, `securityContext.runAsUser: 0`, hand-edited Dockerfile dropping `USER`). The complements — CI / deploy-manifest checks that prod manifests actually set `PYRYCODE_RELAY_PRODUCTION=1` and never set `runAsUser: 0` — are out of scope and the responsibility of the deploy layer. -The check is fail-closed: if the env var is on AND plaintext is requested, the relay refuses to boot. There is no degradation path, no fallback, no retry. Boot-time refusal is total. +The checks are fail-closed: if any precondition trips, the relay refuses to boot. There is no degradation path, no fallback, no retry. Boot-time refusal is total. ## Out of scope (deferred) -- **No `Config` struct.** A bundled `relay.Config` with `Validate()` returning a multi-error is a natural extension once 3+ startup checks exist. With one check today (this one) and one more queued (#78), the consolidation is premature. +- **No `Config` struct.** A bundled `relay.Config` with `Validate()` returning a multi-error is a natural extension once ~5 startup checks exist. With four today (`CheckEnvConfig`, `CheckInsecureListenInProduction`, `CheckRunningAsRoot`, `CheckCapabilities`), the wiring boilerplate is approaching the cost threshold; a follow-up ticket will consolidate when it crosses. - **No fork-exec integration test on `main.go`.** The `main` wiring is observable via package-level unit tests; one `if err != nil { os.Exit(2) }` block does not warrant a binary-spawning test. +- **No `IsRunningAsRoot` bool helper.** A second consumer hasn't appeared (no log line branches on it, no metric labels it); exporting it today would be dead surface area and a temptation to call it without the production-mode guard tomorrow. Inline the predicate inside `CheckRunningAsRoot` until a second consumer materialises. ## Cross-links - ADR: none filed; the shape is precedent-following (mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE`), not a new architectural choice. - [`internal/relay/tls.go`](../../../internal/relay/tls.go) — `ErrCacheDirInsecure` is the canonical boot-time-refusal sentinel this one models. - [Single-instance constraint (v1)](../../architecture.md#single-instance-constraint-v1) — sibling env-var contract (`PYRYCODE_RELAY_SINGLE_INSTANCE`), shape precedent. -- [Codebase ticket note #77](../codebase/77.md) — per-ticket implementation detail. +- [Codebase ticket note #77](../codebase/77.md) — per-ticket implementation detail for the `--insecure-listen` check. +- [Codebase ticket note #78](../codebase/78.md) — per-ticket implementation detail for the uid-0 check. +- [Docker image](docker-image.md) — #32; the build-time non-root contract this ticket complements at deploy time. +- [Linux capability allowlist](capability-allowlist.md) — #79; the next boot-time refusal in the wiring sequence, unconditional rather than production-mode-gated. - [Env-var config validator](env-config-validator.md) — #80; the boot-time validator that polices the malformed-value cases listed in the contract, running before `CheckInsecureListenInProduction` so a typo can never reach `IsProductionMode`.