From f53c1275dca6f9ab8c772d630d2af511121c65e8 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Wed, 13 May 2026 08:55:47 +0300 Subject: [PATCH 1/3] spec: refuse to boot with --insecure-listen in production (#77) --- .../77-refuse-insecure-in-production.md | 195 ++++++++++++++++++ 1 file changed, 195 insertions(+) create mode 100644 docs/specs/architecture/77-refuse-insecure-in-production.md diff --git a/docs/specs/architecture/77-refuse-insecure-in-production.md b/docs/specs/architecture/77-refuse-insecure-in-production.md new file mode 100644 index 0000000..f4a8299 --- /dev/null +++ b/docs/specs/architecture/77-refuse-insecure-in-production.md @@ -0,0 +1,195 @@ +# Spec: refuse to boot with `--insecure-listen` in production mode + +Ticket: [#77](https://github.com/pyrycode/pyrycode-relay/issues/77). Size S. Split from #42. + +## Files to read first + +- `cmd/pyrycode-relay/main.go` (whole file, 128 lines) — the only call site. Lines 23–43 (flag declarations, `flag.Parse`, the existing "either `--domain` or `--insecure-listen`" guard) are where the new check slots in. Lines 61–119 show the two listener-start branches the check must run *before*. +- `internal/relay/tls.go:15-19` — the canonical sentinel-error pattern in this package: `var ErrCacheDirInsecure = errors.New("relay: …")` plus a Go doc comment that names the contract. Mirror that shape exactly. +- `internal/relay/tls.go:29-49` — the canonical "boot-time refusal" return shape (`return nil, fmt.Errorf("%w: %s …", ErrCacheDirInsecure, …)`). The new check returns `error`, not `(T, error)`, but the wrapping convention is the same. +- `internal/relay/tls_test.go:1-80` — the test style for this package: `t.Parallel()`, table-driven where it helps, `errors.Is` for sentinel assertions, no shared globals. Mirror it. +- `docs/specs/architecture/9-autocert-tls.md:55-100` — the prior architect's reasoning for the `ErrCacheDirInsecure` sentinel and its boot-time use. The "fail loud, before any listener starts" framing is identical to this ticket's intent. +- `docs/specs/architecture/64-single-instance-architecture-doc.md:57-76` — the env-var-contract precedent (`PYRYCODE_RELAY_SINGLE_INSTANCE`). The shape this ticket commits to (`=1` means on, anything else means off, read lazily) is intentionally identical. + +## Context + +Today the relay binary accepts two mutually-exclusive transport flags — `--domain` (autocert TLS on :443/:80) and `--insecure-listen` (plain HTTP on a custom address). The plaintext mode exists for dev loops and behind-a-proxy deployments where TLS is terminated upstream. Nothing distinguishes a dev environment from a production environment; an operator (or an AI agent generating a deploy manifest) who copy-pastes a dev config into prod will boot a plaintext listener facing the internet. + +The bug this ticket prevents is the silent-misconfiguration class: the relay starts, `/healthz` returns 200, traffic flows in plaintext, and the operator only notices when a passive observer (or `tcpdump`) shows tokens or message frames on the wire. The defence must run *at boot, before any listener accepts a connection*, so a misconfigured deploy fails the health-check rather than serving traffic. + +The contract is a single env var, `PYRYCODE_RELAY_PRODUCTION=1`. The shape — name `PYRYCODE_RELAY_*`, value `1` means on, anything else means off — mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE` (#39 / #64) so operators only have to remember one shape and can grep `PYRYCODE_RELAY_*` in their manifests to audit production gating. + +This ticket is the canonical place where the `PYRYCODE_RELAY_PRODUCTION` contract is defined. Sibling startup checks (#78 = refuse to run as uid 0 in production; potentially more) will import the exported `IsProductionMode` helper introduced here rather than re-reading the env var. + +Related: #9 (`ErrCacheDirInsecure` set the boot-time-refusal precedent); #39 / #64 (env-var-shape precedent); #42 (parent — was split into #77 / #78). + +## Design + +One new file: `internal/relay/production.go`. One new test file: `internal/relay/production_test.go`. One ~6-line addition to `cmd/pyrycode-relay/main.go`. No existing code is refactored. + +### Package-level constant (unexported) + +```go +const envProductionMode = "PYRYCODE_RELAY_PRODUCTION" +``` + +Unexported because callers should not read the env var directly — they should call `IsProductionMode`. Sibling startup checks in this package (e.g. #78) reuse the constant inside the package; callers in `cmd/` go through `IsProductionMode`. + +### Sentinel error (exported) + +```go +// ErrInsecureListenInProduction is returned by CheckInsecureListenInProduction +// when the relay is configured for production mode (PYRYCODE_RELAY_PRODUCTION=1) +// AND the --insecure-listen flag is set. Serving plaintext traffic from a +// production-tagged process is a fail-fast misconfiguration, not a runtime +// degradation: the relay refuses to start so a dev manifest accidentally +// promoted to prod fails the deploy's health check rather than serving traffic. +var ErrInsecureListenInProduction = errors.New("relay: --insecure-listen is set with PYRYCODE_RELAY_PRODUCTION=1; refusing to start") +``` + +The error message names both the flag and the env var so a developer reading the log line knows the two inputs to flip without consulting docs. Downstream code that wants to branch on this failure mode uses `errors.Is(err, ErrInsecureListenInProduction)`. + +### Helpers (exported) + +```go +// IsProductionMode reports whether the relay is in production mode. +// +// The contract is: PYRYCODE_RELAY_PRODUCTION="1" means production; any other +// value (including unset, "0", "true", "yes", "PRODUCTION", or whitespace) +// means non-production. Only the exact string "1" enables production mode. +// The strictness is intentional: anything fuzzier ("truthy" parsing) creates +// a class of subtle misconfigurations where an operator believes production +// mode is on but the relay disagrees. +// +// The env var is read on every call via getenv. Pass os.Getenv at call sites; +// tests inject a func to exercise the matrix without mutating process env. +func IsProductionMode(getenv func(string) string) bool { + return getenv(envProductionMode) == "1" +} + +// CheckInsecureListenInProduction returns ErrInsecureListenInProduction when +// production mode is on (per IsProductionMode) AND insecureListen is non-empty. +// Returns nil otherwise. Intended to be called from main after flag parse, +// before any listener is started. +// +// getenv is the env-var lookup function; pass os.Getenv at the call site, +// an injected func in tests. +func CheckInsecureListenInProduction(insecureListen string, getenv func(string) string) error { + if IsProductionMode(getenv) && insecureListen != "" { + return ErrInsecureListenInProduction + } + return nil +} +``` + +Three design decisions worth naming: + +1. **Injected-getter test seam, not `t.Setenv`.** The AC requires "the env var is read through a test seam (injected getter or equivalent), not by mutating process env." A `func(string) string` parameter is the smallest seam: no interface, no struct, no package-level mutable variable. `os.Getenv` satisfies the signature directly at the call site (`relay.CheckInsecureListenInProduction(*insecureListen, os.Getenv)`). Tests pass a closure built from a `map[string]string`. Process env is never mutated, so the test is safe under `t.Parallel()` and under `go test -count=N -race`. +2. **Lazy read, not cached at init.** The AC says "read the env var lazily on each call." This matters for two reasons: (a) tests inject different values per case without re-importing the package; (b) sibling checks (#78 and future) call `IsProductionMode` at their own boot moment without taking a hidden dependency on initialisation order. Cost is one `os.Getenv` per boot-check call — negligible. +3. **Exact `"1"` match, not truthy parsing.** Stricter than `strconv.ParseBool` (`"t"`, `"T"`, `"TRUE"`, `"1"`, `"True"`, `"true"` all true). The strictness is the same one the precedent doc for `PYRYCODE_RELAY_SINGLE_INSTANCE` (#64) bakes in — an exact-string contract is harder to misread in a deploy manifest than a "what does this parser accept?" question. The Go doc comment on `IsProductionMode` documents this explicitly. + +### Wiring in `cmd/pyrycode-relay/main.go` + +Insert after the existing "either `--domain` or `--insecure-listen`" guard (line 43), before `startedAt := time.Now()` (line 45): + +```go +if err := relay.CheckInsecureListenInProduction(*insecureListen, os.Getenv); err != nil { + logger.Error("refusing to start: production-mode misconfiguration", + "err", err, + "env_var", "PYRYCODE_RELAY_PRODUCTION", + "fix", "remove --insecure-listen and set --domain, or unset PYRYCODE_RELAY_PRODUCTION") + os.Exit(2) +} +``` + +Three details: + +- **Exit code 2**, matching the existing flag-validation guard immediately above (line 42). Exit 1 is reserved in this file for runtime failures (listener died, autocert failed); exit 2 is configuration-rejected-at-boot. Keeping the two codes distinct lets ops dashboards split "deploy never started" from "deploy started and crashed." +- **Structured log fields named in the AC** — `env_var` and `fix`. The `fix` field is the one-line remediation: it tells the operator the two flips that resolve the boot failure without making them read the source. The phrasing intentionally lists *both* exits (remove the flag OR unset the env var) because either is a valid resolution depending on which input was wrong. +- **Placement before `relay.NewRegistry()`** ensures the check runs before any per-boot side effects (goroutine spawn, port bind, file creation). The current code does not have side effects between flag-parse and `NewRegistry`, so this is purely defensive against future drift. + +### Why not a `Config` struct + +A natural extension is to bundle production-mode + insecure-listen + future flags into a `relay.Config` struct and have `Config.Validate()` return a multi-error. **Out of scope for this ticket.** Each sibling startup check (#78 = uid-0, future = others) will introduce one new exported `CheckXxxInProduction` function. When the count reaches ~3 and the wiring boilerplate in `main.go` becomes a real cost, a follow-up ticket consolidates them. Doing it now is premature abstraction. + +## Concurrency model + +None. All three helpers (`IsProductionMode`, `CheckInsecureListenInProduction`, the `main` wiring) run on the main goroutine before any listener is started. No locks, no channels, no goroutines. + +## Error handling + +The check returns a single sentinel — no wrapping, no formatting with caller-supplied fields. The error message is the same on every failure (`"relay: --insecure-listen is set with PYRYCODE_RELAY_PRODUCTION=1; refusing to start"`); the structured log fields in `main` provide the operator-facing remediation context. + +No error case requires retry, fallback, or partial-state recovery. Boot-time refusal is total: the process exits 2 immediately. + +## Testing strategy + +All tests live in `internal/relay/production_test.go`, are `t.Parallel()`-safe, and never call `os.Setenv` / `t.Setenv`. Helper for building a fake getter: + +```go +func fakeGetenv(env map[string]string) func(string) string { + return func(k string) string { return env[k] } +} +``` + +### Test 1 — `IsProductionMode` value matrix + +Table-driven over the env-var value space. Cases: + +| `PYRYCODE_RELAY_PRODUCTION` value | want `IsProductionMode` | +|---|---| +| `"1"` | `true` | +| `""` (unset) | `false` | +| `"0"` | `false` | +| `"true"` | `false` | +| `"yes"` | `false` | +| `" 1"` (leading space) | `false` | +| `"1 "` (trailing space) | `false` | +| `"PRODUCTION"` | `false` | + +The non-`"1"` rows are not over-coverage: they document the "exact match, not truthy" contract under attack. If a future refactor introduces `strconv.ParseBool`, the `"true"` / `"yes"` rows fail and the contract violation is caught. + +### Test 2 — `CheckInsecureListenInProduction` 2×2 matrix (the AC verbatim) + +| production-mode env | `insecureListen` | want | +|---|---|---| +| unset | `":8080"` | `nil` | +| `"1"` | `":8080"` | `errors.Is(err, ErrInsecureListenInProduction)` | +| `"1"` | `""` (autocert path; `--domain` is set elsewhere) | `nil` | +| unset | `""` | `nil` | + +The third row models the "production + `--domain` (autocert)" case — `insecureListen == ""` is the signal that autocert is in play; the check has no business inspecting `--domain` because the contract is purely about plaintext-in-prod. + +### Test 3 — sentinel is branchable + +A one-line test that `errors.Is(ErrInsecureListenInProduction, ErrInsecureListenInProduction)` is true, plus a check that the returned error from `CheckInsecureListenInProduction("...", getenv)` (production+insecure case) satisfies `errors.Is`. The point is to lock in the `errors.Is` contract so a future "let me return `fmt.Errorf("foo: %s", ...)` instead" refactor breaks the test, not downstream callers. + +### What is NOT tested + +- The `main.go` wiring. `main` is a coordination function; its behaviour is observable via the unit tests on the package it calls. Adding a fork-exec integration test for one `if err != nil { os.Exit(2) }` block is over-engineering. Code review of the diff is the gate. +- The structured log line's exact field names. The AC says "structured log line naming the env var and the fix"; the spec's wiring satisfies that literally. Asserting on slog output requires capturing the handler, which is friction for a contract that is operator-facing prose, not a downstream-machine-parsed format. + +## Open questions + +None. The env var name, the helper-name freedom, and the sentinel name are all settled by the ticket body. The injected-getter seam is the smallest design that satisfies "test seam without env mutation"; no other axis of choice is load-bearing. + +## Security review + +**Verdict:** PASS + +**Findings:** + +- [Trust boundaries] No findings — the only input crossing into the check is the `PYRYCODE_RELAY_PRODUCTION` env var, which is operator-controlled (not network-attacker-controlled). The check itself is a fail-closed gate on a known shape; there is no parser, no allocator, no path that data flows into beyond a string equality. +- [Tokens, secrets, credentials] N/A — the ticket creates no tokens, reads no secrets, and writes no credential material. The env var carries a configuration signal, not a credential. +- [File operations] N/A — no file is read, written, statted, or unlinked. No path concatenation. +- [Subprocess / external command execution] N/A — no `exec.Command`, no `os.StartProcess`. +- [Cryptographic primitives] N/A — no RNG, no hash, no comparison of attacker-controlled values. +- [Network & I/O] No findings — the check runs before any listener is opened. The whole purpose of the ticket is to *prevent* a listener from starting in a misconfigured state. The autocert path (#9) and the existing `http.Server` timeout configuration in `main.go:84-104` are unchanged. +- [Error messages, logs, telemetry] No findings — the sentinel error message contains the flag name and the env var name (both operator-facing constants, no user data); the `main` log line names the env var and a fix string (no PII, no token, no path). The `env_var` field is the *name* of the env var, not its value — even if the operator put a secret in `PYRYCODE_RELAY_PRODUCTION` (they shouldn't; the contract is `"1"`), it would not be logged. Worth flagging to the developer: do not add a `value` field that logs `os.Getenv("PYRYCODE_RELAY_PRODUCTION")`; the contract value is `"1"` but a confused operator might write anything there and we don't want it ending up in centralised logs. +- [Concurrency] N/A — the check runs on the main goroutine before any goroutine is spawned; no shared mutable state. +- [Threat model alignment] No findings — `pyrycode/pyrycode/docs/protocol-mobile.md` § Security model assumes TLS is in place for all production traffic. This ticket is the in-binary enforcement of that assumption when production-mode is explicitly tagged. The complement (a CI / deploy-manifest check that prod manifests set `PYRYCODE_RELAY_PRODUCTION=1`) is out of scope and tracked separately. + +One **SHOULD FIX** noted inline above (Error messages, logs, telemetry category): the developer must not extend the log line to include the env var *value*. The spec already says so explicitly; code-review must double-check. + +**Reviewer:** architect (self-review per `architect/security-review.md`) +**Date:** 2026-05-13 From 4338bea960b810546b4d47100f1eb9f8b2a6c3ac Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Wed, 13 May 2026 08:57:30 +0300 Subject: [PATCH 2/3] feat(relay): refuse to boot with --insecure-listen in production (#77) Adds the PYRYCODE_RELAY_PRODUCTION=1 contract and a boot-time check that returns ErrInsecureListenInProduction when production mode is on AND --insecure-listen is set. Wired into cmd/pyrycode-relay/main.go after flag-parse with structured remediation logging and exit 2. The env var is read via an injected getter so tests cover the matrix without mutating process env. --- cmd/pyrycode-relay/main.go | 8 ++ internal/relay/production.go | 46 ++++++++++++ internal/relay/production_test.go | 119 ++++++++++++++++++++++++++++++ 3 files changed, 173 insertions(+) create mode 100644 internal/relay/production.go create mode 100644 internal/relay/production_test.go diff --git a/cmd/pyrycode-relay/main.go b/cmd/pyrycode-relay/main.go index 54cadf9..c9464cd 100644 --- a/cmd/pyrycode-relay/main.go +++ b/cmd/pyrycode-relay/main.go @@ -42,6 +42,14 @@ func main() { os.Exit(2) } + if err := relay.CheckInsecureListenInProduction(*insecureListen, os.Getenv); err != nil { + logger.Error("refusing to start: production-mode misconfiguration", + "err", err, + "env_var", "PYRYCODE_RELAY_PRODUCTION", + "fix", "remove --insecure-listen and set --domain, or unset PYRYCODE_RELAY_PRODUCTION") + os.Exit(2) + } + startedAt := time.Now() reg := relay.NewRegistry() diff --git a/internal/relay/production.go b/internal/relay/production.go new file mode 100644 index 0000000..e23d37c --- /dev/null +++ b/internal/relay/production.go @@ -0,0 +1,46 @@ +package relay + +import "errors" + +// envProductionMode is the env var that signals production mode. The +// contract is "1" means on; anything else means off. Mirrors the shape +// of PYRYCODE_RELAY_SINGLE_INSTANCE so operators only have to remember +// one pattern. +const envProductionMode = "PYRYCODE_RELAY_PRODUCTION" + +// ErrInsecureListenInProduction is returned by CheckInsecureListenInProduction +// when the relay is configured for production mode (PYRYCODE_RELAY_PRODUCTION=1) +// AND the --insecure-listen flag is set. Serving plaintext traffic from a +// production-tagged process is a fail-fast misconfiguration, not a runtime +// degradation: the relay refuses to start so a dev manifest accidentally +// promoted to prod fails the deploy's health check rather than serving traffic. +var ErrInsecureListenInProduction = errors.New("relay: --insecure-listen is set with PYRYCODE_RELAY_PRODUCTION=1; refusing to start") + +// IsProductionMode reports whether the relay is in production mode. +// +// The contract is: PYRYCODE_RELAY_PRODUCTION="1" means production; any other +// value (including unset, "0", "true", "yes", "PRODUCTION", or whitespace) +// means non-production. Only the exact string "1" enables production mode. +// The strictness is intentional: anything fuzzier ("truthy" parsing) creates +// a class of subtle misconfigurations where an operator believes production +// mode is on but the relay disagrees. +// +// The env var is read on every call via getenv. Pass os.Getenv at call sites; +// tests inject a func to exercise the matrix without mutating process env. +func IsProductionMode(getenv func(string) string) bool { + return getenv(envProductionMode) == "1" +} + +// CheckInsecureListenInProduction returns ErrInsecureListenInProduction when +// production mode is on (per IsProductionMode) AND insecureListen is non-empty. +// Returns nil otherwise. Intended to be called from main after flag parse, +// before any listener is started. +// +// getenv is the env-var lookup function; pass os.Getenv at the call site, +// an injected func in tests. +func CheckInsecureListenInProduction(insecureListen string, getenv func(string) string) error { + if IsProductionMode(getenv) && insecureListen != "" { + return ErrInsecureListenInProduction + } + return nil +} diff --git a/internal/relay/production_test.go b/internal/relay/production_test.go new file mode 100644 index 0000000..b06388f --- /dev/null +++ b/internal/relay/production_test.go @@ -0,0 +1,119 @@ +package relay + +import ( + "errors" + "testing" +) + +func fakeGetenv(env map[string]string) func(string) string { + return func(k string) string { return env[k] } +} + +func TestIsProductionMode_ValueMatrix(t *testing.T) { + t.Parallel() + + cases := []struct { + name string + value string + set bool + want bool + }{ + {name: "exact 1 is production", value: "1", set: true, want: true}, + {name: "unset is non-production", set: false, want: false}, + {name: "empty string is non-production", value: "", set: true, want: false}, + {name: "zero is non-production", value: "0", set: true, want: false}, + {name: "true is non-production", value: "true", set: true, want: false}, + {name: "yes is non-production", value: "yes", set: true, want: false}, + {name: "leading space is non-production", value: " 1", set: true, want: false}, + {name: "trailing space is non-production", value: "1 ", set: true, want: false}, + {name: "uppercase PRODUCTION is non-production", value: "PRODUCTION", set: true, want: false}, + } + + for _, tc := range cases { + tc := tc + t.Run(tc.name, func(t *testing.T) { + t.Parallel() + env := map[string]string{} + if tc.set { + env[envProductionMode] = tc.value + } + if got := IsProductionMode(fakeGetenv(env)); got != tc.want { + t.Errorf("IsProductionMode(%q set=%v) = %v, want %v", tc.value, tc.set, got, tc.want) + } + }) + } +} + +func TestCheckInsecureListenInProduction_Matrix(t *testing.T) { + t.Parallel() + + cases := []struct { + name string + productionMode string // "" means unset + setEnv bool + insecureListen string + wantSentinel bool + }{ + { + name: "non-production + insecure-listen returns nil", + setEnv: false, + insecureListen: ":8080", + wantSentinel: false, + }, + { + name: "production + insecure-listen returns sentinel", + productionMode: "1", + setEnv: true, + insecureListen: ":8080", + wantSentinel: true, + }, + { + name: "production + autocert (insecureListen empty) returns nil", + productionMode: "1", + setEnv: true, + insecureListen: "", + wantSentinel: false, + }, + { + name: "non-production + autocert returns nil", + setEnv: false, + insecureListen: "", + wantSentinel: false, + }, + } + + for _, tc := range cases { + tc := tc + t.Run(tc.name, func(t *testing.T) { + t.Parallel() + env := map[string]string{} + if tc.setEnv { + env[envProductionMode] = tc.productionMode + } + err := CheckInsecureListenInProduction(tc.insecureListen, fakeGetenv(env)) + if tc.wantSentinel { + if !errors.Is(err, ErrInsecureListenInProduction) { + t.Errorf("got err %v, want errors.Is(err, ErrInsecureListenInProduction)", err) + } + } else { + if err != nil { + t.Errorf("got err %v, want nil", err) + } + } + }) + } +} + +func TestErrInsecureListenInProduction_IsBranchable(t *testing.T) { + t.Parallel() + + if !errors.Is(ErrInsecureListenInProduction, ErrInsecureListenInProduction) { + t.Fatal("ErrInsecureListenInProduction should be errors.Is itself") + } + + env := map[string]string{envProductionMode: "1"} + err := CheckInsecureListenInProduction(":8080", fakeGetenv(env)) + if !errors.Is(err, ErrInsecureListenInProduction) { + t.Errorf("returned error %v should satisfy errors.Is(err, ErrInsecureListenInProduction)", err) + } +} From 51a6486c526eb041391a7a79624e6767c349104f Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Wed, 13 May 2026 09:01:47 +0300 Subject: [PATCH 3/3] docs: production-mode contract & insecure-listen startup refusal (#77) --- docs/knowledge/INDEX.md | 1 + docs/knowledge/codebase/77.md | 45 ++++++++++++++ docs/knowledge/features/production-mode.md | 68 ++++++++++++++++++++++ 3 files changed, 114 insertions(+) create mode 100644 docs/knowledge/codebase/77.md create mode 100644 docs/knowledge/features/production-mode.md diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index e0b90e9..e7e15f1 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Features +- [Production-mode contract & `--insecure-listen` startup refusal](features/production-mode.md) — `PYRYCODE_RELAY_PRODUCTION=1` env-var contract (exact-string match, lazy read via injected getter, mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE` shape from #64/#65) plus the first boot-time check that consumes it: `relay.CheckInsecureListenInProduction` returns the exported `ErrInsecureListenInProduction` sentinel (branchable via `errors.Is`) when production mode is on AND `--insecure-listen` is set, wired into `cmd/pyrycode-relay/main.go` after flag-parse with `os.Exit(2)` (config-rejected-at-boot, distinct from runtime-failure exit 1) and structured log fields naming the env var (name only, never value) and a one-line `fix` listing both valid resolutions. `IsProductionMode` exported so sibling startup checks (#78 = uid-0) compose on the same predicate without re-reading the env var. Test seam is a `func(string) string` getter (smallest possible — no interface, no struct, no package-level var) so the 2×2 AC matrix and the value-space matrix run under `t.Parallel()` + `-race` without mutating process env (#77). - [Fly.io deploy](features/fly-deploy.md) — production host wiring: `fly.toml` declares TCP-passthrough on `:80`/`:443` (no Fly HTTP proxy, no Fly-managed certs) so TLS keeps terminating in the relay via autocert (#9), persistent Fly volume `relay_autocert` mounted at `/var/lib/relay/autocert`, and a single-machine hard cap encoded via `min_machines_running=1` + `auto_start_machines=false` + `auto_stop_machines="off"` + `[deploy] strategy="immediate"` (Fly Apps v2 has no `max_machines` key; the in-binary `PYRYCODE_RELAY_SINGLE_INSTANCE` self-check from #65 is the backstop). CI `deploy` job in `.github/workflows/ci.yml` runs `flyctl deploy --remote-only` on push to `main`, gated by branch-condition + `needs: [test, security, image-scan]` + `permissions: contents: read` so `FLY_API_TOKEN` is structurally unreachable from PR code; `superfly/flyctl-actions/setup-flyctl` pinned by commit SHA with `# Tracks:` comment (same convention as #68 / #41). Dedicated IPv4 is required (not optional) for autocert's HTTP-01 challenge; TCP passthrough preserves the real socket peer IP that #34's rate limiter reads. `__REGION__` / `__DOMAIN__` ship as placeholders that fail loud on first deploy (#38). - [Connection-count gauges](features/connection-count-gauges.md) — `pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones` exposed via a pull-based `prometheus.Collector` reading `Registry.Counts()` on each scrape; zero edits to `registry.go`; scalar (no labels) by design — `{server="..."}` would carry the attacker-influenced `x-pyrycode-server` header onto the metrics surface, which threat-model § Log hygiene forbids; stale grace-expiry fires can't move the gauge because the pointer-identity guard (ADR-0006) keeps the maps unchanged and the gauge IS the map size; race-tested against 16 mutator goroutines + a tight-loop scraper under `-race`. First collector wired into the #59 seam (#61). - [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars) — first instantiated by #61's `connectionsCollector`. Listener still pending (#60). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59). diff --git a/docs/knowledge/codebase/77.md b/docs/knowledge/codebase/77.md new file mode 100644 index 0000000..2ed2ffa --- /dev/null +++ b/docs/knowledge/codebase/77.md @@ -0,0 +1,45 @@ +# Ticket #77 — refuse to boot with `--insecure-listen` in production mode + +Introduces the `PYRYCODE_RELAY_PRODUCTION=1` env-var contract and the first boot-time check that consumes it: the relay refuses to start (exit 2) when production mode is on AND `--insecure-listen` is set, so a dev manifest accidentally promoted to prod fails the deploy's health-check rather than serving plaintext traffic. Split from #42; sibling #78 (refuse-to-run-as-uid-0) consumes the same contract. + +## Implementation + +- **`internal/relay/production.go` (new, ~46 lines)** — three exported names plus one unexported package constant: + - `envProductionMode = "PYRYCODE_RELAY_PRODUCTION"` (unexported; reused by in-package siblings). + - `ErrInsecureListenInProduction` — sentinel; message names both flag and env var so log lines are self-documenting. + - `IsProductionMode(getenv func(string) string) bool` — exact-string `"1"` match. Strict by design (not `strconv.ParseBool`); rationale documented inline and matches the `PYRYCODE_RELAY_SINGLE_INSTANCE` precedent (#64/#65). + - `CheckInsecureListenInProduction(insecureListen string, getenv func(string) string) error` — composes `IsProductionMode` with the flag value; returns the sentinel only when both inputs trigger. +- **`internal/relay/production_test.go` (new, ~119 lines)** — three table-driven tests: + - `TestIsProductionMode_ValueMatrix` — 9 rows over the env-var value space, including `"true"`/`"yes"`/whitespace/`"PRODUCTION"`. The non-`"1"` rows are not over-coverage; they lock the strict-equality contract. A future refactor to `strconv.ParseBool` fails on the `"true"`/`"yes"` rows. + - `TestCheckInsecureListenInProduction_Matrix` — the AC's 2×2 verbatim. `insecureListen == ""` models the autocert path (the check has no business inspecting `--domain`). + - `TestErrInsecureListenInProduction_IsBranchable` — locks the `errors.Is` contract so a future `fmt.Errorf("foo: %s", ...)` refactor breaks the test, not downstream callers. + - All `t.Parallel()`-safe; `fakeGetenv(map[string]string)` closure replaces process-env mutation. +- **`cmd/pyrycode-relay/main.go` (+8 lines)** — call inserted between the existing flag-mutex guard (line 40-43) and `startedAt`/`NewRegistry` (line 53-54). Exit code 2 matches the sibling guard above; structured fields are `err`, `env_var` (name only, never value), and `fix` (one-line operator remediation listing both valid resolutions). + +## Acceptance criteria — verification map + +- AC-1 (env-var helper documented, exact-`"1"` semantics): `IsProductionMode` + Go doc comment on `production.go:19-32`. +- AC-2 (sentinel + check function, branchable via `errors.Is`): `ErrInsecureListenInProduction` + `CheckInsecureListenInProduction` on `production.go:11-46`. +- AC-3 (wired in `main.go` after flag-parse, before listener, structured log naming env var + fix): `main.go:45-51`. +- AC-4 (2×2 matrix unit tests, env read via injected getter): `production_test.go:47-105` plus the value-space matrix in `TestIsProductionMode_ValueMatrix`. + +## Patterns established + +- **Env-var contract shape: `PYRYCODE_RELAY_*`, `=1` means on, exact-string match, lazy read via injected getter.** Now applies to two startup-gating env vars (`PYRYCODE_RELAY_SINGLE_INSTANCE`, `PYRYCODE_RELAY_PRODUCTION`) — once is precedent, twice is the pattern. Future startup-gating env vars in this binary should follow the same shape; operators audit production gating with `grep PYRYCODE_RELAY_ deploy/`. +- **Injected `func(string) string` as the env-var test seam.** Smallest possible seam — no interface, no struct, no package-level var. `os.Getenv` satisfies the signature directly at the call site; tests pass a closure built from a map. Process env is never mutated, so tests are safe under `t.Parallel()` and `-race`. Use this shape for any future "behaviour depends on one env var" check; reach for an interface only when the dependency surface grows. +- **Boot-time refusal sentinel pair: `ErrXxx` + `CheckXxxInProduction`.** Each new production-only startup check follows this two-name shape and wires into `main.go` as `if err := relay.CheckXxx(...); err != nil { logger.Error(...); os.Exit(2) }`. When the count reaches ~3 and the wiring boilerplate becomes a real cost, a follow-up ticket consolidates them into `relay.Config` + `Config.Validate()` returning a multi-error. Do not pre-build that abstraction. +- **Exit-code split: 2 = config-rejected-at-boot, 1 = runtime failure.** Every boot-time refusal in this binary should exit 2; runtime failures (listener died, autocert failed) keep exit 1. Ops dashboards can split "deploy never started" from "deploy started and crashed" on the exit code alone. + +## Lessons learned + +- **Document the strict-`"1"` contract in the helper's Go doc, not just the spec.** The doc comment on `IsProductionMode` enumerates the values that look-like-on-but-are-not (`"true"`, `"yes"`, `"PRODUCTION"`, `" 1"`). Without this, a future maintainer would reasonably reach for `strconv.ParseBool` and silently widen the contract. The test matrix is the deterministic backstop; the prose is the advisory layer that prevents the change from being attempted in the first place. (Same belt-and-suspenders shape as the #64 + #65 doc/code split.) +- **Log the env-var name, never its value, in the failure log line.** The check's contract value is `"1"`, but a confused operator might write anything there (including a secret they thought was a flag). The structured-log `env_var` field intentionally carries `"PYRYCODE_RELAY_PRODUCTION"` — the literal name — not `os.Getenv("PYRYCODE_RELAY_PRODUCTION")`. The architect's security-review SHOULD-FIX called this out; carrying it forward as a project-wide rule for any future env-var-driven boot check. +- **Skip the ADR when the design follows precedent.** The shape (env-var contract, exact-`"1"` match, injected-getter test seam, sentinel-pair wiring) was already established by #9 (sentinel pattern) and #64/#65 (env-var contract). Filing an ADR for "we did the same thing again" would be process noise. ADRs document the *first* instance of a pattern; subsequent instances are codebase notes. + +## Cross-links + +- [Production-mode contract & `--insecure-listen` startup refusal](../features/production-mode.md) — feature doc covering the contract, API, wiring, and behaviour matrix. +- [`internal/relay/tls.go:15-19`](../../../internal/relay/tls.go) — `ErrCacheDirInsecure`, the canonical boot-time-refusal sentinel this ticket models. +- [Codebase note #64](64.md) / [#65 startup self-check](https://github.com/pyrycode/pyrycode-relay/issues/65) — env-var-shape precedent (`PYRYCODE_RELAY_SINGLE_INSTANCE`). +- [#78 — refuse to run as uid 0 in production](https://github.com/pyrycode/pyrycode-relay/issues/78) — sibling consumer of `IsProductionMode`; exercises the contract this ticket defined. +- [#42 — parent ticket](https://github.com/pyrycode/pyrycode-relay/issues/42) — split into #77 / #78. diff --git a/docs/knowledge/features/production-mode.md b/docs/knowledge/features/production-mode.md new file mode 100644 index 0000000..b63a371 --- /dev/null +++ b/docs/knowledge/features/production-mode.md @@ -0,0 +1,68 @@ +# Production-mode contract & `--insecure-listen` startup refusal + +Single-env-var production-mode signal for the relay, plus the first boot-time check that consumes it. Defined by #77; sibling startup checks (#78 = refuse-to-run-as-uid-0; future others) reuse the contract rather than re-reading the env var. + +## The contract + +- **Env var:** `PYRYCODE_RELAY_PRODUCTION` +- **On:** the exact string `"1"`. Anything else — unset, `"0"`, `"true"`, `"yes"`, `" 1"`, `"1 "`, `"PRODUCTION"` — means non-production. +- **Read lazily** on every call. No init-time cache. Tests inject a getter; production wires `os.Getenv`. + +Strict-equality (not `strconv.ParseBool` "truthy" parsing) is intentional and mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE` (#64/#65). An exact-string contract is harder to misread in a deploy manifest than a "what does this parser accept?" question, and an operator who believes production mode is on but the relay disagrees is exactly the silent-misconfiguration class the contract exists to prevent. + +## API (`internal/relay/production.go`) + +- `IsProductionMode(getenv func(string) string) bool` — reports whether `getenv("PYRYCODE_RELAY_PRODUCTION") == "1"`. Exported so sibling checks (#78+) compose on the same predicate. +- `CheckInsecureListenInProduction(insecureListen string, getenv func(string) string) error` — returns `ErrInsecureListenInProduction` when production mode is on AND `insecureListen != ""`, nil otherwise. Intended to run after `flag.Parse()`, before any listener is started. +- `ErrInsecureListenInProduction` — exported sentinel, branchable via `errors.Is`. Message names both inputs (`"relay: --insecure-listen is set with PYRYCODE_RELAY_PRODUCTION=1; refusing to start"`) so a log line is self-documenting. +- `envProductionMode` (unexported) — the env-var-name constant. In-package siblings reuse it; out-of-package callers go through `IsProductionMode`. + +## Why this shape (test seam) + +The check takes a `func(string) string` rather than calling `os.Getenv` directly. The seam is the smallest design that satisfies the AC's "do not mutate process env" — no interface, no struct, no package-level mutable variable. `os.Getenv` satisfies the signature literally at the call site; tests build a closure over a `map[string]string`. Process env is never touched, so the tests are safe under `t.Parallel()` and `go test -race -count=N`. + +## Wiring (`cmd/pyrycode-relay/main.go`) + +After flag-parse and after the existing "either `--domain` or `--insecure-listen`" guard, before `relay.NewRegistry()`: + +```go +if err := relay.CheckInsecureListenInProduction(*insecureListen, os.Getenv); err != nil { + logger.Error("refusing to start: production-mode misconfiguration", + "err", err, + "env_var", "PYRYCODE_RELAY_PRODUCTION", + "fix", "remove --insecure-listen and set --domain, or unset PYRYCODE_RELAY_PRODUCTION") + os.Exit(2) +} +``` + +- **Exit code 2** matches the flag-validation guard immediately above. Exit 1 is reserved for runtime failures in this binary (listener died, autocert failed); exit 2 = configuration-rejected-at-boot. Splitting the codes lets ops dashboards distinguish "deploy never started" from "deploy started and crashed." +- **`fix` field** lists both valid resolutions (remove the flag OR unset the env var). The operator picks whichever input was wrong. +- **`env_var` field carries the name, never the value.** Even if a confused operator put a secret in `PYRYCODE_RELAY_PRODUCTION`, it would not be logged. Do not extend this log line to include the env var value. + +## Behaviour matrix + +| `PYRYCODE_RELAY_PRODUCTION` | `--insecure-listen` | Result | +|---|---|---| +| unset / not `"1"` | any | nil (no check fires) | +| `"1"` | empty (autocert path) | nil | +| `"1"` | non-empty | `ErrInsecureListenInProduction` → `os.Exit(2)` | + +Autocert (`--domain`) is not inspected — the contract is purely about plaintext-in-prod. Setting `--domain` with `PYRYCODE_RELAY_PRODUCTION=1` is the happy path. + +## Threat model alignment + +`pyrycode/pyrycode/docs/protocol-mobile.md` § Security model assumes TLS for all production traffic. This is the in-binary enforcement of that assumption *when production mode is explicitly tagged*. The complement — a CI / deploy-manifest check that prod manifests actually set `PYRYCODE_RELAY_PRODUCTION=1` — is out of scope and the responsibility of the deploy layer. + +The check is fail-closed: if the env var is on AND plaintext is requested, the relay refuses to boot. There is no degradation path, no fallback, no retry. Boot-time refusal is total. + +## Out of scope (deferred) + +- **No `Config` struct.** A bundled `relay.Config` with `Validate()` returning a multi-error is a natural extension once 3+ startup checks exist. With one check today (this one) and one more queued (#78), the consolidation is premature. +- **No fork-exec integration test on `main.go`.** The `main` wiring is observable via package-level unit tests; one `if err != nil { os.Exit(2) }` block does not warrant a binary-spawning test. + +## Cross-links + +- ADR: none filed; the shape is precedent-following (mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE`), not a new architectural choice. +- [`internal/relay/tls.go`](../../../internal/relay/tls.go) — `ErrCacheDirInsecure` is the canonical boot-time-refusal sentinel this one models. +- [Single-instance constraint (v1)](../../architecture.md#single-instance-constraint-v1) — sibling env-var contract (`PYRYCODE_RELAY_SINGLE_INSTANCE`), shape precedent. +- [Codebase ticket note #77](../codebase/77.md) — per-ticket implementation detail.