Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions cmd/pyrycode-relay/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,14 @@ func main() {
os.Exit(2)
}

if err := relay.CheckInsecureListenInProduction(*insecureListen, os.Getenv); err != nil {
logger.Error("refusing to start: production-mode misconfiguration",
"err", err,
"env_var", "PYRYCODE_RELAY_PRODUCTION",
"fix", "remove --insecure-listen and set --domain, or unset PYRYCODE_RELAY_PRODUCTION")
os.Exit(2)
}

startedAt := time.Now()
reg := relay.NewRegistry()

Expand Down
1 change: 1 addition & 0 deletions docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [Production-mode contract & `--insecure-listen` startup refusal](features/production-mode.md) — `PYRYCODE_RELAY_PRODUCTION=1` env-var contract (exact-string match, lazy read via injected getter, mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE` shape from #64/#65) plus the first boot-time check that consumes it: `relay.CheckInsecureListenInProduction` returns the exported `ErrInsecureListenInProduction` sentinel (branchable via `errors.Is`) when production mode is on AND `--insecure-listen` is set, wired into `cmd/pyrycode-relay/main.go` after flag-parse with `os.Exit(2)` (config-rejected-at-boot, distinct from runtime-failure exit 1) and structured log fields naming the env var (name only, never value) and a one-line `fix` listing both valid resolutions. `IsProductionMode` exported so sibling startup checks (#78 = uid-0) compose on the same predicate without re-reading the env var. Test seam is a `func(string) string` getter (smallest possible — no interface, no struct, no package-level var) so the 2×2 AC matrix and the value-space matrix run under `t.Parallel()` + `-race` without mutating process env (#77).
- [Fly.io deploy](features/fly-deploy.md) — production host wiring: `fly.toml` declares TCP-passthrough on `:80`/`:443` (no Fly HTTP proxy, no Fly-managed certs) so TLS keeps terminating in the relay via autocert (#9), persistent Fly volume `relay_autocert` mounted at `/var/lib/relay/autocert`, and a single-machine hard cap encoded via `min_machines_running=1` + `auto_start_machines=false` + `auto_stop_machines="off"` + `[deploy] strategy="immediate"` (Fly Apps v2 has no `max_machines` key; the in-binary `PYRYCODE_RELAY_SINGLE_INSTANCE` self-check from #65 is the backstop). CI `deploy` job in `.github/workflows/ci.yml` runs `flyctl deploy --remote-only` on push to `main`, gated by branch-condition + `needs: [test, security, image-scan]` + `permissions: contents: read` so `FLY_API_TOKEN` is structurally unreachable from PR code; `superfly/flyctl-actions/setup-flyctl` pinned by commit SHA with `# Tracks:` comment (same convention as #68 / #41). Dedicated IPv4 is required (not optional) for autocert's HTTP-01 challenge; TCP passthrough preserves the real socket peer IP that #34's rate limiter reads. `__REGION__` / `__DOMAIN__` ship as placeholders that fail loud on first deploy (#38).
- [Connection-count gauges](features/connection-count-gauges.md) — `pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones` exposed via a pull-based `prometheus.Collector` reading `Registry.Counts()` on each scrape; zero edits to `registry.go`; scalar (no labels) by design — `{server="..."}` would carry the attacker-influenced `x-pyrycode-server` header onto the metrics surface, which threat-model § Log hygiene forbids; stale grace-expiry fires can't move the gauge because the pointer-identity guard (ADR-0006) keeps the maps unchanged and the gauge IS the map size; race-tested against 16 mutator goroutines + a tight-loop scraper under `-race`. First collector wired into the #59 seam (#61).
- [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars) — first instantiated by #61's `connectionsCollector`. Listener still pending (#60). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59).
Expand Down
45 changes: 45 additions & 0 deletions docs/knowledge/codebase/77.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Ticket #77 — refuse to boot with `--insecure-listen` in production mode

Introduces the `PYRYCODE_RELAY_PRODUCTION=1` env-var contract and the first boot-time check that consumes it: the relay refuses to start (exit 2) when production mode is on AND `--insecure-listen` is set, so a dev manifest accidentally promoted to prod fails the deploy's health-check rather than serving plaintext traffic. Split from #42; sibling #78 (refuse-to-run-as-uid-0) consumes the same contract.

## Implementation

- **`internal/relay/production.go` (new, ~46 lines)** — three exported names plus one unexported package constant:
- `envProductionMode = "PYRYCODE_RELAY_PRODUCTION"` (unexported; reused by in-package siblings).
- `ErrInsecureListenInProduction` — sentinel; message names both flag and env var so log lines are self-documenting.
- `IsProductionMode(getenv func(string) string) bool` — exact-string `"1"` match. Strict by design (not `strconv.ParseBool`); rationale documented inline and matches the `PYRYCODE_RELAY_SINGLE_INSTANCE` precedent (#64/#65).
- `CheckInsecureListenInProduction(insecureListen string, getenv func(string) string) error` — composes `IsProductionMode` with the flag value; returns the sentinel only when both inputs trigger.
- **`internal/relay/production_test.go` (new, ~119 lines)** — three table-driven tests:
- `TestIsProductionMode_ValueMatrix` — 9 rows over the env-var value space, including `"true"`/`"yes"`/whitespace/`"PRODUCTION"`. The non-`"1"` rows are not over-coverage; they lock the strict-equality contract. A future refactor to `strconv.ParseBool` fails on the `"true"`/`"yes"` rows.
- `TestCheckInsecureListenInProduction_Matrix` — the AC's 2×2 verbatim. `insecureListen == ""` models the autocert path (the check has no business inspecting `--domain`).
- `TestErrInsecureListenInProduction_IsBranchable` — locks the `errors.Is` contract so a future `fmt.Errorf("foo: %s", ...)` refactor breaks the test, not downstream callers.
- All `t.Parallel()`-safe; `fakeGetenv(map[string]string)` closure replaces process-env mutation.
- **`cmd/pyrycode-relay/main.go` (+8 lines)** — call inserted between the existing flag-mutex guard (line 40-43) and `startedAt`/`NewRegistry` (line 53-54). Exit code 2 matches the sibling guard above; structured fields are `err`, `env_var` (name only, never value), and `fix` (one-line operator remediation listing both valid resolutions).

## Acceptance criteria — verification map

- AC-1 (env-var helper documented, exact-`"1"` semantics): `IsProductionMode` + Go doc comment on `production.go:19-32`.
- AC-2 (sentinel + check function, branchable via `errors.Is`): `ErrInsecureListenInProduction` + `CheckInsecureListenInProduction` on `production.go:11-46`.
- AC-3 (wired in `main.go` after flag-parse, before listener, structured log naming env var + fix): `main.go:45-51`.
- AC-4 (2×2 matrix unit tests, env read via injected getter): `production_test.go:47-105` plus the value-space matrix in `TestIsProductionMode_ValueMatrix`.

## Patterns established

- **Env-var contract shape: `PYRYCODE_RELAY_*`, `=1` means on, exact-string match, lazy read via injected getter.** Now applies to two startup-gating env vars (`PYRYCODE_RELAY_SINGLE_INSTANCE`, `PYRYCODE_RELAY_PRODUCTION`) — once is precedent, twice is the pattern. Future startup-gating env vars in this binary should follow the same shape; operators audit production gating with `grep PYRYCODE_RELAY_ deploy/`.
- **Injected `func(string) string` as the env-var test seam.** Smallest possible seam — no interface, no struct, no package-level var. `os.Getenv` satisfies the signature directly at the call site; tests pass a closure built from a map. Process env is never mutated, so tests are safe under `t.Parallel()` and `-race`. Use this shape for any future "behaviour depends on one env var" check; reach for an interface only when the dependency surface grows.
- **Boot-time refusal sentinel pair: `ErrXxx` + `CheckXxxInProduction`.** Each new production-only startup check follows this two-name shape and wires into `main.go` as `if err := relay.CheckXxx(...); err != nil { logger.Error(...); os.Exit(2) }`. When the count reaches ~3 and the wiring boilerplate becomes a real cost, a follow-up ticket consolidates them into `relay.Config` + `Config.Validate()` returning a multi-error. Do not pre-build that abstraction.
- **Exit-code split: 2 = config-rejected-at-boot, 1 = runtime failure.** Every boot-time refusal in this binary should exit 2; runtime failures (listener died, autocert failed) keep exit 1. Ops dashboards can split "deploy never started" from "deploy started and crashed" on the exit code alone.

## Lessons learned

- **Document the strict-`"1"` contract in the helper's Go doc, not just the spec.** The doc comment on `IsProductionMode` enumerates the values that look-like-on-but-are-not (`"true"`, `"yes"`, `"PRODUCTION"`, `" 1"`). Without this, a future maintainer would reasonably reach for `strconv.ParseBool` and silently widen the contract. The test matrix is the deterministic backstop; the prose is the advisory layer that prevents the change from being attempted in the first place. (Same belt-and-suspenders shape as the #64 + #65 doc/code split.)
- **Log the env-var name, never its value, in the failure log line.** The check's contract value is `"1"`, but a confused operator might write anything there (including a secret they thought was a flag). The structured-log `env_var` field intentionally carries `"PYRYCODE_RELAY_PRODUCTION"` — the literal name — not `os.Getenv("PYRYCODE_RELAY_PRODUCTION")`. The architect's security-review SHOULD-FIX called this out; carrying it forward as a project-wide rule for any future env-var-driven boot check.
- **Skip the ADR when the design follows precedent.** The shape (env-var contract, exact-`"1"` match, injected-getter test seam, sentinel-pair wiring) was already established by #9 (sentinel pattern) and #64/#65 (env-var contract). Filing an ADR for "we did the same thing again" would be process noise. ADRs document the *first* instance of a pattern; subsequent instances are codebase notes.

## Cross-links

- [Production-mode contract & `--insecure-listen` startup refusal](../features/production-mode.md) — feature doc covering the contract, API, wiring, and behaviour matrix.
- [`internal/relay/tls.go:15-19`](../../../internal/relay/tls.go) — `ErrCacheDirInsecure`, the canonical boot-time-refusal sentinel this ticket models.
- [Codebase note #64](64.md) / [#65 startup self-check](https://github.com/pyrycode/pyrycode-relay/issues/65) — env-var-shape precedent (`PYRYCODE_RELAY_SINGLE_INSTANCE`).
- [#78 — refuse to run as uid 0 in production](https://github.com/pyrycode/pyrycode-relay/issues/78) — sibling consumer of `IsProductionMode`; exercises the contract this ticket defined.
- [#42 — parent ticket](https://github.com/pyrycode/pyrycode-relay/issues/42) — split into #77 / #78.
68 changes: 68 additions & 0 deletions docs/knowledge/features/production-mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Production-mode contract & `--insecure-listen` startup refusal

Single-env-var production-mode signal for the relay, plus the first boot-time check that consumes it. Defined by #77; sibling startup checks (#78 = refuse-to-run-as-uid-0; future others) reuse the contract rather than re-reading the env var.

## The contract

- **Env var:** `PYRYCODE_RELAY_PRODUCTION`
- **On:** the exact string `"1"`. Anything else — unset, `"0"`, `"true"`, `"yes"`, `" 1"`, `"1 "`, `"PRODUCTION"` — means non-production.
- **Read lazily** on every call. No init-time cache. Tests inject a getter; production wires `os.Getenv`.

Strict-equality (not `strconv.ParseBool` "truthy" parsing) is intentional and mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE` (#64/#65). An exact-string contract is harder to misread in a deploy manifest than a "what does this parser accept?" question, and an operator who believes production mode is on but the relay disagrees is exactly the silent-misconfiguration class the contract exists to prevent.

## API (`internal/relay/production.go`)

- `IsProductionMode(getenv func(string) string) bool` — reports whether `getenv("PYRYCODE_RELAY_PRODUCTION") == "1"`. Exported so sibling checks (#78+) compose on the same predicate.
- `CheckInsecureListenInProduction(insecureListen string, getenv func(string) string) error` — returns `ErrInsecureListenInProduction` when production mode is on AND `insecureListen != ""`, nil otherwise. Intended to run after `flag.Parse()`, before any listener is started.
- `ErrInsecureListenInProduction` — exported sentinel, branchable via `errors.Is`. Message names both inputs (`"relay: --insecure-listen is set with PYRYCODE_RELAY_PRODUCTION=1; refusing to start"`) so a log line is self-documenting.
- `envProductionMode` (unexported) — the env-var-name constant. In-package siblings reuse it; out-of-package callers go through `IsProductionMode`.

## Why this shape (test seam)

The check takes a `func(string) string` rather than calling `os.Getenv` directly. The seam is the smallest design that satisfies the AC's "do not mutate process env" — no interface, no struct, no package-level mutable variable. `os.Getenv` satisfies the signature literally at the call site; tests build a closure over a `map[string]string`. Process env is never touched, so the tests are safe under `t.Parallel()` and `go test -race -count=N`.

## Wiring (`cmd/pyrycode-relay/main.go`)

After flag-parse and after the existing "either `--domain` or `--insecure-listen`" guard, before `relay.NewRegistry()`:

```go
if err := relay.CheckInsecureListenInProduction(*insecureListen, os.Getenv); err != nil {
logger.Error("refusing to start: production-mode misconfiguration",
"err", err,
"env_var", "PYRYCODE_RELAY_PRODUCTION",
"fix", "remove --insecure-listen and set --domain, or unset PYRYCODE_RELAY_PRODUCTION")
os.Exit(2)
}
```

- **Exit code 2** matches the flag-validation guard immediately above. Exit 1 is reserved for runtime failures in this binary (listener died, autocert failed); exit 2 = configuration-rejected-at-boot. Splitting the codes lets ops dashboards distinguish "deploy never started" from "deploy started and crashed."
- **`fix` field** lists both valid resolutions (remove the flag OR unset the env var). The operator picks whichever input was wrong.
- **`env_var` field carries the name, never the value.** Even if a confused operator put a secret in `PYRYCODE_RELAY_PRODUCTION`, it would not be logged. Do not extend this log line to include the env var value.

## Behaviour matrix

| `PYRYCODE_RELAY_PRODUCTION` | `--insecure-listen` | Result |
|---|---|---|
| unset / not `"1"` | any | nil (no check fires) |
| `"1"` | empty (autocert path) | nil |
| `"1"` | non-empty | `ErrInsecureListenInProduction` → `os.Exit(2)` |

Autocert (`--domain`) is not inspected — the contract is purely about plaintext-in-prod. Setting `--domain` with `PYRYCODE_RELAY_PRODUCTION=1` is the happy path.

## Threat model alignment

`pyrycode/pyrycode/docs/protocol-mobile.md` § Security model assumes TLS for all production traffic. This is the in-binary enforcement of that assumption *when production mode is explicitly tagged*. The complement — a CI / deploy-manifest check that prod manifests actually set `PYRYCODE_RELAY_PRODUCTION=1` — is out of scope and the responsibility of the deploy layer.

The check is fail-closed: if the env var is on AND plaintext is requested, the relay refuses to boot. There is no degradation path, no fallback, no retry. Boot-time refusal is total.

## Out of scope (deferred)

- **No `Config` struct.** A bundled `relay.Config` with `Validate()` returning a multi-error is a natural extension once 3+ startup checks exist. With one check today (this one) and one more queued (#78), the consolidation is premature.
- **No fork-exec integration test on `main.go`.** The `main` wiring is observable via package-level unit tests; one `if err != nil { os.Exit(2) }` block does not warrant a binary-spawning test.

## Cross-links

- ADR: none filed; the shape is precedent-following (mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE`), not a new architectural choice.
- [`internal/relay/tls.go`](../../../internal/relay/tls.go) — `ErrCacheDirInsecure` is the canonical boot-time-refusal sentinel this one models.
- [Single-instance constraint (v1)](../../architecture.md#single-instance-constraint-v1) — sibling env-var contract (`PYRYCODE_RELAY_SINGLE_INSTANCE`), shape precedent.
- [Codebase ticket note #77](../codebase/77.md) — per-ticket implementation detail.
Loading