Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions cmd/pyrycode-relay/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
package main

import (
"errors"
"flag"
"fmt"
"log/slog"
Expand Down Expand Up @@ -42,6 +43,24 @@ func main() {
os.Exit(2)
}

// Boot-time env-var validation runs BEFORE CheckInsecureListenInProduction
// so that a typo like PYRYCODE_RELAY_PRODUCTION=true cannot slip through
// IsProductionMode's silent-non-production fallback and reach the
// insecure-listen guard with an unvalidated value. See
// docs/specs/architecture/80-validate-env-vars-at-boot.md § Wiring order.
if err := relay.CheckEnvConfig(os.LookupEnv); err != nil {
var cfgErr *relay.ErrInvalidConfig
if errors.As(err, &cfgErr) {
logger.Error("refusing to start: invalid env-var config",
"err", err,
"env_var", cfgErr.Key,
"reason", cfgErr.Reason)
} else {
logger.Error("refusing to start: invalid env-var config", "err", err)
}
os.Exit(2)
}

if err := relay.CheckInsecureListenInProduction(*insecureListen, os.Getenv); err != nil {
logger.Error("refusing to start: production-mode misconfiguration",
"err", err,
Expand Down
1 change: 1 addition & 0 deletions docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [Env-var config validator (boot-time refusal)](features/env-config-validator.md) — table-driven validation of every env var the relay reads at boot. Single source of truth is the unexported `envContracts []envContract` registry in `internal/relay/env_config.go`; each row carries `name`, `required` bool, and an inline `validate func(string) error`. `CheckEnvConfig(lookup func(string) (string, bool)) error` walks the registry and returns the structured `*ErrInvalidConfig{Key, Reason}` on the first failure (`Reason` is `"missing"` or `"malformed-value: <err>"`); the package-level sentinel `ErrInvalidConfigSentinel` is matched via a custom `Is` method (not `Unwrap`, which would double-print the message prefix) so `errors.Is(err, ErrInvalidConfigSentinel)` and `errors.As(err, &cfgErr)` form a dual contract. The `func(string) (string, bool)` (= `os.LookupEnv` shape) getter coexists with #77's `func(string) string` getter — the presence bit is necessary here to distinguish "missing-but-required" from "present-but-empty", semantically inert for `IsProductionMode`'s exact-`"1"` match. **Ordering is load-bearing**: wired in `main.go` BEFORE `CheckInsecureListenInProduction` so a typo like `PYRYCODE_RELAY_PRODUCTION=true` cannot slip through `IsProductionMode`'s silent-non-production fallback and reach the insecure-listen guard with an unvalidated value. Today's registry has one row (`PYRYCODE_RELAY_PRODUCTION`, optional-but-format-validated); future env-var reads register here at code-review time. `checkEnvConfigWith(lookup, contracts)` is the parameterised inner used by the `required: true` test case (today's production table has no required entries). Exit 2 = config-rejected-at-boot, matching the sibling refusals (#9, #77, #79) (#80).
- [Linux capability allowlist (boot-time refusal)](features/capability-allowlist.md) — relay parses `/proc/self/status`'s `CapEff:` hex mask at boot and refuses to start (exit 2) if any bit is set outside `AllowedCapabilities` (currently `{CAP_NET_BIND_SERVICE}` only, motivated by autocert binding `:80`/`:443` from uid 65532 in the distroless image). Exported sentinel `ErrUnexpectedCapability` is branchable via `errors.Is`; the wrapped error names every offending bit symbolically (`CAP_SYS_ADMIN (bit 21)` or `bit 63` for unknown), lists the allowlist contents, and embeds the operator fix string. `CapEff` only — `CapPrm/CapBnd/CapInh` would broaden false-positives (legitimate K8s default policy grants wide CapBnd) without adding load-bearing protection (relay never `capset(2)`s). Linux/non-Linux split at compile time via the new `_<goos>.go` / `_other.go` build-tag convention (see ADR-0009); non-Linux GOOS logs one skip line and returns nil. Unconditional — no production-mode gating, no env-var bypass, because stray capabilities are never legitimate. Reader-boundary test seam (`func() (string, error)`) exercises the parse + mask check end-to-end without touching real `/proc`. Joins the boot-time-refusal sentinel family (#9, #77, #79; future #78) (#79).
- [Production-mode contract & `--insecure-listen` startup refusal](features/production-mode.md) — `PYRYCODE_RELAY_PRODUCTION=1` env-var contract (exact-string match, lazy read via injected getter, mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE` shape from #64/#65) plus the first boot-time check that consumes it: `relay.CheckInsecureListenInProduction` returns the exported `ErrInsecureListenInProduction` sentinel (branchable via `errors.Is`) when production mode is on AND `--insecure-listen` is set, wired into `cmd/pyrycode-relay/main.go` after flag-parse with `os.Exit(2)` (config-rejected-at-boot, distinct from runtime-failure exit 1) and structured log fields naming the env var (name only, never value) and a one-line `fix` listing both valid resolutions. `IsProductionMode` exported so sibling startup checks (#78 = uid-0) compose on the same predicate without re-reading the env var. Test seam is a `func(string) string` getter (smallest possible — no interface, no struct, no package-level var) so the 2×2 AC matrix and the value-space matrix run under `t.Parallel()` + `-race` without mutating process env (#77).
- [Fly.io deploy](features/fly-deploy.md) — production host wiring: `fly.toml` declares TCP-passthrough on `:80`/`:443` (no Fly HTTP proxy, no Fly-managed certs) so TLS keeps terminating in the relay via autocert (#9), persistent Fly volume `relay_autocert` mounted at `/var/lib/relay/autocert`, and a single-machine hard cap encoded via `min_machines_running=1` + `auto_start_machines=false` + `auto_stop_machines="off"` + `[deploy] strategy="immediate"` (Fly Apps v2 has no `max_machines` key; the in-binary `PYRYCODE_RELAY_SINGLE_INSTANCE` self-check from #65 is the backstop). CI `deploy` job in `.github/workflows/ci.yml` runs `flyctl deploy --remote-only` on push to `main`, gated by branch-condition + `needs: [test, security, image-scan]` + `permissions: contents: read` so `FLY_API_TOKEN` is structurally unreachable from PR code; `superfly/flyctl-actions/setup-flyctl` pinned by commit SHA with `# Tracks:` comment (same convention as #68 / #41). Dedicated IPv4 is required (not optional) for autocert's HTTP-01 challenge; TCP passthrough preserves the real socket peer IP that #34's rate limiter reads. `__REGION__` / `__DOMAIN__` ship as placeholders that fail loud on first deploy (#38).
Expand Down
52 changes: 52 additions & 0 deletions docs/knowledge/codebase/80.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Ticket #80 — validate required env vars at boot, fail-fast on missing/malformed

Adds a boot-time, table-driven validator over an explicit contract registry of env vars the relay reads. A misconfigured deploy manifest with a typo like `PYRYCODE_RELAY_PRODUCTION=true` now fails the deploy's health check at boot (`os.Exit(2)`) with a structured log line naming the key and shape violation, rather than silently being treated as "not production" and reaching `CheckInsecureListenInProduction` (#77) with an unvalidated value. Split from #42; sibling of #77 (`--insecure-listen` guard) and #79 (Linux capability allowlist).

## Implementation

- **`internal/relay/env_config.go` (new, 92 lines)** — four exported names plus the unexported registry:
- `ErrInvalidConfigSentinel` — package-level sentinel; matched by `*ErrInvalidConfig.Is`.
- `ErrInvalidConfig{Key, Reason string}` — structured error. The AC's literal type name; `Sentinel` suffix on the variable disambiguates the two without departing from the AC. `Is(target)` method matches the sentinel without `Unwrap` (which would double-print the message prefix).
- `CheckEnvConfig(lookup func(string) (string, bool)) error` — entry point; walks `envContracts` and returns the first failure. Calls `checkEnvConfigWith(lookup, envContracts)`.
- `checkEnvConfigWith(lookup, contracts)` (unexported) — parameterised version so tests exercise the `required: true` branch with a synthetic table.
- `envContract{name, required, validate}` + `envContracts []envContract` — single-source-of-truth registry. Today one row: `PYRYCODE_RELAY_PRODUCTION` with `required: false` and an inline `func(v string) error` that returns nil only on exact-`"1"` match.
- **`internal/relay/env_config_test.go` (new, 138 lines)** — five `t.Parallel()`-safe tests; never mutate process env:
- `TestCheckEnvConfig_ValidEnvReturnsNil` — two rows (unset, exact `"1"`).
- `TestCheckEnvConfig_MalformedValueReturnsStructuredError` — seven rows (`"true"`, `"0"`, `"yes"`, `" 1"`, `"1 "`, `"PRODUCTION"`, `""`). Asserts `errors.Is`, `errors.As`, `Key`, and `Reason` prefix + substring. Mirrors the `IsProductionMode` value matrix from #77 so a future refactor to `strconv.ParseBool` fails on both tests at once.
- `TestCheckEnvConfig_MissingRequiredKey` — uses `checkEnvConfigWith` with a synthetic `required: true` row (today's `envContracts` has none).
- `TestCheckEnvConfig_IgnoresUnregisteredKeys` — locks in "polices declared contract, not arbitrary env".
- `TestErrInvalidConfigSentinel_IsBranchable` — `errors.Is` + `errors.As` dual contract.
- **`cmd/pyrycode-relay/main.go` (+19 lines, +1 import `errors`)** — call inserted between the flag-mutex guard (line 41-44) and the existing `CheckInsecureListenInProduction` (line 64). `errors.As` extraction at the log site populates `env_var` and `reason` structured fields; the no-`As` fallback branch is defensive (unreachable under the current design).

## Acceptance criteria — verification map

- **AC-1** (exported `ErrInvalidConfig` exposes Key + Reason; `errors.As` for fields, `errors.Is` against sentinel): `env_config.go:13-36`, locked in by `TestErrInvalidConfigSentinel_IsBranchable`.
- **AC-2** (check function iterates an explicit in-package contract table, returns `ErrInvalidConfig` on first failure): `env_config.go:74-92`.
- **AC-3** (contract table covers every env var read at merge time): `envContracts` on `env_config.go:52-63`. Grep verification: `grep -rn "os\.Getenv\|os\.LookupEnv" cmd/ internal/` returns `PYRYCODE_RELAY_PRODUCTION` only (via the `IsProductionMode` getter argument). One row, complete.
- **AC-4** (wired in `main.go` after flag parse, before listener, fail-fast with structured log naming key + reason): `main.go:46-62`. Runs before `CheckInsecureListenInProduction` and `CheckCapabilities`.
- **AC-5** (unit tests for valid / missing-required / malformed / unregistered; test seam not process env): `env_config_test.go:16-138`.

## Patterns established

- **Contract-table-as-registry shape.** A package-level `[]envContract` slice where each row carries the env-var name, a `required` bool, and an inline shape `validate`. The registry is read-only at runtime, append-only at code-review time. Use this shape any time a binary reads >1 env var; for one var, the #77 single-helper-per-var shape is still fine. The natural next step (a typed `Config` struct populated from the registry) is deferred until env-var count + downstream-consumer pressure motivates it.
- **`os.LookupEnv`-shape seam for presence-aware checks.** Distinguishing "missing-but-required" from "present-but-empty" requires the presence bit. The `func(string) (string, bool)` shape carries it. Coexists with #77's `func(string) string` shape — pick the seam by whether the check needs the presence bit, not by repo-wide consistency. Mixing the two intentionally is correct.
- **Sentinel + structured error pair, dual-contract.** `ErrXxxSentinel` (the matchable identity) + `ErrXxx struct` (the field-carrier) with an `Is(target) bool` method on the struct. `errors.Is(err, ErrXxxSentinel)` matches any instance regardless of fields; `errors.As(err, &xxxErr)` extracts the fields. The `Is`-method-not-`Unwrap` choice avoids double-printing the sentinel's message prefix in the wrapped form. Apply for any future field-carrying boot-time-refusal error.
- **`checkXxxWith(lookup, table)` test seam for table-driven checks.** When the production table is a package-level var, expose an unexported parameterised variant so tests inject a synthetic table without mutating package state. Same shape as `caps_test.go`'s reader-boundary seam (#79).
- **Boot-time-refusal sentinel family grows.** Now five: `ErrCacheDirInsecure` (#9), `ErrInsecureListenInProduction` (#77), `ErrUnexpectedCapability` (#79), `ErrInvalidConfigSentinel` (#80), future `#78`. Same wiring shape: `if err := relay.CheckXxx(...); err != nil { logger.Error(...); os.Exit(2) }`. The boilerplate is starting to repeat; a follow-up ticket may consolidate into `relay.Config.Validate()` returning a multi-error once #78 lands and the count is six — not before.

## Lessons learned

- **The validator runs BEFORE the consumer of the value.** This is the architectural decision of the ticket. `CheckInsecureListenInProduction` reads `PYRYCODE_RELAY_PRODUCTION` via `IsProductionMode`, which silently accepts every non-`"1"` value as "not production". If `CheckEnvConfig` runs *after* `CheckInsecureListenInProduction`, the typo `PYRYCODE_RELAY_PRODUCTION=true` causes the insecure-listen guard to pass (because `IsProductionMode` returned false), the plaintext listener is allowed to boot, and only then does `CheckEnvConfig` fire — but by that time the dangerous code path has already been validated as safe. The ordering is load-bearing; the wiring comment in `main.go:46-50` documents it; code review enforces it. General rule for boot-time validators: validate the input *before* any consumer reads it, otherwise the consumer's failure mode is the very thing the validator exists to prevent.
- **Departing from Go naming convention is sometimes the right call.** Go convention is `XxxError` for struct types and `ErrXxx` for sentinel variables (cf. `os.PathError` vs `os.ErrNotExist`). The AC explicitly named the struct `ErrInvalidConfig`. Renaming it to `InvalidConfigError` would have been more idiomatic but would have caused grep-confusion between ticket discussion and code for every future reader. Resolution: keep the struct as `ErrInvalidConfig`, name the sentinel `ErrInvalidConfigSentinel` to disambiguate. Naming conflicts are best resolved by adding a suffix to the less-load-bearing name, not by renaming the AC-fixed one. (The deviation cost is one extra word in the sentinel; the alternative cost is permanent ticket/code drift.)
- **First-failure-only is sufficient for boot-time refusal.** A multi-error variant (collect all failures, surface together) is a natural design instinct but adds no value for the current threat model: the operator fixes one misconfiguration, redeploys, sees the next failure, fixes it. The fail-fast loop is "fix, redeploy, fix, redeploy" — the multi-error path would be "fix all, redeploy", which saves at most a few redeploy cycles per onboarding. Defer the multi-error consolidation until env-var count grows enough that the redeploy cost is real (per #77's "do not pre-build that abstraction" guidance).
- **`errors.Is` method instead of `Unwrap` for sentinel pairs.** If `*ErrInvalidConfig.Unwrap()` returned `ErrInvalidConfigSentinel`, then `fmt.Errorf("…: %w", cfgErr)` would print `…: relay: invalid env-var config: KEY: REASON: relay: invalid env-var config` — the sentinel's message appears twice. The custom `Is(target) bool` method gives the same `errors.Is` semantics without the message duplication. Use this shape any time a struct error needs to match a sentinel without being a wrapper.
- **Echoing raw env-var values in error messages is safe today, but a forward-looking trap.** The `PYRYCODE_RELAY_PRODUCTION` validator's reason includes `got %q` because the legitimate value is the literal `"1"` and any other operator-supplied bytes are misconfiguration data, not secret data. For future env vars that could carry tokens, keys, or other secrets, the validator's reason string must describe the shape violation in operator-facing terms without echoing the raw value (the architect security-review noted this; codifying it here for future entries). The structured log line in `main.go` already keeps the raw value out via the `env_var` (name only) and `reason` (validator's prose) fields — do not add a `value` field.

## Cross-links

- [Env-var config validator (feature)](../features/env-config-validator.md) — operator/contributor-facing doc covering contract, API, wiring, and "how to add a new env var".
- [Production-mode contract & `--insecure-listen` startup refusal](../features/production-mode.md) — #77, the first env-var contract; `IsProductionMode` is the silent-non-production consumer this ticket defends.
- [Codebase note #77](77.md) — env-var contract precedent (`PYRYCODE_RELAY_PRODUCTION`); same single-helper shape that scaled into this ticket's registry.
- [Codebase note #79](79.md) — sibling boot-time refusal (Linux capability allowlist); same wiring shape, runs after env-config validation.
- [Architecture spec](../../specs/architecture/80-env-config-validator.md) — the design that landed.
- [#42 — parent ticket](https://github.com/pyrycode/pyrycode-relay/issues/42) — split into #77 / #78 / #79 / #80.
Loading