Skip to content

relay: startup security posture self-check — refuse to boot in misconfigured environments #42

@ilmoniemi

Description

@ilmoniemi

Why

Dockerfile + CI scans + threat model are upstream guards (stochastic — depend on every contributor remembering them). A runtime startup self-check is the deterministic backstop that fires every single boot, regardless of how the deploy pipeline was assembled. Belt-and-suspenders per [[PROJECT-MEMORY#Behavioral / Instruction-Design]].

The pattern already exists: ErrCacheDirInsecure (#9) refuses to start if autocert cache is world/group-readable. This ticket extends that pattern into a coherent boot-time posture check.

What

Add internal/relay/startup_check.go exposing one function:

// CheckSecurityPosture runs all startup environment checks and returns
// a multi-error describing every failed posture invariant. Fail-fast
// on first error in production mode; in dev mode (--insecure-listen)
// log warnings and continue.
func CheckSecurityPosture(cfg Config) error

Invariants checked:

Check Failure mode
Effective uid != 0 (production mode) ErrRunningAsRoot
Effective capabilities ⊆ allowlist (Linux only via /proc/self/status) ErrUnexpectedCapability
Autocert cache dir perms == 0700 already as ErrCacheDirInsecure (#9) — call from this layer
--insecure-listen flag is unset in production mode (env-var contract: PYRYCODE_RELAY_PRODUCTION=1) ErrInsecureListenInProduction
Required env vars present + valid format ErrInvalidConfig{key, reason}
Listener will bind only to expected ports (no stray :6060 pprof, no leaked debug ports) ErrUnexpectedListener
Multi-instance check covered by #39CheckSecurityPosture calls into that

Wired in cmd/pyrycode-relay/main.go after flag parse, before http.ListenAndServe — fail-fast with structured log line per invariant.

Implementation notes

  • Sentinel errors per the established pattern (Err... naming, errors.Is branchable)
  • Test the matrix: each invariant gets a unit test for both pass + fail; the public function gets an integration-shape test that asserts ALL invariants run (so a future invariant added to one but not the other surfaces immediately)
  • Cross-platform: capability + uid checks are Linux-specific; gate with build tags or runtime OS checks. macOS dev runs skip them with a single "running on darwin, skipping linux-only checks" warning.
  • Production-mode signal: explicit env var (PYRYCODE_RELAY_PRODUCTION=1) is cleaner than autodetecting. Dev / test runs default to non-production. Deploy manifest sets the flag explicitly.
  • Loud failure log: each failed invariant logs structured error + the env-var or fix to set. Don't make the operator grep — the log line IS the runbook.

Why this exists alongside CI scans

CI scans verify the build is clean. The startup self-check verifies the runtime environment is clean. A correct image deployed wrong (root user via docker run --user 0, leaked debug port via env override, etc.) escapes CI but gets caught at boot.

Out of scope

  • Periodic re-scans during runtime (most posture invariants are boot-time decisions; runtime drift is rare). Single boot-time check is enough for v1.
  • Auto-remediation (don't try to fix; loud-fail per established "loud failure over silent correction" pattern)
  • Telemetry of failed boot attempts (production posture doesn't change between boots; if it fails once, ops sees it; no need for time-series)

Metadata

Metadata

Assignees

No one assigned

    Labels

    security-sensitiveTouches auth, crypto, or internet-exposed input paths

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions