Why
Dockerfile + CI scans + threat model are upstream guards (stochastic — depend on every contributor remembering them). A runtime startup self-check is the deterministic backstop that fires every single boot, regardless of how the deploy pipeline was assembled. Belt-and-suspenders per [[PROJECT-MEMORY#Behavioral / Instruction-Design]].
The pattern already exists: ErrCacheDirInsecure (#9) refuses to start if autocert cache is world/group-readable. This ticket extends that pattern into a coherent boot-time posture check.
What
Add internal/relay/startup_check.go exposing one function:
// CheckSecurityPosture runs all startup environment checks and returns
// a multi-error describing every failed posture invariant. Fail-fast
// on first error in production mode; in dev mode (--insecure-listen)
// log warnings and continue.
func CheckSecurityPosture(cfg Config) error
Invariants checked:
| Check |
Failure mode |
| Effective uid != 0 (production mode) |
ErrRunningAsRoot |
Effective capabilities ⊆ allowlist (Linux only via /proc/self/status) |
ErrUnexpectedCapability |
| Autocert cache dir perms == 0700 |
already as ErrCacheDirInsecure (#9) — call from this layer |
--insecure-listen flag is unset in production mode (env-var contract: PYRYCODE_RELAY_PRODUCTION=1) |
ErrInsecureListenInProduction |
| Required env vars present + valid format |
ErrInvalidConfig{key, reason} |
| Listener will bind only to expected ports (no stray :6060 pprof, no leaked debug ports) |
ErrUnexpectedListener |
| Multi-instance check |
covered by #39 — CheckSecurityPosture calls into that |
Wired in cmd/pyrycode-relay/main.go after flag parse, before http.ListenAndServe — fail-fast with structured log line per invariant.
Implementation notes
- Sentinel errors per the established pattern (
Err... naming, errors.Is branchable)
- Test the matrix: each invariant gets a unit test for both pass + fail; the public function gets an integration-shape test that asserts ALL invariants run (so a future invariant added to one but not the other surfaces immediately)
- Cross-platform: capability + uid checks are Linux-specific; gate with build tags or runtime OS checks. macOS dev runs skip them with a single "running on darwin, skipping linux-only checks" warning.
- Production-mode signal: explicit env var (
PYRYCODE_RELAY_PRODUCTION=1) is cleaner than autodetecting. Dev / test runs default to non-production. Deploy manifest sets the flag explicitly.
- Loud failure log: each failed invariant logs structured error + the env-var or fix to set. Don't make the operator grep — the log line IS the runbook.
Why this exists alongside CI scans
CI scans verify the build is clean. The startup self-check verifies the runtime environment is clean. A correct image deployed wrong (root user via docker run --user 0, leaked debug port via env override, etc.) escapes CI but gets caught at boot.
Out of scope
- Periodic re-scans during runtime (most posture invariants are boot-time decisions; runtime drift is rare). Single boot-time check is enough for v1.
- Auto-remediation (don't try to fix; loud-fail per established "loud failure over silent correction" pattern)
- Telemetry of failed boot attempts (production posture doesn't change between boots; if it fails once, ops sees it; no need for time-series)
Why
Dockerfile + CI scans + threat model are upstream guards (stochastic — depend on every contributor remembering them). A runtime startup self-check is the deterministic backstop that fires every single boot, regardless of how the deploy pipeline was assembled. Belt-and-suspenders per [[PROJECT-MEMORY#Behavioral / Instruction-Design]].
The pattern already exists:
ErrCacheDirInsecure(#9) refuses to start if autocert cache is world/group-readable. This ticket extends that pattern into a coherent boot-time posture check.What
Add
internal/relay/startup_check.goexposing one function:Invariants checked:
ErrRunningAsRoot/proc/self/status)ErrUnexpectedCapabilityErrCacheDirInsecure(#9) — call from this layer--insecure-listenflag is unset in production mode (env-var contract:PYRYCODE_RELAY_PRODUCTION=1)ErrInsecureListenInProductionErrInvalidConfig{key, reason}ErrUnexpectedListenerCheckSecurityPosturecalls into thatWired in
cmd/pyrycode-relay/main.goafter flag parse, beforehttp.ListenAndServe— fail-fast with structured log line per invariant.Implementation notes
Err...naming,errors.Isbranchable)PYRYCODE_RELAY_PRODUCTION=1) is cleaner than autodetecting. Dev / test runs default to non-production. Deploy manifest sets the flag explicitly.Why this exists alongside CI scans
CI scans verify the build is clean. The startup self-check verifies the runtime environment is clean. A correct image deployed wrong (root user via
docker run --user 0, leaked debug port via env override, etc.) escapes CI but gets caught at boot.Out of scope