Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions cmd/pyrycode-relay/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,24 @@ func run(args []string, sigCtx context.Context) int {
return 2
}

// CheckSingleInstance is the deterministic backstop to the
// docs/architecture.md § Single-instance constraint prose half (#64).
// Refuses to start on multi-instance-capable platforms (today: Fly.io,
// signalled by FLY_APP_NAME) unless the operator asserts single-instance
// intent via PYRYCODE_RELAY_SINGLE_INSTANCE=1. Runs after CheckEnvConfig
// so a typo'd bypass fails with the structured config error rather than
// being silently treated as unset; runs before CheckInsecureListenInProduction
// because a deploy-shape misconfiguration is more fundamental than a
// production-mode flag misconfiguration. See
// docs/specs/architecture/65-startup-multi-instance-check.md § Wiring.
if err := relay.CheckSingleInstance(os.Getenv); err != nil {
logger.Error("refusing to start: multi-instance deploy detected",
"err", err,
"bypass_env_var", "PYRYCODE_RELAY_SINGLE_INSTANCE",
"fix", "set PYRYCODE_RELAY_SINGLE_INSTANCE=1 in the deploy manifest (e.g. fly.toml [env]) to assert single-instance intent; see docs/architecture.md § Single-instance constraint")
return 2
}

if err := relay.CheckInsecureListenInProduction(*insecureListen, os.Getenv); err != nil {
logger.Error("refusing to start: production-mode misconfiguration",
"err", err,
Expand Down
1 change: 1 addition & 0 deletions docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [Single-instance startup self-check](features/single-instance-check.md) — deterministic backstop to `docs/architecture.md § Single-instance constraint` (#64). `relay.CheckSingleInstance(getenv)` in `internal/relay/single_instance.go` refuses to start (exit 2, `ErrMultiInstanceDeployDetected`) when the relay detects it is running on a multi-instance-capable platform (today: Fly.io, signalled by `FLY_APP_NAME` non-empty) AND the operator has not asserted single-instance intent via `PYRYCODE_RELAY_SINGLE_INSTANCE=1`. Decision order: bypass exact `"1"` wins unconditionally; else `FLY_APP_NAME` non-empty returns the sentinel; else nil. Sentinel message contains both AC-mandated substrings (`multi-instance deploy detected` + `PYRYCODE_RELAY_SINGLE_INSTANCE`) so `grep -i` over boot logs hits regardless of structured-field rendering. Wired in `cmd/pyrycode-relay/main.go` between `CheckEnvConfig` (so a typo'd bypass like `=true` fails with the structured per-key config error before reaching here) and `CheckInsecureListenInProduction` (because deploy-shape misconfiguration is more fundamental than production-mode flag misconfiguration). `envContracts` in `env_config.go` gains a row for `PYRYCODE_RELAY_SINGLE_INSTANCE` with the same exact-`"1"`-or-unset shape as `envProductionMode`; the env-name constant lives in `single_instance.go` and is referenced by the registry row (single-source-of-truth invariant). The bypass constant carries `// #nosec G101 -- env var name, not a credential` to silence gosec's literal-credential false positive. **Heuristic, not adversarial**: catches "operator deployed to Fly and never read `architecture.md`"; does NOT catch "operator ran `fly scale count 3` after configuring the bypass in `fly.toml [env]`" — once the bypass propagates via the manifest, all replicas inherit it. The ticket Technical Notes precluded a Fly Machines API client or DNS-against-`<app>.internal` mechanism that could catch runtime scale-out; honest framing is "explicit deploy-time gate, not a runtime instance count". Fly.io only today; K8s/ECS/Lambda signals deferred until a non-Fly substrate is considered (speculative signals would false-positive in local-dev shells). Deterministic half of the belt-and-suspenders pair shipped jointly with #64's prose; both halves agree on the literal env var name (`PYRYCODE_RELAY_SINGLE_INSTANCE`) as the shared contract surface. Out of scope: updating `fly.toml [env]` to set the bypass (rides with `docs/deploy.md` bootstrap procedure); reconciling `architecture.md`'s "NOT recommended for production" framing with the deployed reality that the bypass must be set permanently in the Fly manifest (separate XS doc-only follow-up — code is the contract source). Tests: 7-row table matrix in `single_instance_test.go` covering AC #5(a)/(b)/(c) plus bypass-set-no-signal no-op plus three non-`"1"` bypass values (`"true"`, `"0"`, `" 1"`) to lock that `CheckSingleInstance` is strict-equality independently of `CheckEnvConfig`; AC-substring assertion as its own canary test; new row in the `env_config_test.go` malformed-value matrix locks the registry entry (#65).
- [Graceful shutdown on SIGTERM](features/graceful-shutdown.md) — `SIGTERM` / `SIGINT` triggers a bounded drain: stop accepting new connections, emit `1001 StatusGoingAway` on every live WS conn, give peers a configurable window (default 10s) to ack, then force-close and exit. Closes the "graceful-shutdown signal handler" gap that #47, #57, and #60 deferred to this ticket. Three pieces: (a) `func (r *Registry) Snapshot() []Conn` — freshly-allocated slice of every registered binary + phone conn, RLock-scoped, nil on empty, mirrors `PhonesFor`'s shape; preserves ADR-0003's passive-registry contract (slow work happens outside the lock). (b) `func relay.Shutdown(ctx, logger, reg, servers...) error` in `internal/relay/shutdown.go` — concurrently invokes `srv.Shutdown(ctx)` on each `*http.Server` (closes listeners synchronously, waits for non-hijacked handlers) AND fans `CloseWithCode(StatusGoingAway, "shutting down")` out across `reg.Snapshot()` (one goroutine per conn; concurrent fan-out is load-bearing because `nhooyr.io/websocket.Conn.Close` blocks up to 5s per conn for the peer's reciprocal handshake — serial iteration would scale linearly in conn count and blow past the deadline at the second stuck peer). A type-assertion seam against the private `gracefulCloser` interface (`CloseWithCode(websocket.StatusCode, string)`) lets `*WSConn` (#7, ADR-0007) take the code-aware path while test fakes and any future narrow-`Conn` impls fall back to plain `Close()`. Returns `nil` on clean drain, `ctx.Err()` on deadline expiry — and on the deadline path explicitly iterates `srv.Close()` per server because `http.Server.Shutdown` does NOT internally force-close its listeners after ctx expiry (the unit test asserts wall-clock < 300ms after a 50ms deadline specifically to pin this). Errors from `srv.Shutdown` are logged but never abort the drain — every server gets a Shutdown call, every conn gets a close call. (c) `cmd/pyrycode-relay/main.go` refactor splitting `main` into a testable `run(args []string, sigCtx context.Context) int` entry point invoked from a one-line `main` (`os.Exit(run(os.Args[1:], signalContextFor(SIGTERM, SIGINT)))`); the split exists so `main_e2e_test.go` can drive the full boot+drain sequence with a synthetic `context.Context` instead of forking a subprocess and sending real signals. `runServers(sigCtx, logger, reg, servers, listen)` is the single sink for both `--insecure-listen` and `--domain` modes — launches one goroutine per `*http.Server` running the `listen` closure (`ListenAndServe` or `ListenAndServeTLS("", "")` for the autocert HTTPS server), forwards the first non-`ErrServerClosed` listener error to a 1-slot buffered channel (first error wins, additional non-blocking sends drop), `select`s on `<-sigCtx.Done()` vs `<-listenerErr`, then builds `context.WithTimeout(_, drainDeadline)` and calls `relay.Shutdown(...)`. Exit codes: 0 for signal-triggered drain (clean operator action), 1 for listener-error-triggered drain (process supervisor restart) — preserves today's `os.Exit(1)` semantics on listener errors while enabling clean shutdown on signals. `drainDeadline = 10 * time.Second` declared as a `const` at the top of `main.go` per the established policy-values-at-wiring-site convention (#21, #60, #79); leaves ~5s of headroom beyond a single close-handshake. The five pre-existing `os.Exit(1)` calls inside listener goroutines collapse into the buffered-error channel — and `defer limiter.Close()` (introduced by #47) now actually runs on every shutdown path. Shutdown ordering is structural, not enforced by a barrier: the listener-close is in flight before the snapshot is taken, so no new WS handshake can complete between snapshot and close. Handler goroutines are NOT joined by the helper — `CloseWithCode` unblocks the handler's `Read`, the handler unwinds through its existing `defer reg.ScheduleReleaseServer / wsconn.Close` chain, and anything still alive at process exit is reaped by the kernel. A second SIGTERM during drain is ignored (`NotifyContext` cancels its context once); SIGKILL bypasses Go entirely. Empty registry is a no-op (`Snapshot()` returns nil, fan-out has zero goroutines, helper returns nil promptly). Tests: `shutdown_test.go` (empty-registry, all-conns-get-StatusGoingAway, deadline-expiry-with-wall-clock-bound, real-`*WSConn`-idempotency under race), `registry_test.go` (Snapshot empty/membership/isolation/race-freedom), `main_e2e_test.go` (full boot + WS dial + cancel-sigCtx + assert `websocket.CloseStatus(err) == StatusGoingAway` + assert exit 0 within drainDeadline+2s). Out of scope: protocol-level reconnect/resume coordination (lives in `protocol-mobile.md`), in-flight frame persistence (relay is stateless), `--drain-deadline` CLI flag (10s constant for v1; promote when ops experience demands), SIGHUP config reload (no runtime-reloadable config exists). (#31).
- [Upgrade-attempt and register-failure counters](features/upgrade-and-register-failure-counters.md) — two CounterVecs at the `/v1/server` + `/v1/client` upgrade call sites so operators can distinguish "flood of malformed headers" from "flood of 4409 conflicts" from "flood of 4429 phone-cap rejections" from "flood of rate-limited clients" without log-grepping. `pyrycode_relay_ws_upgrade_attempts_total{endpoint, outcome}` (`endpoint` ∈ `{server, client}`, `outcome` ∈ `{accept, reject_headers, reject_409, reject_404, reject_429, reject_rate_limit}`) is incremented exactly once per terminal decision; `pyrycode_relay_register_failures_total{kind}` (`kind` ∈ `{no_server, server_in_use, phones_at_cap, ratelimit}`) co-increments in lockstep with the matching `outcome=reject_*` cell. All 16 series are pre-bound via `WithLabelValues` in `NewUpgradeMetrics(reg prometheus.Registerer)` — including the three structurally unreachable cells (`server×reject_404`, `server×reject_429`, `client×reject_409`) which stay at 0 by construction — so ops dashboards see a stable scrape shape rather than series flickering into existence on first occurrence. Nine event-named, nil-receiver-safe methods (`ServerAccept`, `ServerHeaderReject`, `ServerIDConflict`, `ServerRateLimitDeny`, `ClientAccept`, `ClientHeaderReject`, `ClientNoServer`, `ClientPhonesAtCap`, `ClientRateLimitDeny`); the five composite methods are the lockstep contract — a single call bumps both the endpoint×outcome and kind cells, so tests cannot accidentally advance one without the other. **Two load-bearing seam choices**: (a) handler-parameter threading (`*UpgradeMetrics` as the final arg to `ServerHandler` / `ClientHandler`) instead of #58's hooks-on-`Registry`, because three of the five terminal sites (header-reject, success, rate-limit-deny) have NO Registry call to hook through — fan-out is 7 mechanical `, nil` test appends, under the 10-call-site red line; (b) `WrapServerRateLimitDeny` / `WrapClientRateLimitDeny` as a status-code observer wrapper stacked OUTSIDE the rate-limit middleware (#47), because adding a callback parameter to `NewRateLimitMiddleware` would cascade across 11 call sites — HTTP 429 in either pipe is unambiguously the rate-limit deny signal (header-gate writes 400, success writes 101, WS close codes 4404/4409/4429 travel inside the upgraded socket). The wrapper's `statusObserverResponseWriter` **must implement `http.Hijacker`** because the wrapped pipe terminates in `websocket.Accept` which type-asserts the writer for hijacking; without forwarding, the upgrade fails with HTTP 500 — a constraint that applies to any future middleware wrapping a WS upgrade. Increments fire structurally tight to the deciding branch (between `http.Error`/`logger.Info` and `return`) so "did this code path execute?" is one statement from "did the counter increment?". All three label values (`endpoint`, `outcome`, `kind`) are hard-coded compile-time constants set per event-named method — none derived from request headers, request paths, or any other attacker-influenced input; cardinality is exactly 16 regardless of traffic. Registered against the private `metricsReg` per ADR-0008 § Scope of use; package-level `TestMetricsRegistry_NoGlobalRegistrarLeak` plus a local sibling `TestUpgradeMetrics_NoGlobalRegistrarLeak` enforce no `DefaultRegisterer` leak structurally. The defensive `wsconn.Close()` non-`Is`-matched branches in both handlers stay MUST-NOT-increment — a future error variant would be a new outcome and belongs in a follow-up ticket. Out of scope: latency histograms for the upgrade handshake, a second HTTP-429 source on the upgrade pipes (would invalidate the observer's filter), source-IP-labelled attempt counters (would carry attacker-influenced label values) (#57).
- [Per-IP rate-limit middleware (`/v1/server` + `/v1/client`)](features/rate-limit-middleware.md) — HTTP middleware that throttles WS upgrade attempts per source IP **before** `websocket.Accept` runs, before the registry is touched, before the mobile/binary token header is read. One new exported symbol: `relay.NewRateLimitMiddleware(limiter *IPRateLimiter, logger *slog.Logger, trustForwardedFor bool) func(http.Handler) http.Handler` — returns the conventional composition shape, matching `EnforceHost`. Per request: extract `ip := ClientIP(r, trustForwardedFor)`, deny on `ip == ""` (empty-IP guard runs **before** `Allow` because the limiter's `Allow("")` is a normal map key, not a deny — loud-failure rule: refuse un-attributable traffic rather than admit it un-throttled), deny on `!limiter.Allow(ip)`, otherwise `next.ServeHTTP`. Both deny paths emit one `logger.Warn("rate_limited", "remote", strconv.Quote(ip))` line — single allowlisted key, `strconv.Quote` defends against control-byte log injection, the empty-IP case renders `remote="\"\""` so operators can distinguish the two deny causes by the value. 429 body is empty (`http.Error(w, "", http.StatusTooManyRequests)` — matches the existing `BadRequest` shape in `server_endpoint.go` / `client_endpoint.go`, no internal state leaked). New `--trust-x-forwarded-for` CLI flag (default `false`, Usage carries the explicit spoofing warning) is threaded once into the factory. Policy lives at the wiring site in `cmd/pyrycode-relay/main.go`: `rateLimitRefillEvery = 6s`, `rateLimitBurst = 20`, `rateLimitEvictionInterval = 5min` → ~10 sustained attempts/IP/min with 20-attempt burst headroom; `5min` sweep is well above `burst*refillEvery = 120s`. One shared `*IPRateLimiter` is applied to both `mux.Handle` registrations — a misbehaving IP retrying against either endpoint shares one bucket; `/healthz` is intentionally NOT wrapped (must remain pollable from monitoring). `defer limiter.Close()` is best-effort (current `os.Exit` listener path skips defers; graceful shutdown out of scope). Concurrency safety fully delegated to the limiter; middleware is stateless, no new locks, no new goroutines. Mobile/binary `X-Pyrycode-Token` structurally not read on this code path. Why middleware not constructor injection: keeps frame-routing handlers ignorant of admission control, makes `/healthz`'s exemption fall out for free, mirrors `EnforceHost`. Closes the third leg of `docs/threat-model.md` § *DoS resistance* alongside #29 (per-frame size cap) and #30 (per-server-id phone cap); the "token-bucket rate limit on /v1/server and /v1/client" line moves from *Future hardening* to v1 mitigation. Out of scope: connection-count cap (different threat surface — *attempt rate* vs *resident count*), CIDR-aware trusted-proxy chain (today's flag is flat all-or-nothing), multi-instance shared-state (would need Redis), rate-limit metrics counter (future ticket parallel to #58), adaptive policy, graceful-shutdown signal handler (#47).
Expand Down
Loading