pyrycode · ilmoniemi · May 13, 2026 · May 13, 2026 · May 13, 2026 · May 13, 2026
diff --git a/cmd/pyrycode-relay/main.go b/cmd/pyrycode-relay/main.go
@@ -92,6 +92,24 @@ func run(args []string, sigCtx context.Context) int {
 		return 2
 	}
 
+	// CheckSingleInstance is the deterministic backstop to the
+	// docs/architecture.md § Single-instance constraint prose half (#64).
+	// Refuses to start on multi-instance-capable platforms (today: Fly.io,
+	// signalled by FLY_APP_NAME) unless the operator asserts single-instance
+	// intent via PYRYCODE_RELAY_SINGLE_INSTANCE=1. Runs after CheckEnvConfig
+	// so a typo'd bypass fails with the structured config error rather than
+	// being silently treated as unset; runs before CheckInsecureListenInProduction
+	// because a deploy-shape misconfiguration is more fundamental than a
+	// production-mode flag misconfiguration. See
+	// docs/specs/architecture/65-startup-multi-instance-check.md § Wiring.
+	if err := relay.CheckSingleInstance(os.Getenv); err != nil {
+		logger.Error("refusing to start: multi-instance deploy detected",
+			"err", err,
+			"bypass_env_var", "PYRYCODE_RELAY_SINGLE_INSTANCE",
+			"fix", "set PYRYCODE_RELAY_SINGLE_INSTANCE=1 in the deploy manifest (e.g. fly.toml [env]) to assert single-instance intent; see docs/architecture.md § Single-instance constraint")
+		return 2
+	}
+
 	if err := relay.CheckInsecureListenInProduction(*insecureListen, os.Getenv); err != nil {
 		logger.Error("refusing to start: production-mode misconfiguration",
 			"err", err,

diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md
@@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o
 
 ## Features
 
+- [Single-instance startup self-check](features/single-instance-check.md) — deterministic backstop to `docs/architecture.md § Single-instance constraint` (#64). `relay.CheckSingleInstance(getenv)` in `internal/relay/single_instance.go` refuses to start (exit 2, `ErrMultiInstanceDeployDetected`) when the relay detects it is running on a multi-instance-capable platform (today: Fly.io, signalled by `FLY_APP_NAME` non-empty) AND the operator has not asserted single-instance intent via `PYRYCODE_RELAY_SINGLE_INSTANCE=1`. Decision order: bypass exact `"1"` wins unconditionally; else `FLY_APP_NAME` non-empty returns the sentinel; else nil. Sentinel message contains both AC-mandated substrings (`multi-instance deploy detected` + `PYRYCODE_RELAY_SINGLE_INSTANCE`) so `grep -i` over boot logs hits regardless of structured-field rendering. Wired in `cmd/pyrycode-relay/main.go` between `CheckEnvConfig` (so a typo'd bypass like `=true` fails with the structured per-key config error before reaching here) and `CheckInsecureListenInProduction` (because deploy-shape misconfiguration is more fundamental than production-mode flag misconfiguration). `envContracts` in `env_config.go` gains a row for `PYRYCODE_RELAY_SINGLE_INSTANCE` with the same exact-`"1"`-or-unset shape as `envProductionMode`; the env-name constant lives in `single_instance.go` and is referenced by the registry row (single-source-of-truth invariant). The bypass constant carries `// #nosec G101 -- env var name, not a credential` to silence gosec's literal-credential false positive. **Heuristic, not adversarial**: catches "operator deployed to Fly and never read `architecture.md`"; does NOT catch "operator ran `fly scale count 3` after configuring the bypass in `fly.toml [env]`" — once the bypass propagates via the manifest, all replicas inherit it. The ticket Technical Notes precluded a Fly Machines API client or DNS-against-`<app>.internal` mechanism that could catch runtime scale-out; honest framing is "explicit deploy-time gate, not a runtime instance count". Fly.io only today; K8s/ECS/Lambda signals deferred until a non-Fly substrate is considered (speculative signals would false-positive in local-dev shells). Deterministic half of the belt-and-suspenders pair shipped jointly with #64's prose; both halves agree on the literal env var name (`PYRYCODE_RELAY_SINGLE_INSTANCE`) as the shared contract surface. Out of scope: updating `fly.toml [env]` to set the bypass (rides with `docs/deploy.md` bootstrap procedure); reconciling `architecture.md`'s "NOT recommended for production" framing with the deployed reality that the bypass must be set permanently in the Fly manifest (separate XS doc-only follow-up — code is the contract source). Tests: 7-row table matrix in `single_instance_test.go` covering AC #5(a)/(b)/(c) plus bypass-set-no-signal no-op plus three non-`"1"` bypass values (`"true"`, `"0"`, `" 1"`) to lock that `CheckSingleInstance` is strict-equality independently of `CheckEnvConfig`; AC-substring assertion as its own canary test; new row in the `env_config_test.go` malformed-value matrix locks the registry entry (#65).
 - [Graceful shutdown on SIGTERM](features/graceful-shutdown.md) — `SIGTERM` / `SIGINT` triggers a bounded drain: stop accepting new connections, emit `1001 StatusGoingAway` on every live WS conn, give peers a configurable window (default 10s) to ack, then force-close and exit. Closes the "graceful-shutdown signal handler" gap that #47, #57, and #60 deferred to this ticket. Three pieces: (a) `func (r *Registry) Snapshot() []Conn` — freshly-allocated slice of every registered binary + phone conn, RLock-scoped, nil on empty, mirrors `PhonesFor`'s shape; preserves ADR-0003's passive-registry contract (slow work happens outside the lock). (b) `func relay.Shutdown(ctx, logger, reg, servers...) error` in `internal/relay/shutdown.go` — concurrently invokes `srv.Shutdown(ctx)` on each `*http.Server` (closes listeners synchronously, waits for non-hijacked handlers) AND fans `CloseWithCode(StatusGoingAway, "shutting down")` out across `reg.Snapshot()` (one goroutine per conn; concurrent fan-out is load-bearing because `nhooyr.io/websocket.Conn.Close` blocks up to 5s per conn for the peer's reciprocal handshake — serial iteration would scale linearly in conn count and blow past the deadline at the second stuck peer). A type-assertion seam against the private `gracefulCloser` interface (`CloseWithCode(websocket.StatusCode, string)`) lets `*WSConn` (#7, ADR-0007) take the code-aware path while test fakes and any future narrow-`Conn` impls fall back to plain `Close()`. Returns `nil` on clean drain, `ctx.Err()` on deadline expiry — and on the deadline path explicitly iterates `srv.Close()` per server because `http.Server.Shutdown` does NOT internally force-close its listeners after ctx expiry (the unit test asserts wall-clock < 300ms after a 50ms deadline specifically to pin this). Errors from `srv.Shutdown` are logged but never abort the drain — every server gets a Shutdown call, every conn gets a close call. (c) `cmd/pyrycode-relay/main.go` refactor splitting `main` into a testable `run(args []string, sigCtx context.Context) int` entry point invoked from a one-line `main` (`os.Exit(run(os.Args[1:], signalContextFor(SIGTERM, SIGINT)))`); the split exists so `main_e2e_test.go` can drive the full boot+drain sequence with a synthetic `context.Context` instead of forking a subprocess and sending real signals. `runServers(sigCtx, logger, reg, servers, listen)` is the single sink for both `--insecure-listen` and `--domain` modes — launches one goroutine per `*http.Server` running the `listen` closure (`ListenAndServe` or `ListenAndServeTLS("", "")` for the autocert HTTPS server), forwards the first non-`ErrServerClosed` listener error to a 1-slot buffered channel (first error wins, additional non-blocking sends drop), `select`s on `<-sigCtx.Done()` vs `<-listenerErr`, then builds `context.WithTimeout(_, drainDeadline)` and calls `relay.Shutdown(...)`. Exit codes: 0 for signal-triggered drain (clean operator action), 1 for listener-error-triggered drain (process supervisor restart) — preserves today's `os.Exit(1)` semantics on listener errors while enabling clean shutdown on signals. `drainDeadline = 10 * time.Second` declared as a `const` at the top of `main.go` per the established policy-values-at-wiring-site convention (#21, #60, #79); leaves ~5s of headroom beyond a single close-handshake. The five pre-existing `os.Exit(1)` calls inside listener goroutines collapse into the buffered-error channel — and `defer limiter.Close()` (introduced by #47) now actually runs on every shutdown path. Shutdown ordering is structural, not enforced by a barrier: the listener-close is in flight before the snapshot is taken, so no new WS handshake can complete between snapshot and close. Handler goroutines are NOT joined by the helper — `CloseWithCode` unblocks the handler's `Read`, the handler unwinds through its existing `defer reg.ScheduleReleaseServer / wsconn.Close` chain, and anything still alive at process exit is reaped by the kernel. A second SIGTERM during drain is ignored (`NotifyContext` cancels its context once); SIGKILL bypasses Go entirely. Empty registry is a no-op (`Snapshot()` returns nil, fan-out has zero goroutines, helper returns nil promptly). Tests: `shutdown_test.go` (empty-registry, all-conns-get-StatusGoingAway, deadline-expiry-with-wall-clock-bound, real-`*WSConn`-idempotency under race), `registry_test.go` (Snapshot empty/membership/isolation/race-freedom), `main_e2e_test.go` (full boot + WS dial + cancel-sigCtx + assert `websocket.CloseStatus(err) == StatusGoingAway` + assert exit 0 within drainDeadline+2s). Out of scope: protocol-level reconnect/resume coordination (lives in `protocol-mobile.md`), in-flight frame persistence (relay is stateless), `--drain-deadline` CLI flag (10s constant for v1; promote when ops experience demands), SIGHUP config reload (no runtime-reloadable config exists). (#31).
 - [Upgrade-attempt and register-failure counters](features/upgrade-and-register-failure-counters.md) — two CounterVecs at the `/v1/server` + `/v1/client` upgrade call sites so operators can distinguish "flood of malformed headers" from "flood of 4409 conflicts" from "flood of 4429 phone-cap rejections" from "flood of rate-limited clients" without log-grepping. `pyrycode_relay_ws_upgrade_attempts_total{endpoint, outcome}` (`endpoint` ∈ `{server, client}`, `outcome` ∈ `{accept, reject_headers, reject_409, reject_404, reject_429, reject_rate_limit}`) is incremented exactly once per terminal decision; `pyrycode_relay_register_failures_total{kind}` (`kind` ∈ `{no_server, server_in_use, phones_at_cap, ratelimit}`) co-increments in lockstep with the matching `outcome=reject_*` cell. All 16 series are pre-bound via `WithLabelValues` in `NewUpgradeMetrics(reg prometheus.Registerer)` — including the three structurally unreachable cells (`server×reject_404`, `server×reject_429`, `client×reject_409`) which stay at 0 by construction — so ops dashboards see a stable scrape shape rather than series flickering into existence on first occurrence. Nine event-named, nil-receiver-safe methods (`ServerAccept`, `ServerHeaderReject`, `ServerIDConflict`, `ServerRateLimitDeny`, `ClientAccept`, `ClientHeaderReject`, `ClientNoServer`, `ClientPhonesAtCap`, `ClientRateLimitDeny`); the five composite methods are the lockstep contract — a single call bumps both the endpoint×outcome and kind cells, so tests cannot accidentally advance one without the other. **Two load-bearing seam choices**: (a) handler-parameter threading (`*UpgradeMetrics` as the final arg to `ServerHandler` / `ClientHandler`) instead of #58's hooks-on-`Registry`, because three of the five terminal sites (header-reject, success, rate-limit-deny) have NO Registry call to hook through — fan-out is 7 mechanical `, nil` test appends, under the 10-call-site red line; (b) `WrapServerRateLimitDeny` / `WrapClientRateLimitDeny` as a status-code observer wrapper stacked OUTSIDE the rate-limit middleware (#47), because adding a callback parameter to `NewRateLimitMiddleware` would cascade across 11 call sites — HTTP 429 in either pipe is unambiguously the rate-limit deny signal (header-gate writes 400, success writes 101, WS close codes 4404/4409/4429 travel inside the upgraded socket). The wrapper's `statusObserverResponseWriter` **must implement `http.Hijacker`** because the wrapped pipe terminates in `websocket.Accept` which type-asserts the writer for hijacking; without forwarding, the upgrade fails with HTTP 500 — a constraint that applies to any future middleware wrapping a WS upgrade. Increments fire structurally tight to the deciding branch (between `http.Error`/`logger.Info` and `return`) so "did this code path execute?" is one statement from "did the counter increment?". All three label values (`endpoint`, `outcome`, `kind`) are hard-coded compile-time constants set per event-named method — none derived from request headers, request paths, or any other attacker-influenced input; cardinality is exactly 16 regardless of traffic. Registered against the private `metricsReg` per ADR-0008 § Scope of use; package-level `TestMetricsRegistry_NoGlobalRegistrarLeak` plus a local sibling `TestUpgradeMetrics_NoGlobalRegistrarLeak` enforce no `DefaultRegisterer` leak structurally. The defensive `wsconn.Close()` non-`Is`-matched branches in both handlers stay MUST-NOT-increment — a future error variant would be a new outcome and belongs in a follow-up ticket. Out of scope: latency histograms for the upgrade handshake, a second HTTP-429 source on the upgrade pipes (would invalidate the observer's filter), source-IP-labelled attempt counters (would carry attacker-influenced label values) (#57).
 - [Per-IP rate-limit middleware (`/v1/server` + `/v1/client`)](features/rate-limit-middleware.md) — HTTP middleware that throttles WS upgrade attempts per source IP **before** `websocket.Accept` runs, before the registry is touched, before the mobile/binary token header is read. One new exported symbol: `relay.NewRateLimitMiddleware(limiter *IPRateLimiter, logger *slog.Logger, trustForwardedFor bool) func(http.Handler) http.Handler` — returns the conventional composition shape, matching `EnforceHost`. Per request: extract `ip := ClientIP(r, trustForwardedFor)`, deny on `ip == ""` (empty-IP guard runs **before** `Allow` because the limiter's `Allow("")` is a normal map key, not a deny — loud-failure rule: refuse un-attributable traffic rather than admit it un-throttled), deny on `!limiter.Allow(ip)`, otherwise `next.ServeHTTP`. Both deny paths emit one `logger.Warn("rate_limited", "remote", strconv.Quote(ip))` line — single allowlisted key, `strconv.Quote` defends against control-byte log injection, the empty-IP case renders `remote="\"\""` so operators can distinguish the two deny causes by the value. 429 body is empty (`http.Error(w, "", http.StatusTooManyRequests)` — matches the existing `BadRequest` shape in `server_endpoint.go` / `client_endpoint.go`, no internal state leaked). New `--trust-x-forwarded-for` CLI flag (default `false`, Usage carries the explicit spoofing warning) is threaded once into the factory. Policy lives at the wiring site in `cmd/pyrycode-relay/main.go`: `rateLimitRefillEvery = 6s`, `rateLimitBurst = 20`, `rateLimitEvictionInterval = 5min` → ~10 sustained attempts/IP/min with 20-attempt burst headroom; `5min` sweep is well above `burst*refillEvery = 120s`. One shared `*IPRateLimiter` is applied to both `mux.Handle` registrations — a misbehaving IP retrying against either endpoint shares one bucket; `/healthz` is intentionally NOT wrapped (must remain pollable from monitoring). `defer limiter.Close()` is best-effort (current `os.Exit` listener path skips defers; graceful shutdown out of scope). Concurrency safety fully delegated to the limiter; middleware is stateless, no new locks, no new goroutines. Mobile/binary `X-Pyrycode-Token` structurally not read on this code path. Why middleware not constructor injection: keeps frame-routing handlers ignorant of admission control, makes `/healthz`'s exemption fall out for free, mirrors `EnforceHost`. Closes the third leg of `docs/threat-model.md` § *DoS resistance* alongside #29 (per-frame size cap) and #30 (per-server-id phone cap); the "token-bucket rate limit on /v1/server and /v1/client" line moves from *Future hardening* to v1 mitigation. Out of scope: connection-count cap (different threat surface — *attempt rate* vs *resident count*), CIDR-aware trusted-proxy chain (today's flag is flat all-or-nothing), multi-instance shared-state (would need Redis), rate-limit metrics counter (future ticket parallel to #58), adaptive policy, graceful-shutdown signal handler (#47).