You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As an operator who accidentally runs fly scale count 3 (or any equivalent on another host), I want the relay to refuse to start on any replica past the first with a loud, specific error, so that silent server-id routing breakage is caught before phones connect.
Context
The connection registry (internal/relay/registry.go) is in-memory per process. Two replicas serve disjoint registries: a phone connected to replica A cannot reach a binary connected to replica B. Docker + Fly.io makes fly scale count 3 a one-line command. Without a runtime guard, an operator or AI agent could scale out and silently break server-id routing for half the connections.
This ticket is the deterministic backstop in the belt-and-suspenders pair noted in docs/PROJECT-MEMORY.md — the doc half (sibling #64, already merged) handles the stochastic guard.
Acceptance Criteria
At process startup, the relay performs a self-check intended to detect that it is part of a multi-instance deploy.
On positive detection, the process refuses to start (non-zero exit) and emits an ERROR-level log whose message contains the substring multi-instance deploy detected and names the bypass env var PYRYCODE_RELAY_SINGLE_INSTANCE so an operator can grep for it.
Setting PYRYCODE_RELAY_SINGLE_INSTANCE=1 causes the self-check to be skipped, and startup proceeds. Bypass is intended for emergencies and migration windows.
When no multi-instance signal is present (the common single-instance case), startup proceeds unchanged. Existing single-instance deployments must not regress.
Tests cover at minimum: (a) multi-instance signal present → refuses to start with the documented error substring; (b) PYRYCODE_RELAY_SINGLE_INSTANCE=1 → starts; (c) no multi-instance signal → starts.
Technical Notes
The bypass env var name PYRYCODE_RELAY_SINGLE_INSTANCE is fixed — sibling relay: document v1 single-instance constraint in architecture.md #64 has merged and committed that literal name (and the value 1) into docs/architecture.md. Do not rename; if a rename is genuinely needed, raise it as a fresh ticket against docs/architecture.md first.
The parent body (relay: single-instance constraint — doc + startup self-check (registry is in-memory) #39) sketched two candidate detection mechanisms: (1) an env-var contract PYRYCODE_RELAY_ASSERT_SINGLE_INSTANCE=true set by the deploy manifest, refuse-to-start when absent; (2) host-exposed signals like FLY_MACHINE_ID combined with a count assertion. Mechanism (1) would regress existing deployments that don't set the assertion (violates AC relay: WS upgrade for /v1/server — accept binary connection, validate headers, claim server-id #4), and the repo has no fly.toml or other host manifest under our control today (the deploy artifact is Dockerfile only) — so mechanism (2), or any equivalent local-only host-env signal, is the natural choice. Architect confirms.
No new runtime dependency on platform-specific SDKs/API clients (e.g. no Fly machines API client). The detection mechanism must be either an env-var contract or a local-only signal (env vars exposed by the host).
The check belongs in the cmd/pyrycode-relay/main.go startup path or a small helper in internal/relay/. Architect's call.
Out of scope
Implementing shared-registry support (Redis / NATS / sticky-session).
Health-check integration with platform autoscalers.
User Story
As an operator who accidentally runs
fly scale count 3(or any equivalent on another host), I want the relay to refuse to start on any replica past the first with a loud, specific error, so that silent server-id routing breakage is caught before phones connect.Context
The connection registry (
internal/relay/registry.go) is in-memory per process. Two replicas serve disjoint registries: a phone connected to replica A cannot reach a binary connected to replica B. Docker + Fly.io makesfly scale count 3a one-line command. Without a runtime guard, an operator or AI agent could scale out and silently break server-id routing for half the connections.This ticket is the deterministic backstop in the belt-and-suspenders pair noted in
docs/PROJECT-MEMORY.md— the doc half (sibling #64, already merged) handles the stochastic guard.Acceptance Criteria
ERROR-level log whose message contains the substringmulti-instance deploy detectedand names the bypass env varPYRYCODE_RELAY_SINGLE_INSTANCEso an operator can grep for it.PYRYCODE_RELAY_SINGLE_INSTANCE=1causes the self-check to be skipped, and startup proceeds. Bypass is intended for emergencies and migration windows.PYRYCODE_RELAY_SINGLE_INSTANCE=1→ starts; (c) no multi-instance signal → starts.Technical Notes
PYRYCODE_RELAY_SINGLE_INSTANCEis fixed — sibling relay: document v1 single-instance constraint in architecture.md #64 has merged and committed that literal name (and the value1) intodocs/architecture.md. Do not rename; if a rename is genuinely needed, raise it as a fresh ticket againstdocs/architecture.mdfirst.PYRYCODE_RELAY_ASSERT_SINGLE_INSTANCE=trueset by the deploy manifest, refuse-to-start when absent; (2) host-exposed signals likeFLY_MACHINE_IDcombined with a count assertion. Mechanism (1) would regress existing deployments that don't set the assertion (violates AC relay: WS upgrade for /v1/server — accept binary connection, validate headers, claim server-id #4), and the repo has nofly.tomlor other host manifest under our control today (the deploy artifact isDockerfileonly) — so mechanism (2), or any equivalent local-only host-env signal, is the natural choice. Architect confirms.cmd/pyrycode-relay/main.gostartup path or a small helper ininternal/relay/. Architect's call.Out of scope
Size Estimate
XS
Split from #39.