Skip to content

relay: startup self-check refuses to run as multi-instance deploy #65

@ilmoniemi

Description

@ilmoniemi

User Story

As an operator who accidentally runs fly scale count 3 (or any equivalent on another host), I want the relay to refuse to start on any replica past the first with a loud, specific error, so that silent server-id routing breakage is caught before phones connect.

Context

The connection registry (internal/relay/registry.go) is in-memory per process. Two replicas serve disjoint registries: a phone connected to replica A cannot reach a binary connected to replica B. Docker + Fly.io makes fly scale count 3 a one-line command. Without a runtime guard, an operator or AI agent could scale out and silently break server-id routing for half the connections.

This ticket is the deterministic backstop in the belt-and-suspenders pair noted in docs/PROJECT-MEMORY.md — the doc half (sibling #64, already merged) handles the stochastic guard.

Acceptance Criteria

  • At process startup, the relay performs a self-check intended to detect that it is part of a multi-instance deploy.
  • On positive detection, the process refuses to start (non-zero exit) and emits an ERROR-level log whose message contains the substring multi-instance deploy detected and names the bypass env var PYRYCODE_RELAY_SINGLE_INSTANCE so an operator can grep for it.
  • Setting PYRYCODE_RELAY_SINGLE_INSTANCE=1 causes the self-check to be skipped, and startup proceeds. Bypass is intended for emergencies and migration windows.
  • When no multi-instance signal is present (the common single-instance case), startup proceeds unchanged. Existing single-instance deployments must not regress.
  • Tests cover at minimum: (a) multi-instance signal present → refuses to start with the documented error substring; (b) PYRYCODE_RELAY_SINGLE_INSTANCE=1 → starts; (c) no multi-instance signal → starts.

Technical Notes

  • The bypass env var name PYRYCODE_RELAY_SINGLE_INSTANCE is fixed — sibling relay: document v1 single-instance constraint in architecture.md #64 has merged and committed that literal name (and the value 1) into docs/architecture.md. Do not rename; if a rename is genuinely needed, raise it as a fresh ticket against docs/architecture.md first.
  • The parent body (relay: single-instance constraint — doc + startup self-check (registry is in-memory) #39) sketched two candidate detection mechanisms: (1) an env-var contract PYRYCODE_RELAY_ASSERT_SINGLE_INSTANCE=true set by the deploy manifest, refuse-to-start when absent; (2) host-exposed signals like FLY_MACHINE_ID combined with a count assertion. Mechanism (1) would regress existing deployments that don't set the assertion (violates AC relay: WS upgrade for /v1/server — accept binary connection, validate headers, claim server-id #4), and the repo has no fly.toml or other host manifest under our control today (the deploy artifact is Dockerfile only) — so mechanism (2), or any equivalent local-only host-env signal, is the natural choice. Architect confirms.
  • No new runtime dependency on platform-specific SDKs/API clients (e.g. no Fly machines API client). The detection mechanism must be either an env-var contract or a local-only signal (env vars exposed by the host).
  • The check belongs in the cmd/pyrycode-relay/main.go startup path or a small helper in internal/relay/. Architect's call.

Out of scope

  • Implementing shared-registry support (Redis / NATS / sticky-session).
  • Health-check integration with platform autoscalers.

Size Estimate

XS

Split from #39.

Metadata

Metadata

Assignees

No one assigned

    Labels

    security-sensitiveTouches auth, crypto, or internet-exposed input pathssize:xsTiny ticket: <30 lines production code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions