diff --git a/docs/architecture.md b/docs/architecture.md index 52c1709..6089017 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -30,6 +30,38 @@ The wire protocol lives in [`pyrycode/pyrycode/docs/protocol-mobile.md`](https:/ - Implement push notifications. The binary calls APNs/FCM directly. - Operate on multiple users / tenants in v1. One operator, many users' server-ids. +## Single-instance constraint (v1) + +v1 is single-instance. Operators MUST NOT scale horizontally (`fly scale count > 1`, multiple `docker run` of the same image behind a load balancer, etc.). + +### Why + +The connection registry (`internal/relay/registry.go`) lives in process memory. Two replicas hold two disjoint registries. When a phone and its binary land on different replicas — a coincidence the load balancer is free to produce — the phone's replica sees a server-id miss in its local table and closes with `4404`, even though the binary is connected to a sibling replica. The failure is silent: nothing in the relay logs identifies "the binary is on the other replica" as the cause, because no replica knows the others exist. + +### What multi-instance would require + +Documented as future work, not a commitment. Neither path is planned for v1, and adopting either would invalidate the bypass env var below. + +1. **Shared registry.** A cross-replica store that every replica consults on every `/v1/server` claim and `/v1/client` lookup — for example Redis pub/sub keyed on server-id, or NATS subjects per server-id. Adds a network hop to every routing decision and a new failure mode (registry down → relay cannot route). +2. **Sticky-session-on-server-id at the LB layer.** The load balancer hashes on the `x-pyrycode-server` header so that all traffic for a given server-id reaches the same replica. Avoids the shared-store cost but requires LB awareness of the header; most L4 load balancers do not route on request headers. + +### The `PYRYCODE_RELAY_SINGLE_INSTANCE` bypass + +A startup self-check (see #65) refuses to run when the relay detects it is one of several instances. Setting + +```text +PYRYCODE_RELAY_SINGLE_INSTANCE=1 +``` + +skips that check. The variable name matches what the relay binary checks at startup. + +Intended uses: + +- **Emergency rollback** — the self-check itself misfires and blocks startup. +- **Migration windows** — the operator is mid-switch to a multi-instance topology and accepts the routing breakage transiently. + +**NOT recommended for production.** Setting it permanently silences a check whose entire purpose is to catch the silent-routing-failure mode described above. + ## Threat model Wire-protocol threats live in the protocol spec's [Security model](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#security-model). Operational threats specific to the relay binary as a deployed process — deploy, supply chain, DoS, log hygiene, cert handling, TLS config, error-leakage — live in [`docs/threat-model.md`](./threat-model.md). diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index 34e73c9..2e28e6a 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -30,7 +30,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Architecture -- [System overview](../architecture.md) — top-level: stateless WS router between phones and pyry binaries. (Lives at `docs/architecture.md`; not yet split into `architecture/`.) +- [System overview](../architecture.md) — top-level: stateless WS router between phones and pyry binaries. Names the v1 single-instance constraint (in-memory registry → two replicas hold disjoint registries → silent `4404`), the two multi-instance paths documented as future work (shared registry; sticky-on-`x-pyrycode-server`), and the `PYRYCODE_RELAY_SINGLE_INSTANCE=1` bypass for the #65 startup self-check (#64). (Lives at `docs/architecture.md`; not yet split into `architecture/`.) ## Cross-cutting diff --git a/docs/knowledge/codebase/64.md b/docs/knowledge/codebase/64.md new file mode 100644 index 0000000..f02faef --- /dev/null +++ b/docs/knowledge/codebase/64.md @@ -0,0 +1,40 @@ +# Ticket #64 — document v1 single-instance constraint in `docs/architecture.md` + +Doc-only. Makes the implicit "v1 is single-instance" constraint explicit in the canonical architecture doc, so an operator (or AI agent) eyeing `fly scale count > 1` reads the failure mode before issuing the command. Stochastic half of the belt-and-suspenders pair noted in `docs/PROJECT-MEMORY.md`; deterministic backstop (a startup self-check that refuses to run multi-instance) ships in sibling #65. + +## Implementation + +- **`docs/architecture.md`** — new H2 section `## Single-instance constraint (v1)` placed between *What this binary does NOT do* and *Threat model*. Three subsections: + - **Why** — names the in-memory per-process registry (`internal/relay/registry.go`), the two-disjoint-registries failure mode, and the `4404` symptom an operator would see for any phone↔binary pair the load balancer happens to place on different replicas. Notes the failure is silent: no replica knows the others exist, so logs don't say "the binary is on the other replica." + - **What multi-instance would require** — two enumerated paths labelled future-work, not commitments: shared registry (Redis pub/sub on server-id, NATS subjects per server-id) with the routing-decision network-hop cost and registry-down failure mode called out; sticky-session-on-`x-pyrycode-server` at the LB layer with the L4-LBs-don't-route-on-headers caveat. + - **The `PYRYCODE_RELAY_SINGLE_INSTANCE` bypass** — exact env var name, what setting it to `1` skips (the #65 self-check), the two intended uses (emergency rollback when the check itself misfires; migration windows mid-switch to multi-instance), and the explicit "NOT recommended for production" because setting it permanently silences the very check that catches the silent-routing-failure mode. The shared-contract sentence framed operator-facing ("the variable name matches what the relay binary checks at startup") per the architect's preference. +- **No cross-link to `docs/threat-model.md`.** Architect-recommended skip: operator misconfiguration silently degrading routing is a foot-gun category, not an adversarial threat — conflating them on a "see also" line is the worse outcome. Filed implicitly for a future "operator foot-guns" section when one is added to the threat model. +- **No code changes.** Out-of-scope by design — the enforcement is #65. + +## Acceptance-criteria literals (the canary set) + +- `grep -n single-instance docs/architecture.md` → AC-1 (heading contains the phrase). +- `grep -n PYRYCODE_RELAY_SINGLE_INSTANCE docs/architecture.md` → AC-4 (exact env var string present). + +Both are spec-mandated literal checks; mirroring them in this note so a future search for the constraint hits the same anchor strings used at AC time. + +## Cross-ticket contract with #65 + +The env var name `PYRYCODE_RELAY_SINGLE_INSTANCE` is the shared surface between this doc and the enforcement in #65. Resolution rule, per the ticket and spec: whichever ticket merges second updates the other to match. If #65 lands with a different name, this section is amended in #65's change set, or this doc PR is re-opened. The doc and the code agree on the emitted bypass env var name at merge time, not at spec time. + +## Patterns established + +- **Doc-only ticket as the stochastic half of a belt-and-suspenders pair.** When a constraint needs both an advisory prose anchor and a deterministic startup check, splitting them across two tickets (#64 prose, #65 code) keeps each PR small and lets the doc land without waiting on the code. The shared env var name is the contract; the merge-second-fixes-it rule resolves the contract divergence risk. This is the same pattern that `docs/PROJECT-MEMORY.md` and per-ticket `codebase/.md` rules implement at the agent layer — stochastic advisory + deterministic backstop, both labelled as such. +- **AC-as-literal-grep.** When the architect specifies "the heading text contains 'single-instance'" and "the section contains the exact string `PYRYCODE_RELAY_SINGLE_INSTANCE'", the verification is a literal grep before commit, not a re-read. Cheap, mechanical, repeatable — and survives any future agent rewriting the section for prose flow. + +## Lessons learned + +- **A doc that names code-side behaviour by ticket number ages well, prose pinning a specific identifier ages badly.** The bypass subsection refers to "the startup self-check shipped in #65", not to a function name or flag identifier on the #65 side that does not exist yet. If #65's architect picks a different env var name, the doc updates a single string; if the doc had pinned a function or struct name, the rename surface would be larger. +- **Skip a tempting cross-link when the categories don't match.** The ticket allowed a `docs/threat-model.md` link "only if natural". Operator misconfiguration is not on the threat-model's adversarial axis. The architect's call to skip the link and defer to a future "foot-guns" section is the right shape — adding a half-fit cross-link now makes the eventual proper section harder to introduce, because the wrong link is already there. + +## Cross-links + +- [Architecture § Single-instance constraint (v1)](../../architecture.md#single-instance-constraint-v1) — the section this ticket added. +- [Spec: 64-single-instance-architecture-doc](../../specs/architecture/64-single-instance-architecture-doc.md) — architect's design (file structure, prose style constraints, AC literal-grep rules). +- [#65 — startup self-check](https://github.com/pyrycode/pyrycode-relay/issues/65) — the deterministic enforcement half; shares the `PYRYCODE_RELAY_SINGLE_INSTANCE` env var contract. +- [Docker image § single-instance](../features/docker-image.md) — prior prose pattern ("single-instance enforcement is the host's problem") referenced by the spec; this ticket does not duplicate it. diff --git a/docs/specs/architecture/64-single-instance-architecture-doc.md b/docs/specs/architecture/64-single-instance-architecture-doc.md new file mode 100644 index 0000000..720a291 --- /dev/null +++ b/docs/specs/architecture/64-single-instance-architecture-doc.md @@ -0,0 +1,109 @@ +# Spec: document v1 single-instance constraint in `docs/architecture.md` + +Ticket: [#64](https://github.com/pyrycode/pyrycode-relay/issues/64). Doc-only, XS. Split from #39. + +## Files to read first + +- `docs/architecture.md` (whole file, 36 lines) — the canonical structure to extend. Note the existing two-list pattern ("What this binary does" / "does NOT do") and the trailing **Threat model** section that cross-links to `docs/threat-model.md`. The new section slots between those. +- `internal/relay/registry.go:1-60` — confirm the registry's in-memory, per-process nature. Skim only; the doc summarises the constraint, it does not re-document the type. +- `docs/threat-model.md:1-20` — the cross-link target if you choose to add one. Note the format: `[Section name](./threat-model.md#anchor)` works because both files live in `docs/`. +- `docs/knowledge/features/docker-image.md:70-75` — the existing prose pattern for "single-instance enforcement is the host's problem" — reuse the same tone, do not duplicate the content. +- `docs/specs/architecture/50-ip-rate-limiter.md:260` and `docs/knowledge/codebase/50.md:43` — prior in-tree uses of the phrase "v1 is single-instance"; mirror that phrasing for consistency. + +## Context + +The ticket is the **doc (stochastic) half** of the belt-and-suspenders pair noted in `docs/PROJECT-MEMORY.md`. The deterministic backstop — a startup self-check that refuses to run when the relay detects it is one of several instances — ships in sibling #65. + +The risk being mitigated: Docker + Fly.io makes `fly scale count 3` a one-line command. The connection registry (`internal/relay/registry.go`) is in-memory per process, so two replicas hold two disjoint registries: a phone connected to replica A cannot reach a binary connected to replica B because replica A's server-id table never sees replica B's claim. An operator (or AI agent) who scales out without reading the code will silently break server-id routing for any phone↔binary pair that load-balancer-coincidence places on different replicas. Today `docs/architecture.md` does not call this out; it lists what the relay does and does not do, but the single-instance constraint is implicit. This spec makes it explicit. + +No code changes. Doc-only. + +## Design + +Add a new top-level section to `docs/architecture.md`. The heading **must contain the phrase "single-instance"** (AC-1). Place it after **"What this binary does NOT do"** and before **"Threat model"** — the section is a constraint on deployment topology, which sits naturally between the binary's behavioural contract and the operational threat surface. + +### Section structure + +The section is short — three subsections plus an optional cross-link line. Total target: 25–40 lines of prose, no code blocks needed except for the env var name and a single example deployment command. + +#### Heading + +```markdown +## Single-instance constraint (v1) +``` + +The exact wording is not load-bearing as long as the heading text contains "single-instance" (the AC-1 anchor). Match the title-case style of existing H2s in the file ("What this binary does", "Threat model"). + +#### Subsection 1 — Why v1 is single-instance + +State the structural reason directly. Required content (paraphrase, do not copy verbatim): + +- The connection registry (`internal/relay/registry.go`) lives in process memory. +- Two replicas → two disjoint registries. +- Routing fails when a phone and its binary land on different replicas: the phone's replica sees a server-id miss and returns `4404`, even though the binary is connected to a sibling replica. +- v1 ships single-instance for this reason. Operators MUST NOT scale horizontally (`fly scale count > 1`, multiple `docker run` of the same image behind a load balancer, etc.). + +Anchor the registry claim with an inline reference to `internal/relay/registry.go` (no line numbers — the file is small and they will rot). + +#### Subsection 2 — What multi-instance would require + +Enumerate the two known paths (AC-3), documented as **future work, not a commitment**. Phrase as "would need", not "we will add": + +1. **Shared registry.** A cross-replica store that every replica consults on every `/v1/server` claim and `/v1/client` lookup. Examples: Redis pub/sub keyed on server-id, NATS subjects per server-id. Imposes a network hop on every routing decision and a new failure mode (registry down → relay can't route). +2. **Sticky-session-on-server-id at the LB layer.** The load balancer hashes on the `x-pyrycode-server` header so that all traffic for a given server-id reaches the same replica. Avoids the shared-store cost but requires LB awareness of the header and breaks if the LB does not support request-routing on headers (most L4 LBs do not). + +Make explicit that **neither is planned for v1** and that adopting either would invalidate the bypass env var contract below. + +#### Subsection 3 — The `PYRYCODE_RELAY_SINGLE_INSTANCE` bypass + +Document the env var (AC-4). Required content: + +- The env var name is exactly `PYRYCODE_RELAY_SINGLE_INSTANCE`. Setting it to `1` skips the startup self-check shipped in #65. +- Intended use cases: **emergency rollback** (the self-check itself misfires and blocks startup), and **migration windows** (operator is in the middle of switching to a multi-instance topology and accepts the routing breakage transiently). +- **NOT recommended for production.** Setting it permanently silences a check whose entire purpose is to catch the silent-routing-failure mode this section documents. +- The variable name is the shared contract between this document and #65. If #65 lands with a different name, this section is updated in the same change set. (This sentence may be omitted if it reads as agent-internal scaffolding — the operator-facing version is "the variable name matches what the relay binary checks at startup".) + +A one-line code block showing the variable form is fine: + +```text +PYRYCODE_RELAY_SINGLE_INSTANCE=1 +``` + +#### Optional cross-link + +The ticket notes a cross-link to `docs/threat-model.md` is allowed only if natural. Recommendation: **skip it.** The threat model catalogues adversarial threats (DoS, supply chain, TLS, log hygiene). Operator misconfiguration that silently degrades routing is a different category — it is not an attack, it is a foot-gun. Adding a "see also" link conflates them. If a future ticket adds an "operator foot-guns" section to the threat model, link it then. + +### Constraints on prose style + +- Match the existing tone of `docs/architecture.md`: declarative, present-tense, no marketing voice. The existing file says "Read message payloads. Frames are opaque bytes." That is the target register. +- Do not duplicate the per-feature detail that already lives in `docs/knowledge/features/docker-image.md`. The architecture doc states the constraint; the features doc states host-wiring detail. Refer by name (`#65`) for the enforcement detail rather than restating it. +- Do not name a target release for multi-instance support. The constraint is "v1 is single-instance"; v2 is not a commitment. + +## Concurrency model + +N/A — doc-only. + +## Error handling + +N/A — doc-only. + +## Testing strategy + +No automated tests. Verification is by reading the diff against the acceptance criteria: + +1. The new section's heading text contains the substring "single-instance" (case-insensitive grep against the rendered heading). +2. The section names the in-memory registry, the two-disjoint-registries failure mode, and the resulting `4404` symptom. +3. The section enumerates at least the two multi-instance paths (shared registry, sticky-on-header) and labels them as future work, not commitments. +4. The section contains the exact string `PYRYCODE_RELAY_SINGLE_INSTANCE` and explains: what setting it skips, when it is acceptable, and that it is not recommended for production. +5. No file outside `docs/architecture.md` is changed. + +A developer who finishes the edit should run `grep -n single-instance docs/architecture.md` and `grep -n PYRYCODE_RELAY_SINGLE_INSTANCE docs/architecture.md` to confirm AC-1 and AC-4 are literally satisfied before committing. + +## Open questions + +1. **Env var name.** The ticket nominates `PYRYCODE_RELAY_SINGLE_INSTANCE`. The architect for #65 may pick a different name. Resolution at merge time, not at spec time: whichever of #64 and #65 lands second updates the other to match. Both this doc PR and the #65 implementation PR carry the same env var name at merge. +2. **Future "operator foot-guns" threat-model section.** Out of scope for this ticket. Filed implicitly as something to consider when a sibling ticket touches `docs/threat-model.md`. + +## Why not split further + +The work is one heading + ~30 lines of prose in one file. There is no fan-out, no cross-package coordination, no test surface. Splitting would produce two tickets that each rewrite the same paragraph.