Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,38 @@ The wire protocol lives in [`pyrycode/pyrycode/docs/protocol-mobile.md`](https:/
- Implement push notifications. The binary calls APNs/FCM directly.
- Operate on multiple users / tenants in v1. One operator, many users' server-ids.

## Single-instance constraint (v1)

v1 is single-instance. Operators MUST NOT scale horizontally (`fly scale count > 1`, multiple `docker run` of the same image behind a load balancer, etc.).

### Why

The connection registry (`internal/relay/registry.go`) lives in process memory. Two replicas hold two disjoint registries. When a phone and its binary land on different replicas — a coincidence the load balancer is free to produce — the phone's replica sees a server-id miss in its local table and closes with `4404`, even though the binary is connected to a sibling replica. The failure is silent: nothing in the relay logs identifies "the binary is on the other replica" as the cause, because no replica knows the others exist.

### What multi-instance would require

Documented as future work, not a commitment. Neither path is planned for v1, and adopting either would invalidate the bypass env var below.

1. **Shared registry.** A cross-replica store that every replica consults on every `/v1/server` claim and `/v1/client` lookup — for example Redis pub/sub keyed on server-id, or NATS subjects per server-id. Adds a network hop to every routing decision and a new failure mode (registry down → relay cannot route).
2. **Sticky-session-on-server-id at the LB layer.** The load balancer hashes on the `x-pyrycode-server` header so that all traffic for a given server-id reaches the same replica. Avoids the shared-store cost but requires LB awareness of the header; most L4 load balancers do not route on request headers.

### The `PYRYCODE_RELAY_SINGLE_INSTANCE` bypass

A startup self-check (see #65) refuses to run when the relay detects it is one of several instances. Setting

```text
PYRYCODE_RELAY_SINGLE_INSTANCE=1
```

skips that check. The variable name matches what the relay binary checks at startup.

Intended uses:

- **Emergency rollback** — the self-check itself misfires and blocks startup.
- **Migration windows** — the operator is mid-switch to a multi-instance topology and accepts the routing breakage transiently.

**NOT recommended for production.** Setting it permanently silences a check whose entire purpose is to catch the silent-routing-failure mode described above.

## Threat model

Wire-protocol threats live in the protocol spec's [Security model](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#security-model). Operational threats specific to the relay binary as a deployed process — deploy, supply chain, DoS, log hygiene, cert handling, TLS config, error-leakage — live in [`docs/threat-model.md`](./threat-model.md).
2 changes: 1 addition & 1 deletion docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Architecture

- [System overview](../architecture.md) — top-level: stateless WS router between phones and pyry binaries. (Lives at `docs/architecture.md`; not yet split into `architecture/`.)
- [System overview](../architecture.md) — top-level: stateless WS router between phones and pyry binaries. Names the v1 single-instance constraint (in-memory registry → two replicas hold disjoint registries → silent `4404`), the two multi-instance paths documented as future work (shared registry; sticky-on-`x-pyrycode-server`), and the `PYRYCODE_RELAY_SINGLE_INSTANCE=1` bypass for the #65 startup self-check (#64). (Lives at `docs/architecture.md`; not yet split into `architecture/`.)

## Cross-cutting

Expand Down
40 changes: 40 additions & 0 deletions docs/knowledge/codebase/64.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Ticket #64 — document v1 single-instance constraint in `docs/architecture.md`

Doc-only. Makes the implicit "v1 is single-instance" constraint explicit in the canonical architecture doc, so an operator (or AI agent) eyeing `fly scale count > 1` reads the failure mode before issuing the command. Stochastic half of the belt-and-suspenders pair noted in `docs/PROJECT-MEMORY.md`; deterministic backstop (a startup self-check that refuses to run multi-instance) ships in sibling #65.

## Implementation

- **`docs/architecture.md`** — new H2 section `## Single-instance constraint (v1)` placed between *What this binary does NOT do* and *Threat model*. Three subsections:
- **Why** — names the in-memory per-process registry (`internal/relay/registry.go`), the two-disjoint-registries failure mode, and the `4404` symptom an operator would see for any phone↔binary pair the load balancer happens to place on different replicas. Notes the failure is silent: no replica knows the others exist, so logs don't say "the binary is on the other replica."
- **What multi-instance would require** — two enumerated paths labelled future-work, not commitments: shared registry (Redis pub/sub on server-id, NATS subjects per server-id) with the routing-decision network-hop cost and registry-down failure mode called out; sticky-session-on-`x-pyrycode-server` at the LB layer with the L4-LBs-don't-route-on-headers caveat.
- **The `PYRYCODE_RELAY_SINGLE_INSTANCE` bypass** — exact env var name, what setting it to `1` skips (the #65 self-check), the two intended uses (emergency rollback when the check itself misfires; migration windows mid-switch to multi-instance), and the explicit "NOT recommended for production" because setting it permanently silences the very check that catches the silent-routing-failure mode. The shared-contract sentence framed operator-facing ("the variable name matches what the relay binary checks at startup") per the architect's preference.
- **No cross-link to `docs/threat-model.md`.** Architect-recommended skip: operator misconfiguration silently degrading routing is a foot-gun category, not an adversarial threat — conflating them on a "see also" line is the worse outcome. Filed implicitly for a future "operator foot-guns" section when one is added to the threat model.
- **No code changes.** Out-of-scope by design — the enforcement is #65.

## Acceptance-criteria literals (the canary set)

- `grep -n single-instance docs/architecture.md` → AC-1 (heading contains the phrase).
- `grep -n PYRYCODE_RELAY_SINGLE_INSTANCE docs/architecture.md` → AC-4 (exact env var string present).

Both are spec-mandated literal checks; mirroring them in this note so a future search for the constraint hits the same anchor strings used at AC time.

## Cross-ticket contract with #65

The env var name `PYRYCODE_RELAY_SINGLE_INSTANCE` is the shared surface between this doc and the enforcement in #65. Resolution rule, per the ticket and spec: whichever ticket merges second updates the other to match. If #65 lands with a different name, this section is amended in #65's change set, or this doc PR is re-opened. The doc and the code agree on the emitted bypass env var name at merge time, not at spec time.

## Patterns established

- **Doc-only ticket as the stochastic half of a belt-and-suspenders pair.** When a constraint needs both an advisory prose anchor and a deterministic startup check, splitting them across two tickets (#64 prose, #65 code) keeps each PR small and lets the doc land without waiting on the code. The shared env var name is the contract; the merge-second-fixes-it rule resolves the contract divergence risk. This is the same pattern that `docs/PROJECT-MEMORY.md` and per-ticket `codebase/<N>.md` rules implement at the agent layer — stochastic advisory + deterministic backstop, both labelled as such.
- **AC-as-literal-grep.** When the architect specifies "the heading text contains 'single-instance'" and "the section contains the exact string `PYRYCODE_RELAY_SINGLE_INSTANCE'", the verification is a literal grep before commit, not a re-read. Cheap, mechanical, repeatable — and survives any future agent rewriting the section for prose flow.

## Lessons learned

- **A doc that names code-side behaviour by ticket number ages well, prose pinning a specific identifier ages badly.** The bypass subsection refers to "the startup self-check shipped in #65", not to a function name or flag identifier on the #65 side that does not exist yet. If #65's architect picks a different env var name, the doc updates a single string; if the doc had pinned a function or struct name, the rename surface would be larger.
- **Skip a tempting cross-link when the categories don't match.** The ticket allowed a `docs/threat-model.md` link "only if natural". Operator misconfiguration is not on the threat-model's adversarial axis. The architect's call to skip the link and defer to a future "foot-guns" section is the right shape — adding a half-fit cross-link now makes the eventual proper section harder to introduce, because the wrong link is already there.

## Cross-links

- [Architecture § Single-instance constraint (v1)](../../architecture.md#single-instance-constraint-v1) — the section this ticket added.
- [Spec: 64-single-instance-architecture-doc](../../specs/architecture/64-single-instance-architecture-doc.md) — architect's design (file structure, prose style constraints, AC literal-grep rules).
- [#65 — startup self-check](https://github.com/pyrycode/pyrycode-relay/issues/65) — the deterministic enforcement half; shares the `PYRYCODE_RELAY_SINGLE_INSTANCE` env var contract.
- [Docker image § single-instance](../features/docker-image.md) — prior prose pattern ("single-instance enforcement is the host's problem") referenced by the spec; this ticket does not duplicate it.
109 changes: 109 additions & 0 deletions docs/specs/architecture/64-single-instance-architecture-doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Spec: document v1 single-instance constraint in `docs/architecture.md`

Ticket: [#64](https://github.com/pyrycode/pyrycode-relay/issues/64). Doc-only, XS. Split from #39.

## Files to read first

- `docs/architecture.md` (whole file, 36 lines) — the canonical structure to extend. Note the existing two-list pattern ("What this binary does" / "does NOT do") and the trailing **Threat model** section that cross-links to `docs/threat-model.md`. The new section slots between those.
- `internal/relay/registry.go:1-60` — confirm the registry's in-memory, per-process nature. Skim only; the doc summarises the constraint, it does not re-document the type.
- `docs/threat-model.md:1-20` — the cross-link target if you choose to add one. Note the format: `[Section name](./threat-model.md#anchor)` works because both files live in `docs/`.
- `docs/knowledge/features/docker-image.md:70-75` — the existing prose pattern for "single-instance enforcement is the host's problem" — reuse the same tone, do not duplicate the content.
- `docs/specs/architecture/50-ip-rate-limiter.md:260` and `docs/knowledge/codebase/50.md:43` — prior in-tree uses of the phrase "v1 is single-instance"; mirror that phrasing for consistency.

## Context

The ticket is the **doc (stochastic) half** of the belt-and-suspenders pair noted in `docs/PROJECT-MEMORY.md`. The deterministic backstop — a startup self-check that refuses to run when the relay detects it is one of several instances — ships in sibling #65.

The risk being mitigated: Docker + Fly.io makes `fly scale count 3` a one-line command. The connection registry (`internal/relay/registry.go`) is in-memory per process, so two replicas hold two disjoint registries: a phone connected to replica A cannot reach a binary connected to replica B because replica A's server-id table never sees replica B's claim. An operator (or AI agent) who scales out without reading the code will silently break server-id routing for any phone↔binary pair that load-balancer-coincidence places on different replicas. Today `docs/architecture.md` does not call this out; it lists what the relay does and does not do, but the single-instance constraint is implicit. This spec makes it explicit.

No code changes. Doc-only.

## Design

Add a new top-level section to `docs/architecture.md`. The heading **must contain the phrase "single-instance"** (AC-1). Place it after **"What this binary does NOT do"** and before **"Threat model"** — the section is a constraint on deployment topology, which sits naturally between the binary's behavioural contract and the operational threat surface.

### Section structure

The section is short — three subsections plus an optional cross-link line. Total target: 25–40 lines of prose, no code blocks needed except for the env var name and a single example deployment command.

#### Heading

```markdown
## Single-instance constraint (v1)
```

The exact wording is not load-bearing as long as the heading text contains "single-instance" (the AC-1 anchor). Match the title-case style of existing H2s in the file ("What this binary does", "Threat model").

#### Subsection 1 — Why v1 is single-instance

State the structural reason directly. Required content (paraphrase, do not copy verbatim):

- The connection registry (`internal/relay/registry.go`) lives in process memory.
- Two replicas → two disjoint registries.
- Routing fails when a phone and its binary land on different replicas: the phone's replica sees a server-id miss and returns `4404`, even though the binary is connected to a sibling replica.
- v1 ships single-instance for this reason. Operators MUST NOT scale horizontally (`fly scale count > 1`, multiple `docker run` of the same image behind a load balancer, etc.).

Anchor the registry claim with an inline reference to `internal/relay/registry.go` (no line numbers — the file is small and they will rot).

#### Subsection 2 — What multi-instance would require

Enumerate the two known paths (AC-3), documented as **future work, not a commitment**. Phrase as "would need", not "we will add":

1. **Shared registry.** A cross-replica store that every replica consults on every `/v1/server` claim and `/v1/client` lookup. Examples: Redis pub/sub keyed on server-id, NATS subjects per server-id. Imposes a network hop on every routing decision and a new failure mode (registry down → relay can't route).
2. **Sticky-session-on-server-id at the LB layer.** The load balancer hashes on the `x-pyrycode-server` header so that all traffic for a given server-id reaches the same replica. Avoids the shared-store cost but requires LB awareness of the header and breaks if the LB does not support request-routing on headers (most L4 LBs do not).

Make explicit that **neither is planned for v1** and that adopting either would invalidate the bypass env var contract below.

#### Subsection 3 — The `PYRYCODE_RELAY_SINGLE_INSTANCE` bypass

Document the env var (AC-4). Required content:

- The env var name is exactly `PYRYCODE_RELAY_SINGLE_INSTANCE`. Setting it to `1` skips the startup self-check shipped in #65.
- Intended use cases: **emergency rollback** (the self-check itself misfires and blocks startup), and **migration windows** (operator is in the middle of switching to a multi-instance topology and accepts the routing breakage transiently).
- **NOT recommended for production.** Setting it permanently silences a check whose entire purpose is to catch the silent-routing-failure mode this section documents.
- The variable name is the shared contract between this document and #65. If #65 lands with a different name, this section is updated in the same change set. (This sentence may be omitted if it reads as agent-internal scaffolding — the operator-facing version is "the variable name matches what the relay binary checks at startup".)

A one-line code block showing the variable form is fine:

```text
PYRYCODE_RELAY_SINGLE_INSTANCE=1
```

#### Optional cross-link

The ticket notes a cross-link to `docs/threat-model.md` is allowed only if natural. Recommendation: **skip it.** The threat model catalogues adversarial threats (DoS, supply chain, TLS, log hygiene). Operator misconfiguration that silently degrades routing is a different category — it is not an attack, it is a foot-gun. Adding a "see also" link conflates them. If a future ticket adds an "operator foot-guns" section to the threat model, link it then.

### Constraints on prose style

- Match the existing tone of `docs/architecture.md`: declarative, present-tense, no marketing voice. The existing file says "Read message payloads. Frames are opaque bytes." That is the target register.
- Do not duplicate the per-feature detail that already lives in `docs/knowledge/features/docker-image.md`. The architecture doc states the constraint; the features doc states host-wiring detail. Refer by name (`#65`) for the enforcement detail rather than restating it.
- Do not name a target release for multi-instance support. The constraint is "v1 is single-instance"; v2 is not a commitment.

## Concurrency model

N/A — doc-only.

## Error handling

N/A — doc-only.

## Testing strategy

No automated tests. Verification is by reading the diff against the acceptance criteria:

1. The new section's heading text contains the substring "single-instance" (case-insensitive grep against the rendered heading).
2. The section names the in-memory registry, the two-disjoint-registries failure mode, and the resulting `4404` symptom.
3. The section enumerates at least the two multi-instance paths (shared registry, sticky-on-header) and labels them as future work, not commitments.
4. The section contains the exact string `PYRYCODE_RELAY_SINGLE_INSTANCE` and explains: what setting it skips, when it is acceptable, and that it is not recommended for production.
5. No file outside `docs/architecture.md` is changed.

A developer who finishes the edit should run `grep -n single-instance docs/architecture.md` and `grep -n PYRYCODE_RELAY_SINGLE_INSTANCE docs/architecture.md` to confirm AC-1 and AC-4 are literally satisfied before committing.

## Open questions

1. **Env var name.** The ticket nominates `PYRYCODE_RELAY_SINGLE_INSTANCE`. The architect for #65 may pick a different name. Resolution at merge time, not at spec time: whichever of #64 and #65 lands second updates the other to match. Both this doc PR and the #65 implementation PR carry the same env var name at merge.
2. **Future "operator foot-guns" threat-model section.** Out of scope for this ticket. Filed implicitly as something to consider when a sibling ticket touches `docs/threat-model.md`.

## Why not split further

The work is one heading + ~30 lines of prose in one file. There is no fan-out, no cross-package coordination, no test surface. Splitting would produce two tickets that each rewrite the same paragraph.