Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/PROJECT-MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex
| Go skeleton | Done | `cmd/pyrycode-relay/main.go`, `internal/relay/doc.go` |
| `http.Server` with explicit timeouts (gosec G114) | Done | `cmd/pyrycode-relay/main.go` |
| Routing envelope wrapper type (`Envelope`, `Marshal`, `Unmarshal`, sentinel errors) | Done (#1) | `internal/relay/envelope.go` |
| Connection registry (`Conn`, `Registry`; 1:1 binary, 1:N phones; sentinel errors for `4409`/`4404`; race-tested) | Done (#3) | `internal/relay/registry.go` |
| TLS via autocert in `--domain` mode (`NewAutocertManager`, `EnforceHost`, `TLSConfig`, `ErrCacheDirInsecure`) | Done (#9) | `internal/relay/tls.go`, `cmd/pyrycode-relay/main.go` |
| WS upgrade on `/v1/server` and `/v1/client` | Not started | — |
| Header validation (`x-pyrycode-server`, `x-pyrycode-token`) | Not started | — |
| Connection registry (server-id ↔ binary, server-id ↔ phones) | Not started | — |
| Frame forwarding using the routing envelope | Not started | — |
| `conn_id` generation scheme | Not started | — |
| Threat model doc — operational surface (deploy, supply chain, DoS, log hygiene, cert handling, TLS, error leakage) | Done (#11) | `docs/threat-model.md` |
Expand All @@ -25,6 +25,8 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex
- **Stdlib + `golang.org/x/crypto`.** First non-stdlib dep arrived in #9 (`acme/autocert`). Keep the dependency surface deliberate; new deps need a justification.
- **Loud failure over silent correction.** `--domain` and `--insecure-listen` are mutually exclusive and explicit. `ErrCacheDirInsecure` refuses to start on a world/group-readable cert cache rather than re-chmoding it. `:80` returns `404` for non-challenge traffic instead of redirecting to HTTPS (ADR-0002).
- **Tests live in the same package** (`package relay`, not `relay_test`) so they can `errors.Is` against unexported sentinels if needed and reach private helpers without re-exporting.
- **Passive in-memory stores guard mutation under one RWMutex; reads return copies, never references.** `PhonesFor` allocates a fresh slice so callers do slow work (broadcast, `Send` over the network) without holding the registry lock. Adopted in `internal/relay/registry.go` (#3); the same shape applies to any future "shared map of conns" type.
- **Interface methods called under the lock are documented as non-blocking getters.** The registry's `Conn.ConnID()` is invoked under the write lock during `UnregisterPhone`; `Send` and `Close` are never called while the lock is held. Pattern: state the contract on the interface, never call something that could block on I/O while a mutex is held.

## Conventions

Expand Down
2 changes: 2 additions & 0 deletions docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,13 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`.
- [Autocert TLS](features/autocert-tls.md) — `--domain` mode wiring: `:443` WSS via Let's Encrypt + `:80` ACME http-01, host gates, cache-dir permission policy, TLS 1.2 floor.
- [Routing envelope](features/routing-envelope.md) — typed Go wrapper (`Envelope`, `Marshal`, `Unmarshal`) for the `{conn_id, frame}` wire shape exchanged between relay and binary.

## Decisions

- [ADR-0003: Connection registry as a passive store](decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, narrow `ReleaseServer` so grace logic (#8) can wrap it.
- [ADR-0002: Explicit-failure 404 on `:80` instead of HTTPS redirect](decisions/0002-autocert-explicit-failure-on-port-80.md) — pass `http.NotFoundHandler()` to `autocert.Manager.HTTPHandler`, not `nil`; loud failure over silent scheme-switching.
- [ADR-0001: Routing envelope shape and opacity](decisions/0001-routing-envelope-shape-and-opacity.md) — `json.RawMessage` for the inner frame; sentinel errors; validate at boundary, never parse payloads.

Expand Down
59 changes: 59 additions & 0 deletions docs/knowledge/decisions/0003-connection-registry-passive-store.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# ADR-0003: Connection registry as a passive store

**Status:** Accepted (#3)
**Date:** 2026-05-08

## Context

The relay needs one canonical structure that owns "who is connected as what for which server-id." Every routing-layer ticket (#4 server upgrade, #5 client upgrade, #6 forwarding, #7 heartbeat, #8 grace, #10 health) reads or writes through it. The shape comes from `docs/architecture.md`: 1:1 server-id → binary with first-claim-wins, 1:N server-id → phones gated on a binary holding the slot.

The design space had several axes — what the registry *owns*, what locks it takes, how it returns data, and where the grace period lives.

## Decision

`internal/relay/registry.go` defines a `Conn` interface, a `Registry` type with one `sync.RWMutex` over both maps, seven methods (`ClaimServer`, `ReleaseServer`, `RegisterPhone`, `UnregisterPhone`, `BinaryFor`, `PhonesFor`, `Counts`), and two sentinel errors (`ErrServerIDConflict`, `ErrNoServer`). The registry is a **passive store**: it never invokes `Send` / `Close` on conns it tracks, never times anything, never calls back to anyone. Stdlib only.

Rules adopted:

1. **One RWMutex over both maps.** No sharding, no per-server-id locks.
2. **`Conn` interface lives in `relay` package.** Defined at the consumer.
3. **`PhonesFor` returns a freshly allocated copy.** Callers iterate the snapshot lock-free.
4. **`UnregisterPhone` calls `ConnID()` under the lock.** The `Conn` contract requires it to be a non-blocking getter.
5. **`ReleaseServer` does not touch phones.** Orphan phones survive a release; their owner must unregister them.
6. **Sentinel errors over wrapped errors.** Same idiom as ADR-0001 (#1).

## Rationale

### One RWMutex, not per-map or sharded

Two separate locks (one for `binaries`, one for `phones`) invite ordering bugs (always-A-then-B) for no measurable gain — `RegisterPhone` already needs both, and the critical sections are sub-microsecond map operations. Sharding by server-id is a possible later optimisation but irrelevant at v1: the relay handles tens to hundreds of concurrent connections, not millions, and lock contention is unobserved.

### `Conn` defined in `relay`, not at the consumer

Conventional Go style says "define interfaces at the consumer." Here there is one consumer package — `internal/relay` — and the Conn impls themselves will live in a `relay/ws` subpackage in #4/#5. Defining `Conn` in `relay` avoids an import cycle and matches the practical scope (one consumer).

### `PhonesFor` returns a copy, not a reference

Returning the underlying slice would force callers to broadcast under the registry lock or risk a data race against `RegisterPhone` / `UnregisterPhone`. A copy is cheap (pointer-sized entries, typically 1–3 phones per server-id) and lets callers do the slow thing — `Send` over the network — without holding the lock. The Conn references inside the copy are still live; that's intentional, and it's what makes the broadcast pattern work.

### `ConnID` is a getter, called under the lock

Alternative: copy the slice, drop the lock, scan for the matching `ConnID`, re-acquire the lock to remove. This creates a TOCTOU window — a phone added between snapshot and removal could be erroneously skipped or removed. The cleaner contract is "`ConnID` is a non-blocking getter" stated on the interface; if a future Conn impl violates that, fix the impl. Documented; tested only structurally.

### `ReleaseServer` is narrow

It removes the binary entry and nothing else. The grace ticket (#8) needs phones to remain reachable for 30 seconds across a binary disconnect — so cleanup of orphan phones cannot live inside `ReleaseServer`. Keeping release immediate-and-narrow lets #8 layer policy on top without changing the registry's API. If a future ticket wants auto-close-phones-on-release, it adds a wrapper, not a flag on this method.

### Sentinel errors

Same idiom as ADR-0001. `errors.Is` for branching, no string matching, downstream WS handlers map sentinels → close codes (`4404`, `4409`).

## Consequences

- The registry is the single canonical place for "who is connected." Future routing-layer code reads/writes through it; ad-hoc maps in #4–#10 are a code-review smell.
- Concurrency is provable from the lock graph (one node) plus the documented invariant that `Send` / `Close` never run under the lock. Race-test coverage (32 goroutines × 200 ops) plus `make test`'s default `-race` mode is the enforcement.
- The registry is testable without WebSockets — a 5-line `fakeConn` is enough. WS-library choice in #4/#5 doesn't reach back into this code.
- `Conn` is a contract: any future Conn impl must honour non-blocking `ConnID`, idempotent `Close`, and per-impl write serialisation in `Send`. Those properties belong in #4/#5's spec.
- Per-server connection caps are explicitly NOT here. #5 owns that.
- Orphan phones after `ReleaseServer` are a documented state. #8 will own the grace timer that decides when to close them.
- Linear scan in `UnregisterPhone` is correct for v1 (1–3 phones per server-id). If profiling later flags it, switch to `map[connID]Conn` per server-id; the public API doesn't change.
104 changes: 104 additions & 0 deletions docs/knowledge/features/connection-registry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Connection registry

The relay's routing core needs one canonical structure that owns "who is connected as what for which server-id." Every routing-layer ticket downstream — `/v1/server` upgrade (#4), `/v1/client` upgrade (#5), frame forwarding (#6), heartbeat (#7), grace period (#8), health (#10) — reads or writes through this registry.

Two non-negotiable shapes flow from `docs/architecture.md`:

1. **server-id → binary is 1:1, first-claim-wins.** A second binary claiming a held server-id is a `4409` conflict.
2. **server-id → phones is 1:N, gated on a binary holding the slot.** A phone whose server-id has no binary is a `4404` no-server.

The registry is a **passive in-memory store**. It holds opaque `Conn` handles, never speaks WebSocket, never invokes `Send` / `Close` on the conns it tracks, and never times anything out. The 30-second grace period on binary disconnect (#8) wraps the registry rather than living inside it.

## API

Package `internal/relay` (`registry.go`):

```go
type Conn interface {
ConnID() string // stable, non-blocking; called under the registry lock
Send(msg []byte) error // never called under the registry lock
Close() // idempotent; never called under the registry lock
}

type Registry struct { /* sync.RWMutex + binaries map + phones map */ }

func NewRegistry() *Registry

func (r *Registry) ClaimServer(serverID string, conn Conn) error
func (r *Registry) ReleaseServer(serverID string) (released bool)
func (r *Registry) RegisterPhone(serverID string, conn Conn) error
func (r *Registry) UnregisterPhone(serverID string, connID string)
func (r *Registry) BinaryFor(serverID string) (Conn, bool)
func (r *Registry) PhonesFor(serverID string) []Conn
func (r *Registry) Counts() (binaries, phones int)
```

Sentinel errors (branch with `errors.Is`):

| Error | Returned by | Maps to |
|---|---|---|
| `ErrServerIDConflict` | `ClaimServer` when slot already held | WS close `4409` |
| `ErrNoServer` | `RegisterPhone` when no binary holds the slot | WS close `4404` |

The mapping to close codes is informational; the registry doesn't know about WebSockets — the upgrade handlers translate.

## Concurrency model

- **One `sync.RWMutex` covers both maps.** Mutating methods take `Lock`; lookups take `RLock`. Sharding is a possible later optimisation but irrelevant at v1 (tens to hundreds of conns, sub-microsecond critical sections, no measured contention).
- **Conn callbacks are never invoked under the lock,** with one documented exception: `UnregisterPhone` calls `ConnID()` while holding the write lock to scan for the target. The `Conn` contract requires `ConnID` to be a non-blocking getter.
- **Broadcast pattern.** A caller wanting to fan out a frame to all phones for a server-id calls `PhonesFor` (returns a freshly allocated copy) and iterates the snapshot calling `Send` per conn — no registry lock held during I/O.
- **No callbacks, no channels.** The registry is a passive store. Notification of binary loss / new phone arrival happens elsewhere (the WS upgrade handlers and #8's grace timer).

## Invariants the registry enforces

- `ClaimServer` is atomic check-then-set under the write lock; concurrent claims for the same server-id cannot both succeed.
- `RegisterPhone` is atomic check-binary-then-append under the write lock; a phone cannot land in the map for a server-id whose binary was released between check and append.
- `UnregisterPhone` removes by `ConnID` via swap-and-truncate, nils the trailing slot for GC hygiene, and `delete`s the map key when the slice empties — no orphaned empty slices.
- `PhonesFor` returns nil for unknown server-ids or empty slices — callers don't have to distinguish "unknown" from "known with zero phones."
- `PhonesFor` returns a copy: mutating the returned slice cannot affect registry state.
- `Counts` is internally consistent for one call (read under RLock); two concurrent calls may observe different values.

## What the registry deliberately does NOT do

- **Does not close `Conn` handles.** `ReleaseServer` and `UnregisterPhone` only remove map entries. Connection lifetime is owned by the WS upgrade handlers.
- **Does not deduplicate phones by `ConnID`.** Caller invariant: each `Conn` is registered once.
- **Does not bound the per-server-id phone slice.** Per-server caps belong to the WS upgrade layer (#5).
- **Does not propagate binary loss to phones.** When `ReleaseServer` runs, the phones slice for that server-id is left in place. The owner of phones must `UnregisterPhone` (and choose whether to close them) when their server's binary is gone. This is intentional: the grace ticket (#8) needs phones to remain reachable across the 30-second disconnect window.
- **Does not validate `serverID` content.** Length, charset, prefix — owned by `x-pyrycode-server` header validation in #4. The registry treats any string as an opaque map key.

## Adversarial framing

The registry has no networking and no direct exposure to adversarial input. Adversaries reach it only through future tickets (#4/#5) that pass strings and `Conn` handles into its API. Notes for the layers above:

- **`serverID` of pathological size or content** is harmless to the registry (just a map key) but the WS upgrade layer must cap header lengths.
- **`Conn.ConnID()` returning unstable values** would cause `UnregisterPhone` to remove the wrong phone (or none). Documented as a contract on the `Conn` interface; violation is a leak, not a cross-tenant data leak.
- **Repeated `RegisterPhone` for the same server-id** is unbounded memory growth, gated only by the requirement that a binary holds the slot. Per-server connection caps are #5's job.
- **Slow `ConnID` blocking the write lock** is the cost of the documented non-blocking contract; copying the slice and scanning outside the lock would create a TOCTOU window where a phone could be added between snapshot and removal.

## Testing

`internal/relay/registry_test.go`, `package relay`. Tests use a minimal `fakeConn` (id, sent buffer, closed flag — no synchronisation; the race test gives each goroutine its own fakes).

Functional cases (one subtest per AC bullet):

- `ClaimServer` first-wins / second-conflicts; `BinaryFor` still resolves to the first conn.
- `ReleaseServer` reclaim — release returns `true`, second release of an unheld id returns `false`, claim again with a different conn succeeds.
- `RegisterPhone` requires a binary (`ErrNoServer` until `ClaimServer`).
- `UnregisterPhone` removes by `ConnID`, leaves siblings intact, no-op on unknown id.
- `PhonesFor` snapshot isolation: mutating the returned slice doesn't affect the registry; later registrations don't appear in earlier snapshots.
- `Counts` across the full lifecycle including the `ReleaseServer` orphan case (`(0, 1)` while phones survive a release).
- `UnregisterPhone` deletes the empty-slice map entry.

Race coverage: one `TestRegistry_RaceFreedom` hammers `ClaimServer` / `BinaryFor` / `RegisterPhone` / `PhonesFor` / `UnregisterPhone` / `Counts` / `ReleaseServer` from 32 goroutines × 200 ops over four contended server-ids. The test deliberately ignores returned errors — `ErrServerIDConflict` under contention is expected behaviour, not a failure. The point is the absence of `DATA RACE` reports.

`make test` runs `-race` by default. The `-count=20` invocation in the AC is a manual stress command, documented in the test's doc comment, not a knob in the test code:

```sh
go test -race -count=20 -run TestRegistry_RaceFreedom ./internal/relay
```

## Related

- [ADR-0003: Connection registry as a passive store](../decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, why grace logic lives outside.
- [Routing envelope](routing-envelope.md) — the wrapper used by frame forwarding (#6) once the registry is wired up.
- [Architecture overview](../../architecture.md) — where the registry fits in the data flow.
12 changes: 12 additions & 0 deletions docs/lessons.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,18 @@ When unmarshalling `{"frame": null}` into a struct with `Frame json.RawMessage`,

`encoding/json` re-encodes `RawMessage` through the outer marshaller and may strip insignificant whitespace. Tests asserting opacity should canonicalise both sides with `json.Compact` before `bytes.Equal`, otherwise they flake on formatting. Source: routing-envelope tests (#1).

## Slice swap-and-truncate needs an explicit nil for GC hygiene

Removing an element from `[]Conn` (or any slice of pointer-bearing values) by `slice[i] = slice[len(slice)-1]; slice = slice[:len(slice)-1]` leaves the original last element reachable from the underlying array's spare capacity, pinning whatever it references in memory. Set `slice[len(slice)-1] = nil` *before* re-slicing. Source: `Registry.UnregisterPhone` (#3). Catches: connection handles can be heavy structs; a leaked `Conn` keeps a goroutine and a socket buffer alive.

## Race-test count is a CI-runner knob, not test code

The AC asked for `go test -race -count=20` coverage. `count` belongs on the command line, not as a loop inside the test. `make test` already runs `-race`; document the manual stress invocation in the test's doc comment so future contributors know how to reproduce locally. Source: `TestRegistry_RaceFreedom` (#3).

## A passive registry returns copies, callers do the slow work outside the lock

When a shared map holds connection handles (or anything you'll iterate to do I/O over), the read accessor should allocate and return a fresh copy. Callers can then `Send` / `Close` / fan out without holding the lock — and concurrent mutators don't trip the iteration. The cost is one allocation per call; the benefit is no foot-gun where a slow `Send` blocks every other registry caller. Source: `Registry.PhonesFor` (#3).

## `autocert.Manager.HTTPHandler(nil)` redirects to HTTPS — it does not 404

The autocert docs and the `nil` argument are easy to misread as "no fallback → 404." Actual behaviour: GET/HEAD get a `302` to HTTPS; other methods get `400`. To get an explicit 404 (or any other handler), pass it explicitly: `mgr.HTTPHandler(http.NotFoundHandler())`. See ADR-0002. Source: autocert TLS wiring (#9).
Expand Down
Loading