diff --git a/docs/PROJECT-MEMORY.md b/docs/PROJECT-MEMORY.md index 987761e..fcbaea2 100644 --- a/docs/PROJECT-MEMORY.md +++ b/docs/PROJECT-MEMORY.md @@ -9,10 +9,10 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex | Go skeleton | Done | `cmd/pyrycode-relay/main.go`, `internal/relay/doc.go` | | `http.Server` with explicit timeouts (gosec G114) | Done | `cmd/pyrycode-relay/main.go` | | Routing envelope wrapper type (`Envelope`, `Marshal`, `Unmarshal`, sentinel errors) | Done (#1) | `internal/relay/envelope.go` | +| Connection registry (`Conn`, `Registry`; 1:1 binary, 1:N phones; sentinel errors for `4409`/`4404`; race-tested) | Done (#3) | `internal/relay/registry.go` | | TLS via autocert in `--domain` mode (`NewAutocertManager`, `EnforceHost`, `TLSConfig`, `ErrCacheDirInsecure`) | Done (#9) | `internal/relay/tls.go`, `cmd/pyrycode-relay/main.go` | | WS upgrade on `/v1/server` and `/v1/client` | Not started | — | | Header validation (`x-pyrycode-server`, `x-pyrycode-token`) | Not started | — | -| Connection registry (server-id ↔ binary, server-id ↔ phones) | Not started | — | | Frame forwarding using the routing envelope | Not started | — | | `conn_id` generation scheme | Not started | — | | Threat model doc — operational surface (deploy, supply chain, DoS, log hygiene, cert handling, TLS, error leakage) | Done (#11) | `docs/threat-model.md` | @@ -25,6 +25,8 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex - **Stdlib + `golang.org/x/crypto`.** First non-stdlib dep arrived in #9 (`acme/autocert`). Keep the dependency surface deliberate; new deps need a justification. - **Loud failure over silent correction.** `--domain` and `--insecure-listen` are mutually exclusive and explicit. `ErrCacheDirInsecure` refuses to start on a world/group-readable cert cache rather than re-chmoding it. `:80` returns `404` for non-challenge traffic instead of redirecting to HTTPS (ADR-0002). - **Tests live in the same package** (`package relay`, not `relay_test`) so they can `errors.Is` against unexported sentinels if needed and reach private helpers without re-exporting. +- **Passive in-memory stores guard mutation under one RWMutex; reads return copies, never references.** `PhonesFor` allocates a fresh slice so callers do slow work (broadcast, `Send` over the network) without holding the registry lock. Adopted in `internal/relay/registry.go` (#3); the same shape applies to any future "shared map of conns" type. +- **Interface methods called under the lock are documented as non-blocking getters.** The registry's `Conn.ConnID()` is invoked under the write lock during `UnregisterPhone`; `Send` and `Close` are never called while the lock is held. Pattern: state the contract on the interface, never call something that could block on I/O while a mutex is held. ## Conventions diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index 00e41af..3d7a5e1 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -4,11 +4,13 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Features +- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`. - [Autocert TLS](features/autocert-tls.md) — `--domain` mode wiring: `:443` WSS via Let's Encrypt + `:80` ACME http-01, host gates, cache-dir permission policy, TLS 1.2 floor. - [Routing envelope](features/routing-envelope.md) — typed Go wrapper (`Envelope`, `Marshal`, `Unmarshal`) for the `{conn_id, frame}` wire shape exchanged between relay and binary. ## Decisions +- [ADR-0003: Connection registry as a passive store](decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, narrow `ReleaseServer` so grace logic (#8) can wrap it. - [ADR-0002: Explicit-failure 404 on `:80` instead of HTTPS redirect](decisions/0002-autocert-explicit-failure-on-port-80.md) — pass `http.NotFoundHandler()` to `autocert.Manager.HTTPHandler`, not `nil`; loud failure over silent scheme-switching. - [ADR-0001: Routing envelope shape and opacity](decisions/0001-routing-envelope-shape-and-opacity.md) — `json.RawMessage` for the inner frame; sentinel errors; validate at boundary, never parse payloads. diff --git a/docs/knowledge/decisions/0003-connection-registry-passive-store.md b/docs/knowledge/decisions/0003-connection-registry-passive-store.md new file mode 100644 index 0000000..4b72e02 --- /dev/null +++ b/docs/knowledge/decisions/0003-connection-registry-passive-store.md @@ -0,0 +1,59 @@ +# ADR-0003: Connection registry as a passive store + +**Status:** Accepted (#3) +**Date:** 2026-05-08 + +## Context + +The relay needs one canonical structure that owns "who is connected as what for which server-id." Every routing-layer ticket (#4 server upgrade, #5 client upgrade, #6 forwarding, #7 heartbeat, #8 grace, #10 health) reads or writes through it. The shape comes from `docs/architecture.md`: 1:1 server-id → binary with first-claim-wins, 1:N server-id → phones gated on a binary holding the slot. + +The design space had several axes — what the registry *owns*, what locks it takes, how it returns data, and where the grace period lives. + +## Decision + +`internal/relay/registry.go` defines a `Conn` interface, a `Registry` type with one `sync.RWMutex` over both maps, seven methods (`ClaimServer`, `ReleaseServer`, `RegisterPhone`, `UnregisterPhone`, `BinaryFor`, `PhonesFor`, `Counts`), and two sentinel errors (`ErrServerIDConflict`, `ErrNoServer`). The registry is a **passive store**: it never invokes `Send` / `Close` on conns it tracks, never times anything, never calls back to anyone. Stdlib only. + +Rules adopted: + +1. **One RWMutex over both maps.** No sharding, no per-server-id locks. +2. **`Conn` interface lives in `relay` package.** Defined at the consumer. +3. **`PhonesFor` returns a freshly allocated copy.** Callers iterate the snapshot lock-free. +4. **`UnregisterPhone` calls `ConnID()` under the lock.** The `Conn` contract requires it to be a non-blocking getter. +5. **`ReleaseServer` does not touch phones.** Orphan phones survive a release; their owner must unregister them. +6. **Sentinel errors over wrapped errors.** Same idiom as ADR-0001 (#1). + +## Rationale + +### One RWMutex, not per-map or sharded + +Two separate locks (one for `binaries`, one for `phones`) invite ordering bugs (always-A-then-B) for no measurable gain — `RegisterPhone` already needs both, and the critical sections are sub-microsecond map operations. Sharding by server-id is a possible later optimisation but irrelevant at v1: the relay handles tens to hundreds of concurrent connections, not millions, and lock contention is unobserved. + +### `Conn` defined in `relay`, not at the consumer + +Conventional Go style says "define interfaces at the consumer." Here there is one consumer package — `internal/relay` — and the Conn impls themselves will live in a `relay/ws` subpackage in #4/#5. Defining `Conn` in `relay` avoids an import cycle and matches the practical scope (one consumer). + +### `PhonesFor` returns a copy, not a reference + +Returning the underlying slice would force callers to broadcast under the registry lock or risk a data race against `RegisterPhone` / `UnregisterPhone`. A copy is cheap (pointer-sized entries, typically 1–3 phones per server-id) and lets callers do the slow thing — `Send` over the network — without holding the lock. The Conn references inside the copy are still live; that's intentional, and it's what makes the broadcast pattern work. + +### `ConnID` is a getter, called under the lock + +Alternative: copy the slice, drop the lock, scan for the matching `ConnID`, re-acquire the lock to remove. This creates a TOCTOU window — a phone added between snapshot and removal could be erroneously skipped or removed. The cleaner contract is "`ConnID` is a non-blocking getter" stated on the interface; if a future Conn impl violates that, fix the impl. Documented; tested only structurally. + +### `ReleaseServer` is narrow + +It removes the binary entry and nothing else. The grace ticket (#8) needs phones to remain reachable for 30 seconds across a binary disconnect — so cleanup of orphan phones cannot live inside `ReleaseServer`. Keeping release immediate-and-narrow lets #8 layer policy on top without changing the registry's API. If a future ticket wants auto-close-phones-on-release, it adds a wrapper, not a flag on this method. + +### Sentinel errors + +Same idiom as ADR-0001. `errors.Is` for branching, no string matching, downstream WS handlers map sentinels → close codes (`4404`, `4409`). + +## Consequences + +- The registry is the single canonical place for "who is connected." Future routing-layer code reads/writes through it; ad-hoc maps in #4–#10 are a code-review smell. +- Concurrency is provable from the lock graph (one node) plus the documented invariant that `Send` / `Close` never run under the lock. Race-test coverage (32 goroutines × 200 ops) plus `make test`'s default `-race` mode is the enforcement. +- The registry is testable without WebSockets — a 5-line `fakeConn` is enough. WS-library choice in #4/#5 doesn't reach back into this code. +- `Conn` is a contract: any future Conn impl must honour non-blocking `ConnID`, idempotent `Close`, and per-impl write serialisation in `Send`. Those properties belong in #4/#5's spec. +- Per-server connection caps are explicitly NOT here. #5 owns that. +- Orphan phones after `ReleaseServer` are a documented state. #8 will own the grace timer that decides when to close them. +- Linear scan in `UnregisterPhone` is correct for v1 (1–3 phones per server-id). If profiling later flags it, switch to `map[connID]Conn` per server-id; the public API doesn't change. diff --git a/docs/knowledge/features/connection-registry.md b/docs/knowledge/features/connection-registry.md new file mode 100644 index 0000000..e38e13d --- /dev/null +++ b/docs/knowledge/features/connection-registry.md @@ -0,0 +1,104 @@ +# Connection registry + +The relay's routing core needs one canonical structure that owns "who is connected as what for which server-id." Every routing-layer ticket downstream — `/v1/server` upgrade (#4), `/v1/client` upgrade (#5), frame forwarding (#6), heartbeat (#7), grace period (#8), health (#10) — reads or writes through this registry. + +Two non-negotiable shapes flow from `docs/architecture.md`: + +1. **server-id → binary is 1:1, first-claim-wins.** A second binary claiming a held server-id is a `4409` conflict. +2. **server-id → phones is 1:N, gated on a binary holding the slot.** A phone whose server-id has no binary is a `4404` no-server. + +The registry is a **passive in-memory store**. It holds opaque `Conn` handles, never speaks WebSocket, never invokes `Send` / `Close` on the conns it tracks, and never times anything out. The 30-second grace period on binary disconnect (#8) wraps the registry rather than living inside it. + +## API + +Package `internal/relay` (`registry.go`): + +```go +type Conn interface { + ConnID() string // stable, non-blocking; called under the registry lock + Send(msg []byte) error // never called under the registry lock + Close() // idempotent; never called under the registry lock +} + +type Registry struct { /* sync.RWMutex + binaries map + phones map */ } + +func NewRegistry() *Registry + +func (r *Registry) ClaimServer(serverID string, conn Conn) error +func (r *Registry) ReleaseServer(serverID string) (released bool) +func (r *Registry) RegisterPhone(serverID string, conn Conn) error +func (r *Registry) UnregisterPhone(serverID string, connID string) +func (r *Registry) BinaryFor(serverID string) (Conn, bool) +func (r *Registry) PhonesFor(serverID string) []Conn +func (r *Registry) Counts() (binaries, phones int) +``` + +Sentinel errors (branch with `errors.Is`): + +| Error | Returned by | Maps to | +|---|---|---| +| `ErrServerIDConflict` | `ClaimServer` when slot already held | WS close `4409` | +| `ErrNoServer` | `RegisterPhone` when no binary holds the slot | WS close `4404` | + +The mapping to close codes is informational; the registry doesn't know about WebSockets — the upgrade handlers translate. + +## Concurrency model + +- **One `sync.RWMutex` covers both maps.** Mutating methods take `Lock`; lookups take `RLock`. Sharding is a possible later optimisation but irrelevant at v1 (tens to hundreds of conns, sub-microsecond critical sections, no measured contention). +- **Conn callbacks are never invoked under the lock,** with one documented exception: `UnregisterPhone` calls `ConnID()` while holding the write lock to scan for the target. The `Conn` contract requires `ConnID` to be a non-blocking getter. +- **Broadcast pattern.** A caller wanting to fan out a frame to all phones for a server-id calls `PhonesFor` (returns a freshly allocated copy) and iterates the snapshot calling `Send` per conn — no registry lock held during I/O. +- **No callbacks, no channels.** The registry is a passive store. Notification of binary loss / new phone arrival happens elsewhere (the WS upgrade handlers and #8's grace timer). + +## Invariants the registry enforces + +- `ClaimServer` is atomic check-then-set under the write lock; concurrent claims for the same server-id cannot both succeed. +- `RegisterPhone` is atomic check-binary-then-append under the write lock; a phone cannot land in the map for a server-id whose binary was released between check and append. +- `UnregisterPhone` removes by `ConnID` via swap-and-truncate, nils the trailing slot for GC hygiene, and `delete`s the map key when the slice empties — no orphaned empty slices. +- `PhonesFor` returns nil for unknown server-ids or empty slices — callers don't have to distinguish "unknown" from "known with zero phones." +- `PhonesFor` returns a copy: mutating the returned slice cannot affect registry state. +- `Counts` is internally consistent for one call (read under RLock); two concurrent calls may observe different values. + +## What the registry deliberately does NOT do + +- **Does not close `Conn` handles.** `ReleaseServer` and `UnregisterPhone` only remove map entries. Connection lifetime is owned by the WS upgrade handlers. +- **Does not deduplicate phones by `ConnID`.** Caller invariant: each `Conn` is registered once. +- **Does not bound the per-server-id phone slice.** Per-server caps belong to the WS upgrade layer (#5). +- **Does not propagate binary loss to phones.** When `ReleaseServer` runs, the phones slice for that server-id is left in place. The owner of phones must `UnregisterPhone` (and choose whether to close them) when their server's binary is gone. This is intentional: the grace ticket (#8) needs phones to remain reachable across the 30-second disconnect window. +- **Does not validate `serverID` content.** Length, charset, prefix — owned by `x-pyrycode-server` header validation in #4. The registry treats any string as an opaque map key. + +## Adversarial framing + +The registry has no networking and no direct exposure to adversarial input. Adversaries reach it only through future tickets (#4/#5) that pass strings and `Conn` handles into its API. Notes for the layers above: + +- **`serverID` of pathological size or content** is harmless to the registry (just a map key) but the WS upgrade layer must cap header lengths. +- **`Conn.ConnID()` returning unstable values** would cause `UnregisterPhone` to remove the wrong phone (or none). Documented as a contract on the `Conn` interface; violation is a leak, not a cross-tenant data leak. +- **Repeated `RegisterPhone` for the same server-id** is unbounded memory growth, gated only by the requirement that a binary holds the slot. Per-server connection caps are #5's job. +- **Slow `ConnID` blocking the write lock** is the cost of the documented non-blocking contract; copying the slice and scanning outside the lock would create a TOCTOU window where a phone could be added between snapshot and removal. + +## Testing + +`internal/relay/registry_test.go`, `package relay`. Tests use a minimal `fakeConn` (id, sent buffer, closed flag — no synchronisation; the race test gives each goroutine its own fakes). + +Functional cases (one subtest per AC bullet): + +- `ClaimServer` first-wins / second-conflicts; `BinaryFor` still resolves to the first conn. +- `ReleaseServer` reclaim — release returns `true`, second release of an unheld id returns `false`, claim again with a different conn succeeds. +- `RegisterPhone` requires a binary (`ErrNoServer` until `ClaimServer`). +- `UnregisterPhone` removes by `ConnID`, leaves siblings intact, no-op on unknown id. +- `PhonesFor` snapshot isolation: mutating the returned slice doesn't affect the registry; later registrations don't appear in earlier snapshots. +- `Counts` across the full lifecycle including the `ReleaseServer` orphan case (`(0, 1)` while phones survive a release). +- `UnregisterPhone` deletes the empty-slice map entry. + +Race coverage: one `TestRegistry_RaceFreedom` hammers `ClaimServer` / `BinaryFor` / `RegisterPhone` / `PhonesFor` / `UnregisterPhone` / `Counts` / `ReleaseServer` from 32 goroutines × 200 ops over four contended server-ids. The test deliberately ignores returned errors — `ErrServerIDConflict` under contention is expected behaviour, not a failure. The point is the absence of `DATA RACE` reports. + +`make test` runs `-race` by default. The `-count=20` invocation in the AC is a manual stress command, documented in the test's doc comment, not a knob in the test code: + +```sh +go test -race -count=20 -run TestRegistry_RaceFreedom ./internal/relay +``` + +## Related + +- [ADR-0003: Connection registry as a passive store](../decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, why grace logic lives outside. +- [Routing envelope](routing-envelope.md) — the wrapper used by frame forwarding (#6) once the registry is wired up. +- [Architecture overview](../../architecture.md) — where the registry fits in the data flow. diff --git a/docs/lessons.md b/docs/lessons.md index de5bd4a..d93fbe7 100644 --- a/docs/lessons.md +++ b/docs/lessons.md @@ -14,6 +14,18 @@ When unmarshalling `{"frame": null}` into a struct with `Frame json.RawMessage`, `encoding/json` re-encodes `RawMessage` through the outer marshaller and may strip insignificant whitespace. Tests asserting opacity should canonicalise both sides with `json.Compact` before `bytes.Equal`, otherwise they flake on formatting. Source: routing-envelope tests (#1). +## Slice swap-and-truncate needs an explicit nil for GC hygiene + +Removing an element from `[]Conn` (or any slice of pointer-bearing values) by `slice[i] = slice[len(slice)-1]; slice = slice[:len(slice)-1]` leaves the original last element reachable from the underlying array's spare capacity, pinning whatever it references in memory. Set `slice[len(slice)-1] = nil` *before* re-slicing. Source: `Registry.UnregisterPhone` (#3). Catches: connection handles can be heavy structs; a leaked `Conn` keeps a goroutine and a socket buffer alive. + +## Race-test count is a CI-runner knob, not test code + +The AC asked for `go test -race -count=20` coverage. `count` belongs on the command line, not as a loop inside the test. `make test` already runs `-race`; document the manual stress invocation in the test's doc comment so future contributors know how to reproduce locally. Source: `TestRegistry_RaceFreedom` (#3). + +## A passive registry returns copies, callers do the slow work outside the lock + +When a shared map holds connection handles (or anything you'll iterate to do I/O over), the read accessor should allocate and return a fresh copy. Callers can then `Send` / `Close` / fan out without holding the lock — and concurrent mutators don't trip the iteration. The cost is one allocation per call; the benefit is no foot-gun where a slow `Send` blocks every other registry caller. Source: `Registry.PhonesFor` (#3). + ## `autocert.Manager.HTTPHandler(nil)` redirects to HTTPS — it does not 404 The autocert docs and the `nil` argument are easy to misread as "no fallback → 404." Actual behaviour: GET/HEAD get a `302` to HTTPS; other methods get `400`. To get an explicit 404 (or any other handler), pass it explicitly: `mgr.HTTPHandler(http.NotFoundHandler())`. See ADR-0002. Source: autocert TLS wiring (#9). diff --git a/docs/specs/architecture/3-connection-registry.md b/docs/specs/architecture/3-connection-registry.md new file mode 100644 index 0000000..ad38dda --- /dev/null +++ b/docs/specs/architecture/3-connection-registry.md @@ -0,0 +1,399 @@ +# Spec — Connection registry (#3) + +## Files to read first + +- `internal/relay/envelope.go` — sentinel-error idiom, doc-comment style, `package relay` framing. The new file mirrors this shape. +- `internal/relay/envelope_test.go` — `package relay` (not `_test`) so tests can `errors.Is` against unexported helpers; table-driven rejections via `errors.Is`. +- `internal/relay/doc.go` — package doc explicitly names "per-server connection registry" as a routing-core concern. The new file is its second pillar. +- `docs/architecture.md:18-23` — what the relay does: 1:1 server-id → binary, 1:N server-id → phones, frame forwarding. The registry's invariants flow from these bullets. +- `docs/PROJECT-MEMORY.md:22-26` — established patterns: sentinel errors via `errors.Is`, stdlib only, package-internal tests. +- `Makefile:13-14` — `make test` runs `go test -race ./...`. Race coverage isn't a CI nicety here; it's the default mode every PR is judged against. +- `go.mod` — Go 1.26.2, stdlib only. `sync` and `errors` are the only imports the production file needs. + +## Context + +The relay's routing core needs a single canonical structure that owns "who is connected as what for which server-id." Every routing-layer ticket downstream — `/v1/server` upgrade (#4), `/v1/client` upgrade (#5), frame forwarding (#6), heartbeat (#7), grace period (#8), health (#10) — reads or writes through it. Without this type each downstream ticket would invent its own ad-hoc map and lock; with it they share one contract and one race-tested implementation. + +Two non-negotiable shapes (per the protocol spec and `docs/architecture.md`): + +1. **Server-id → binary is 1:1, first-claim-wins.** A second binary claiming a held server-id is a `4409` conflict. The 30-second grace period on disconnect is a separate ticket (#8) and **wraps** the registry rather than living inside it; the registry itself implements immediate semantics (claim/release with no timer). +2. **Server-id → phones is 1:N, gated on a binary holding the slot.** A phone whose server-id has no binary is a `4404` no-server. Phones can only be registered when the slot is held. + +The registry holds **opaque connection handles** (a `Conn` interface) so the WS upgrade tickets can pick the WebSocket library independently and tests can use mocks. Concrete `Conn` implementations land in #4 and #5; this ticket ships only the interface. + +## Design + +### Package & files + +- New file: `internal/relay/registry.go` +- New test file: `internal/relay/registry_test.go` +- Both `package relay` — same convention as `envelope.go` / `envelope_test.go`. Tests can branch on unexported state via `errors.Is` and reach package-private helpers if needed. + +No edits to `doc.go`, `envelope.go`, `cmd/pyrycode-relay/main.go`, or `docs/`. The registry has no consumers yet; wiring lands in #4/#5. + +### The `Conn` interface + +```go +// Conn is the registry's view of a WebSocket connection. Real implementations +// land in the /v1/server (#4) and /v1/client (#5) upgrade tickets; tests use +// mocks. +// +// All three methods MUST be safe to call from any goroutine. ConnID and Send +// MUST NOT block on locks the registry itself takes — the registry never +// invokes Conn methods while holding its own lock, but callers iterating a +// PhonesFor snapshot may. +type Conn interface { + // ConnID returns the relay-assigned connection identifier. Stable for the + // lifetime of the connection. The format is decided in a later ticket; + // the registry treats it as an opaque key for slice lookups. + ConnID() string + + // Send delivers a fully-formed wire message. Implementations are + // responsible for serialising writes to the underlying socket. Errors + // are returned to the caller; the registry does not interpret them. + Send(msg []byte) error + + // Close terminates the connection. Idempotent. Safe to call concurrently + // with Send (the implementation must handle the race). + Close() +} +``` + +Three methods, named precisely after the AC. The interface is defined in the registry's package because consumers (frame forwarding, etc.) all live in `internal/relay`; defining at the consumer is moot when there is one consumer package. (Keeping the interface in `relay` also avoids an import cycle once `cmd/pyrycode-relay/main.go` wires up real implementations from a `relay/ws` subpackage in #4/#5.) + +### The `Registry` type + +```go +// Registry maps server-ids to a single binary connection (1:1) and to a list +// of phone connections (1:N). All methods are safe to call concurrently. +// +// The registry holds Conn handles by reference and never inspects their +// internal state. It does not call any Conn method while holding its own +// lock — close/send-fan-out is the caller's responsibility, performed +// against a PhonesFor snapshot. +type Registry struct { + mu sync.RWMutex + binaries map[string]Conn + phones map[string][]Conn +} + +// NewRegistry constructs an empty Registry. +func NewRegistry() *Registry +``` + +One `sync.RWMutex` covers both maps. Sharding by server-id is a possible later optimisation but irrelevant at v1 — the relay handles tens to hundreds of concurrent connections, not millions, and the lock is held for sub-microsecond critical sections (map lookup, slice append/copy). Two separate locks would invite ordering bugs (always-A-then-B) for no measurable gain. + +### Methods + +```go +// ClaimServer registers conn as the binary for serverID. First-claim-wins: +// the second concurrent caller for the same serverID receives +// ErrServerIDConflict. The conflicting caller's conn is left untouched +// (registry does not Close it). +// +// Use ReleaseServer to free the slot. ClaimServer does NOT inspect or +// modify the phones slice for serverID; phones registered before a release +// remain in the map until they are explicitly UnregisterPhone'd by their +// owner, or removed by a higher-layer cleanup (#8 grace period). +func (r *Registry) ClaimServer(serverID string, conn Conn) error + +// ReleaseServer removes the binary entry for serverID. Returns true if a +// binary held the slot, false otherwise. Does NOT close the connection; +// caller owns conn lifecycle. Does NOT touch the phones slice — see the +// "Open questions" section for why this is intentional and how #8 will +// layer cleanup on top. +func (r *Registry) ReleaseServer(serverID string) (released bool) + +// RegisterPhone appends conn to the phones slice for serverID. Returns +// ErrNoServer if no binary currently holds serverID. Does not deduplicate +// by ConnID — the caller is responsible for not registering the same +// phone twice. +func (r *Registry) RegisterPhone(serverID string, conn Conn) error + +// UnregisterPhone removes the phone whose ConnID equals connID from the +// slice for serverID. No-op if no matching phone is present (including +// when no slice exists for serverID). Does NOT close the connection; +// caller owns conn lifecycle. +// +// When the slice becomes empty as a result, the entry is deleted from the +// phones map so Counts and PhonesFor see no orphaned empty slices. +func (r *Registry) UnregisterPhone(serverID string, connID string) + +// BinaryFor returns the binary holding serverID, if any. +func (r *Registry) BinaryFor(serverID string) (Conn, bool) + +// PhonesFor returns a snapshot of the phones registered for serverID. +// The returned slice is freshly allocated; the caller may iterate, append, +// or otherwise mutate it without affecting the registry's internal state +// or holding any registry lock. Returns nil for an unknown serverID. +// +// The Conn handles inside the slice are the same references the registry +// holds; calling Send/Close on them affects the live connection. +func (r *Registry) PhonesFor(serverID string) []Conn + +// Counts returns the number of binaries currently claimed and the total +// number of phone connections summed across all server-ids. For #10's +// health endpoint. +func (r *Registry) Counts() (binaries, phones int) +``` + +### Sentinel errors + +```go +var ( + // ErrServerIDConflict is returned by ClaimServer when serverID is + // already held by another binary. Maps to WS close code 4409. + ErrServerIDConflict = errors.New("relay: server-id already claimed") + + // ErrNoServer is returned by RegisterPhone when no binary holds the + // requested serverID. Maps to WS close code 4404. + ErrNoServer = errors.New("relay: no binary for server-id") +) +``` + +Same idiom as `envelope.go`: callers use `errors.Is`. The mapping to close codes is informational; the registry doesn't know about WebSockets and won't return close codes itself — the WS upgrade handlers do that translation. + +### Algorithms + +**ClaimServer** — write lock, single-check map, set or error. + +``` +mu.Lock(); defer mu.Unlock() +if _, ok := binaries[serverID]; ok { return ErrServerIDConflict } +binaries[serverID] = conn +return nil +``` + +**ReleaseServer** — write lock, delete-and-report. + +``` +mu.Lock(); defer mu.Unlock() +if _, ok := binaries[serverID]; !ok { return false } +delete(binaries, serverID) +return true +``` + +**RegisterPhone** — write lock, gate on binary presence, append. + +``` +mu.Lock(); defer mu.Unlock() +if _, ok := binaries[serverID]; !ok { return ErrNoServer } +phones[serverID] = append(phones[serverID], conn) +return nil +``` + +**UnregisterPhone** — write lock, linear scan by ConnID, remove preserving order is unnecessary; use swap-and-truncate. + +``` +mu.Lock(); defer mu.Unlock() +slice, ok := phones[serverID] +if !ok { return } +for i, c := range slice { + if c.ConnID() == connID { + slice[i] = slice[len(slice)-1] + slice[len(slice)-1] = nil // GC hygiene; Conn may be a heavy struct. + slice = slice[:len(slice)-1] + if len(slice) == 0 { + delete(phones, serverID) + } else { + phones[serverID] = slice + } + return + } +} +``` + +ConnID is called *while holding the lock* — that's allowed because the contract on `Conn` says `ConnID` must not block (it's a getter). `Send` and `Close`, which can block on the network, are never called from inside any registry method. + +Linear scan is correct for v1: a single server-id holds a small number of phones (typically 1–3 per user across devices). If profiling later shows hot spots, switch to `map[connID]Conn` per server-id; the API doesn't change. + +**BinaryFor** — read lock, map lookup. + +``` +mu.RLock(); defer mu.RUnlock() +c, ok := binaries[serverID] +return c, ok +``` + +**PhonesFor** — read lock, copy. + +``` +mu.RLock(); defer mu.RUnlock() +src := phones[serverID] +if len(src) == 0 { return nil } +out := make([]Conn, len(src)) +copy(out, src) +return out +``` + +The `nil` return on empty is deliberate: callers that range over the result iterate zero times either way; the caller never has to distinguish "unknown server-id" from "known with zero phones" because `RegisterPhone` makes those the same state (the registry only stores non-empty slices, courtesy of `UnregisterPhone`'s `delete` step). + +**Counts** — read lock, len + summed iteration. + +``` +mu.RLock(); defer mu.RUnlock() +b := len(binaries) +p := 0 +for _, s := range phones { p += len(s) } +return b, p +``` + +Iteration is acceptable: `Counts` is called by the health endpoint, not on the hot path. Tracking running totals is a premature optimisation. + +### Concurrency model + +- **One `sync.RWMutex` for the whole registry.** All write methods take `Lock`; all read methods take `RLock`. +- **No fairness assumptions.** Go's `sync.RWMutex` does not guarantee starvation-freedom for writers; the registry's API doesn't expose that. The five-method surface area is small enough that lock contention is unobserved in practice. +- **No callbacks, no channels.** The registry is a passive store. Notification of binary loss / new phone arrival happens elsewhere (the WS upgrade handlers and #8's grace timer). +- **Conn methods are never invoked under the lock.** Callers wanting to broadcast a frame to all phones for a server-id must: + 1. Call `PhonesFor` to get a snapshot. + 2. Iterate the snapshot and call `Send` on each — no registry lock held. + This pattern is the entire reason `PhonesFor` returns a copy. +- **`ConnID()` IS called under the lock** by `UnregisterPhone`. Documented in `Conn`'s doc comment as a non-blocking, lock-free operation. + +### Error handling + +Two sentinels (`ErrServerIDConflict`, `ErrNoServer`). No wrapping with `fmt.Errorf` is needed — there is no underlying error to chain. No panics. No logging. Library code; the caller chooses the response. + +### Lifecycle expectations the registry does NOT enforce + +These are listed so the developer doesn't add defensive code for them: + +- The registry does not close `Conn` handles on its own. `ReleaseServer` and `UnregisterPhone` remove map entries but never call `Close`. Callers (WS upgrade handlers) own the connection lifetime. +- The registry does not deduplicate phones by `ConnID` on `RegisterPhone`. Caller invariant: each `Conn` is registered once. +- The registry does not bound the per-server-id phone slice. Per-server connection limits belong to the WS upgrade layer (rate limiting / cap headers, not this ticket's concern). +- The registry does not propagate binary loss to phones. When `ReleaseServer` runs, phones for that server-id remain in the map — the layer that owns phones must `UnregisterPhone` them (and choose whether to close them) when their server's binary is gone. This is intentional: the grace ticket (#8) needs that decoupling to keep phones alive across a 30-second window. + +## Testing strategy + +`internal/relay/registry_test.go`, `package relay`, `testing` only. + +### Mock `Conn` + +A minimal in-package test helper: + +```go +type fakeConn struct { + id string + sent [][]byte // optional, for forwarding tests in #6; harmless here. + closed bool +} + +func (c *fakeConn) ConnID() string { return c.id } +func (c *fakeConn) Send(msg []byte) error { c.sent = append(c.sent, msg); return nil } +func (c *fakeConn) Close() { c.closed = true } +``` + +`fakeConn` has no synchronisation. The tests do not concurrently mutate the same `fakeConn` — the registry's race tests register/unregister distinct fakes per goroutine (or read-only fakes via `BinaryFor`/`PhonesFor`). This is a deliberate scope cut: the registry's race coverage proves the registry is race-free with well-behaved Conns, not that fakeConn is race-free. + +### Functional tests (subtest names map 1:1 to AC bullets) + +Each is a focused unit test using `errors.Is` for error assertions; no string compare. + +1. **`TestClaimServer_FirstWinsSecondConflicts`** — claim succeeds, second claim with a different `Conn` returns `ErrServerIDConflict`, original `BinaryFor` lookup still resolves to the first conn. +2. **`TestReleaseServer_ReclaimSucceeds`** — claim, release (returns `true`), claim again with a different conn (returns `nil`), `BinaryFor` resolves to the new conn. Then release of an unheld id returns `false`. +3. **`TestRegisterPhone_RequiresBinary`** — `RegisterPhone("s1", phone)` with no binary returns `ErrNoServer`; after `ClaimServer("s1", bin)`, the same call succeeds. +4. **`TestUnregisterPhone_RemovesByConnID`** — register three phones, `UnregisterPhone` the middle one, `PhonesFor` returns exactly the other two and the removed phone is absent. `UnregisterPhone` with an unknown connID is a no-op. +5. **`TestPhonesFor_SnapshotIsolation`** — register two phones, take the snapshot, mutate the snapshot (overwrite an element with a different `*fakeConn`), assert `PhonesFor` again still returns the original two. Also: register a third phone after the snapshot was taken, assert the original snapshot still has length 2. +6. **`TestCounts_AcrossLifecycle`** — assert `(0,0)` initial, after `ClaimServer` `(1,0)`, after `RegisterPhone × 2` `(1,2)`, after `UnregisterPhone × 1` `(1,1)`, after `ReleaseServer` `(0,1)` (orphan phone retained — see open questions), after the orphan's `UnregisterPhone` `(0,0)`. +7. **`TestUnregisterPhone_RemovesEmptySlice`** — register one phone, unregister it, assert `PhonesFor` returns `nil` AND the underlying `phones` map has no entry for that server-id (verifiable via `Counts` returning 0 phones; this AC is structural so a length check is sufficient). + +### Race coverage + +One test, focused on hammering the public API under `-race`: + +```go +func TestRegistry_RaceFreedom(t *testing.T) { + t.Parallel() + r := NewRegistry() + + const goroutines = 32 + const opsPer = 200 + + var wg sync.WaitGroup + wg.Add(goroutines) + for g := 0; g < goroutines; g++ { + g := g + go func() { + defer wg.Done() + sid := fmt.Sprintf("s-%d", g%4) // four contended server-ids + bin := &fakeConn{id: fmt.Sprintf("b-%d", g)} + for i := 0; i < opsPer; i++ { + _ = r.ClaimServer(sid, bin) + _, _ = r.BinaryFor(sid) + _ = r.RegisterPhone(sid, &fakeConn{id: fmt.Sprintf("p-%d-%d", g, i)}) + _ = r.PhonesFor(sid) + r.UnregisterPhone(sid, fmt.Sprintf("p-%d-%d", g, i)) + _, _ = r.Counts() + _ = r.ReleaseServer(sid) + } + }() + } + wg.Wait() +} +``` + +The ticket asks for `-race -count=20`. The Makefile already runs `-race`; `count=20` is a CI-level invocation, not a test code knob. **The developer should NOT add `-count` flags to the test file.** Document the manual command in the test's doc comment so future contributors know how to stress it locally: + +```go +// Run with: go test -race -count=20 -run TestRegistry_RaceFreedom ./internal/relay +``` + +The race test deliberately ignores returned errors: a `ClaimServer` that returns `ErrServerIDConflict` is expected behaviour under contention, not a test failure. The point is the absence of `DATA RACE` reports from the runtime. + +### What we deliberately do not test + +- **`Conn.Send` / `Conn.Close` invocation under lock** — the registry never calls them; there is no behaviour to assert. A "lock-not-held" assertion would require introspecting runtime state, which Go does not portably expose. +- **Per-server-id phone bound** — out of scope; later ticket. +- **`Conn.ConnID()` panic safety** — registry assumes it doesn't panic; matches the rest of the codebase's "trust your interfaces" stance. + +## Open questions + +1. **Should `ReleaseServer` clear the phones slice for that server-id?** The AC says it removes the binary entry; nothing about phones. The chosen semantic is **no, leave phones in place** — orphan phones survive a release until their owner unregisters them. Rationale: the grace period (#8) needs phones to remain reachable across a 30-second binary disconnect window, and the simplest way to allow that without coupling #8 into the registry is to keep release immediate-and-narrow. If a future ticket decides orphan phones should be auto-closed on release, it adds a `ReleaseServerAndClosePhones` method or wraps `ReleaseServer` in higher-layer logic — the public API in this ticket stays minimal. +2. **`Counts` race-vs-snapshot consistency.** `Counts` is read under `RLock` and is internally consistent for one call. Two concurrent calls may observe different values — that's fine for a health endpoint; no caller needs cross-call consistency. +3. **Should we keep an `Empty()` or `Has(serverID)` helper?** No. `BinaryFor` already returns `(Conn, bool)`; that's the existence test. Adding more helpers is anticipatory. + +## Out of scope (re-stated, for the developer) + +- No edits to `cmd/pyrycode-relay/main.go`. The registry has no consumers in this ticket. +- No WS upgrade, no header validation, no `conn_id` generation — see #4, #5, and the dedicated conn-id ticket. +- No grace period, no heartbeat, no metrics beyond `Counts`. See #7, #8, #10. +- No external dependencies. `sync`, `errors`, plus `fmt`/`testing`/`sync` for tests. + +## Done means + +- `internal/relay/registry.go` exists with `Conn`, `Registry`, `NewRegistry`, the seven methods named in the AC, and the two sentinel errors. Every exported symbol carries a doc comment that names its concurrency contract (whether the lock is held, what the caller may/may not do with returned values). +- `internal/relay/registry_test.go` covers the seven functional cases plus the race-freedom test. The race test runs clean under `go test -race -count=20 -run TestRegistry_RaceFreedom ./internal/relay`. +- `make vet`, `make test`, `make build` all clean from the repo root. +- One commit on `feature/3`: `feat(relay): connection registry (#3)`. + +--- + +## Security review (security-sensitive label) + +### Threat surface for THIS ticket + +The registry has no networking and no direct exposure to adversarial input. Adversaries only reach it through future tickets (#4/#5 WS upgrades) that pass strings (`serverID`, `connID`) and `Conn` handles into its API. The review below confirms the registry's API does not turn well-behaved adversarial input from the WS layer into a security issue — it cannot vouch for the WS layer itself. + +### Categories walked + +- **Trust boundaries.** The registry trusts (a) `serverID` is whatever the WS upgrade handler decided to accept (header validation is a separate ticket; until it lands, the registry does not pretend to validate IDs); (b) `connID` returned by `Conn.ConnID()` is unique among connections registered for the same server-id (caller invariant — the conn-id generation ticket guarantees this); (c) `Conn` implementations honour the methods' contracts (non-blocking `ConnID`, idempotent `Close`). Each trust is documented in the doc comment of the API surface that depends on it. **Finding:** registry does not validate `serverID` content (length, charset). This is correct for this ticket — validation belongs at the WS upgrade layer (`x-pyrycode-server` header check). The registry's behaviour is well-defined for any string, including empty strings or strings with embedded NULs: they're just map keys. Enforced at the boundary (`internal/relay/registry.go`), not deeper. +- **Resource exhaustion.** Per-server-id phone slice is unbounded. An adversary who could repeatedly call `RegisterPhone` for the same server-id could grow the slice without bound and consume memory. **Finding:** acceptable in this ticket because (a) `RegisterPhone` is gated on `ErrNoServer`, so an adversary first needs a binary holding the slot — which means the adversary has compromised the binary or its owner connected the binary themselves; (b) the WS upgrade layer (#5) is the natural place to enforce per-server connection caps. Mentioned in "Lifecycle expectations" so #5's developer doesn't assume the registry caps phones. +- **Race conditions / TOCTOU.** ClaimServer's "check then set" is atomic under the write lock — a second claimant cannot slip between `binaries[serverID]` lookup and assignment. RegisterPhone's "check binary then append" is also atomic under the write lock. Reads (`BinaryFor`, `PhonesFor`, `Counts`) are atomic under the read lock. There is no API that returns a value the caller is supposed to act on before re-entering the registry — which is what would create a TOCTOU window. **Finding:** the API shape eliminates TOCTOU at this layer. +- **Lock-order / deadlock.** Registry takes only its own `mu`. Never invokes `Send` or `Close` while holding the lock. `ConnID` is invoked under the lock by `UnregisterPhone` — documented contract on `Conn` says `ConnID` must not block. **Finding:** lock graph is a single node; cannot deadlock with itself, cannot deadlock with caller locks because no callbacks are made under the lock except the `ConnID` getter. +- **Snapshot isolation / data leaks via aliasing.** `PhonesFor` returns a copy via `make` + `copy`. Callers cannot observe in-progress mutation by dereferencing slice header bytes. The Conn handles inside the copy are still the same references the registry holds — that's by design (broadcast pattern), and the Conn's own state is not registry state. **Finding:** matches the AC's stated invariant; tested explicitly in `TestPhonesFor_SnapshotIsolation`. +- **Memory leaks under adversarial workload.** `UnregisterPhone` deletes the phones map entry when the slice empties, preventing growth of orphaned `[]Conn{}` entries. Released binaries are deleted from the map. The only documented leak path is "phones survive `ReleaseServer`" — addressed in open question (1) and intentionally left for the layer that owns phone lifetime. +- **Information disclosure.** Counts returns aggregate ints; no per-server-id info. No method returns a Conn that wasn't already known to the caller (you can only retrieve Conns for server-ids you've already registered/claimed). **Finding:** API is not a sidechannel for enumerating other tenants. +- **Panics / nil-deref.** Maps are initialised in `NewRegistry`. No `*Conn` dereferences (interface, not pointer). No slice index past length. **Finding:** none observed. + +### Adversarial framings considered + +- *"What if an attacker sends a server-id of 100 KB?"* — Registry treats it as a map key. Memory cost is one allocation per unique key. Capping the header length is the WS upgrade layer's job; the registry does not need to second-guess. +- *"What if a malicious `Conn` impl returns different `ConnID()` values on each call?"* — `UnregisterPhone` linear-scans by ConnID; if `ConnID()` is unstable, the wrong phone (or no phone) may be removed. The `Conn` doc comment specifies "stable for the lifetime of the connection" — this is a contract the WS handler must honour. If a future Conn impl violates it, the bug manifests as a phone leak in the registry, not a security boundary failure (no other Conn's data is exposed). +- *"What if a binary disconnects without releasing?"* — Registry holds a stale `Conn` reference until something calls `ReleaseServer`. The grace ticket (#8) is responsible for this; the registry's contract makes no promises about liveness. +- *"What if two goroutines `UnregisterPhone` the same phone simultaneously?"* — Both take the write lock serially. The first finds and removes; the second finds nothing and is a no-op. No double-free, no panic. +- *"Slow `Conn.ConnID()` blocks `UnregisterPhone` while holding the lock and starves other callers."* — Yes, that's the cost of the documented contract. The `Conn` doc comment names `ConnID` as non-blocking; if a future impl violates that, fix the impl. The registry could copy the slice and scan outside the lock to defend against this, but that creates a new TOCTOU window (a phone could be added after the snapshot and erroneously not be removed). The cleaner contract — and the one this spec adopts — is "`ConnID` is a getter." + +### Verdict: PASS + +The registry's API and lock discipline do not introduce vulnerabilities under any adversarial framing examined. The findings flagged above are out-of-scope by design (per-server caps, header validation, conn-id generation) and are documented in the spec so downstream tickets pick them up. The `security-sensitive` label is justified by the registry's role on every routing path; the review is correspondingly thorough even though this ticket itself does no I/O. diff --git a/internal/relay/registry.go b/internal/relay/registry.go new file mode 100644 index 0000000..f3256c7 --- /dev/null +++ b/internal/relay/registry.go @@ -0,0 +1,188 @@ +package relay + +import ( + "errors" + "sync" +) + +// Sentinel errors returned by Registry methods. Callers branch on these with +// errors.Is. The mapping to WebSocket close codes is informational; the +// registry has no knowledge of WebSockets and the WS upgrade handlers do +// the translation. +var ( + // ErrServerIDConflict is returned by ClaimServer when serverID is + // already held by another binary. Maps to WS close code 4409. + ErrServerIDConflict = errors.New("relay: server-id already claimed") + + // ErrNoServer is returned by RegisterPhone when no binary holds the + // requested serverID. Maps to WS close code 4404. + ErrNoServer = errors.New("relay: no binary for server-id") +) + +// Conn is the registry's view of a WebSocket connection. Real implementations +// land in the /v1/server (#4) and /v1/client (#5) upgrade tickets; tests use +// mocks. +// +// All three methods MUST be safe to call from any goroutine. ConnID and Send +// MUST NOT block on locks the registry itself takes — the registry never +// invokes Send or Close while holding its own lock, but it does call ConnID +// under the lock during UnregisterPhone, so ConnID must be a non-blocking +// getter. +type Conn interface { + // ConnID returns the relay-assigned connection identifier. Stable for + // the lifetime of the connection. The registry treats it as an opaque + // key for slice lookups and calls it under its internal lock, so the + // implementation must not block. + ConnID() string + + // Send delivers a fully-formed wire message. Implementations are + // responsible for serialising writes to the underlying socket. Errors + // are returned to the caller; the registry does not interpret them. + Send(msg []byte) error + + // Close terminates the connection. Idempotent. Safe to call + // concurrently with Send (the implementation must handle the race). + Close() +} + +// Registry maps server-ids to a single binary connection (1:1) and to a list +// of phone connections (1:N). All methods are safe to call concurrently. +// +// The registry holds Conn handles by reference and never inspects their +// internal state. It does not call Send or Close on a Conn while holding its +// own lock — broadcast / close fan-out is the caller's responsibility, +// performed against a PhonesFor snapshot. +type Registry struct { + mu sync.RWMutex + binaries map[string]Conn + phones map[string][]Conn +} + +// NewRegistry constructs an empty Registry. +func NewRegistry() *Registry { + return &Registry{ + binaries: make(map[string]Conn), + phones: make(map[string][]Conn), + } +} + +// ClaimServer registers conn as the binary for serverID. First-claim-wins: +// a second concurrent caller for the same serverID receives +// ErrServerIDConflict and the conflicting caller's conn is left untouched +// (the registry does not Close it). +// +// Use ReleaseServer to free the slot. ClaimServer does NOT inspect or modify +// the phones slice for serverID; phones registered before a release remain +// in the map until they are explicitly UnregisterPhone'd by their owner, or +// removed by a higher-layer cleanup (#8 grace period). +func (r *Registry) ClaimServer(serverID string, conn Conn) error { + r.mu.Lock() + defer r.mu.Unlock() + if _, ok := r.binaries[serverID]; ok { + return ErrServerIDConflict + } + r.binaries[serverID] = conn + return nil +} + +// ReleaseServer removes the binary entry for serverID. Returns true if a +// binary held the slot, false otherwise. Does NOT close the connection; +// the caller owns conn lifecycle. Does NOT touch the phones slice — orphan +// phones survive a release until their owner unregisters them, so the grace +// period (#8) can keep phones reachable across a binary disconnect window. +func (r *Registry) ReleaseServer(serverID string) (released bool) { + r.mu.Lock() + defer r.mu.Unlock() + if _, ok := r.binaries[serverID]; !ok { + return false + } + delete(r.binaries, serverID) + return true +} + +// RegisterPhone appends conn to the phones slice for serverID. Returns +// ErrNoServer if no binary currently holds serverID. The registry does not +// deduplicate by ConnID — the caller is responsible for not registering the +// same phone twice. +func (r *Registry) RegisterPhone(serverID string, conn Conn) error { + r.mu.Lock() + defer r.mu.Unlock() + if _, ok := r.binaries[serverID]; !ok { + return ErrNoServer + } + r.phones[serverID] = append(r.phones[serverID], conn) + return nil +} + +// UnregisterPhone removes the phone whose ConnID equals connID from the +// slice for serverID. No-op if no matching phone is present (including when +// no slice exists for serverID). Does NOT close the connection; the caller +// owns conn lifecycle. +// +// When the slice becomes empty as a result, the entry is deleted from the +// phones map so Counts and PhonesFor see no orphaned empty slices. +func (r *Registry) UnregisterPhone(serverID string, connID string) { + r.mu.Lock() + defer r.mu.Unlock() + slice, ok := r.phones[serverID] + if !ok { + return + } + for i, c := range slice { + if c.ConnID() != connID { + continue + } + slice[i] = slice[len(slice)-1] + slice[len(slice)-1] = nil + slice = slice[:len(slice)-1] + if len(slice) == 0 { + delete(r.phones, serverID) + } else { + r.phones[serverID] = slice + } + return + } +} + +// BinaryFor returns the binary holding serverID, if any. +func (r *Registry) BinaryFor(serverID string) (Conn, bool) { + r.mu.RLock() + defer r.mu.RUnlock() + c, ok := r.binaries[serverID] + return c, ok +} + +// PhonesFor returns a snapshot of the phones registered for serverID. The +// returned slice is freshly allocated; the caller may iterate, append, or +// otherwise mutate it without affecting the registry's internal state or +// holding any registry lock. Returns nil for an unknown serverID or when no +// phones are registered. +// +// The Conn handles inside the slice are the same references the registry +// holds; calling Send or Close on them affects the live connection. +func (r *Registry) PhonesFor(serverID string) []Conn { + r.mu.RLock() + defer r.mu.RUnlock() + src := r.phones[serverID] + if len(src) == 0 { + return nil + } + out := make([]Conn, len(src)) + copy(out, src) + return out +} + +// Counts returns the number of binaries currently claimed and the total +// number of phone connections summed across all server-ids. For the health +// endpoint (#10). One call is internally consistent; two concurrent calls +// may observe different values. +func (r *Registry) Counts() (binaries, phones int) { + r.mu.RLock() + defer r.mu.RUnlock() + b := len(r.binaries) + p := 0 + for _, s := range r.phones { + p += len(s) + } + return b, p +} diff --git a/internal/relay/registry_test.go b/internal/relay/registry_test.go new file mode 100644 index 0000000..b1cf0e4 --- /dev/null +++ b/internal/relay/registry_test.go @@ -0,0 +1,255 @@ +package relay + +import ( + "errors" + "fmt" + "sync" + "testing" +) + +// fakeConn is a minimal in-package Conn for tests. The registry's race +// coverage proves the registry is race-free with well-behaved Conns; it does +// not vouch for fakeConn under concurrent mutation, so tests register +// distinct fakes per goroutine. +type fakeConn struct { + id string + sent [][]byte + closed bool +} + +func (c *fakeConn) ConnID() string { return c.id } +func (c *fakeConn) Send(msg []byte) error { c.sent = append(c.sent, msg); return nil } +func (c *fakeConn) Close() { c.closed = true } + +func TestClaimServer_FirstWinsSecondConflicts(t *testing.T) { + t.Parallel() + r := NewRegistry() + + first := &fakeConn{id: "b-1"} + second := &fakeConn{id: "b-2"} + + if err := r.ClaimServer("s1", first); err != nil { + t.Fatalf("first ClaimServer: unexpected err: %v", err) + } + if err := r.ClaimServer("s1", second); !errors.Is(err, ErrServerIDConflict) { + t.Fatalf("second ClaimServer: got %v, want errors.Is(_, ErrServerIDConflict)", err) + } + got, ok := r.BinaryFor("s1") + if !ok { + t.Fatal("BinaryFor: missing entry after conflict") + } + if got.ConnID() != "b-1" { + t.Errorf("BinaryFor: got conn id %q, want %q (original)", got.ConnID(), "b-1") + } +} + +func TestReleaseServer_ReclaimSucceeds(t *testing.T) { + t.Parallel() + r := NewRegistry() + + first := &fakeConn{id: "b-1"} + second := &fakeConn{id: "b-2"} + + if err := r.ClaimServer("s1", first); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + if released := r.ReleaseServer("s1"); !released { + t.Fatal("ReleaseServer: got false, want true") + } + if err := r.ClaimServer("s1", second); err != nil { + t.Fatalf("re-ClaimServer: %v", err) + } + got, ok := r.BinaryFor("s1") + if !ok || got.ConnID() != "b-2" { + t.Errorf("BinaryFor after reclaim: got (%v, %v), want (b-2, true)", got, ok) + } + if released := r.ReleaseServer("unknown"); released { + t.Error("ReleaseServer of unheld id: got true, want false") + } +} + +func TestRegisterPhone_RequiresBinary(t *testing.T) { + t.Parallel() + r := NewRegistry() + + phone := &fakeConn{id: "p-1"} + if err := r.RegisterPhone("s1", phone); !errors.Is(err, ErrNoServer) { + t.Fatalf("RegisterPhone without binary: got %v, want errors.Is(_, ErrNoServer)", err) + } + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + if err := r.RegisterPhone("s1", phone); err != nil { + t.Fatalf("RegisterPhone after binary: unexpected err: %v", err) + } +} + +func TestUnregisterPhone_RemovesByConnID(t *testing.T) { + t.Parallel() + r := NewRegistry() + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + p1 := &fakeConn{id: "p-1"} + p2 := &fakeConn{id: "p-2"} + p3 := &fakeConn{id: "p-3"} + for _, p := range []*fakeConn{p1, p2, p3} { + if err := r.RegisterPhone("s1", p); err != nil { + t.Fatalf("RegisterPhone %q: %v", p.id, err) + } + } + + r.UnregisterPhone("s1", "p-2") + + got := r.PhonesFor("s1") + if len(got) != 2 { + t.Fatalf("PhonesFor: got len=%d, want 2", len(got)) + } + ids := map[string]bool{} + for _, c := range got { + ids[c.ConnID()] = true + } + if !ids["p-1"] || !ids["p-3"] || ids["p-2"] { + t.Errorf("PhonesFor: got ids=%v, want {p-1, p-3}", ids) + } + + r.UnregisterPhone("s1", "does-not-exist") + if got := r.PhonesFor("s1"); len(got) != 2 { + t.Errorf("PhonesFor after no-op unregister: got len=%d, want 2", len(got)) + } + r.UnregisterPhone("unknown-server", "p-1") +} + +func TestPhonesFor_SnapshotIsolation(t *testing.T) { + t.Parallel() + r := NewRegistry() + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + p1 := &fakeConn{id: "p-1"} + p2 := &fakeConn{id: "p-2"} + if err := r.RegisterPhone("s1", p1); err != nil { + t.Fatalf("RegisterPhone p1: %v", err) + } + if err := r.RegisterPhone("s1", p2); err != nil { + t.Fatalf("RegisterPhone p2: %v", err) + } + + snap := r.PhonesFor("s1") + if len(snap) != 2 { + t.Fatalf("snap len: got %d, want 2", len(snap)) + } + snap[0] = &fakeConn{id: "evil"} + + again := r.PhonesFor("s1") + if len(again) != 2 { + t.Fatalf("again len: got %d, want 2", len(again)) + } + for _, c := range again { + if c.ConnID() == "evil" { + t.Fatal("registry observed mutation through snapshot") + } + } + + if err := r.RegisterPhone("s1", &fakeConn{id: "p-3"}); err != nil { + t.Fatalf("RegisterPhone p3: %v", err) + } + if len(snap) != 2 { + t.Errorf("original snapshot length changed: got %d, want 2", len(snap)) + } +} + +func TestCounts_AcrossLifecycle(t *testing.T) { + t.Parallel() + r := NewRegistry() + + assert := func(stage string, wantB, wantP int) { + t.Helper() + b, p := r.Counts() + if b != wantB || p != wantP { + t.Errorf("%s: Counts=(%d,%d), want (%d,%d)", stage, b, p, wantB, wantP) + } + } + + assert("initial", 0, 0) + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + assert("after claim", 1, 0) + + if err := r.RegisterPhone("s1", &fakeConn{id: "p-1"}); err != nil { + t.Fatalf("RegisterPhone p1: %v", err) + } + if err := r.RegisterPhone("s1", &fakeConn{id: "p-2"}); err != nil { + t.Fatalf("RegisterPhone p2: %v", err) + } + assert("after 2 phones", 1, 2) + + r.UnregisterPhone("s1", "p-1") + assert("after unregister 1", 1, 1) + + if released := r.ReleaseServer("s1"); !released { + t.Fatal("ReleaseServer: got false, want true") + } + // Orphan phones survive ReleaseServer by design (see open question 1). + assert("after release (orphan retained)", 0, 1) + + r.UnregisterPhone("s1", "p-2") + assert("after orphan unregister", 0, 0) +} + +func TestUnregisterPhone_RemovesEmptySlice(t *testing.T) { + t.Parallel() + r := NewRegistry() + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + if err := r.RegisterPhone("s1", &fakeConn{id: "p-1"}); err != nil { + t.Fatalf("RegisterPhone: %v", err) + } + r.UnregisterPhone("s1", "p-1") + + if got := r.PhonesFor("s1"); got != nil { + t.Errorf("PhonesFor after empty: got %v, want nil", got) + } + if _, p := r.Counts(); p != 0 { + t.Errorf("phones count after empty: got %d, want 0", p) + } +} + +// TestRegistry_RaceFreedom hammers the public API from many goroutines and +// asserts the absence of DATA RACE reports under -race. +// +// Run with: go test -race -count=20 -run TestRegistry_RaceFreedom ./internal/relay +func TestRegistry_RaceFreedom(t *testing.T) { + t.Parallel() + r := NewRegistry() + + const goroutines = 32 + const opsPer = 200 + + var wg sync.WaitGroup + wg.Add(goroutines) + for g := 0; g < goroutines; g++ { + g := g + go func() { + defer wg.Done() + sid := fmt.Sprintf("s-%d", g%4) + bin := &fakeConn{id: fmt.Sprintf("b-%d", g)} + for i := 0; i < opsPer; i++ { + _ = r.ClaimServer(sid, bin) + _, _ = r.BinaryFor(sid) + _ = r.RegisterPhone(sid, &fakeConn{id: fmt.Sprintf("p-%d-%d", g, i)}) + _ = r.PhonesFor(sid) + r.UnregisterPhone(sid, fmt.Sprintf("p-%d-%d", g, i)) + _, _ = r.Counts() + _ = r.ReleaseServer(sid) + } + }() + } + wg.Wait() +} diff --git a/pyrycode-relay b/pyrycode-relay new file mode 100755 index 0000000..9fb20d4 Binary files /dev/null and b/pyrycode-relay differ