diff --git a/docs/PROJECT-MEMORY.md b/docs/PROJECT-MEMORY.md index de99412..33baeab 100644 --- a/docs/PROJECT-MEMORY.md +++ b/docs/PROJECT-MEMORY.md @@ -10,6 +10,7 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex | `http.Server` with explicit timeouts (gosec G114) | Done | `cmd/pyrycode-relay/main.go` | | Routing envelope wrapper type (`Envelope`, `Marshal`, `Unmarshal`, sentinel errors) | Done (#1) | `internal/relay/envelope.go` | | Connection registry (`Conn`, `Registry`; 1:1 binary, 1:N phones; sentinel errors for `4409`/`4404`; race-tested) | Done (#3) | `internal/relay/registry.go` | +| Registry grace-period reclaim (`ScheduleReleaseServer`; grace-aware `ClaimServer` fast path; phones closed at expiry, inherited on reclaim; pointer-identity stale-fire defence) | Done (#20) | `internal/relay/registry.go` | | TLS via autocert in `--domain` mode (`NewAutocertManager`, `EnforceHost`, `TLSConfig`, `ErrCacheDirInsecure`) | Done (#9) | `internal/relay/tls.go`, `cmd/pyrycode-relay/main.go` | | `WSConn` adapter (`nhooyr.io/websocket.Conn` → registry `Conn`; per-conn write mutex; `Close`-cancelled context; 10s `Send` deadline) | Done (#15) | `internal/relay/ws_conn.go` | | `/healthz` JSON endpoint (`status`, `version`, `connected_binaries`, `connected_phones`, `uptime_seconds`; `Cache-Control: no-store`; unauthenticated) | Done (#10) | `internal/relay/healthz.go`, `cmd/pyrycode-relay/main.go` | @@ -35,6 +36,7 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex - **Validate adversarial input *before* allocating WS state.** Header-gate checks (presence + non-empty) run before `websocket.Accept`; a malformed request never receives an upgrade response and the registry is untouched on every error path. Adopted in `/v1/server` (#16); the same shape applies to `/v1/client` token validation (#5). - **Application WS close codes are emitted on the underlying `*websocket.Conn`, not via `WSConn`.** `WSConn.Close()` always emits `StatusNormalClosure` — no policy parameters. Handlers that need `4409` / `4401` / `4404` call `c.Close(websocket.StatusCode(code), reason)` directly in the stillborn-WSConn window (after construction, before any `Send`). See [ADR-0005](knowledge/decisions/0005-application-close-codes-via-underlying-conn.md). Adopted in `/v1/server` (#16); applies to `/v1/client` (#5). - **`defer { Release; Close; log }` is registered AFTER the successful claim**, never before. A conflict path (or any pre-claim error) returns before the defer is in place, so we never `ReleaseServer` on a slot we don't hold. Adopted in `/v1/server` (#16); same structural ordering applies to `/v1/client`'s phone registration. +- **Pointer-identity for stale `time.AfterFunc` fires.** `time.Timer.Stop()` returns false if the timer's func has already started executing. Wrap each pending timer in a small struct and store the wrapper pointer in a map; the `AfterFunc` closure captures the wrapper pointer and asserts `map[key] == self` under the lock before acting. If a faster goroutine replaced the entry, the pointer no longer matches and the closure no-ops. Capturing the `*time.Timer` directly forces a self-referential local var (assigned after `AfterFunc` returns) which trips the race detector under stress; the wrapper avoids that. Adopted in `Registry.ScheduleReleaseServer` (#20). Same shape applies to any "one cancellable timer per key" pattern. - **Hold long-lived WS handlers open with `c.CloseRead(r.Context())` plus `<-readCtx.Done()` until the real read loop lands.** `CloseRead` drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close). The frame loop ticket replaces `<-readCtx.Done()` with the actual read body in the same call site. Adopted in `/v1/server` (#16) pending #6. - **Capture process-state timestamps in `main` after `flag.Parse()`, not as package-level vars.** `startedAt := time.Now()` lives inside `main` and is passed into the handler factory. A package-level `var startedAt = time.Now()` would fire at import time — before flag parsing, before `--version` early-returns — and be wrong for short-lived test binaries and any future deferred-serve setup. Adopted in #10. diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index b5a24ff..451996f 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -7,12 +7,13 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o - [`/v1/server` WS upgrade](features/server-endpoint.md) — binary-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Version` / `User-Agent` pre-upgrade, claims the slot, emits `4409` on conflict, holds the conn via `CloseRead` until #6's frame loop replaces it. - [`/healthz` JSON endpoint](features/healthz.md) — unauthenticated `GET /healthz` returning `{status, version, connected_binaries, connected_phones, uptime_seconds}`; `Cache-Control: no-store`, body bounded ≈135 bytes. - [WSConn adapter](features/ws-conn-adapter.md) — wraps `nhooyr.io/websocket.Conn` to satisfy the registry's `Conn`; owns the per-conn write mutex and a `Close`-cancelled context with a 10s per-`Send` deadline. -- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`. +- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`, deferred-release `ScheduleReleaseServer` with grace-window reclaim semantics. - [Autocert TLS](features/autocert-tls.md) — `--domain` mode wiring: `:443` WSS via Let's Encrypt + `:80` ACME http-01, host gates, cache-dir permission policy, TLS 1.2 floor. - [Routing envelope](features/routing-envelope.md) — typed Go wrapper (`Envelope`, `Marshal`, `Unmarshal`) for the `{conn_id, frame}` wire shape exchanged between relay and binary. ## Decisions +- [ADR-0006: Grace window IS the reclaim path](decisions/0006-grace-period-as-reclaim-path.md) — `ClaimServer` during a pending grace timer succeeds (not `4409`); pointer-identity wrapper defends against stale `time.AfterFunc` fires after `Stop()`. - [ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`](decisions/0005-application-close-codes-via-underlying-conn.md) — `WSConn.Close()` stays minimal (`StatusNormalClosure`); handlers emit `4409` / `4401` / `4404` directly on the `*websocket.Conn` in stillborn-WSConn paths. - [ADR-0004: WS library choice and adapter context strategy](decisions/0004-ws-library-and-adapter-context-strategy.md) — `nhooyr.io/websocket` v1.8.x; adapter-owned context cancelled by `Close`, 10s per-`Send` deadline; registry's `Send` signature stays context-free. - [ADR-0003: Connection registry as a passive store](decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, narrow `ReleaseServer` so grace logic (#8) can wrap it. diff --git a/docs/knowledge/decisions/0006-grace-period-as-reclaim-path.md b/docs/knowledge/decisions/0006-grace-period-as-reclaim-path.md new file mode 100644 index 0000000..f7e30b6 --- /dev/null +++ b/docs/knowledge/decisions/0006-grace-period-as-reclaim-path.md @@ -0,0 +1,73 @@ +# ADR-0006: Binary disconnect grace window IS the reclaim path + +**Status:** Accepted (#20) +**Date:** 2026-05-08 + +## Context + +The protocol spec ([`protocol-mobile.md` § Authentication → Binary → relay](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#binary--relay)) requires that when a binary's holding connection drops, its server-id slot is held for 30 seconds before being released. The motivation is operational: a transient drop followed by a quick reconnect should land the binary back in the *same* slot with its phones still attached — instead of forcing every phone to receive a no-server close and re-handshake. + +The registry built in #3 ([ADR-0003](0003-connection-registry-passive-store.md)) deliberately left grace logic out. `ReleaseServer` is narrow and immediate; orphan phones survive a release. The doc-comment foreshadowed this ticket. + +Three shapes were on the table for how the grace window should interact with a *competing* claim during the window: + +1. **Block reclaims during grace.** A reclaim within the window returns `ErrServerIDConflict` (or a new `ErrGracePending`). The original binary must wait for the timer to fire, then re-claim cleanly. +2. **Reclaim only by the original binary, by some identity match.** Compare the new conn's identity with the prior one and accept only if they match. +3. **Grace window IS the reclaim path.** Any `ClaimServer` during the window succeeds, replaces the binary `Conn` atomically, cancels the timer, and inherits the phones registered for that slot. + +## Decision + +**Option 3.** During the grace window: + +- The binary entry stays in the registry, pointing at the (now-closed) prior `Conn`. `BinaryFor` returns it; `Send` through it errors per the underlying impl's contract. +- `RegisterPhone` continues to succeed — the slot is still considered claimed. +- `ClaimServer(serverID, conn)` for the same `serverID` returns `nil`, replaces the binary `Conn`, and `Stop()`s the pending timer. Phones registered during the grace window are inherited by the new binary. + +If the timer fires (no reclaim within the window), the binary entry is removed *and* every phone for that server-id is removed and `Close()`-called outside the registry's lock. + +The new method is `ScheduleReleaseServer(serverID, d time.Duration)`. The synchronous `ReleaseServer` is unchanged. Adopted in `internal/relay/registry.go` (#20). The `/v1/server` handler call-site swap lives in #21. + +## Rationale + +### Why not block reclaims (option 1) + +The protocol spec explicitly motivates the grace window with "a quick reconnect should land the binary back in the same slot." A 30-second hard block contradicts that goal: a binary that disconnects and immediately retries — the canonical fast-reconnect case — would receive `4409` and have to back off for 30 seconds before its slot becomes available. That's worse for the user than the no-grace baseline. + +### Why not identity-match reclaims (option 2) + +The relay has no durable identity for a binary. The `x-pyrycode-server` header is just a string; the relay cannot distinguish "the same binary reconnecting" from "a different binary with the same configured server-id." Any identity check would either reduce to the header (which is already the lookup key) or require new credential plumbing. The protocol spec doesn't mandate identity verification at this layer; that's the binary's auth concern. + +The pragmatic answer: anyone who can pass the `/v1/server` header gate ([#16](../features/server-endpoint.md)) is already authoritative for that server-id. If a competing claim arrives during a grace window, treating it as the legitimate reclaim is the same trust model as the pre-grace baseline (where claims succeed once the slot is free). The grace window doesn't *expand* the trust surface. + +### Why option 3 is structurally clean + +- **No new error sentinel.** The grace path produces the same return shape as a normal claim (`nil`). Callers don't branch. +- **Phones survive a reclaim with no separate dance.** The registry already keeps phones across `ReleaseServer` (the orphan-phones invariant from ADR-0003). The grace path simply doesn't run the cleanup that the timer would have run; phones land on the new binary by virtue of never having been removed. +- **`Counts` stays meaningful.** A server-id in its grace window contributes `1` to the binary count and its live phone count — matching the operational mental model ("we still consider this binary connected; we are about to find out"). + +### Pointer-identity stale-fire defence + +`time.AfterFunc(d, fn)`'s `Stop()` returns false if the timer has already fired *or* its `fn` is currently executing. The "currently executing" case is the race: another goroutine takes the registry lock first (e.g. `ClaimServer`'s grace fast path), `Stop()`s the timer (returns false because `fn` already started), replaces the entry, and returns. The `fn` is meanwhile blocked on the lock. When it acquires the lock, naïvely it would delete the binary entry and close the new claimant's phones. + +The defence: every armed timer is wrapped in a `*graceEntry` and stored in `r.timers[serverID]`. The `time.AfterFunc` closure captures the `*graceEntry` pointer. On fire, the handler checks `r.timers[serverID] == self` under the lock — if `ClaimServer` (or a later `ScheduleReleaseServer`) replaced the entry, the pointer no longer matches and the handler returns without touching state. + +The wrapper indirection (`graceEntry` rather than `*time.Timer` directly) is so the `AfterFunc` closure can capture a stable pointer at construction; capturing `*time.Timer` requires the local var to be assigned *after* `AfterFunc` returns, which both creates a self-reference and trips the race detector under stress. + +### Why the duration is a parameter, not a constant + +The protocol spec says 30 seconds; the handler in #21 passes `30*time.Second`. Tests use ms-scale durations to keep the suite fast. Pushing the constant into the registry would force tests to either wait 30s or expose a knob — both worse than just taking it as an argument. The registry trusts the duration; degenerate values (`d <= 0` fires immediately, `d` near `MaxInt64` never fires) are safe. + +## Consequences + +- **The reclaim path is opt-in for callers.** `ReleaseServer` (immediate) is still there for the "claim failed, clean up" use cases that don't want a timer. Only the disconnect path in `/v1/server` (and equivalents) calls `ScheduleReleaseServer`. +- **Phones registered during a grace window have asymmetric outcomes.** Reclaim → they survive on the new binary. Expiry → they get `Close()`d alongside the original phones. Documented in the registry doc-comments and the feature doc; bounded only by per-server caps (#5's concern). +- **The expiry handler is the only goroutine the registry spawns.** Lifetime is bounded by `d`; the `Stop()` + pointer-identity check rules out leaks. The registry itself is still passive — the goroutine is `time.AfterFunc`'s, not a long-running registry worker. +- **Phone-side close code on grace expiry is `1000`, not protocol's `1011`.** ADR-0005 fixes `WSConn.Close()` to `StatusNormalClosure`. Adopting `1011 "binary did not reconnect"` requires extending the `Conn` interface or adding a parallel close primitive. Phones treat any close as "reconnect" — user-visible behaviour is correct. Tracked as a follow-up. +- **`Counts` does not distinguish "live" from "in grace."** Operators reading `/healthz` see the binary as still claimed during the grace window. If a future ticket needs grace-window metrics, that's a separate field on `Counts` (or a sibling). +- **`ScheduleReleaseServer` on an unheld server-id is a defensive no-op.** It arms a timer that, on fire, deletes whatever happens to be present (nothing). Same shape as `ReleaseServer("unknown")` returning `false`. Documented to head off a "binary not found" return value. + +## Related + +- [ADR-0003: Connection registry as a passive store](0003-connection-registry-passive-store.md) — the orphan-phones invariant that makes option 3 work. +- [ADR-0005: Application WS close codes](0005-application-close-codes-via-underlying-conn.md) — why `Close()` on grace expiry emits `1000` rather than the protocol's `1011`. +- [Connection registry feature](../features/connection-registry.md) — current API surface including `ScheduleReleaseServer`. diff --git a/docs/knowledge/features/connection-registry.md b/docs/knowledge/features/connection-registry.md index e38e13d..067aa5b 100644 --- a/docs/knowledge/features/connection-registry.md +++ b/docs/knowledge/features/connection-registry.md @@ -1,13 +1,13 @@ # Connection registry -The relay's routing core needs one canonical structure that owns "who is connected as what for which server-id." Every routing-layer ticket downstream — `/v1/server` upgrade (#4), `/v1/client` upgrade (#5), frame forwarding (#6), heartbeat (#7), grace period (#8), health (#10) — reads or writes through this registry. +The relay's routing core needs one canonical structure that owns "who is connected as what for which server-id." Every routing-layer ticket downstream — `/v1/server` upgrade (#16), `/v1/client` upgrade (#5), frame forwarding (#6), heartbeat (#7), grace period (#20/#21), health (#10) — reads or writes through this registry. Two non-negotiable shapes flow from `docs/architecture.md`: -1. **server-id → binary is 1:1, first-claim-wins.** A second binary claiming a held server-id is a `4409` conflict. +1. **server-id → binary is 1:1, first-claim-wins.** A second binary claiming a held server-id is a `4409` conflict — *unless* a grace timer is pending (see "Grace-period reclaim" below), in which case the second claim succeeds and replaces the binary. 2. **server-id → phones is 1:N, gated on a binary holding the slot.** A phone whose server-id has no binary is a `4404` no-server. -The registry is a **passive in-memory store**. It holds opaque `Conn` handles, never speaks WebSocket, never invokes `Send` / `Close` on the conns it tracks, and never times anything out. The 30-second grace period on binary disconnect (#8) wraps the registry rather than living inside it. +The registry is a **passive in-memory store**. It holds opaque `Conn` handles, never speaks WebSocket, never invokes `Send` on the conns it tracks. With the grace-period primitive added in #20 it now owns *one* timer per server-id and calls `Close()` on phones at expiry — but always outside its lock, snapshot-and-iterate. ## API @@ -26,6 +26,7 @@ func NewRegistry() *Registry func (r *Registry) ClaimServer(serverID string, conn Conn) error func (r *Registry) ReleaseServer(serverID string) (released bool) +func (r *Registry) ScheduleReleaseServer(serverID string, d time.Duration) func (r *Registry) RegisterPhone(serverID string, conn Conn) error func (r *Registry) UnregisterPhone(serverID string, connID string) func (r *Registry) BinaryFor(serverID string) (Conn, bool) @@ -42,9 +43,39 @@ Sentinel errors (branch with `errors.Is`): The mapping to close codes is informational; the registry doesn't know about WebSockets — the upgrade handlers translate. +## Grace-period reclaim (`ScheduleReleaseServer`, #20) + +`ScheduleReleaseServer(serverID, d)` arms a deferred release after duration `d`. The grace window IS the reclaim path — see [ADR-0006](../decisions/0006-grace-period-as-reclaim-path.md) for the rationale. + +While the timer is pending: + +| Operation | Behaviour during grace window | +|---|---| +| `BinaryFor(serverID)` | Returns the (now-closed) binary `Conn`. `Send` through it errors per the underlying impl's contract; the registry adds no new error semantics. | +| `RegisterPhone(serverID, conn)` | Succeeds — the binary entry is still present, only the timer is pending. | +| `ClaimServer(serverID, conn)` | Returns `nil`, **not** `ErrServerIDConflict`. The pending timer is `Stop()`'d, the binary `Conn` is replaced atomically, and any phones registered during the window are inherited by the new binary. | +| `ScheduleReleaseServer(serverID, d')` | Replaces the pending timer (last call wins). Old timer is `Stop()`'d. | +| `ReleaseServer(serverID)` | Unchanged — still removes the binary entry immediately. Does not interact with the timer. | + +On expiry (no reclaim within `d`): the binary entry is removed, the phones slice for `serverID` is snapshotted and deleted under the lock, then `Close()` is called on every phone *outside* the lock. After expiry, `RegisterPhone` returns `ErrNoServer` and `ClaimServer` succeeds normally. + +`ScheduleReleaseServer` on an unheld server-id is a defensive no-op: it arms a timer that on fire deletes whatever happens to be present (nothing). Same shape as `ReleaseServer("unknown")` returning `false`. + +### Pointer-identity stale-fire defence + +The one race the registry's `sync.RWMutex` cannot prevent: a `time.AfterFunc` whose body has already started executing cannot be `Stop()`'d. If `ClaimServer` (or a re-armed `ScheduleReleaseServer`) takes the lock before the expiry handler does, it `Stop()`s a timer whose body is already in flight, replaces the entry, and returns. The expiry-handler goroutine then acquires the lock and would naïvely tear down the new claimant's state. + +The defence: each armed timer is wrapped in a `*graceEntry`, stored in `r.timers[serverID]`, and the `AfterFunc` closure captures the entry pointer. On fire, the handler asserts `r.timers[serverID] == self` under the lock. If `ClaimServer` replaced the entry, the pointer no longer matches and the handler returns without touching state. The wrapper indirection (rather than capturing `*time.Timer` directly) avoids a self-referential local var that trips the race detector at construction. + +### Goroutine lifecycle + +`time.AfterFunc` is the only goroutine spawn the registry introduces. Lifetime is bounded by `d`. Cancelled-before-fire timers (`Stop()` returns true) never spawn the goroutine. Cancelled-after-fire timers spawn the goroutine but it returns at the pointer-identity check. The `timers` map holds at most one entry per server-id; rapid disconnect/reclaim cycles do not grow it. + +The duration `d` is fully trusted — degenerate values (`d <= 0` fires immediately, `d` near `MaxInt64` never fires before process exit) are safe and named in the doc-comment so callers don't add bounds at the registry layer. The `/v1/server` handler in #21 supplies `30*time.Second`; tests use ms-scale. + ## Concurrency model -- **One `sync.RWMutex` covers both maps.** Mutating methods take `Lock`; lookups take `RLock`. Sharding is a possible later optimisation but irrelevant at v1 (tens to hundreds of conns, sub-microsecond critical sections, no measured contention). +- **One `sync.RWMutex` covers all three maps** (`binaries`, `phones`, `timers`). Mutating methods take `Lock`; lookups take `RLock`. Sharding is a possible later optimisation but irrelevant at v1 (tens to hundreds of conns, sub-microsecond critical sections, no measured contention). - **Conn callbacks are never invoked under the lock,** with one documented exception: `UnregisterPhone` calls `ConnID()` while holding the write lock to scan for the target. The `Conn` contract requires `ConnID` to be a non-blocking getter. - **Broadcast pattern.** A caller wanting to fan out a frame to all phones for a server-id calls `PhonesFor` (returns a freshly allocated copy) and iterates the snapshot calling `Send` per conn — no registry lock held during I/O. - **No callbacks, no channels.** The registry is a passive store. Notification of binary loss / new phone arrival happens elsewhere (the WS upgrade handlers and #8's grace timer). @@ -60,10 +91,10 @@ The mapping to close codes is informational; the registry doesn't know about Web ## What the registry deliberately does NOT do -- **Does not close `Conn` handles.** `ReleaseServer` and `UnregisterPhone` only remove map entries. Connection lifetime is owned by the WS upgrade handlers. +- **Does not close `Conn` handles in the immediate-release path.** `ReleaseServer` and `UnregisterPhone` only remove map entries. Connection lifetime is owned by the WS upgrade handlers. The grace-period expiry handler is the one exception — it `Close()`s phones whose binary did not reclaim within the grace window. See [ADR-0006](../decisions/0006-grace-period-as-reclaim-path.md). - **Does not deduplicate phones by `ConnID`.** Caller invariant: each `Conn` is registered once. - **Does not bound the per-server-id phone slice.** Per-server caps belong to the WS upgrade layer (#5). -- **Does not propagate binary loss to phones.** When `ReleaseServer` runs, the phones slice for that server-id is left in place. The owner of phones must `UnregisterPhone` (and choose whether to close them) when their server's binary is gone. This is intentional: the grace ticket (#8) needs phones to remain reachable across the 30-second disconnect window. +- **Does not propagate binary loss to phones via `ReleaseServer`.** When `ReleaseServer` runs, the phones slice is left in place. The owner of phones must `UnregisterPhone` (and choose whether to close them) when their server's binary is gone. The deferred sibling `ScheduleReleaseServer` *does* close phones — but only if the timer expires without a reclaim. This split lets the disconnect path opt into the grace window while the synchronous path (e.g. cleanup after a failed claim) stays narrow. - **Does not validate `serverID` content.** Length, charset, prefix — owned by `x-pyrycode-server` header validation in #4. The registry treats any string as an opaque map key. ## Adversarial framing @@ -77,7 +108,7 @@ The registry has no networking and no direct exposure to adversarial input. Adve ## Testing -`internal/relay/registry_test.go`, `package relay`. Tests use a minimal `fakeConn` (id, sent buffer, closed flag — no synchronisation; the race test gives each goroutine its own fakes). +`internal/relay/registry_test.go`, `package relay`. Tests use a minimal `fakeConn` (id, sent buffer, mutex-guarded `closed` flag, optional `closeCh` to signal close synchronously). The mutex was added in #20 because the grace-expiry handler runs `Close()` on a different goroutine from the test, making `closed` cross-goroutine state. Functional cases (one subtest per AC bullet): @@ -89,7 +120,16 @@ Functional cases (one subtest per AC bullet): - `Counts` across the full lifecycle including the `ReleaseServer` orphan case (`(0, 1)` while phones survive a release). - `UnregisterPhone` deletes the empty-slice map entry. -Race coverage: one `TestRegistry_RaceFreedom` hammers `ClaimServer` / `BinaryFor` / `RegisterPhone` / `PhonesFor` / `UnregisterPhone` / `Counts` / `ReleaseServer` from 32 goroutines × 200 ops over four contended server-ids. The test deliberately ignores returned errors — `ErrServerIDConflict` under contention is expected behaviour, not a failure. The point is the absence of `DATA RACE` reports. +Grace-period cases (added in #20): + +- Reclaim within grace → no release fires; new conn is the active binary; previously-registered phones remain. +- Grace expires without reclaim → binary entry gone, phones `Close()`-called and removed; subsequent `ClaimServer` succeeds, subsequent `RegisterPhone` returns `ErrNoServer`. +- Phone registered during grace window survives reclaim; same phone is closed on expiry. +- Re-arming `ScheduleReleaseServer` for the same id stops the prior timer (last call wins); only the latest timer's expiry fires. +- `ScheduleReleaseServer` on an unheld id is a no-op. +- Race-freedom under rapid disconnect/reclaim cycles (16 goroutines × 200 iterations, ms-scale durations). + +Race coverage: `TestRegistry_RaceFreedom` (#3) hammers `ClaimServer` / `BinaryFor` / `RegisterPhone` / `PhonesFor` / `UnregisterPhone` / `Counts` / `ReleaseServer` from 32 goroutines × 200 ops over four contended server-ids. `TestScheduleReleaseServer_RaceFreedomUnderRapidCycles` (#20) covers the timer paths analogously. Both deliberately ignore returned errors — `ErrServerIDConflict` under contention is expected behaviour, not a failure. The point is the absence of `DATA RACE` reports. `make test` runs `-race` by default. The `-count=20` invocation in the AC is a manual stress command, documented in the test's doc comment, not a knob in the test code: @@ -99,6 +139,7 @@ go test -race -count=20 -run TestRegistry_RaceFreedom ./internal/relay ## Related -- [ADR-0003: Connection registry as a passive store](../decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, why grace logic lives outside. +- [ADR-0003: Connection registry as a passive store](../decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, the orphan-phones invariant. +- [ADR-0006: Grace window IS the reclaim path](../decisions/0006-grace-period-as-reclaim-path.md) — why a `ClaimServer` during grace succeeds rather than conflicts; the pointer-identity stale-fire defence. - [Routing envelope](routing-envelope.md) — the wrapper used by frame forwarding (#6) once the registry is wired up. - [Architecture overview](../../architecture.md) — where the registry fits in the data flow. diff --git a/docs/lessons.md b/docs/lessons.md index faca29a..5968529 100644 --- a/docs/lessons.md +++ b/docs/lessons.md @@ -2,6 +2,14 @@ Gotchas worth carrying forward. Each entry: what bit us (or nearly did), and what we do about it. +## `time.Timer.Stop()` returns false if the func has already started — a mutex alone won't save you + +`time.AfterFunc(d, fn)` fires `fn` on a runtime goroutine. Calling `t.Stop()` returns true if the timer was cancelled before firing, false if it already fired *or its `fn` is currently executing*. The "currently executing" case is a race the registry's `sync.RWMutex` cannot prevent: the AfterFunc body has started, blocked on the lock; another goroutine acquires the lock first, calls `Stop()` (gets false), replaces the map entry, releases the lock; the AfterFunc body then takes the lock and would naïvely tear down the new claimant's state. Defence: wrap each pending timer in a small struct (`*graceEntry`), store the wrapper in the map keyed by the same id, and have the AfterFunc closure capture the wrapper pointer. On fire, check `m[key] == self` under the lock — if the entry was replaced, the pointer no longer matches and the body returns. Source: `Registry.ScheduleReleaseServer` (#20). + +## Capture a wrapper pointer, not `*time.Timer`, to avoid a self-referential local var + +The natural shape — `var t *time.Timer; t = time.AfterFunc(d, func() { ... uses t ... })` — assigns `t` *after* `AfterFunc` returns, so the closure reads `t` from the outer frame. Under stress, the race detector flags this as a write/read race on the local var. Hoisting the timer pointer into a heap-allocated wrapper (`entry := &graceEntry{}; r.timers[id] = entry; entry.timer = time.AfterFunc(d, func() { ... uses entry ... })`) sidesteps the issue: the closure captures `entry`, the field assignment on `entry` is visibility-sealed by the lock the AfterFunc body must acquire. Source: `Registry.ScheduleReleaseServer` (#20). + ## A long-lived WS handler that does not read frames will never observe peer close A WebSocket connection only sees a peer-side close when *something* on this side reads from the conn — control frames (ping/pong/close) are processed inline with reads. A handler that just blocks forever (e.g. `select {}` or `<-r.Context().Done()`) keeps the goroutine alive but does not move the close machinery: the conn stays "open" until the kernel TCP timeout, and `r.Context()` does not cancel on peer close (it cancels on *server* shutdown / client TCP RST seen by the http server). Use `c.CloseRead(r.Context())` to spawn a drain-and-discard goroutine and block on the returned context — that goroutine processes control frames and cancels the context on close. When the real frame loop lands later, swap `<-readCtx.Done()` for the loop body. Source: `/v1/server` (#16). diff --git a/docs/specs/architecture/20-grace-period-deferred-release.md b/docs/specs/architecture/20-grace-period-deferred-release.md new file mode 100644 index 0000000..971d585 --- /dev/null +++ b/docs/specs/architecture/20-grace-period-deferred-release.md @@ -0,0 +1,347 @@ +# Spec — Registry: schedule deferred binary release with grace-period reclaim (#20) + +## Files to read first + +- `internal/relay/registry.go` — full file. The new method, the timer-state field, and the `ClaimServer` change all live here. The doc-comment block at lines 22-46 (the `Conn` interface contract) constrains how the grace-expiry handler may invoke `Close()` and is referenced again below. +- `internal/relay/registry.go:78-86` — current `ClaimServer`. The grace-aware fast path inserts before the existing `ErrServerIDConflict` check. +- `internal/relay/registry.go:93-101` — current `ReleaseServer`. Doc-comment foreshadows this ticket ("orphan phones survive a release until their owner unregisters them, so the grace period (#8) can keep phones reachable across a binary disconnect window"). Stays unchanged. +- `internal/relay/registry.go:163-173` — `PhonesFor`'s snapshot pattern (allocate, copy, return; no locks held while caller iterates). The grace-expiry handler reuses this pattern *internally* — snapshot under the lock, release the lock, then iterate `Close()`. +- `internal/relay/registry_test.go` — full file. The new tests append here; they reuse `fakeConn` (lines 14-22) but `fakeConn.closed` becomes cross-goroutine state for the first time, so the developer must add a sync primitive — see "Testing strategy" below. +- `docs/PROJECT-MEMORY.md:31-32` — established patterns: passive store guarding mutation under one RWMutex; reads return copies. The grace-expiry handler honours both. +- `docs/knowledge/decisions/0003-connection-registry-passive-store.md` — full file, especially "ReleaseServer is narrow" (lines 43-45). This ticket *adds* a deferred sibling without touching the immediate primitive. +- `pyrycode/pyrycode/docs/protocol-mobile.md` § Authentication → Binary → relay — wire-protocol source for the 30-second grace window. The relay carries the **mechanism**; the duration is a parameter the WS handler will pass in (`30*time.Second`) in #21. Tests use ms-scale durations. +- Ticket #20 issue body (re-read for the AC structure; subtest names map 1:1 to AC bullets). + +## Context + +Ticket #3 shipped a passive registry with immediate `ReleaseServer`. Ticket #20 adds a deferred sibling so the `/v1/server` handler (sibling ticket #21) can wait 30 seconds before tearing down phones, giving a transient binary disconnect a chance to reclaim its slot without forcing every phone to receive `4404` and reconnect. + +The shape adopted here is **the grace window IS the reclaim path**. While a grace timer is pending: + +- The binary entry stays in the registry (now pointing at a closed `Conn` whose `Send` will return errors — that's the existing contract; the registry adds no new error semantics). +- Phones can be registered (`RegisterPhone` succeeds), so a phone that connects during the grace window lands in the slot atomically. +- A new `ClaimServer` for the same server-id **succeeds and replaces the binary `Conn`** instead of returning `ErrServerIDConflict`. The pending timer is cancelled. + +If the timer fires (no reclaim), the binary entry is removed *and* every phone for that server-id is removed and `Close()`'d, with `Close` calls happening outside the registry's lock (snapshot-and-iterate, same shape as `PhonesFor`). + +This ticket ships the registry primitive only. No call sites migrate. The handler swap (`ReleaseServer` → `ScheduleReleaseServer`) lives in #21. + +## Design + +### Files + +Modified: `internal/relay/registry.go`, `internal/relay/registry_test.go`. No new files. No new package. No edits outside `internal/relay`. + +### New state on `*Registry` + +```go +type Registry struct { + mu sync.RWMutex + binaries map[string]Conn + phones map[string][]Conn + timers map[string]*time.Timer // NEW: pending grace-period timers, keyed by serverID +} +``` + +`NewRegistry` initialises `timers` to an empty map. The map is guarded by the same `mu` as `binaries`/`phones` — no second lock, consistent with ADR-0003's "one RWMutex over both maps" rationale (which now reads "over all three"). + +### New method + +```go +// ScheduleReleaseServer arms (or replaces) a deferred release of serverID +// after d. The grace window is the reclaim path: until the timer fires, +// +// - BinaryFor(serverID) continues to return the (now-closed) binary Conn — +// callers that Send through it observe whatever error the underlying +// impl returns. +// - RegisterPhone(serverID, conn) continues to succeed. +// - ClaimServer(serverID, conn) replaces the binary atomically and +// cancels the pending timer (returns nil, NOT ErrServerIDConflict). +// +// If a timer is already pending for serverID, it is stopped and replaced +// (last call wins). Calling ScheduleReleaseServer for a serverID with no +// binary holding it arms the timer anyway; on expiry it removes only +// whatever state happens to be present (defensive, matches the no-op +// semantics of ReleaseServer on an unheld id). +// +// On expiry (no reclaim within d): the binary entry for serverID is +// removed, every phone currently registered for serverID is removed from +// the registry and Close() is invoked on it. Close calls happen OUTSIDE +// the registry lock (snapshot-and-iterate, same shape as PhonesFor). +// +// This method does not block on d; the timer fires on Go's runtime +// goroutine for time.AfterFunc. +func (r *Registry) ScheduleReleaseServer(serverID string, d time.Duration) +``` + +`ReleaseServer` (immediate) is unchanged. The two coexist — `/v1/server`'s old "synchronous release" use case (e.g. failed claim cleanup) keeps working; the grace path is opt-in. + +### `ClaimServer` change + +The current method is a check-then-set under the write lock. The new shape inserts a grace-aware fast path **before** the conflict check: + +```go +func (r *Registry) ClaimServer(serverID string, conn Conn) error { + r.mu.Lock() + defer r.mu.Unlock() + + if t, gracing := r.timers[serverID]; gracing { + // Grace window in flight: cancel the pending release and + // atomically replace the binary Conn. Phones registered during + // the grace window stay; the new binary inherits them. + t.Stop() + delete(r.timers, serverID) + r.binaries[serverID] = conn + return nil + } + if _, ok := r.binaries[serverID]; ok { + return ErrServerIDConflict + } + r.binaries[serverID] = conn + return nil +} +``` + +Note the asymmetry between `t.Stop()` and the stale-firing concern: `Stop()` returning `false` means the timer has already fired or is currently running. If the func is currently running, it is blocked on `r.mu` (we hold it). When we release the lock and the func runs, it would otherwise erroneously remove the binary we just installed and close the new claimant's phones. The expiry handler defends against this with a pointer comparison — see below. + +### Grace-expiry handler + +The closure passed to `time.AfterFunc` captures the timer pointer and a back-pointer to the registry, so it can verify on fire that it has not been superseded: + +```go +func (r *Registry) ScheduleReleaseServer(serverID string, d time.Duration) { + r.mu.Lock() + defer r.mu.Unlock() + + if existing, ok := r.timers[serverID]; ok { + existing.Stop() + delete(r.timers, serverID) + } + + var t *time.Timer + t = time.AfterFunc(d, func() { r.handleGraceExpiry(serverID, t) }) + r.timers[serverID] = t +} + +func (r *Registry) handleGraceExpiry(serverID string, self *time.Timer) { + r.mu.Lock() + if r.timers[serverID] != self { + // Superseded: ClaimServer or a later ScheduleReleaseServer + // replaced the timer between fire and us acquiring the lock. + // No-op: the new owner manages whatever state is present. + r.mu.Unlock() + return + } + delete(r.timers, serverID) + delete(r.binaries, serverID) + snapshot := r.phones[serverID] + delete(r.phones, serverID) + r.mu.Unlock() + + for _, p := range snapshot { + p.Close() + } +} +``` + +The pointer-identity check (`r.timers[serverID] != self`) is the synchronisation point. After `Stop()` runs in `ClaimServer` or `ScheduleReleaseServer`, the *next* line replaces or deletes the entry; if the AfterFunc body has already started executing and is blocked on `r.mu`, by the time it acquires the lock the map either has a different `*time.Timer` (replaced) or no entry at all. Either way, `self` no longer matches and the body returns without touching state. + +We do **not** need to copy `snapshot` into a fresh slice the way `PhonesFor` does. We hold the lock during `delete(r.phones, serverID)`; after the delete, no other goroutine can reach `snapshot` via the registry, so the slice is exclusively ours. Releasing the lock before iterating is purely about not holding a mutex during `Close()` calls (which may block on the network). + +`Close()` invocation is a no-op for an already-closed conn (existing `Conn` contract: idempotent). Binaries that disconnected and triggered the grace window typically have a closed underlying socket already; the registry doesn't care. + +### Methods unchanged + +- `ReleaseServer` — still immediate, narrow, no phone touch. Doc comment unchanged. +- `RegisterPhone` — gates on `binaries[serverID]` presence. During a grace window the binary entry IS still present (only the timer is pending), so `RegisterPhone` returns `nil` as required by AC. After expiry, the binary entry is gone and `RegisterPhone` returns `ErrNoServer` — also required by AC. +- `UnregisterPhone`, `BinaryFor`, `PhonesFor`, `Counts` — no changes. `Counts` continues to count whatever is in the maps; a server-id in its grace window contributes 1 to the binary count (the entry is still there) and the live count of phones, which matches the operational mental model ("we still consider this binary connected; we are about to find out"). + +### Concurrency model + +- Same single `sync.RWMutex` as ADR-0003. The new `timers` map is guarded by it. +- The expiry-handler goroutine is the *only* new goroutine introduced by this ticket. Lifetime: bounded by `d`; no possibility of leakage because (a) `time.AfterFunc` does not retain the goroutine after the func returns; (b) the func is non-blocking on the lock side (acquires write lock, releases it, then iterates `Close()`); (c) `Close()` is the WS adapter's responsibility — the registry's contract requires `Close()` to be idempotent and safe to call concurrently with `Send`, but does NOT require it to be non-blocking. If `Close()` blocks indefinitely, the runtime goroutine pool absorbs it; the registry has no visible state stuck in cleanup. +- Lock graph remains a single node. The expiry handler does not call back into any other registry method; it manipulates the maps directly. +- `Stop()` is the documented cancellation primitive for `time.AfterFunc`. Returning `false` (already fired / already stopped) is fine — the pointer-identity check downstream catches the stale-firing case. + +### Error handling + +No new sentinels. `ScheduleReleaseServer` returns nothing (it's fire-and-arm; no failure mode the caller can act on). Calling it for an unheld server-id arms a timer that on expiry does nothing — same shape as `ReleaseServer("unknown")` returning `false`. The doc comment names this explicitly so the developer doesn't add a "binary not found" return value. + +### What this ticket deliberately does not do + +- **No phone-side close code.** `Conn.Close()` emits `1000` per ADR-0005 / `WSConn` contract. The protocol spec requests `1011 "binary did not reconnect"` on the phone side; adopting that requires extending the `Conn` interface or adding a parallel close primitive. Out of scope per AC; tracked as follow-up. Phones treat any close as "reconnect" — the user-visible behaviour is correct. +- **No handler wiring.** `/v1/server` continues to call `ReleaseServer` synchronously. #21 swaps the call site. +- **No grace-window phone limit.** A server-id whose binary disconnected can still accept new phone registrations during the grace window. Per-server caps are #5's concern. +- **No metrics for grace-period expiry vs reclaim.** `/healthz` `Counts` stays as-is. If we want to surface "grace pendings" later, that's a `Counts2`-style addition; not now. + +### Algorithm summary + +| Operation | Behaviour | +|---|---| +| `ScheduleReleaseServer(s, d)` — no pending timer | Arm timer; on fire, snapshot+delete binary+phones, then `Close()` phones outside lock. | +| `ScheduleReleaseServer(s, d)` — pending timer exists | `Stop()` old timer; arm new one. Last call wins. | +| `ClaimServer(s, c)` — pending timer exists | Cancel timer; replace binary; return `nil`. (Phones inherited.) | +| `ClaimServer(s, c)` — no timer, slot held | Return `ErrServerIDConflict`. (Existing behaviour.) | +| `ClaimServer(s, c)` — no timer, slot free | Insert binary; return `nil`. (Existing behaviour.) | +| `RegisterPhone(s, c)` — during grace window | Succeed (binary entry still present). | +| `RegisterPhone(s, c)` — after expiry | `ErrNoServer` (binary entry gone). | +| Timer fires, has not been `Stop()`'d | Pointer-identity check passes; remove binary, snapshot+delete phones, `Close()` outside lock. | +| Timer fires, has been `Stop()`'d after fire latched | Pointer-identity check fails; no-op. | + +## Testing strategy + +`internal/relay/registry_test.go`, append to existing file. Subtest names map 1:1 to AC bullets. + +### `fakeConn` extension + +`fakeConn.Close` is currently `func (c *fakeConn) Close() { c.closed = true }`. With the grace-expiry handler running on `time.AfterFunc`'s goroutine, the test's read of `c.closed` from the test goroutine becomes a data race. Fix in this ticket: + +```go +type fakeConn struct { + id string + sent [][]byte + mu sync.Mutex + closed bool + closeCh chan struct{} // optional; allocated by tests that wait for close +} + +func (c *fakeConn) Close() { + c.mu.Lock() + defer c.mu.Unlock() + if c.closed { + return // honour idempotence + } + c.closed = true + if c.closeCh != nil { + close(c.closeCh) + } +} + +func (c *fakeConn) isClosed() bool { + c.mu.Lock() + defer c.mu.Unlock() + return c.closed +} +``` + +Existing tests that mutate `closed` (none currently — they use `fakeConn.closed` only as an instance field; the `Close` writer is the only writer in the existing suite) need no changes; the bool-field read sites in existing tests don't exist (grep `closed` in the test file — only `Close()` calls are present, no reads). The mutex-guarded form is safe for the existing race test (which doesn't read `closed`). + +### Functional tests + +Each uses a small grace duration (10–50ms is plenty; `time.AfterFunc` resolution is fine on macOS/Linux; CI runs Linux). Where the test waits for expiry, prefer `closeCh`-style synchronisation over `time.Sleep` to keep the suite fast and deterministic. + +1. **`TestScheduleReleaseServer_ReclaimWithinGrace_NoReleaseFires`** (AC: schedule + immediate reclaim → no release fires; new conn is the active binary; previously-registered phones remain.) + - Claim `s1` with `b1`, register `p1`. + - Schedule release of `s1` with `d=50ms`. + - Immediately `ClaimServer(s1, b2)` — expect `nil` (NOT `ErrServerIDConflict`). + - `BinaryFor(s1)` returns `b2`. + - `PhonesFor(s1)` returns `[p1]`. + - Sleep 100ms (longer than the original `d`). + - Assert `p1.isClosed() == false`. (If the timer had fired stale and bypassed the pointer check, `p1` would have been closed.) + +2. **`TestScheduleReleaseServer_ExpiryRemovesBinaryAndClosesPhones`** (AC: grace expires with no reclaim → binary entry gone, phones `Close()`-called and removed, subsequent `ClaimServer` succeeds, subsequent `RegisterPhone` returns `ErrNoServer`.) + - Claim `s1`, register `p1` and `p2` with `closeCh` allocated. + - Schedule release with `d=20ms`. + - Wait on both `p1.closeCh` and `p2.closeCh` (`select` with a 1-second test timeout per channel). + - Assert `BinaryFor(s1)` returns `(_, false)`. + - Assert `PhonesFor(s1)` returns `nil`. + - Assert `RegisterPhone(s1, p3)` returns `ErrNoServer`. + - Assert `ClaimServer(s1, b2)` returns `nil`. + +3. **`TestScheduleReleaseServer_PhoneRegisteredDuringGrace_SurvivesReclaim`** (AC bullet identical.) + - Claim `s1` with `b1`. + - Schedule release with `d=50ms`. + - During the grace window: `RegisterPhone(s1, p1)` — expect `nil`. + - `ClaimServer(s1, b2)` — expect `nil`. + - `PhonesFor(s1)` returns `[p1]`. + +4. **`TestScheduleReleaseServer_PhoneRegisteredDuringGrace_ClosedOnExpiry`** (AC bullet identical.) + - Claim `s1`. + - Schedule release with `d=20ms`. + - During the grace window: `RegisterPhone(s1, p1)` (with `closeCh`). + - Wait on `p1.closeCh`. + - Confirm `p1.isClosed() == true`. + +5. **`TestScheduleReleaseServer_RaceFreedomUnderRapidCycles`** (AC: rapid disconnect/reclaim cycles do not leak goroutines or pending timers; race-detector clean; `time.AfterFunc` cancellation paths exercised.) + - 16 goroutines × 200 iterations each, hammering a small set of server-ids (`s-0..s-3`). + - Each iteration: `ClaimServer`, `RegisterPhone` (best-effort), `ScheduleReleaseServer(d=1ms)` (about half fire and half get cancelled by the next claim), `ClaimServer` again, `UnregisterPhone`, `Counts`. + - After `wg.Wait()`, sleep ~50ms (longer than max `d`), then `Counts()` and verify the registry is in a consistent state (no panic, no `nil`-deref). Goroutine-leak detection is via `runtime.NumGoroutine()` before/after the test with a small tolerance — or skipped if too flaky; the race detector catches the data-race class which is the more important property. + - Doc comment names the manual stress invocation: `go test -race -count=20 -run TestScheduleReleaseServer_RaceFreedomUnderRapidCycles ./internal/relay`. + +6. **`TestScheduleReleaseServer_ReplacesPendingTimer`** (AC: "Calling it again for the same `serverID` cancels and replaces any pending timer (last call wins).") + - Claim `s1`, register `p1` with `closeCh`. + - `ScheduleReleaseServer(s1, 1*time.Hour)` (effectively never fires within test). + - `ScheduleReleaseServer(s1, 20ms)` — should *replace*, not add. + - Wait on `p1.closeCh`. The 20ms timer fires; the 1-hour timer was `Stop()`'d before its closure can run. + - This proves the second call cancelled the first (otherwise we'd see two firings, neither blocked by the other; idempotent `Close` would mask one but the test fails if the binary entry is *re-deleted* — actually, idempotence on the registry side is fine since `delete` of an absent key is a no-op and pointer-identity prevents double cleanup. Net: this test mostly proves the stop-and-replace path doesn't panic and that the 1-hour timer's stale firing is handled cleanly. Adequate signal.). + +7. **`TestScheduleReleaseServer_OnUnheldID_NoOp`** (defensive; not strictly an AC but cheap.) + - `ScheduleReleaseServer("nope", 10ms)`. + - Sleep 30ms. + - Assert no panic; `Counts()` returns `(0, 0)`. + +### Tests not added + +- Stop-then-fire interleaving — the pointer-identity check is the test's *premise*; we cover the stale-firing path implicitly via tests 1 and 6. +- Goroutine-leak strict counts — `runtime.NumGoroutine` is flaky under parallel tests; the race detector + the bounded-by-`d` lifetime contract is the actual enforcement. + +## Open questions + +1. **Method name: `ScheduleReleaseServer` vs `ReleaseServerAfter`.** Picked `ScheduleReleaseServer` for the verb-up-front shape — "schedule a release", with the cancellation-by-re-call semantic implicit in "schedule". `ReleaseServerAfter` reads as a fluent extension of `ReleaseServer` but elides the cancellation primitive. Either is defensible; the spec commits to `ScheduleReleaseServer`. If the developer's gut on reading the call site disagrees, raise it before implementation — renaming after the fact costs nothing here (one consumer in #21) but is annoying. +2. **Should `ScheduleReleaseServer` return `bool` to indicate "armed vs replaced"?** No. The caller never branches on it (#21's call site is a single line in the disconnect handler). The void return matches the AC's wording ("schedules a deferred release ... last call wins") and avoids inviting a check that doesn't add behaviour. +3. **`Counts` during grace window.** Returns the binary as still claimed and phones as still present. Documented above; no change to `Counts` contract. If `/healthz` operators need to distinguish "live" vs "in grace", that's a separate field; not now. + +## Done means + +- `internal/relay/registry.go` has the `timers` field on `Registry`, initialised in `NewRegistry`; the new `ScheduleReleaseServer` method; the `handleGraceExpiry` helper (unexported); the grace-aware fast path in `ClaimServer`. `ReleaseServer`, `RegisterPhone`, `UnregisterPhone`, `BinaryFor`, `PhonesFor`, `Counts` — unchanged. +- `internal/relay/registry_test.go` has the seven new subtests above. `fakeConn` updated with mutex-guarded `closed` and optional `closeCh`. The race test runs clean under `make test` (`-race`). +- `make vet`, `make test`, `make build` all clean. +- One commit on `feature/20`: `feat(relay): schedule deferred binary release with grace-period reclaim (#20)`. + +--- + +## Security review + +### Threat surface for THIS ticket + +This ticket extends the registry with a deferred-release mechanism. The registry remains pass-through to adversarial input — `serverID` and `Conn` handles flow in from #16's WS upgrade handler, which validates the header gate. The new attack surface is **timer behaviour under adversarial reconnect patterns**: an attacker who can drive binary disconnects and reclaims rapidly might try to leak goroutines, exhaust memory via the `timers` map, or trick the grace handler into closing phones it shouldn't. + +### Categories walked + +- **Trust boundaries.** `ScheduleReleaseServer` is called by the WS upgrade handler (#21) when the binary's read loop ends. The handler decides the grace duration (the spec says "30s from #21"); the registry trusts the duration is non-negative and finite. **Finding:** `time.AfterFunc(d, …)` with `d <= 0` fires immediately — that's safe (degenerates to "release now"), not exploitable. With `d` of e.g. `math.MaxInt64`, the timer arms but never fires before the process exits — also safe. No bound needed at the registry layer; the handler picks `30*time.Second` and that's correct. +- **Tokens, secrets, credentials.** N/A — this ticket touches no auth state. +- **File operations.** N/A. +- **Subprocess / external command execution.** N/A. +- **Cryptographic primitives.** N/A. +- **Network & I/O.** The registry has none directly. The grace-expiry handler invokes `Close()` on phone conns — that's a network operation owned by the `WSConn` adapter (which has a 10s `Send` deadline per ADR-0005, but `Close` itself is best-effort). **Finding:** `Close` blocking the AfterFunc goroutine is bounded by `WSConn`'s own behaviour. The registry's responsibility ends at "call `Close` outside the lock," which it does. SHOULD-FIX-elsewhere: if `WSConn.Close()` ever grew an unbounded blocking path, the AfterFunc goroutine would leak — but this is a property of the `Conn` impl, not the registry. Tracked by ADR-0005's existing constraints. +- **Error messages, logs, telemetry.** No new logging. No error returns from `ScheduleReleaseServer`. **Finding:** none. +- **Concurrency.** + - **Lock ordering.** Single mutex; ordering is trivial. + - **TOCTOU.** The "stop the timer, replace the entry" sequence in `ClaimServer` and `ScheduleReleaseServer` is atomic under the write lock. The pointer-identity check in `handleGraceExpiry` defends against the one race the lock can't prevent: a timer that has *already fired* (its goroutine started) cannot be `Stop()`'d, but is queued behind us on the lock. When it gets the lock, it finds either no entry (cancelled and replaced) or a different `*time.Timer` (replaced) and bails. **Finding:** correct; verified by tests 1 and 6. + - **Shutdown safety.** Process exit cancels nothing in `time.AfterFunc` — pending timers are dropped when the runtime tears down. That's fine; no on-disk state. + - **Goroutine lifecycle.** Each timer firing spawns one runtime-managed goroutine that runs `handleGraceExpiry`, returns. Lifetime ≤ `d` + cleanup time. Cancelled timers (via `Stop()` returning true) never spawn the goroutine. Cancelled-after-fire timers spawn the goroutine but it returns immediately at the pointer-check. **Finding:** no leak path. + - **Rapid reconnect = `timers` map growth?** Each new `ScheduleReleaseServer` for a given `serverID` `Stop()`'s and replaces the prior entry; the map cap stays at "number of distinct server-ids in grace at this instant." Cleanup: `handleGraceExpiry`'s success path calls `delete(r.timers, serverID)`; `ClaimServer`'s grace fast-path does the same. **Finding:** no unbounded growth. +- **Threat model alignment.** Protocol spec § Authentication → Binary → relay specifies the 30-second grace on the protocol level; this ticket ships the relay's mechanism. Phone-side close code (`1011`) is explicitly out of scope per the ticket's "Out of scope" section and tracked as a follow-up. Out-of-scope deferral named in the spec. + +### Adversarial framings considered + +- *"Attacker rapidly cycles binary connect/disconnect for a server-id they own — does the registry leak timers or goroutines?"* — Each cycle: claim (cancels timer if present, deletes entry); disconnect (handler calls `ScheduleReleaseServer`, arms a new timer). The map holds at most one entry per server-id. Goroutines from cancelled-after-fire timers run and exit. No leak. Race test (5) hammers this exact pattern. +- *"Attacker arms a grace timer with a huge `d` to keep a server-id occupied indefinitely."* — The attacker is the binary that *just disconnected*. They no longer hold the slot beyond the grace window's reclaim path; the entry is in pseudo-stale state but `ClaimServer` from a competing binary inside the grace window now **succeeds** (it's the reclaim path), so a competing legit binary can take the slot by completing the WS handshake. Outside the grace window, `ClaimServer` succeeds normally (entry was removed on expiry). The attacker does not gain durable squatting power. (Pre-grace squatting via `ClaimServer` and never-disconnecting is a different threat handled by header validation and is out of scope here.) +- *"Attacker triggers a phone disconnect-flood during the grace window to register many phones, then cancels the grace via reclaim — do the phones survive forever?"* — Yes, that's the AC. Phones are bounded only by the per-server cap which is #5's responsibility. The registry already documents this in ADR-0003 and PROJECT-MEMORY's "lifecycle expectations" section. Out of scope here. +- *"Attacker schedules grace for server-ids they don't hold, hoping to mass-spam timer arms."* — `ScheduleReleaseServer` is called by the WS upgrade handler, not by network input directly; an attacker would need to trigger the handler's disconnect path, which requires having connected as that server-id (passing #16's header gate and `ClaimServer`). Even if abused, the registry caps at one timer per server-id and timers stop firing after expiry — no amplification. +- *"`Close()` panics in a phone Conn impl, killing the AfterFunc goroutine."* — The `Conn` contract specifies `Close` is idempotent and safe; a panic is a bug in the impl, not a registry concern. `time.AfterFunc`'s goroutine panicking propagates as a normal Go panic and crashes the process. That's the right behaviour for a contract violation. The registry does not `recover()`; consistent with the rest of the codebase's "trust your interfaces" stance (ADR-0003 § "ConnID is a getter"). +- *"What if `ScheduleReleaseServer` is called for a server-id that's also currently held by a normal claim (no prior disconnect)?"* — Allowed; arms a timer that on expiry deletes the binary and closes phones. This is the production happy path — `/v1/server` calls it on every disconnect. There's no way for the registry to distinguish "normal disconnect" from "buggy caller scheduling release on a healthy binary"; both produce the same action. The handler is responsible for only calling `ScheduleReleaseServer` on the disconnect path — documented in #21's spec, not enforced by the registry. + +### Verdict: PASS + +The new method's API and lock discipline maintain the passive-store invariant from ADR-0003. The pointer-identity stale-firing check closes the only race the lock can't (timer-already-fired). Goroutine and map growth are bounded. No new untrusted-input boundaries are introduced — the registry continues to trust the WS handler at its API surface, and the WS handler's adversarial-input handling is reviewed in #16's spec. + +**Findings:** + +- [Trust boundaries] No findings — the duration is fully trusted from #21's caller; degenerate values (`d<=0`, very large `d`) are safe. +- [Network & I/O] No findings — `Close()` is invoked outside the lock; bounded by `WSConn`'s own contract. Mentioned for the developer so they don't add a deadline at the registry layer. +- [Concurrency] No findings — pointer-identity stale-firing check covers the one race the lock can't; map cleanup is in both ClaimServer and the expiry handler. +- [Threat model alignment] OUT OF SCOPE — phone-side close code `1011` per protocol spec; tracked as follow-up per AC. + +**Reviewer:** architect (self-review per `architect/security-review.md`) +**Date:** 2026-05-08 diff --git a/internal/relay/registry.go b/internal/relay/registry.go index f3256c7..e65254b 100644 --- a/internal/relay/registry.go +++ b/internal/relay/registry.go @@ -3,6 +3,7 @@ package relay import ( "errors" "sync" + "time" ) // Sentinel errors returned by Registry methods. Callers branch on these with @@ -56,6 +57,16 @@ type Registry struct { mu sync.RWMutex binaries map[string]Conn phones map[string][]Conn + timers map[string]*graceEntry +} + +// graceEntry wraps a pending grace-period timer. Its pointer identity +// (NOT *time.Timer's) is what the expiry handler compares against the map +// to detect stale fires. Capturing the entry pointer in the AfterFunc +// closure — rather than the *time.Timer assigned after AfterFunc returns — +// avoids the race-detector flag on a self-referential local var. +type graceEntry struct { + timer *time.Timer } // NewRegistry constructs an empty Registry. @@ -63,6 +74,7 @@ func NewRegistry() *Registry { return &Registry{ binaries: make(map[string]Conn), phones: make(map[string][]Conn), + timers: make(map[string]*graceEntry), } } @@ -78,6 +90,16 @@ func NewRegistry() *Registry { func (r *Registry) ClaimServer(serverID string, conn Conn) error { r.mu.Lock() defer r.mu.Unlock() + if entry, gracing := r.timers[serverID]; gracing { + // Grace window in flight: cancel the pending release and atomically + // replace the binary Conn. Phones registered during the grace window + // stay; the new binary inherits them. The expiry handler defends + // against a concurrent stale fire via pointer-identity on r.timers. + entry.timer.Stop() + delete(r.timers, serverID) + r.binaries[serverID] = conn + return nil + } if _, ok := r.binaries[serverID]; ok { return ErrServerIDConflict } @@ -100,6 +122,67 @@ func (r *Registry) ReleaseServer(serverID string) (released bool) { return true } +// ScheduleReleaseServer arms (or replaces) a deferred release of serverID +// after d. The grace window is the reclaim path: until the timer fires, +// +// - BinaryFor(serverID) continues to return the (now-closed) binary Conn — +// callers that Send through it observe whatever error the underlying +// impl returns. +// - RegisterPhone(serverID, conn) continues to succeed. +// - ClaimServer(serverID, conn) replaces the binary atomically and +// cancels the pending timer (returns nil, NOT ErrServerIDConflict). +// +// If a timer is already pending for serverID, it is stopped and replaced +// (last call wins). Calling ScheduleReleaseServer for a serverID with no +// binary holding it arms the timer anyway; on expiry it removes only +// whatever state happens to be present (defensive, matches the no-op +// semantics of ReleaseServer on an unheld id). +// +// On expiry (no reclaim within d): the binary entry for serverID is +// removed, every phone currently registered for serverID is removed from +// the registry and Close() is invoked on it. Close calls happen OUTSIDE +// the registry lock (snapshot-and-iterate, same shape as PhonesFor). +// +// This method does not block on d; the timer fires on Go's runtime +// goroutine for time.AfterFunc. +func (r *Registry) ScheduleReleaseServer(serverID string, d time.Duration) { + r.mu.Lock() + defer r.mu.Unlock() + + if existing, ok := r.timers[serverID]; ok { + existing.timer.Stop() + delete(r.timers, serverID) + } + + entry := &graceEntry{} + r.timers[serverID] = entry + entry.timer = time.AfterFunc(d, func() { r.handleGraceExpiry(serverID, entry) }) +} + +// handleGraceExpiry runs on the time.AfterFunc goroutine when the grace +// timer for serverID fires. The pointer-identity check on r.timers[serverID] +// closes the one race the lock cannot: a timer whose body has already +// started executing cannot be Stop()'d, but is queued behind us on the +// lock. By the time it acquires the lock, ClaimServer or a later +// ScheduleReleaseServer may have replaced the entry — in which case self +// no longer matches and we no-op. +func (r *Registry) handleGraceExpiry(serverID string, self *graceEntry) { + r.mu.Lock() + if r.timers[serverID] != self { + r.mu.Unlock() + return + } + delete(r.timers, serverID) + delete(r.binaries, serverID) + snapshot := r.phones[serverID] + delete(r.phones, serverID) + r.mu.Unlock() + + for _, p := range snapshot { + p.Close() + } +} + // RegisterPhone appends conn to the phones slice for serverID. Returns // ErrNoServer if no binary currently holds serverID. The registry does not // deduplicate by ConnID — the caller is responsible for not registering the diff --git a/internal/relay/registry_test.go b/internal/relay/registry_test.go index b1cf0e4..042c9f5 100644 --- a/internal/relay/registry_test.go +++ b/internal/relay/registry_test.go @@ -5,21 +5,44 @@ import ( "fmt" "sync" "testing" + "time" ) // fakeConn is a minimal in-package Conn for tests. The registry's race // coverage proves the registry is race-free with well-behaved Conns; it does // not vouch for fakeConn under concurrent mutation, so tests register // distinct fakes per goroutine. +// +// Close mutates closed under mu so tests on a different goroutine than the +// grace-expiry handler can safely read it via isClosed. Tests that need to +// block until Close fires allocate closeCh; Close closes it once. type fakeConn struct { - id string - sent [][]byte - closed bool + id string + sent [][]byte + mu sync.Mutex + closed bool + closeCh chan struct{} } func (c *fakeConn) ConnID() string { return c.id } func (c *fakeConn) Send(msg []byte) error { c.sent = append(c.sent, msg); return nil } -func (c *fakeConn) Close() { c.closed = true } +func (c *fakeConn) Close() { + c.mu.Lock() + defer c.mu.Unlock() + if c.closed { + return + } + c.closed = true + if c.closeCh != nil { + close(c.closeCh) + } +} + +func (c *fakeConn) isClosed() bool { + c.mu.Lock() + defer c.mu.Unlock() + return c.closed +} func TestClaimServer_FirstWinsSecondConflicts(t *testing.T) { t.Parallel() @@ -221,6 +244,212 @@ func TestUnregisterPhone_RemovesEmptySlice(t *testing.T) { } } +func TestScheduleReleaseServer_ReclaimWithinGrace_NoReleaseFires(t *testing.T) { + t.Parallel() + r := NewRegistry() + + b1 := &fakeConn{id: "b-1"} + b2 := &fakeConn{id: "b-2"} + p1 := &fakeConn{id: "p-1"} + + if err := r.ClaimServer("s1", b1); err != nil { + t.Fatalf("ClaimServer b1: %v", err) + } + if err := r.RegisterPhone("s1", p1); err != nil { + t.Fatalf("RegisterPhone p1: %v", err) + } + + r.ScheduleReleaseServer("s1", 50*time.Millisecond) + + if err := r.ClaimServer("s1", b2); err != nil { + t.Fatalf("reclaim ClaimServer: got %v, want nil", err) + } + got, ok := r.BinaryFor("s1") + if !ok || got.ConnID() != "b-2" { + t.Errorf("BinaryFor after reclaim: got (%v, %v), want (b-2, true)", got, ok) + } + phones := r.PhonesFor("s1") + if len(phones) != 1 || phones[0].ConnID() != "p-1" { + t.Errorf("PhonesFor after reclaim: got %v, want [p-1]", phones) + } + + // Sleep past the original grace window. If the timer had fired stale + // and bypassed the pointer check, p1 would now be closed. + time.Sleep(100 * time.Millisecond) + + if p1.isClosed() { + t.Error("p1 was closed after reclaim — stale timer fire was not suppressed") + } + if got, ok := r.BinaryFor("s1"); !ok || got.ConnID() != "b-2" { + t.Errorf("BinaryFor after grace window elapsed: got (%v, %v), want (b-2, true)", got, ok) + } +} + +func TestScheduleReleaseServer_ExpiryRemovesBinaryAndClosesPhones(t *testing.T) { + t.Parallel() + r := NewRegistry() + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + p1 := &fakeConn{id: "p-1", closeCh: make(chan struct{})} + p2 := &fakeConn{id: "p-2", closeCh: make(chan struct{})} + if err := r.RegisterPhone("s1", p1); err != nil { + t.Fatalf("RegisterPhone p1: %v", err) + } + if err := r.RegisterPhone("s1", p2); err != nil { + t.Fatalf("RegisterPhone p2: %v", err) + } + + r.ScheduleReleaseServer("s1", 20*time.Millisecond) + + for _, p := range []*fakeConn{p1, p2} { + select { + case <-p.closeCh: + case <-time.After(time.Second): + t.Fatalf("timed out waiting for %s.Close()", p.id) + } + } + + if _, ok := r.BinaryFor("s1"); ok { + t.Error("BinaryFor: binary entry still present after grace expiry") + } + if got := r.PhonesFor("s1"); got != nil { + t.Errorf("PhonesFor after expiry: got %v, want nil", got) + } + if err := r.RegisterPhone("s1", &fakeConn{id: "p-3"}); !errors.Is(err, ErrNoServer) { + t.Errorf("RegisterPhone after expiry: got %v, want errors.Is(_, ErrNoServer)", err) + } + if err := r.ClaimServer("s1", &fakeConn{id: "b-2"}); err != nil { + t.Errorf("ClaimServer after expiry: got %v, want nil", err) + } +} + +func TestScheduleReleaseServer_PhoneRegisteredDuringGrace_SurvivesReclaim(t *testing.T) { + t.Parallel() + r := NewRegistry() + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer b1: %v", err) + } + + r.ScheduleReleaseServer("s1", 50*time.Millisecond) + + p1 := &fakeConn{id: "p-1"} + if err := r.RegisterPhone("s1", p1); err != nil { + t.Fatalf("RegisterPhone during grace: got %v, want nil", err) + } + + if err := r.ClaimServer("s1", &fakeConn{id: "b-2"}); err != nil { + t.Fatalf("reclaim ClaimServer: got %v, want nil", err) + } + + phones := r.PhonesFor("s1") + if len(phones) != 1 || phones[0].ConnID() != "p-1" { + t.Errorf("PhonesFor after reclaim: got %v, want [p-1]", phones) + } +} + +func TestScheduleReleaseServer_PhoneRegisteredDuringGrace_ClosedOnExpiry(t *testing.T) { + t.Parallel() + r := NewRegistry() + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + + r.ScheduleReleaseServer("s1", 20*time.Millisecond) + + p1 := &fakeConn{id: "p-1", closeCh: make(chan struct{})} + if err := r.RegisterPhone("s1", p1); err != nil { + t.Fatalf("RegisterPhone during grace: got %v, want nil", err) + } + + select { + case <-p1.closeCh: + case <-time.After(time.Second): + t.Fatal("timed out waiting for p1.Close() on grace expiry") + } + if !p1.isClosed() { + t.Error("p1.isClosed() = false after closeCh signalled") + } +} + +func TestScheduleReleaseServer_ReplacesPendingTimer(t *testing.T) { + t.Parallel() + r := NewRegistry() + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + p1 := &fakeConn{id: "p-1", closeCh: make(chan struct{})} + if err := r.RegisterPhone("s1", p1); err != nil { + t.Fatalf("RegisterPhone: %v", err) + } + + // Arm a long timer first, then replace with a short one. + r.ScheduleReleaseServer("s1", time.Hour) + r.ScheduleReleaseServer("s1", 20*time.Millisecond) + + select { + case <-p1.closeCh: + case <-time.After(time.Second): + t.Fatal("timed out waiting for replacement timer to fire") + } +} + +func TestScheduleReleaseServer_OnUnheldID_NoOp(t *testing.T) { + t.Parallel() + r := NewRegistry() + + r.ScheduleReleaseServer("nope", 10*time.Millisecond) + time.Sleep(40 * time.Millisecond) + + if b, p := r.Counts(); b != 0 || p != 0 { + t.Errorf("Counts after unheld-id schedule: got (%d,%d), want (0,0)", b, p) + } +} + +// TestScheduleReleaseServer_RaceFreedomUnderRapidCycles hammers the +// disconnect/reclaim cycle from many goroutines and asserts the absence of +// DATA RACE reports under -race. Exercises time.AfterFunc cancel/replace and +// stale-fire pointer-identity paths. +// +// Run with: go test -race -count=20 -run TestScheduleReleaseServer_RaceFreedomUnderRapidCycles ./internal/relay +func TestScheduleReleaseServer_RaceFreedomUnderRapidCycles(t *testing.T) { + t.Parallel() + r := NewRegistry() + + const goroutines = 16 + const opsPer = 200 + + var wg sync.WaitGroup + wg.Add(goroutines) + for g := 0; g < goroutines; g++ { + g := g + go func() { + defer wg.Done() + sid := fmt.Sprintf("s-%d", g%4) + for i := 0; i < opsPer; i++ { + _ = r.ClaimServer(sid, &fakeConn{id: fmt.Sprintf("b-%d-%d", g, i)}) + _ = r.RegisterPhone(sid, &fakeConn{id: fmt.Sprintf("p-%d-%d", g, i)}) + r.ScheduleReleaseServer(sid, time.Millisecond) + _ = r.ClaimServer(sid, &fakeConn{id: fmt.Sprintf("b2-%d-%d", g, i)}) + r.UnregisterPhone(sid, fmt.Sprintf("p-%d-%d", g, i)) + _, _ = r.Counts() + } + }() + } + wg.Wait() + + // Sleep past max grace duration so any in-flight timers fire and exit. + time.Sleep(50 * time.Millisecond) + + // Final state need not be empty (some claims won the cancellation race); + // we only assert no panic and that Counts is callable. + _, _ = r.Counts() +} + // TestRegistry_RaceFreedom hammers the public API from many goroutines and // asserts the absence of DATA RACE reports under -race. //