From e36e2fadf07c680e54b06bfd0f8591b00b4fd882 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Fri, 8 May 2026 22:51:23 +0300 Subject: [PATCH 1/3] spec: /v1/server defer binary release with 30s grace window (#21) --- .../21-server-endpoint-grace-release.md | 274 ++++++++++++++++++ 1 file changed, 274 insertions(+) create mode 100644 docs/specs/architecture/21-server-endpoint-grace-release.md diff --git a/docs/specs/architecture/21-server-endpoint-grace-release.md b/docs/specs/architecture/21-server-endpoint-grace-release.md new file mode 100644 index 0000000..3b4e61b --- /dev/null +++ b/docs/specs/architecture/21-server-endpoint-grace-release.md @@ -0,0 +1,274 @@ +# Spec — /v1/server: defer binary release by 30s grace window (#21) + +## Files to read first + +- `internal/relay/server_endpoint.go` — full file. The single call-site swap and the new constructor parameter both live here. The disconnect `defer` block is lines 79-83. +- `internal/relay/server_endpoint.go:79-83` — exact `defer` to modify. Preserve `wsconn.Close()` and the `server_released` log line; only the registry call changes. +- `internal/relay/registry.go:125-160` — `ScheduleReleaseServer` contract (signature, doc, semantics). Read it in full so the spec choices match the registry's actual behaviour. Note: returns nothing; arms a timer; `ClaimServer` during the grace window cancels and reclaims. +- `internal/relay/server_endpoint_test.go` — full file. The existing helper `startServer` (lines 20-27) and the four test functions all construct `ServerHandler(reg, logger)`; they all need the new third argument. The new grace test goes here too. +- `internal/relay/server_endpoint_test.go:197-248` — `TestServerEndpoint_PeerClose_ReleasesSlot`. Today it asserts release within 2 s; with the swap the release becomes scheduled, so this test must pass a *short* grace through the helper to keep the 2-second wait meaningful. +- `cmd/pyrycode-relay/main.go:50` — single production wiring site. Becomes `ServerHandler(reg, logger, 30*time.Second)`. +- `docs/lessons.md` lines 21-23 — *"`defer ReleaseServer(...)` must be registered AFTER the successful `ClaimServer`."* The new call inherits this constraint verbatim; the defer position does not move. +- `docs/specs/architecture/20-grace-period-deferred-release.md` — sibling registry spec. Especially the "What this ticket deliberately does not do" section: phone-side close code is the registry's deferral, not this ticket's. The handler does *not* pick it up here either. +- `pyrycode/pyrycode/docs/protocol-mobile.md` § Authentication → Binary → relay — wire-protocol source for the 30-second grace ("the server-id is released after a 30-second grace period"). The duration constant lives at the handler's wiring site (production) so it stays near the policy definition. + +## Context + +#20 shipped `Registry.ScheduleReleaseServer(serverID, d)` — a deferred sibling of `ReleaseServer` that arms a timer, keeps the binary entry visible during the grace window so `ClaimServer` can reclaim it, and on expiry closes orphan phones. No call sites migrated in #20 by design; the `/v1/server` handler still calls the immediate `ReleaseServer`. + +This ticket flips the handler. After this lands: + +- A binary disconnect (any reason: clean close, ping timeout, network error) leaves the server-id slot reserved for `30*time.Second`. +- A reconnect within the grace window from the same logical binary lands back in the same slot atomically (registry-side behaviour from #20). +- If no reconnect arrives, the timer fires, the binary entry is removed, and any phones registered against that server-id are closed (registry-side behaviour from #20). + +Scope is narrow: one call swap in `ServerHandler`, one grace duration threaded through the constructor, one new test, one wiring update in `main.go`. + +## Design + +### Files + +Modified: `internal/relay/server_endpoint.go`, `internal/relay/server_endpoint_test.go`, `cmd/pyrycode-relay/main.go`. No new files. No new package. + +### `ServerHandler` signature change + +```go +// ServerHandler returns the http.Handler for /v1/server. On binary +// disconnect (clean close, network error, ping timeout) the server-id slot +// is scheduled for release after `grace`; a reconnect within that window +// (#20's reclaim path) inherits the slot atomically. Production passes +// 30*time.Second per protocol spec § Authentication → Binary → relay. +// Tests pass a short duration to exercise the scheduled-release path +// without slowing the suite. +func ServerHandler(reg *Registry, logger *slog.Logger, grace time.Duration) http.Handler +``` + +Rationale for "constructor parameter" over the two alternatives PO offered: + +- **Constructor parameter (chosen).** Honest about the dependency; consistent with how `ServerHandler` already takes `reg` and `logger`; the duration is policy data and policy lives at the wiring site (`cmd/pyrycode-relay/main.go`), not buried in a package-level var. +- *Unexported package-level override.* Action-at-a-distance; tests would need a `setGracePeriod(t, d)` helper or a mutex to keep parallel tests honest. Avoid. +- *Options struct.* Premature abstraction for a one-field config. Reach for it when there's a second knob. + +The grace duration is fully trusted from the wiring site. The handler does not validate it (`d <= 0` degrades to "release now," `d` very large degrades to "never release before process exit" — both safe per #20's registry contract; see the security-review section below). + +### Disconnect-cleanup `defer` + +The current block (`internal/relay/server_endpoint.go:79-83`): + +```go +defer func() { + reg.ReleaseServer(serverID) + wsconn.Close() + logger.Info("server_released", "server_id", serverID) +}() +``` + +Becomes: + +```go +defer func() { + reg.ScheduleReleaseServer(serverID, grace) + wsconn.Close() + logger.Info("server_released", "server_id", serverID) +}() +``` + +Three observations the developer should preserve: + +1. **`defer` registration order is unchanged.** The lesson at `docs/lessons.md:21-23` still applies: register the cleanup *after* `ClaimServer` succeeds, so the conflict path doesn't arm a stray timer for a slot it never owned. The `defer` already lives below the `ClaimServer` success branch (line 79, after line 74's success log) — leave it there. +2. **`wsconn.Close()` stays.** The grace window concerns *registry state*; the underlying WebSocket is still gone (the peer closed it; we close our side too for hygiene). Leaving the WS open during the grace would leak file descriptors. The closed `Conn` remains in the registry's `binaries` map — that's #20's contract; phones that try to `Send` through it observe whatever error the WS adapter returns, and frame forwarding's reaction is #6's territory (deliberately out of scope). +3. **The `server_released` log line stays unchanged.** Despite the slight inaccuracy ("released" now means "scheduled for release"), the existing string is what operators search for and what #20's spec calls out. A rename would ripple into log queries with no observable benefit. If we ever distinguish "scheduled" from "fired," that's a separate field on the *expiry* path, not here. + +The `readCtx := c.CloseRead(r.Context()); <-readCtx.Done()` block at the end of the handler is unchanged. The `defer` runs on return from the handler — same as before. + +No other handler changes: header gate, conflict path (`ErrServerIDConflict` → 4409 close on the underlying `*websocket.Conn`), `connID` generation, `ClaimServer` invocation, success log — all untouched. + +### `cmd/pyrycode-relay/main.go` wiring + +One line change at line 50: + +```go +mux.Handle("/v1/server", relay.ServerHandler(reg, logger, 30*time.Second)) +``` + +The `time` package is already imported (line 14: `"time"`). The `30*time.Second` literal stays inline at the wiring site rather than a package-level constant: it appears exactly once, the value is policy not implementation detail, and inlining it keeps the protocol-spec linkage visible in the file that wires the relay together. + +### Concurrency model + +No new goroutines in this ticket. The handler's lifecycle is unchanged: one goroutine per upgrade (Go's `http.Server` default), `CloseRead` adds one drain goroutine on the underlying conn (existing behaviour), and the disconnect `defer` runs on the handler's goroutine. + +The grace timer's goroutine is owned by `time.AfterFunc` inside the registry — that machinery is #20's responsibility and is already exercised by the registry's test suite. The handler does not block on, observe, or reason about the timer's goroutine. + +### Error handling + +`ScheduleReleaseServer` returns nothing. The `defer` is fire-and-forget; there is no failure mode for the handler to surface. Any operational anomaly (timer didn't fire, phones not closed) is a registry-internal concern owned by #20's tests. + +### Testing strategy + +`internal/relay/server_endpoint_test.go`. Three pieces of work, in order. + +#### 1. Helper signature update + +`startServer(t)` becomes `startServer(t, grace)`: + +```go +// startServer spins up an httptest.NewServer running ServerHandler against a +// fresh registry. The grace duration is the third arg to ServerHandler; tests +// that don't exercise the disconnect path can pass any reasonable value +// (a small duration is fine — none of the non-disconnect tests wait on it). +func startServer(t *testing.T, grace time.Duration) (*Registry, string, func()) { + t.Helper() + reg := NewRegistry() + logger := slog.New(slog.NewTextHandler(io.Discard, nil)) + srv := httptest.NewServer(ServerHandler(reg, logger, grace)) + wsURL := "ws" + strings.TrimPrefix(srv.URL, "http") + return reg, wsURL, srv.Close +} +``` + +Updating call sites (3 of them in the existing file): + +- `TestServerEndpoint_ValidUpgrade_RegistersBinary` (line 45): pass `100*time.Millisecond` — test does not wait on disconnect. +- `TestServerEndpoint_DuplicateClaim_4409` (line 148): pass `100*time.Millisecond` — conflict path, no disconnect wait. +- `TestServerEndpoint_PeerClose_ReleasesSlot` (line 198): pass `100*time.Millisecond`. The existing 2-second wait for release still works (100 ms ≪ 2 s) and the test continues to verify the end-to-end "disconnect → eventual release → re-claim succeeds" property. + +The two inline `httptest.NewServer(ServerHandler(reg, logger))` sites that don't go through `startServer`: + +- `TestServerEndpoint_HeaderGate_400` (line 105): pass `100*time.Millisecond`. +- `TestServerEndpoint_WrongMethod_NoPanic` (line 289): pass `100*time.Millisecond`. + +These are mechanical edits. Total updates: 5 call sites (1 helper + 2 inline + the 2 tests that go through the helper get the change for free). + +#### 2. New test: scheduled-not-immediate release + +```go +func TestServerEndpoint_PeerClose_SchedulesGraceRelease(t *testing.T) { + grace := 200 * time.Millisecond + reg, wsURL, cleanup := startServer(t, grace) + defer cleanup() + + c, _, err := dialWith(t, wsURL, validHeaders("s1")) + if err != nil { + t.Fatalf("dial: %v", err) + } + + // Wait for the claim to land (handler runs in its own goroutine). + deadline := time.Now().Add(time.Second) + for time.Now().Before(deadline) { + if _, ok := reg.BinaryFor("s1"); ok { + break + } + time.Sleep(5 * time.Millisecond) + } + if _, ok := reg.BinaryFor("s1"); !ok { + t.Fatalf("BinaryFor(s1) not registered before close") + } + + closeAt := time.Now() + if err := c.Close(websocket.StatusNormalClosure, ""); err != nil { + t.Fatalf("client Close: %v", err) + } + + // Phase 1 (scheduled, NOT immediate): for ~grace/4 after closeAt, the + // binary entry must remain in the registry. If the handler were calling + // ReleaseServer (the pre-#21 path) instead of ScheduleReleaseServer, + // BinaryFor would flip to false within milliseconds of the handler + // observing the close. + earlyDeadline := closeAt.Add(grace / 4) + for time.Now().Before(earlyDeadline) { + if _, ok := reg.BinaryFor("s1"); !ok { + t.Fatalf("BinaryFor(s1) released at T+%v < grace=%v (handler used immediate ReleaseServer, not scheduled)", time.Since(closeAt), grace) + } + time.Sleep(5 * time.Millisecond) + } + + // Phase 2 (timer fires): within grace + slack, the binary entry must be + // gone. Registry's #20 expiry handler does the deletion. + lateDeadline := closeAt.Add(2*grace + 500*time.Millisecond) + for time.Now().Before(lateDeadline) { + if _, ok := reg.BinaryFor("s1"); !ok { + return // pass + } + time.Sleep(5 * time.Millisecond) + } + t.Fatalf("BinaryFor(s1) still registered at T+%v; expected scheduled release to have fired (grace=%v)", time.Since(closeAt), grace) +} +``` + +Why these knobs: + +- **`grace = 200 ms`.** Long enough that the early-deadline window (50 ms) is comfortably outside scheduler jitter on slow CI runners; short enough that the late-deadline window (~900 ms) keeps the suite fast. The existing `TestServerEndpoint_PeerClose_ReleasesSlot` uses 100 ms successfully on the same harness; doubling it for the assertion-rich variant is a safe margin. +- **Early window is `grace/4` (50 ms), not `grace - ε`.** With ε too small the test races the timer; `grace/4` proves "scheduled, not immediate" with adequate margin. The handler-observes-close lag is at most a few ms in `httptest`, so the binary entry's earliest possible removal (under a buggy immediate-release path) is closeAt + handler-lag (~few ms), well inside the 50 ms early window. +- **Direct registry inspection** rather than waiting on a `Conn.Close()` channel: the AC explicitly suggests this approach ("by inspecting `Registry` state directly, since `ScheduleReleaseServer` exposes no return value"). It also avoids dragging the not-yet-built `/v1/client` handler into this test. + +#### 3. AC: "Existing /v1/server tests still pass unchanged" + +The four existing tests are unchanged in *behaviour* — header gate, conflict path, and the claim/release happy path all continue to assert what they always asserted. The mechanical helper-signature update is the only edit. Verify with `go test ./internal/relay -run TestServerEndpoint`. + +### Open questions + +1. **Should the grace duration be a package-level constant `defaultGrace = 30 * time.Second` exported from the relay package?** Picked **no** — the constant lives at the policy site (`main.go`). If a second binary ever needs the same value (e.g. an integration test harness or a second wiring entry point), promote it to `package relay` then. Premature constants make grep harder. +2. **Should `ServerHandler` validate `grace > 0`?** No. The registry tolerates `d <= 0` (degenerates to immediate release on the AfterFunc goroutine — same end state, marginally different timing). Adding a panic or a clamp at the handler's API surface would be defensive code with no observed failure mode (per CLAUDE.md "evidence-based fix selection"). The wiring site is the policy, and the wiring site passes `30*time.Second`. +3. **Should the `server_released` log line gain a `grace` field?** Tempting, but: the log fires at the *moment the defer runs*, which is "scheduled," and adding `grace=30s` to every disconnect log is structured-log noise. If operators ever want to distinguish "this slot is in grace" from "this slot fully released," the right place is a new event the registry's expiry handler emits, not this log line. + +## Done means + +- `internal/relay/server_endpoint.go`'s `ServerHandler` takes `grace time.Duration` as the third parameter; the disconnect `defer` calls `reg.ScheduleReleaseServer(serverID, grace)` instead of `reg.ReleaseServer(serverID)`. `wsconn.Close()` and the `server_released` log line are unchanged. No other edits to the file. +- `internal/relay/server_endpoint_test.go`: `startServer` signature updated; all 5 `ServerHandler(...)` call sites updated; new `TestServerEndpoint_PeerClose_SchedulesGraceRelease` added. No other test changes. +- `cmd/pyrycode-relay/main.go:50` passes `30*time.Second` as the third argument to `relay.ServerHandler`. +- `make vet`, `make test`, `make build` all clean. +- One commit on `feature/21`: `feat(relay): defer binary release with 30s grace window (#21)`. + +--- + +## Security review + +### Threat surface for THIS ticket + +This ticket swaps one call site and threads one duration through the constructor. The new attack surface is whatever the swap and the constructor parameter expose: + +- A binary that was previously evicted immediately on disconnect now lingers in the registry for `grace` seconds. +- The `grace` value is passed at process startup; not adversarial input. +- The handler does not gain new code paths handling untrusted input — header gate, claim, conflict path, hold-open loop are all unchanged. + +### Categories walked + +- **Trust boundaries.** No new boundary. The `serverID` flows in from the request header (`X-Pyrycode-Server`), is gated by the existing presence/non-empty check, and is passed verbatim to `ScheduleReleaseServer`. The registry already accepts arbitrary string keys (#3) and doesn't interpret them. No finding. +- **Tokens, secrets, credentials.** N/A — this ticket touches no auth state. Token-based binary auth is a future ticket; no surface here. +- **File operations.** N/A. +- **Subprocess / external command execution.** N/A. +- **Cryptographic primitives.** N/A — `randHex8` for `connID` is unchanged and already audited in #16. +- **Network & I/O.** + - **Per-server slot retention.** A binary that disconnects holds its server-id slot for `grace` (30 s) instead of being evicted immediately. Operationally this is the protocol-spec-mandated reclaim window; adversarially it means an attacker who connects with a chosen server-id and then disconnects keeps that slot reserved against legitimate competing claims for 30 s. **Finding:** *not exploitable* — the reclaim-during-grace path lets a *competing legitimate connection succeed* (registry-side: `ClaimServer` during the grace window returns `nil` and replaces the binary). So a competing legit binary can take the slot inside the grace window by completing the WS handshake. The attacker's "squat" only delays a clean transition; it does not block legitimate access. (Compare: pre-#21 the attacker could also squat by *not disconnecting*; that's the same threat, neither better nor worse here.) + - **Server-id pool exhaustion via grace.** Could an attacker open N WS connections with N distinct server-ids, drop them all, and force the relay to retain N entries for 30 s? Yes — the `binaries` map and the `timers` map both grow proportional to "distinct server-ids in flight + in grace." The total memory cost is bounded by (a) the registry's per-entry footprint (one map entry per id; one timer; phone slices are empty for never-claimed-by-phones ids) and (b) the connection-acceptance rate. **Finding:** *Mentioned, not addressed here.* Per-IP and total-connection caps are #5/#16's territory and the AC explicitly defers them. The `ScheduleReleaseServer` shape doesn't introduce *unbounded* growth — the timer eventually fires and reclaims the entry — but it does multiply the steady-state cap by `grace × accept-rate`. CLAUDE.md "evidence-based" applies: no observed exhaustion, deferral is correct. + - **Timeouts on the WS upgrade.** Unchanged from #16/#9 (`http.Server` has the four explicit timeouts). The grace duration is unrelated to read/write deadlines. +- **Error messages, logs, telemetry.** No new error branches. The `server_released` log line is unchanged in shape and content, even though its meaning shifts slightly (now "scheduled for release"). Operators searching for the existing string keep working. **Finding:** none. +- **Concurrency.** + - **Defer ordering.** The lesson at `docs/lessons.md:21-23` still holds — the `defer` is registered after the successful `ClaimServer`, so the conflict path doesn't arm a stray grace timer. Verified the spec preserves the line position. **Finding:** none. + - **`wsconn.Close()` after `ScheduleReleaseServer`.** The order in the new `defer` body is `Schedule → Close → Log`, same as old `Release → Close → Log`. Closing the WS happens after the registry has the timer armed, so a phone trying to `Send` during the grace window observes whatever the closed `WSConn` returns (existing #15 contract). No race the registry's lock doesn't already cover. + - **Goroutine lifecycle.** The handler does not spawn anything new. The grace timer's goroutine is owned by `time.AfterFunc` and audited in #20. + - **Shutdown safety.** Process exit drops pending timers (registry has no on-disk state). A binary mid-grace at shutdown loses its slot the moment the new process starts — same as today's "binary mid-claim at shutdown" behaviour. **Finding:** none. +- **Threat model alignment.** Protocol spec § Authentication → Binary → relay specifies the 30-second grace window — this ticket implements that mechanism on the handler side. Phone-side `1011` close code is *deliberately deferred* per #20's spec (the registry's `Close()` invocation goes through the `Conn.Close()` interface which emits `1000`; emitting `1011` requires a parallel close primitive on `Conn` and is tracked as a follow-up). Out-of-scope deferral named in the spec. No further misalignment. + +### Adversarial framings considered + +- *"Attacker connects as server-id `X`, disconnects, reconnects within 30s — does the relay treat this as a fresh claim or a reclaim?"* — Reclaim, by design (#20). The new binary inherits the slot. This is the AC; not exploitable, it's the feature. +- *"Attacker connects as server-id `X`, disconnects, immediately tries to connect as a *different* server-id `Y` — does the grace on `X` interfere?"* — No. `X` and `Y` are independent map keys. The grace timer on `X` is unrelated to claims on `Y`. +- *"Attacker rapidly connect-disconnects with random server-ids to flood the `timers` map."* — Bounded by `grace × accept-rate`. The accept-rate cap (per-IP / total) is not in this ticket; it's #16's continuation territory and #5's per-server cap. CLAUDE.md's "evidence-based" framing applies: no observed flood, no defensive code added here. Mentioned in the spec so the developer doesn't ad-hoc a cap. +- *"Attacker passes a malicious value through the `grace` parameter."* — They can't. `grace` is set at process startup from a hard-coded literal in `main.go`; it never crosses the network boundary. A test that passes `0` or a negative value would degenerate the registry to "fire immediately"; safe per #20's contract. +- *"Race between client close and a competing `ClaimServer`."* — The competing claim either: (a) arrives before the disconnect `defer` runs, in which case it observes the slot still claimed by the first binary and gets `ErrServerIDConflict` (correct, that's the conflict path); (b) arrives after the `defer` runs, in which case it observes the timer-pending state and reclaims (correct, that's #20's reclaim path). The handler does no extra coordination here — the registry's lock orders the operations. +- *"What if `grace = 0` slips into production via a config typo?"* — `30*time.Second` is a literal in `main.go`, not a config field. There is no path to inject `0` short of editing the source. If/when a CLI flag for the duration is added, *that* ticket gets the validation; not this one. + +### Verdict: PASS + +The call swap inherits #20's well-tested registry-side semantics. No new untrusted-input boundary is introduced. The handler does not validate the grace duration because the duration is fully trusted (compile-time literal at the wiring site) and the registry tolerates degenerate values safely. The one concrete operational property worth flagging — `binaries` and `timers` map growth scales as `grace × accept-rate` — is named explicitly and tracked to the per-IP/total-cap tickets that own it. + +**Findings:** + +- [Trust boundaries] No findings — the only data crossing a boundary is `serverID`, gated by the existing header check and accepted by the registry as an opaque string key. +- [Network & I/O] OUT OF SCOPE — slot retention multiplies steady-state map size by `grace × accept-rate`. Per-IP and total-connection caps are #5/#16 territory and explicitly deferred by this ticket's "Out of scope" section. +- [Concurrency] No findings — defer position preserves the `docs/lessons.md:21-23` invariant; `Schedule → Close → Log` order is consistent with the old `Release → Close → Log` order; no new goroutines. +- [Threat model alignment] OUT OF SCOPE — phone-side `1011` close code per protocol spec is tracked as a follow-up to #20, not picked up here. + +**Reviewer:** architect (self-review per `architect/security-review.md`) +**Date:** 2026-05-08 From a6175a01352a58cfcc0576f11a806833dfe95922 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Fri, 8 May 2026 22:53:48 +0300 Subject: [PATCH 2/3] feat(relay): defer binary release with 30s grace window (#21) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Swap the /v1/server disconnect-cleanup defer from the immediate Registry.ReleaseServer to ScheduleReleaseServer, threading a grace duration through ServerHandler's constructor. Production wires 30*time.Second per protocol spec § Authentication → Binary → relay; tests use short durations to exercise the scheduled-release path without slowing the suite. A reconnect within the grace window now lands in the same server-id slot atomically (registry-side reclaim from #20). On expiry, orphan phones for that server-id are closed by the registry's expiry handler. Adds TestServerEndpoint_PeerClose_SchedulesGraceRelease asserting the binary entry persists for grace/4 after close (proves scheduled, not immediate) and is gone within grace + slack (proves the timer fires). Co-Authored-By: Claude Opus 4.7 --- cmd/pyrycode-relay/main.go | 2 +- internal/relay/server_endpoint.go | 11 ++++- internal/relay/server_endpoint_test.go | 66 ++++++++++++++++++++++---- 3 files changed, 68 insertions(+), 11 deletions(-) diff --git a/cmd/pyrycode-relay/main.go b/cmd/pyrycode-relay/main.go index 5b1083f..e6164b9 100644 --- a/cmd/pyrycode-relay/main.go +++ b/cmd/pyrycode-relay/main.go @@ -47,7 +47,7 @@ func main() { mux := http.NewServeMux() mux.Handle("/healthz", relay.NewHealthzHandler(reg, Version, startedAt)) - mux.Handle("/v1/server", relay.ServerHandler(reg, logger)) + mux.Handle("/v1/server", relay.ServerHandler(reg, logger, 30*time.Second)) if *insecureListen != "" { logger.Info("starting", "version", Version, "mode", "insecure", "listen", *insecureListen) diff --git a/internal/relay/server_endpoint.go b/internal/relay/server_endpoint.go index 970cb3f..1aa43e9 100644 --- a/internal/relay/server_endpoint.go +++ b/internal/relay/server_endpoint.go @@ -16,6 +16,7 @@ import ( "log/slog" "net" "net/http" + "time" "nhooyr.io/websocket" ) @@ -24,7 +25,13 @@ import ( // WebSocket upgrade endpoint. It validates required headers, upgrades the // connection, claims the server-id slot in reg, and holds the connection // open until the binary closes it (or the slot is released). -func ServerHandler(reg *Registry, logger *slog.Logger) http.Handler { +// +// On binary disconnect (clean close, network error, ping timeout) the +// server-id slot is scheduled for release after grace; a reconnect within +// that window inherits the slot atomically (registry-side reclaim path, +// see Registry.ScheduleReleaseServer). Production passes 30*time.Second +// per protocol spec § Authentication → Binary → relay. +func ServerHandler(reg *Registry, logger *slog.Logger, grace time.Duration) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { serverID := r.Header.Get("X-Pyrycode-Server") versionHeader := r.Header.Get("X-Pyrycode-Version") @@ -77,7 +84,7 @@ func ServerHandler(reg *Registry, logger *slog.Logger) http.Handler { "remote", remoteHost(r)) defer func() { - reg.ReleaseServer(serverID) + reg.ScheduleReleaseServer(serverID, grace) wsconn.Close() logger.Info("server_released", "server_id", serverID) }() diff --git a/internal/relay/server_endpoint_test.go b/internal/relay/server_endpoint_test.go index 3ce2358..af4688a 100644 --- a/internal/relay/server_endpoint_test.go +++ b/internal/relay/server_endpoint_test.go @@ -16,12 +16,13 @@ import ( ) // startServer spins up an httptest.NewServer running ServerHandler against a -// fresh registry. It returns the registry, the WS URL, and a cleanup. -func startServer(t *testing.T) (*Registry, string, func()) { +// fresh registry. The grace duration is the third arg to ServerHandler; +// tests that don't exercise the disconnect path can pass a small duration. +func startServer(t *testing.T, grace time.Duration) (*Registry, string, func()) { t.Helper() reg := NewRegistry() logger := slog.New(slog.NewTextHandler(io.Discard, nil)) - srv := httptest.NewServer(ServerHandler(reg, logger)) + srv := httptest.NewServer(ServerHandler(reg, logger, grace)) wsURL := "ws" + strings.TrimPrefix(srv.URL, "http") return reg, wsURL, srv.Close } @@ -42,7 +43,7 @@ func validHeaders(serverID string) http.Header { } func TestServerEndpoint_ValidUpgrade_RegistersBinary(t *testing.T) { - reg, wsURL, cleanup := startServer(t) + reg, wsURL, cleanup := startServer(t, 100*time.Millisecond) defer cleanup() c, _, err := dialWith(t, wsURL, validHeaders("s1")) @@ -102,7 +103,7 @@ func TestServerEndpoint_HeaderGate_400(t *testing.T) { t.Run(tc.name, func(t *testing.T) { reg := NewRegistry() logger := slog.New(slog.NewTextHandler(io.Discard, nil)) - srv := httptest.NewServer(ServerHandler(reg, logger)) + srv := httptest.NewServer(ServerHandler(reg, logger, 100*time.Millisecond)) defer srv.Close() req, err := http.NewRequest(http.MethodGet, srv.URL+"/", nil) @@ -145,7 +146,7 @@ func TestServerEndpoint_HeaderGate_400(t *testing.T) { } func TestServerEndpoint_DuplicateClaim_4409(t *testing.T) { - reg, wsURL, cleanup := startServer(t) + reg, wsURL, cleanup := startServer(t, 100*time.Millisecond) defer cleanup() c1, _, err := dialWith(t, wsURL, validHeaders("s1")) @@ -195,7 +196,7 @@ func TestServerEndpoint_DuplicateClaim_4409(t *testing.T) { } func TestServerEndpoint_PeerClose_ReleasesSlot(t *testing.T) { - reg, wsURL, cleanup := startServer(t) + reg, wsURL, cleanup := startServer(t, 100*time.Millisecond) defer cleanup() c, _, err := dialWith(t, wsURL, validHeaders("s1")) @@ -247,6 +248,55 @@ func TestServerEndpoint_PeerClose_ReleasesSlot(t *testing.T) { t.Fatalf("re-claim did not register") } +func TestServerEndpoint_PeerClose_SchedulesGraceRelease(t *testing.T) { + grace := 200 * time.Millisecond + reg, wsURL, cleanup := startServer(t, grace) + defer cleanup() + + c, _, err := dialWith(t, wsURL, validHeaders("s1")) + if err != nil { + t.Fatalf("dial: %v", err) + } + + deadline := time.Now().Add(time.Second) + for time.Now().Before(deadline) { + if _, ok := reg.BinaryFor("s1"); ok { + break + } + time.Sleep(5 * time.Millisecond) + } + if _, ok := reg.BinaryFor("s1"); !ok { + t.Fatalf("BinaryFor(s1) not registered before close") + } + + closeAt := time.Now() + if err := c.Close(websocket.StatusNormalClosure, ""); err != nil { + t.Fatalf("client Close: %v", err) + } + + // Phase 1: scheduled, NOT immediate. For ~grace/4 after closeAt the + // binary entry must remain in the registry. The pre-#21 path (immediate + // ReleaseServer) would flip BinaryFor to false within milliseconds of + // the handler observing the close. + earlyDeadline := closeAt.Add(grace / 4) + for time.Now().Before(earlyDeadline) { + if _, ok := reg.BinaryFor("s1"); !ok { + t.Fatalf("BinaryFor(s1) released at T+%v < grace=%v (handler used immediate ReleaseServer, not scheduled)", time.Since(closeAt), grace) + } + time.Sleep(5 * time.Millisecond) + } + + // Phase 2: timer fires within grace + slack. + lateDeadline := closeAt.Add(2*grace + 500*time.Millisecond) + for time.Now().Before(lateDeadline) { + if _, ok := reg.BinaryFor("s1"); !ok { + return + } + time.Sleep(5 * time.Millisecond) + } + t.Fatalf("BinaryFor(s1) still registered at T+%v; expected scheduled release to have fired (grace=%v)", time.Since(closeAt), grace) +} + func TestServerEndpoint_WrongMethod_NoPanic(t *testing.T) { cases := []struct { name string @@ -286,7 +336,7 @@ func TestServerEndpoint_WrongMethod_NoPanic(t *testing.T) { t.Run(tc.name, func(t *testing.T) { reg := NewRegistry() logger := slog.New(slog.NewTextHandler(io.Discard, nil)) - srv := httptest.NewServer(ServerHandler(reg, logger)) + srv := httptest.NewServer(ServerHandler(reg, logger, 100*time.Millisecond)) defer srv.Close() resp, err := tc.do(srv.URL) From 9e6b5d765bc1e38e37d3e911653d71297db08256 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Fri, 8 May 2026 22:59:54 +0300 Subject: [PATCH 3/3] docs: /v1/server grace-window deferred release (#21) Update server-endpoint feature doc, PROJECT-MEMORY, lessons, and INDEX to reflect the constructor-parameter grace duration, the ScheduleReleaseServer swap on disconnect, and the sharpened defer-after-claim invariant. --- docs/PROJECT-MEMORY.md | 5 ++- docs/knowledge/INDEX.md | 4 +- .../knowledge/features/connection-registry.md | 2 +- docs/knowledge/features/server-endpoint.md | 39 ++++++++++++------- docs/lessons.md | 4 +- 5 files changed, 33 insertions(+), 21 deletions(-) diff --git a/docs/PROJECT-MEMORY.md b/docs/PROJECT-MEMORY.md index 33baeab..19a9823 100644 --- a/docs/PROJECT-MEMORY.md +++ b/docs/PROJECT-MEMORY.md @@ -14,7 +14,7 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex | TLS via autocert in `--domain` mode (`NewAutocertManager`, `EnforceHost`, `TLSConfig`, `ErrCacheDirInsecure`) | Done (#9) | `internal/relay/tls.go`, `cmd/pyrycode-relay/main.go` | | `WSConn` adapter (`nhooyr.io/websocket.Conn` → registry `Conn`; per-conn write mutex; `Close`-cancelled context; 10s `Send` deadline) | Done (#15) | `internal/relay/ws_conn.go` | | `/healthz` JSON endpoint (`status`, `version`, `connected_binaries`, `connected_phones`, `uptime_seconds`; `Cache-Control: no-store`; unauthenticated) | Done (#10) | `internal/relay/healthz.go`, `cmd/pyrycode-relay/main.go` | -| WS upgrade on `/v1/server` (header gate pre-upgrade; `ClaimServer`; `4409` on conflict; `CloseRead`-held until #6) | Done (#16) | `internal/relay/server_endpoint.go`, `cmd/pyrycode-relay/main.go` | +| WS upgrade on `/v1/server` (header gate pre-upgrade; `ClaimServer`; `4409` on conflict; `CloseRead`-held until #6; disconnect defers a 30s grace release via `ScheduleReleaseServer`) | Done (#16, #21) | `internal/relay/server_endpoint.go`, `cmd/pyrycode-relay/main.go` | | WS upgrade on `/v1/client` | Not started | — | | Header validation (`x-pyrycode-server`, `x-pyrycode-version`, `user-agent` on `/v1/server`; `x-pyrycode-token` on `/v1/client`) | `/v1/server` done (#16); `/v1/client` not started | `internal/relay/server_endpoint.go` | | Frame forwarding using the routing envelope | Not started | — | @@ -35,7 +35,8 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex - **Handler factories return `http.Handler`, not `http.HandlerFunc`.** `NewHealthzHandler(reg, version, startedAt) http.Handler` keeps construction factory-shaped so a future ticket adding per-handler state (logger, injectable clock) can do so without touching the call site in `main`. Adopted in `/healthz` (#10). - **Validate adversarial input *before* allocating WS state.** Header-gate checks (presence + non-empty) run before `websocket.Accept`; a malformed request never receives an upgrade response and the registry is untouched on every error path. Adopted in `/v1/server` (#16); the same shape applies to `/v1/client` token validation (#5). - **Application WS close codes are emitted on the underlying `*websocket.Conn`, not via `WSConn`.** `WSConn.Close()` always emits `StatusNormalClosure` — no policy parameters. Handlers that need `4409` / `4401` / `4404` call `c.Close(websocket.StatusCode(code), reason)` directly in the stillborn-WSConn window (after construction, before any `Send`). See [ADR-0005](knowledge/decisions/0005-application-close-codes-via-underlying-conn.md). Adopted in `/v1/server` (#16); applies to `/v1/client` (#5). -- **`defer { Release; Close; log }` is registered AFTER the successful claim**, never before. A conflict path (or any pre-claim error) returns before the defer is in place, so we never `ReleaseServer` on a slot we don't hold. Adopted in `/v1/server` (#16); same structural ordering applies to `/v1/client`'s phone registration. +- **`defer { ScheduleRelease; Close; log }` is registered AFTER the successful claim**, never before. A conflict path (or any pre-claim error) returns before the defer is in place, so we never schedule a release on a slot we don't hold. Originally `ReleaseServer` (#16); swapped to `ScheduleReleaseServer(serverID, grace)` in #21 so a quick reconnect lands back in the same slot. The structural rule sharpened in #21 — pre-grace this was a benign no-op when violated; now violating it would arm a stray timer for an id the handler never owned. Same ordering applies to `/v1/client`'s phone registration. +- **Policy values live at the wiring site, not as package-level constants.** `30*time.Second` for the grace window is a literal in `cmd/pyrycode-relay/main.go`, threaded into `ServerHandler` as a constructor parameter — it appears exactly once, the value is policy (matches protocol spec), and inlining keeps the protocol-spec linkage visible where the relay is composed. Tests pass ms-scale durations through the same parameter. Promote to a package-level constant only when a second wiring entry point needs the same value. Adopted in `/v1/server` grace duration (#21). - **Pointer-identity for stale `time.AfterFunc` fires.** `time.Timer.Stop()` returns false if the timer's func has already started executing. Wrap each pending timer in a small struct and store the wrapper pointer in a map; the `AfterFunc` closure captures the wrapper pointer and asserts `map[key] == self` under the lock before acting. If a faster goroutine replaced the entry, the pointer no longer matches and the closure no-ops. Capturing the `*time.Timer` directly forces a self-referential local var (assigned after `AfterFunc` returns) which trips the race detector under stress; the wrapper avoids that. Adopted in `Registry.ScheduleReleaseServer` (#20). Same shape applies to any "one cancellable timer per key" pattern. - **Hold long-lived WS handlers open with `c.CloseRead(r.Context())` plus `<-readCtx.Done()` until the real read loop lands.** `CloseRead` drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close). The frame loop ticket replaces `<-readCtx.Done()` with the actual read body in the same call site. Adopted in `/v1/server` (#16) pending #6. - **Capture process-state timestamps in `main` after `flag.Parse()`, not as package-level vars.** `startedAt := time.Now()` lives inside `main` and is passed into the handler factory. A package-level `var startedAt = time.Now()` would fire at import time — before flag parsing, before `--version` early-returns — and be wrong for short-lived test binaries and any future deferred-serve setup. Adopted in #10. diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index 451996f..60ca3e5 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -4,7 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Features -- [`/v1/server` WS upgrade](features/server-endpoint.md) — binary-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Version` / `User-Agent` pre-upgrade, claims the slot, emits `4409` on conflict, holds the conn via `CloseRead` until #6's frame loop replaces it. +- [`/v1/server` WS upgrade](features/server-endpoint.md) — binary-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Version` / `User-Agent` pre-upgrade, claims the slot, emits `4409` on conflict, holds the conn via `CloseRead` until #6's frame loop replaces it; on disconnect schedules a 30s grace release so a quick reconnect inherits the slot (#21). - [`/healthz` JSON endpoint](features/healthz.md) — unauthenticated `GET /healthz` returning `{status, version, connected_binaries, connected_phones, uptime_seconds}`; `Cache-Control: no-store`, body bounded ≈135 bytes. - [WSConn adapter](features/ws-conn-adapter.md) — wraps `nhooyr.io/websocket.Conn` to satisfy the registry's `Conn`; owns the per-conn write mutex and a `Close`-cancelled context with a 10s per-`Send` deadline. - [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`, deferred-release `ScheduleReleaseServer` with grace-window reclaim semantics. @@ -16,7 +16,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o - [ADR-0006: Grace window IS the reclaim path](decisions/0006-grace-period-as-reclaim-path.md) — `ClaimServer` during a pending grace timer succeeds (not `4409`); pointer-identity wrapper defends against stale `time.AfterFunc` fires after `Stop()`. - [ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`](decisions/0005-application-close-codes-via-underlying-conn.md) — `WSConn.Close()` stays minimal (`StatusNormalClosure`); handlers emit `4409` / `4401` / `4404` directly on the `*websocket.Conn` in stillborn-WSConn paths. - [ADR-0004: WS library choice and adapter context strategy](decisions/0004-ws-library-and-adapter-context-strategy.md) — `nhooyr.io/websocket` v1.8.x; adapter-owned context cancelled by `Close`, 10s per-`Send` deadline; registry's `Send` signature stays context-free. -- [ADR-0003: Connection registry as a passive store](decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, narrow `ReleaseServer` so grace logic (#8) can wrap it. +- [ADR-0003: Connection registry as a passive store](decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, narrow `ReleaseServer` so grace logic (#20/#21) can wrap it. - [ADR-0002: Explicit-failure 404 on `:80` instead of HTTPS redirect](decisions/0002-autocert-explicit-failure-on-port-80.md) — pass `http.NotFoundHandler()` to `autocert.Manager.HTTPHandler`, not `nil`; loud failure over silent scheme-switching. - [ADR-0001: Routing envelope shape and opacity](decisions/0001-routing-envelope-shape-and-opacity.md) — `json.RawMessage` for the inner frame; sentinel errors; validate at boundary, never parse payloads. diff --git a/docs/knowledge/features/connection-registry.md b/docs/knowledge/features/connection-registry.md index 067aa5b..a302317 100644 --- a/docs/knowledge/features/connection-registry.md +++ b/docs/knowledge/features/connection-registry.md @@ -78,7 +78,7 @@ The duration `d` is fully trusted — degenerate values (`d <= 0` fires immediat - **One `sync.RWMutex` covers all three maps** (`binaries`, `phones`, `timers`). Mutating methods take `Lock`; lookups take `RLock`. Sharding is a possible later optimisation but irrelevant at v1 (tens to hundreds of conns, sub-microsecond critical sections, no measured contention). - **Conn callbacks are never invoked under the lock,** with one documented exception: `UnregisterPhone` calls `ConnID()` while holding the write lock to scan for the target. The `Conn` contract requires `ConnID` to be a non-blocking getter. - **Broadcast pattern.** A caller wanting to fan out a frame to all phones for a server-id calls `PhonesFor` (returns a freshly allocated copy) and iterates the snapshot calling `Send` per conn — no registry lock held during I/O. -- **No callbacks, no channels.** The registry is a passive store. Notification of binary loss / new phone arrival happens elsewhere (the WS upgrade handlers and #8's grace timer). +- **No callbacks, no channels.** The registry is a passive store. Notification of binary loss / new phone arrival happens elsewhere (the WS upgrade handlers and the grace timer the registry arms in `ScheduleReleaseServer`). ## Invariants the registry enforces diff --git a/docs/knowledge/features/server-endpoint.md b/docs/knowledge/features/server-endpoint.md index 216fa16..e1ee028 100644 --- a/docs/knowledge/features/server-endpoint.md +++ b/docs/knowledge/features/server-endpoint.md @@ -1,8 +1,8 @@ # `/v1/server` — binary-side WebSocket upgrade -`/v1/server` is the relay's ingress for pyry binaries. A binary opens an outbound WSS to the relay, sends three required headers, and — if no other binary holds the same `serverID` — gets registered in the connection registry and held there until the connection ends. The endpoint enforces first-claim-wins via `Registry.ClaimServer`; a duplicate claim is closed with WS code `4409`. +`/v1/server` is the relay's ingress for pyry binaries. A binary opens an outbound WSS to the relay, sends three required headers, and — if no other binary holds the same `serverID` — gets registered in the connection registry and held there until the connection ends. The endpoint enforces first-claim-wins via `Registry.ClaimServer`; a duplicate claim is closed with WS code `4409`. On disconnect (clean close, network error, ping timeout), the slot is scheduled for release after a 30-second grace window so a quick reconnect lands back in the same slot with phones still attached. -This is the binary side only. Phone ingress (`/v1/client`) is #5; frame forwarding is #6; heartbeat is #7; the 30-second grace period on disconnect is #8. The handler currently holds the connection open by draining-and-discarding frames — #6 swaps in the real read loop in the same call site. +This is the binary side only. Phone ingress (`/v1/client`) is #5; frame forwarding is #6; heartbeat is #7. The handler currently holds the connection open by draining-and-discarding frames — #6 swaps in the real read loop in the same call site. ## Wire shape @@ -40,15 +40,17 @@ User-Agent: Package `internal/relay` (`server_endpoint.go`): ```go -func ServerHandler(reg *Registry, logger *slog.Logger) http.Handler +func ServerHandler(reg *Registry, logger *slog.Logger, grace time.Duration) http.Handler ``` -One exported symbol. No new types, no new sentinel errors. Wired in `cmd/pyrycode-relay/main.go` next to `/healthz`: +One exported symbol. No new types, no new sentinel errors. `grace` is the duration the server-id slot is held after disconnect before the registry releases it (see [Connection registry § grace-period reclaim](connection-registry.md)). Wired in `cmd/pyrycode-relay/main.go` next to `/healthz`: ```go -mux.Handle("/v1/server", relay.ServerHandler(reg, logger)) +mux.Handle("/v1/server", relay.ServerHandler(reg, logger, 30*time.Second)) ``` +The `30*time.Second` literal lives at the wiring site, not as a package-level constant: it appears exactly once, the value is policy (matches protocol spec § Authentication → Binary → relay), and inlining it keeps the protocol-spec linkage visible in the file that wires the relay together. Tests pass ms-scale durations. + ## Algorithm 1. Validate the three required headers. Any missing/empty → `400`, return. @@ -59,7 +61,7 @@ mux.Handle("/v1/server", relay.ServerHandler(reg, logger)) - On `ErrServerIDConflict` → close the underlying `*websocket.Conn` with code `4409`, log `server_id_conflict`, return. - On any other error → `wsconn.Close()`, return (defensive; not currently reachable). 6. Log `server_claimed`. -7. `defer { reg.ReleaseServer; wsconn.Close; log server_released }`. The defer is registered **after** the successful claim so a missed/conflicted claim never tries to release a slot we don't hold. +7. `defer { reg.ScheduleReleaseServer(serverID, grace); wsconn.Close; log server_released }`. The defer is registered **after** the successful claim so a missed/conflicted claim never tries to schedule a release for a slot we don't hold — and never arms a stray grace timer for an id it never owned. 8. `readCtx := c.CloseRead(r.Context()); <-readCtx.Done()` — block until the peer closes. ## Logging @@ -72,6 +74,8 @@ Three structured event types. Field set is fixed; nothing else (full headers, pa | `server_id_conflict` | `server_id`, `remote` | | `server_released` | `server_id` | +`server_released` fires at the moment the disconnect `defer` runs — i.e. when the slot is *scheduled* for release, not when the grace timer fires. The wording is preserved despite the slight semantic shift (the existing string is what operators search for; renaming would ripple into log queries with no observable benefit). If a future ticket needs to distinguish "scheduled" from "fired," that's a separate event on the registry's expiry path, not a rename here. + `binary_version` is the value of `X-Pyrycode-Version`, advertised by the binary itself — informational. `remote` is the IP host portion of `r.RemoteAddr`, no port (via `net.SplitHostPort`, with a verbatim fallback). `User-Agent` is required for entry but never logged (no operational use). The conflict path **deliberately omits** `binary_version`: an unauthenticated probe surface that echoed the version of the legitimate squatter would be a version-disclosure oracle. Header gate failures are not logged at all — avoids amplifying header-floods into log volume. @@ -85,9 +89,9 @@ The conflict path **deliberately omits** `binary_version`: an unauthenticated pr | `ClaimServer` | request goroutine | registry write lock (held internally) | one-shot | | `CloseRead` | spawns one read-discard goroutine | none | terminates on conn close / `r.Context()` cancel | | `<-readCtx.Done()` | request goroutine, blocking | none | unblocks on peer close | -| `defer` (release/close/log) | request goroutine | `wsconn.closeOnce` | runs only on the success path; idempotent | +| `defer` (schedule-release/close/log) | request goroutine | `wsconn.closeOnce` | runs only on the success path; idempotent | -Shutdown: `http.Server` cancels `r.Context()` → `CloseRead`'s read errors → `readCtx` cancels → handler unblocks → defer runs → slot released. `ReleaseServer`'s grace-period wrapper from #8 will hook the same defer. +Shutdown: `http.Server` cancels `r.Context()` → `CloseRead`'s read errors → `readCtx` cancels → handler unblocks → defer runs → slot scheduled for release. The grace timer is owned by `time.AfterFunc` inside the registry; the handler does not block on, observe, or reason about it. Process exit drops pending timers (the registry has no on-disk state); a binary mid-grace at shutdown loses its slot the moment the new process starts — same shape as a binary mid-claim at shutdown. ## Design notes @@ -96,6 +100,8 @@ Shutdown: `http.Server` cancels `r.Context()` → `CloseRead`'s read errors → - **`CloseRead` instead of `for { c.Read() }`.** The handler must keep the goroutine alive until the peer disconnects, but reading frames is #6's job. `CloseRead` spawns a goroutine that drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close) and returns a context cancelled on close. When #6 lands, `<-readCtx.Done()` is replaced with a real frame loop in the same call site. - **`crypto/rand`-backed `connID` suffix.** 32 bits is sufficient: scoped per server-id, used only as an opaque map key in `UnregisterPhone`. Birthday bound non-trivial only at ~2¹⁶ live conns under one server-id — far beyond v1 scale. Using `crypto/rand` over `math/rand` is forward-compat hardening: if a future ticket exposes the conn-id (e.g. echoing it in an error envelope), unguessable bytes avoid creating an oracle by accident. - **`crypto/rand` failure is fatal-by-design.** `randHex8` panics on RNG read failure; on Linux/macOS the OS RNG does not block once boot-initialised, so a failure here means a broken host. `http.Server` panic recovery terminates the connection; the slot is not yet claimed when `randHex8` runs, so the registry stays clean. +- **`grace` is a constructor parameter, not a package-level var or `Options` struct.** Honest about the dependency; consistent with how the handler already takes `reg` and `logger`; the value is policy and policy belongs at the wiring site (`cmd/pyrycode-relay/main.go`). A package-level override (e.g. `setGracePeriod(t, d)`) would be action-at-a-distance and force test serialisation. An `Options` struct would be premature for a single field. +- **The handler does not validate `grace`.** `grace` never crosses the network boundary — it's a compile-time literal in `main.go`. The registry tolerates degenerate values safely (`d <= 0` fires immediately, `d` near `MaxInt64` never fires before exit). Adding a clamp/panic at this layer would be defensive code with no observed failure mode. ## What this handler deliberately does NOT do @@ -105,7 +111,7 @@ Shutdown: `http.Server` cancels `r.Context()` → `CloseRead`'s read errors → - **No connection caps (per-IP or global).** Documented residual in `docs/threat-model.md` § DoS resistance; the WS upgrade path is named there. Inherited gap, not widened. - **No frame loop.** `CloseRead` discards frames until #6 replaces it. - **No heartbeat / ping-pong.** #7. -- **No grace-period release.** Immediate release; #8 wraps `ReleaseServer` in a deferred timer. +- **No frame-forward error handling during grace.** A phone whose `Send` errors against the closed binary `Conn` during the grace window is #6's territory; the handler does not observe these. - **No token validation.** `/v1/server` is unauthenticated by design — the binary owns the trust relationship with phones. - **No log-line sanitisation.** `slog`'s text handler quotes string values, so a newline in `binary_version` cannot forge a log line. JSON handler in production preserves the property. @@ -114,7 +120,8 @@ Shutdown: `http.Server` cancels `r.Context()` → `CloseRead`'s read errors → The handler is internet-exposed and unauthenticated. Threats considered: - **Fake `serverID` claim.** First-claim-wins is the trust primitive at this layer. Mitigation lives at the protocol layer (binary validates phone tokens; phones validate the binary out-of-band). Out of scope here. -- **Squatting on a `serverID`.** Legitimate binary sees `4409` until the squatter disconnects. Phones treat unresponsive binaries as offline. Connection caps are a future DoS ticket. +- **Squatting on a `serverID`.** Legitimate binary sees `4409` until the squatter disconnects. After the squatter disconnects the slot is held for `grace` seconds — but `ClaimServer` *during* the grace window succeeds (registry's reclaim path), so a competing legitimate binary lands in the slot inside that window rather than being blocked. The squatter's "drop and reserve for 30 s" is no worse than "stay connected forever," which they could already do pre-#21. Phones treat unresponsive binaries as offline. Connection caps are a future DoS ticket. +- **Grace-window map growth.** `binaries` and `timers` map sizes scale as `grace × accept-rate` in the worst case (an attacker rapidly connect-disconnects with distinct server-ids, leaving a 30 s residual entry per id). Each entry is bounded — the timer eventually fires and reclaims it — so growth is not unbounded. Per-IP and total-connection caps that would ceiling the accept-rate are #5 / #16 territory and explicitly deferred. - **Crafted headers (NULs, very long values).** `slog` quotes string values; `net.SplitHostPort` runs on `r.RemoteAddr` (set by `net/http`, not attacker-controlled); `r.Header.Get` does no parsing. No fragile parse step the attacker controls. - **Space-only headers.** `!= ""` accepts them. Hardening to "trim then non-empty" would be invented enforcement; the fields are informational, not enforcement, and no observed failure motivates it. - **Frame floods after upgrade.** `CloseRead` discards. One goroutine per connection. No amplification. Per-frame work is a header read + body drop. Acceptable for v1. @@ -124,18 +131,21 @@ The handler is internet-exposed and unauthenticated. Threats considered: - **`crypto/rand` failure.** Panic terminates the connection; the http server recovers; other connections continue. A failing host at this depth is operator-visible; refusing to serve is correct (loud failure over silent correction). - **Goroutine leak on shutdown.** `r.Context()` cancellation propagates through `CloseRead` to `readCtx`; defer runs; no leak. -Verdict from #16's security review: **PASS**. No new defences invented; no new threats opened; the residual risks are the same connection-cap and slow-peer gaps `docs/threat-model.md` already names. +Verdicts from #16's and #21's security reviews: **PASS**. No new defences invented; no new threats opened. #21 inherits #20's well-tested registry-side semantics for grace; the residual `grace × accept-rate` map-size scaling is named here and tracked to the per-IP/total-cap tickets that own it. ## Testing -`internal/relay/server_endpoint_test.go`, `package relay`. End-to-end against a real `*websocket.Conn` via `httptest.NewServer(ServerHandler(reg, logger))`. Logger is `slog.New(slog.NewTextHandler(io.Discard, nil))` — log-field assertions are out of scope; future tickets that need them swap in a JSON handler over a `bytes.Buffer`. +`internal/relay/server_endpoint_test.go`, `package relay`. End-to-end against a real `*websocket.Conn` via `httptest.NewServer(ServerHandler(reg, logger, grace))`. Logger is `slog.New(slog.NewTextHandler(io.Discard, nil))` — log-field assertions are out of scope; future tickets that need them swap in a JSON handler over a `bytes.Buffer`. + +The shared `startServer(t, grace)` helper threads the grace duration through. Tests that don't exercise the disconnect path pass any small value (e.g. `100*time.Millisecond`); the assertion-rich grace test passes `200*time.Millisecond` so its early-window check has comfortable margin against scheduler jitter on slow CI runners. Tests (1:1 with AC bullets): - `TestServerEndpoint_ValidUpgrade_RegistersBinary` — valid headers; assert dial succeeds, `reg.BinaryFor(...)` returns a non-nil `Conn` whose `ConnID()` is `"server--<8 hex>"`. - `TestServerEndpoint_HeaderGate_400` — table-driven over the three required headers (each row deletes or empties one); plain `http.Client.Do` with standard upgrade headers; expect `400`; `reg.Counts() == (0, 0)`. - `TestServerEndpoint_DuplicateClaim_4409` — first dial succeeds; second dial with same headers; one `Read` on the second conn returns a `*websocket.CloseError` with `Code == 4409` and `Reason == "server-id already claimed"` (asserted via `errors.As`). First conn stays registered. -- `TestServerEndpoint_PeerClose_ReleasesSlot` — dial, assert registered; client `Close(StatusNormalClosure, "")`; poll `reg.BinaryFor` (~1s, 10ms ticks) because release races the client close; assert eventually unregistered; re-claim succeeds. +- `TestServerEndpoint_PeerClose_ReleasesSlot` — dial, assert registered; client `Close(StatusNormalClosure, "")`; poll `reg.BinaryFor` (~1s, 10ms ticks); assert eventually unregistered; re-claim succeeds. End-to-end "disconnect → eventual release → re-claim" property; the short grace makes the existing 2-second wait still meaningful. +- `TestServerEndpoint_PeerClose_SchedulesGraceRelease` (#21) — proves *scheduled, not immediate*. Two phases against a 200 ms grace: phase 1 polls `reg.BinaryFor` for `grace/4` after the client close and fails if the entry disappears (a buggy immediate-release path would flip it within milliseconds); phase 2 polls for `2*grace + 500ms` and fails if the entry is still there. Inspects the registry directly because `ScheduleReleaseServer` returns nothing. - `TestServerEndpoint_WrongMethod_NoPanic` — `GET` without upgrade headers and `POST` with valid pyrycode headers; both expect a 4xx, no panic, registry empty. Not tested: registry mocks (use the real one — it's race-tested in #3); library mocks; `randHex8` distribution (stdlib); exact log output. @@ -146,6 +156,7 @@ Not tested: registry mocks (use the real one — it's race-tested in #3); librar - [WSConn adapter](ws-conn-adapter.md) — what the handler hands to `ClaimServer`. The handler's direct-`c.Close` on conflict is the documented exception to the WSConn-only invariant. - [`/healthz`](healthz.md) — sibling unauthenticated endpoint; mirrors the handler-factory shape. - [ADR-0005](../decisions/0005-application-close-codes-via-underlying-conn.md) — close-code emission goes through the underlying `*websocket.Conn`, not WSConn. -- [ADR-0003](../decisions/0003-connection-registry-passive-store.md) — narrow `ReleaseServer` so #8's grace logic can wrap it. +- [ADR-0006](../decisions/0006-grace-period-as-reclaim-path.md) — the grace window IS the reclaim path; the handler's disconnect defer hooks into this via `ScheduleReleaseServer`. +- [ADR-0003](../decisions/0003-connection-registry-passive-store.md) — narrow `ReleaseServer` left room for the grace-period wrapper this handler now uses. - [Threat model](../../threat-model.md) § DoS resistance — connection-cap residual is named there. - [Protocol spec § Authentication → Binary→relay](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#binary--relay) — authoritative wire shape. diff --git a/docs/lessons.md b/docs/lessons.md index 5968529..f5b05cb 100644 --- a/docs/lessons.md +++ b/docs/lessons.md @@ -18,9 +18,9 @@ A WebSocket connection only sees a peer-side close when *something* on this side The registry's `WSConn.Close()` is fixed to `StatusNormalClosure` by contract. To emit `4409` / `4401` / `4404`, call `Close` directly on the underlying `*websocket.Conn` from the upgrade handler — but only in the stillborn-WSConn window (after construction, before any `Send`, before the close `defer` is registered). Outside that window you'd race the adapter's `closeOnce` and `writeMu`. The handler comments name the exception so it doesn't read as a "use the underlying conn whenever you like" pattern. Captured as ADR-0005. Source: `/v1/server` conflict path (#16). -## `defer ReleaseServer(...)` must be registered AFTER the successful `ClaimServer` +## `defer { release-or-schedule }` must be registered AFTER the successful `ClaimServer` -If you `defer reg.ReleaseServer(serverID)` before `ClaimServer` returns, the conflict path (`ErrServerIDConflict`) ends up calling `ReleaseServer` on a slot it does not hold — currently a no-op (returns false) but a structural foot-gun if a future grace-period wrapper from #8 starts a timer on every `ReleaseServer` call. Order: claim, then log success, then `defer`. Source: `/v1/server` (#16). +If you `defer reg.ReleaseServer(serverID)` (or `ScheduleReleaseServer`) before `ClaimServer` returns, the conflict path (`ErrServerIDConflict`) ends up acting on a slot it does not hold. Pre-#21 this was a benign no-op (`ReleaseServer` returned false); after #21 swapped the defer to `ScheduleReleaseServer`, violating the rule would *arm a 30-second grace timer for an id the handler never owned* — at expiry the timer fires against whatever entry happens to be present (typically nothing, but indistinguishable in shape from a deletion). Order is non-negotiable: claim, then log success, then `defer`. Source: `/v1/server` (#16, sharpened by #21). ## `crypto/rand.Read` failure is fatal-by-design on Linux/macOS