From d1db2aaa7b6add85c86760303389a6692afbf213 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Fri, 8 May 2026 21:55:13 +0300 Subject: [PATCH 1/3] spec: /v1/server WS upgrade with header gate and server-id claim (#16) --- docs/specs/architecture/16-server-endpoint.md | 327 ++++++++++++++++++ 1 file changed, 327 insertions(+) create mode 100644 docs/specs/architecture/16-server-endpoint.md diff --git a/docs/specs/architecture/16-server-endpoint.md b/docs/specs/architecture/16-server-endpoint.md new file mode 100644 index 0000000..b3a78e1 --- /dev/null +++ b/docs/specs/architecture/16-server-endpoint.md @@ -0,0 +1,327 @@ +# Spec — `/v1/server` WS upgrade, header gate, server-id claim (#16) + +## Files to read first + +- `internal/relay/registry.go:69-101` — `ClaimServer` / `ReleaseServer` semantics and `ErrServerIDConflict`. The handler is a thin shell around these two calls. +- `internal/relay/registry.go:22-46` — `Conn` interface contract. The handler hands `WSConn` (which already implements it) to `ClaimServer`; nothing about the interface changes. +- `internal/relay/ws_conn.go:1-83` — `WSConn` adapter. The handler constructs one with `NewWSConn(c, connID)`. Note the doc invariant: "callers must reach the connection only through `WSConn` methods" — the conflict path needs a principled exception (see § Design). +- `internal/relay/healthz.go:35-57` — handler-factory shape: `func NewXHandler(...) http.Handler` returning an `http.HandlerFunc` literal. Convention to mirror. +- `internal/relay/ws_conn_test.go:20-54` — `httptest.NewServer` + `websocket.Dial` test harness. The endpoint test reuses this shape (server-side handler is `ServerHandler(reg, logger)` instead of an inline accept). +- `cmd/pyrycode-relay/main.go:45-50` — current mux registration site. One new `mux.Handle("/v1/server", ...)` line slots in next to `/healthz`. +- `docs/threat-model.md` § "Log hygiene" (line 42–55) — what must never be logged: payload bodies, token values, full headers. Headers we DO log are enumerated in the AC. +- `docs/threat-model.md` § "Error response leakage" (line 77–85) — public-facing handlers return generic messages. The `400` body is empty / generic; the conflict close-reason is the literal `"server-id already claimed"` (no internal state). +- `docs/PROJECT-MEMORY.md:30-34` — handler-factory pattern, package-internal tests, "loud failure over silent correction." All three apply. +- [`pyrycode/pyrycode/docs/protocol-mobile.md` § Authentication → Binary→relay](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#binary--relay) — authoritative spec for the headers and the `4409` close code. The relay implements; the spec defines. +- `nhooyr.io/websocket` package docs (godoc — already a direct dep) — `Accept`, `AcceptOptions.OriginPatterns`, `StatusCode`, `Conn.CloseRead`, `Conn.Close`. The handler uses each exactly once. + +## Context + +Implements the binary-side handshake. The endpoint accepts an outbound WebSocket from a pyry binary on `/v1/server`, validates three required headers BEFORE upgrading, upgrades on success, and registers the connection in the registry under the binary's `x-pyrycode-server` header. First-claim-wins is enforced by `ClaimServer` (atomic in #3); a losing claim sends close code `4409`. + +The relay treats every byte from a binary as adversarial: header validation runs before `websocket.Accept` so a malformed request never receives an upgrade response. The registry is untouched on every error path. + +This is the binary side only. Phone (`/v1/client`) is #5, frame forwarding is #6, heartbeat is #7, and the 30-second grace period on disconnect is #8. + +## Design + +### Package & files + +- New file: `internal/relay/server_endpoint.go` (`package relay`). +- New test file: `internal/relay/server_endpoint_test.go` (`package relay`). +- Edit: `cmd/pyrycode-relay/main.go` adds one line registering the handler on the mux. + +No changes to `registry.go`, `ws_conn.go`, `envelope.go`, `tls.go`, `healthz.go`, `doc.go`, `go.mod`, or `Makefile`. + +### Public API + +```go +package relay + +// ServerHandler returns the http.Handler for /v1/server: the binary-side +// WebSocket upgrade endpoint. It validates required headers, upgrades the +// connection, claims the server-id slot in reg, and holds the connection +// open until the binary closes it (or the slot is released). +// +// Close codes used by this endpoint: +// - 1000 (StatusNormalClosure) clean close on shutdown / release. +// - 4409 (server-id already claimed) another binary holds the slot. +// +// Future tickets layer 4401 (token), 4404 (no server) on /v1/client and +// frame-forward errors on /v1/server. +func ServerHandler(reg *Registry, logger *slog.Logger) http.Handler +``` + +One exported symbol. No new types, no new sentinel errors. + +### Required headers (validated pre-upgrade) + +| Header (canonical) | Source | Enforcement | +|---|---|---| +| `X-Pyrycode-Server` | binary's server-id | non-empty | +| `X-Pyrycode-Version` | binary's version string | non-empty | +| `User-Agent` | HTTP standard | non-empty | + +`r.Header.Get` performs canonicalisation (case-insensitive). Missing or empty → `http.Error(w, "", http.StatusBadRequest)` with no body, no `WWW-Authenticate`, no internal-state leakage. Registry untouched, `Accept` not called. + +### Algorithm + +``` +1. Validate the three headers. Any missing/empty → 400, return. +2. websocket.Accept(w, r, &websocket.AcceptOptions{ + OriginPatterns: []string{"*"}, + }) + - On error: the library has already written a 4xx; just return. + Registry untouched. +3. connID := "server-" + serverID + "-" + randHex8() +4. wsconn := NewWSConn(c, connID) +5. err := reg.ClaimServer(serverID, wsconn) + - errors.Is(err, ErrServerIDConflict): + _ = c.Close(websocket.StatusCode(4409), "server-id already claimed") + logger.Info("server_id_conflict", + "server_id", serverID, + "remote", remoteHost(r)) + return + - other err (not produced by current registry, but for completeness): + wsconn.Close() + return +6. logger.Info("server_claimed", + "server_id", serverID, + "binary_version", versionHeader, + "remote", remoteHost(r)) +7. defer: + reg.ReleaseServer(serverID) + wsconn.Close() + logger.Info("server_released", "server_id", serverID) +8. // Hold the connection open until the peer closes it. Frame loop (#6) + // replaces this block with a real read loop later. + readCtx := c.CloseRead(r.Context()) + <-readCtx.Done() +``` + +Two design notes inside this algorithm deserve calling out: + +**Why call `c.Close` directly on conflict instead of `wsconn.Close()`.** `WSConn.Close()` always sends `StatusNormalClosure`; the conflict path needs the application-defined `4409`. Adding a "close with custom code" method to `WSConn` would inflate #15's surface for one caller. The conflict case is a stillborn WSConn: no `Send` was attempted, no concurrent goroutine holds `writeMu`, so the WSConn doc's "callers must reach the connection only through WSConn methods" invariant is preserved in spirit — we never *use* the WSConn after construction. The handler's comment names this exception so a future reader doesn't generalise the pattern. + +**Why `CloseRead` instead of `for { c.Read() }` or a bare block.** The handler must keep the goroutine alive until the peer disconnects, but reading frames is #6's job. `nhooyr.io/websocket.Conn.CloseRead(ctx)` spawns a goroutine that drains-and-discards frames and returns a context cancelled when the conn closes. When #6 lands, the read goroutine is replaced with a real frame loop in the same goroutine — the call site swaps `<-readCtx.Done()` for the loop body. Using `CloseRead` here also satisfies a subtle correctness point: WebSocket pings / control frames need to be processed, otherwise the connection never observes a close from the peer side. `CloseRead` handles that internally. + +### Helper functions (file-local, unexported) + +```go +// randHex8 returns 8 hex chars (32 bits) from crypto/rand. Panics on a +// crypto/rand read failure — the OS RNG failing is not a recoverable +// error in this process. +func randHex8() string { + var b [4]byte + if _, err := rand.Read(b[:]); err != nil { + panic("crypto/rand: " + err.Error()) + } + return hex.EncodeToString(b[:]) +} + +// remoteHost strips the port from r.RemoteAddr. Falls back to RemoteAddr +// verbatim if SplitHostPort fails (handlers behind some proxies set it +// without a port). Used only for logging. +func remoteHost(r *http.Request) string { + if host, _, err := net.SplitHostPort(r.RemoteAddr); err == nil { + return host + } + return r.RemoteAddr +} +``` + +`crypto/rand.Read` is documented to never fail on platforms the relay supports (Linux, macOS); the `panic` matches Go-stdlib idiom (`crypto/rand`'s own examples treat it as fatal). 32 bits is sufficient: the conn-id is scoped per server-id, and the registry uses it only as an opaque key for slice lookups (`UnregisterPhone`); collisions inside a single server-id slice are statistically negligible at v1 scale. + +### `cmd/pyrycode-relay/main.go` edit + +One line, inserted after the `/healthz` registration (`main.go:49`): + +```go +mux.Handle("/v1/server", relay.ServerHandler(reg, logger)) +``` + +`logger` is already in scope (`main.go:37`). No other edits. + +### Concurrency model + +The handler runs in the http.Server's per-request goroutine. The `WSConn` it constructs is the registry's reference for `serverID`; broadcasts (future tickets) reach it through `reg.BinaryFor`. There is one `CloseRead`-spawned read-discard goroutine per accepted connection; it terminates when the peer closes or `r.Context()` is cancelled. + +| Method / step | Goroutine | Lock | Lifecycle | +|---|---|---|---| +| Header validation | request goroutine | none | pre-upgrade; no resources held | +| `websocket.Accept` | request goroutine | none | conn allocated on success | +| `ClaimServer` | request goroutine | registry write lock (held internally) | one-shot | +| `CloseRead` | spawns one read-discard goroutine | none | terminates on conn close / ctx cancel | +| `<-readCtx.Done()` | request goroutine, blocking | none | unblocks on peer close | +| `defer { Release; wsconn.Close; log }` | request goroutine | `wsconn`'s `closeOnce` | runs on every successful claim path; idempotent | + +Shutdown path: when the http server shuts down, `r.Context()` is cancelled, the `CloseRead` goroutine's read returns an error, `readCtx` cancels, the handler unblocks, the defer runs, the slot is released. Clean. (The grace-period behaviour from #8 will wrap `ReleaseServer` in a deferred timer; nothing about this handler's structure precludes that — `defer` is the same hook.) + +### Error handling + +- **Missing/empty header** → `400 Bad Request`, empty body. No log line (avoids amplifying header-floods into log volume; the threat model already names "future endpoints may write `err.Error()` to the response" as a leakage risk — generic body is the mitigation). +- **`websocket.Accept` error** (wrong method, missing `Upgrade`, wrong subprotocol) — the library writes its own 4xx response. The handler returns without logging. Registry untouched. +- **`ClaimServer` returns `ErrServerIDConflict`** — close 4409, log `event=server_id_conflict server_id= remote=`. Header values are NOT logged (no `binary_version`, no `user_agent`) — matches the AC and avoids leaking the conflicting binary's version through what is, observationally, an unauthenticated probe surface. +- **`ClaimServer` returns any other error** — not produced by the current registry, but defensively: `wsconn.Close()` and return. Not logged because no such error is reachable from current code; if it becomes reachable, the diff that introduces it should add the log. +- **`CloseRead` / `<-readCtx.Done()`** does not fail — it returns when the conn ends. +- **No panics from this handler.** `randHex8`'s panic on RNG failure is fatal-by-design; if it fired, the http server's panic recovery would terminate the connection, the defer would still run, and the slot would be released. + +### Logging — exact field set + +Per `docs/threat-model.md` § Log hygiene, the relay must log structured fields only, never full headers or payloads. This handler emits three event types: + +| event | fields | +|---|---| +| `server_claimed` | `server_id`, `binary_version`, `remote` | +| `server_id_conflict` | `server_id`, `remote` | +| `server_released` | `server_id` | + +`binary_version` is the value of the `X-Pyrycode-Version` header, advertised by the binary itself — informational, not a secret. `remote` is the IP host portion of `r.RemoteAddr`, no port. `User-Agent` is validated for presence but never logged (no operational use; would inflate log volume without value). + +## Testing strategy + +`internal/relay/server_endpoint_test.go`, `package relay`. End-to-end against a real `*websocket.Conn` via `httptest.NewServer(ServerHandler(reg, logger))`, mirroring the harness in `ws_conn_test.go`. + +### Test harness + +```go +// startServer spins up an httptest.NewServer running ServerHandler against +// a fresh registry. It returns the registry, the WS URL, and a cleanup. +func startServer(t *testing.T) (*Registry, string, func()) { + t.Helper() + reg := NewRegistry() + logger := slog.New(slog.NewTextHandler(io.Discard, nil)) // silence test output + srv := httptest.NewServer(ServerHandler(reg, logger)) + wsURL := "ws" + strings.TrimPrefix(srv.URL, "http") + return reg, wsURL, srv.Close +} + +// dialWith calls websocket.Dial with the given header set. +func dialWith(t *testing.T, wsURL string, hdr http.Header) (*websocket.Conn, *http.Response, error) { + t.Helper() + ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + return websocket.Dial(ctx, wsURL, &websocket.DialOptions{HTTPHeader: hdr}) +} + +// validHeaders returns a header set that passes the gate for serverID. +func validHeaders(serverID string) http.Header { + h := http.Header{} + h.Set("X-Pyrycode-Server", serverID) + h.Set("X-Pyrycode-Version", "0.1.0-test") + h.Set("User-Agent", "pyry-test/0.1.0") + return h +} +``` + +### Tests (1:1 with AC bullets) + +1. **`TestServerEndpoint_ValidUpgrade_RegistersBinary`** — dial with `validHeaders("s1")`; assert `Dial` returns no error; assert `reg.BinaryFor("s1")` returns a non-nil Conn whose `ConnID()` starts with `"server-s1-"` and ends with 8 hex chars (regex or `len`+`hex.DecodeString` check). + +2. **`TestServerEndpoint_HeaderGate_400`** — table-driven over the three required headers; each row builds `validHeaders` and either deletes the header or sets it to `""`. Use a plain `http.Client.Do` (not `websocket.Dial`) with the standard upgrade headers (`Upgrade: websocket`, `Connection: Upgrade`, `Sec-WebSocket-Version: 13`, `Sec-WebSocket-Key: `); expect `StatusCode == 400`. After the call, `reg.Counts()` returns `(0, 0)`. + +3. **`TestServerEndpoint_DuplicateClaim_4409`** — dial once with `validHeaders("s1")` (success). Dial again with the same headers. Read once on the second connection; expect `websocket.CloseError` whose `Code == websocket.StatusCode(4409)` and `Reason == "server-id already claimed"`. Assert with `errors.As(err, &ce)`. The first conn remains registered. + +4. **`TestServerEndpoint_PeerClose_ReleasesSlot`** — dial with valid headers; assert registered; call `c.Close(websocket.StatusNormalClosure, "")` on the client; poll `reg.BinaryFor("s1")` (with a short bounded retry — up to ~1s, sleeping 10ms — because release is observed via the server-side handler returning, which races the client close); assert it eventually returns `(_, false)`. Re-claim succeeds afterwards. + +5. **`TestServerEndpoint_WrongMethod_NoPanic`** — `http.Get(srv.URL + "/v1/server")` (no upgrade headers, but with the three pyrycode headers set, to prove the upgrade gate fails *after* the header gate passes); expect a 4xx (the library will return 426/400); no panic; `reg.Counts() == (0, 0)`. + + And one more: `http.Post(srv.URL+"/v1/server", "application/octet-stream", nil)` with valid pyrycode headers — expect a 4xx, no panic, registry empty. Documents that POST is not a regression hole. + +### What this test file deliberately does not do + +- **Mock the registry.** Use the real one — `Registry` is already race-tested in #3. +- **Mock `nhooyr.io/websocket`.** Same reasoning as #15. +- **Test `randHex8` for distribution / collision.** It's `crypto/rand` + `hex.EncodeToString`; testing the stdlib is out of scope. +- **Assert exact logger output.** The slog handler is `io.Discard` in tests. If a future ticket needs to verify log fields, swap to a JSON handler writing to a `bytes.Buffer` and decode — out of scope here. + +### Lint expectations + +`make vet`, `make test` (under `-race`), `make build` all clean. `gosec ./...` and `govulncheck ./...` clean. No new dependencies — `nhooyr.io/websocket`, `crypto/rand`, `encoding/hex`, `net`, `log/slog`, `errors` are already in scope. + +## Open questions + +1. **Should the handler emit a `server_upgrade_rejected` log when `websocket.Accept` errors?** No, leave it silent. The library writes a 4xx and the failure is visible in the http access log. Adding our own log line per upgrade-failure invites log-flood from misconfigured clients. If observability later wants this, a single counter (Prometheus, future) is the right shape, not a log line. + +2. **Why `OriginPatterns: []string{"*"}` and not omit it / use `InsecureSkipVerify`?** `nhooyr.io/websocket.Accept` rejects all cross-origin upgrades by default unless the request has no `Origin` header. Browsers always send `Origin`; programmatic clients don't have to. Setting `OriginPatterns: ["*"]` says explicitly "this endpoint is not browser-facing; same-origin policy does not apply." `InsecureSkipVerify` is for TLS dial verification — the wrong knob entirely. The choice is documented with a comment in the source. + +3. **Should `binary_version` be sanitised before logging?** It's an attacker-controlled string. `slog.NewTextHandler` quotes string values, so a newline in the version cannot forge a fake log line. We rely on slog's escaping; no manual sanitisation. If the relay later switches to a JSON handler in production, that property is preserved (JSON encoders escape). + +## Out of scope (re-stated, for the developer) + +- Frame forwarding loop (`#6`) — `CloseRead` discards frames; replace with the real loop in #6. +- Heartbeat / ping-pong (`#7`). +- 30-second grace period on disconnect (`#8`) — this handler releases immediately; the grace wrapper goes around `ReleaseServer` in #8. +- `/v1/client` endpoint (`#5`) — separate handler, separate spec. +- Per-IP connection limit / global cap — deferred to the DoS-hardening ticket (`docs/threat-model.md` § DoS resistance). +- Token validation — the relay does not see tokens on `/v1/server`; the binary side is unauthenticated by design (the binary owns the trust relationship with phones). +- Updating `docs/PROJECT-MEMORY.md` and `docs/knowledge/` — owned by the documentation step in the pipeline, not this ticket's commit. + +## Done means + +- `internal/relay/server_endpoint.go` exists with `ServerHandler`, `randHex8`, `remoteHost`, and a top-of-file comment block listing close codes. +- `internal/relay/server_endpoint_test.go` covers the five tests above. +- `cmd/pyrycode-relay/main.go` registers the handler on `/v1/server`. +- `make vet`, `make test` (under `-race`), `make build`, `gosec ./...`, `govulncheck ./...` all clean. +- One commit on `feature/16`: `feat(relay): /v1/server WS upgrade with header gate and server-id claim (#16)`. + +--- + +## Security review (security-sensitive label) + +### Threat surface for THIS ticket + +This is the binary-side ingress. The handler is one of the two places the relay accepts traffic from the public internet (the other being `/v1/client`, #5, and `/healthz`, which is read-only). Every byte of every frame the relay later forwards arrives through this WS upgrade. The endpoint is unauthenticated — the protocol has no token on `/v1/server`; trust is established post-claim by virtue of the binary holding the slot for `serverID`. + +### Categories walked + +- **Trust boundaries.** The handler's input is everything in `*http.Request`: method, path (already routed by mux), headers, RemoteAddr. (a) `serverID` from `X-Pyrycode-Server` is treated as an opaque map key by the registry — no parsing, no filesystem use, no command construction, no SQL. (b) `versionHeader` is logged but not parsed, compared, or used for routing. (c) `userAgent` is validated for presence and discarded. (d) `r.RemoteAddr` is parsed by `net.SplitHostPort` (stdlib, well-tested) for logging only. **Finding:** no trust boundary widened beyond what the registry already accepts; `serverID` is the same opaque-key shape #3 documents. + +- **Input validation / pre-upgrade gate.** All three required headers are checked before `websocket.Accept`. A request missing any header exits with `400` and zero allocation beyond the response. The gate is intentionally minimal — no length cap, no charset check, no regex — because the only consumer (registry) treats `serverID` as opaque. Adding a length cap would be invented enforcement (no observed failure to motivate it). The protocol spec is the right place to define `serverID` shape; if it adds constraints, this gate inherits them. **Finding:** validation matches the registry's contract; no defense added beyond what the AC requires. + +- **Resource exhaustion / DoS — connection floods.** This handler does not cap concurrent connections. `docs/threat-model.md` § "DoS resistance" already names this as a deferred v1 gap with the WS upgrade path called out by name. **Finding:** no new mitigation; the residual is the same residual the threat model already accepts. Per-IP / global caps belong to a future DoS ticket. + +- **Resource exhaustion / DoS — slow-loris on upgrade.** `http.Server.ReadHeaderTimeout: 10s` (`cmd/pyrycode-relay/main.go:53-95`) bounds the time to receive headers. If a peer completes the WS handshake but never sends frames, `CloseRead` simply waits — at the cost of one goroutine per such peer. Connection-count caps are the right answer (deferred). The 10-second `Send` timeout in `WSConn` (#15) bounds the write side. **Finding:** no new exposure; existing mitigations apply. + +- **Resource exhaustion / memory.** Per-connection memory is one `WSConn` (~120 B), one `CloseRead` goroutine (~8 KB initial stack), the registry's map entry. No per-connection buffer pool, no read queue. Logging fields are stack-allocated by slog's variadic API. **Finding:** baseline; identical to what #5/#6 will add per phone connection. + +- **Race conditions / first-claim-wins atomicity.** `ClaimServer` holds the registry write lock across the read-then-write check, so two concurrent claims see one winner and one `ErrServerIDConflict`. The handler's caller-side ordering is irrelevant to atomicity; the lock owns it. The handler's defer (`ReleaseServer + wsconn.Close + log`) runs only when `ClaimServer` succeeded — the conflict path returns before the defer is registered. **Finding:** atomicity preserved; no TOCTOU window. + +- **Race conditions / release-before-success.** If the handler returned without claim succeeding but with the defer already registered, `ReleaseServer` would run on a slot we don't hold (no-op on the registry, false return — safe). The chosen ordering puts `defer` AFTER the successful `ClaimServer`, so this race is structurally absent. **Finding:** safe by construction; spec calls out the ordering explicitly. + +- **Information disclosure — log content.** Logged fields: `server_id`, `binary_version`, `remote`. Not logged: `User-Agent`, full headers, payloads, conn-id internals (the random suffix is a relay-internal value with no leakage value, but it's also not adversarial-relevant; not logging it is just thrift). The conflict path deliberately omits `binary_version` to avoid turning the conflict signal into a version-disclosure oracle for an attacker probing whether a `serverID` is held — the threat model's "do not log header values on the conflict path" is encoded in the AC and reproduced here. **Finding:** matches `docs/threat-model.md` § Log hygiene exactly. + +- **Information disclosure — response bodies.** `400` body is empty. The `4409` close-reason is the literal `"server-id already claimed"` — a public, protocol-defined string with no internal-state leakage. The `websocket.Accept` failure path lets the library write its own 4xx (`nhooyr.io/websocket` returns short, generic strings). **Finding:** matches `docs/threat-model.md` § "Error response leakage." + +- **Conn-id unpredictability.** 32 bits from `crypto/rand`. The conn-id is scoped per server-id and used only as an opaque map key inside `UnregisterPhone`; it is not a security token. An attacker who knew it could not impersonate the connection (the connection IS the trust). Using `crypto/rand` over `math/rand` is principled defence-in-depth: if a future ticket exposes the conn-id (e.g. echoing it in an error envelope), an unguessable id avoids creating an oracle by accident. **Finding:** no harm, modest forward-compat benefit. + +- **Lock-order / deadlock.** Registry write lock is taken inside `ClaimServer` and `ReleaseServer`. Neither call invokes `wsconn.Send` or `wsconn.Close` while holding the lock — `Close` runs in the defer, after `ReleaseServer` has returned. The `CloseRead` goroutine never touches the registry. **Finding:** no deadlock path. + +- **Use-after-close.** If the binary closes the connection mid-handler, `CloseRead`'s read errors and cancels `readCtx`; `<-readCtx.Done()` returns; defer runs. `ReleaseServer` then `wsconn.Close()`; the latter is `sync.Once`-guarded (`#15`) so a stray `Close` from a future broadcast loop is a no-op. **Finding:** safe. + +- **Panics / nil-deref.** `randHex8` panics on RNG failure (intentional). All other paths are nil-safe: `r.Header.Get` returns `""` for missing headers; `net.SplitHostPort` returns an error rather than panicking. The http server's panic recovery would catch a panic if one fired and close the connection; defer would still run because Go's `recover` happens after registered defers — except the defer is registered AFTER `ClaimServer`, so a panic before claim leaves the registry untouched. **Finding:** safe. + +### Adversarial framings considered + +- *"Attacker fakes `X-Pyrycode-Server: ` to claim the slot first."* This is the underlying threat the protocol model accepts: the binary side of `/v1/server` is unauthenticated; first-claim-wins is the trust primitive. Mitigation lives at the protocol layer (binary validates phone tokens; phones validate the binary out-of-band). Out of scope for this ticket; named here so a reader doesn't think it's an oversight. + +- *"Attacker spam-claims a `serverID`, holds the slot, never serves frames."* Same as above. The legitimate binary sees `4409` until the attacker disconnects (or the grace period closes the squatter via another mechanism, post-#8). Mitigations: connection caps (deferred), and at the protocol level, phones treat unresponsive binaries as offline. The relay's job here is to enforce the claim atomically; it does. + +- *"Attacker sends crafted headers with NULs / very long values to try to crash slog or net.SplitHostPort."* slog's text handler quotes string values (newlines, NULs become escapes). `net.SplitHostPort` operates on `r.RemoteAddr`, which is set by `net/http` (not the attacker's headers). `r.Header.Get` returns whatever the client sent without further parsing. **Finding:** no parse step the attacker controls is fragile. + +- *"Attacker omits `User-Agent`, `X-Pyrycode-Version`, or sends them as space-only strings."* `r.Header.Get` returns `""` for unset headers; the gate explicitly checks for non-empty (`!= ""`). A space-only string passes the `!= ""` check — accepted, logged literally. Hardening to "trim then check non-empty" would be invented enforcement; per evidence-based defence, deferred until an observed failure motivates it. The fields are informational, not enforcement. + +- *"Attacker sends a valid upgrade then floods the connection with frames."* `CloseRead` reads-and-discards them. Each goroutine handles one connection. Per-frame work is tiny (read header, drop body). A high-rate attacker hits OS kernel buffers, then `read` rate-limits naturally. No amplification (the relay does not respond). **Finding:** acceptable for v1; named in DoS deferral. + +- *"Attacker uses `OriginPatterns: ["*"]` permissiveness to mount a CSWSH (cross-site WebSocket hijacking) from a browser."* Browsers cannot generate a `pyry` binary's headers (`X-Pyrycode-Server`, `X-Pyrycode-Version`) by simple cross-site request — `fetch` with custom headers triggers CORS pre-flight; raw WebSocket from a browser cannot set custom request headers at all. So a browser-driven attacker cannot pass the header gate. The threat is structurally blocked one layer up. **Finding:** `OriginPatterns: ["*"]` is safe given the header gate; the comment in source documents the reasoning. + +- *"Attacker calls `POST /v1/server` to confuse the upgrade path."* `websocket.Accept` rejects non-GET upgrade requests with a library-written 4xx. Registry untouched. Test coverage explicitly exercises this. + +- *"Attacker holds the slot, then triggers handler shutdown to leak the goroutine."* On shutdown the http server cancels `r.Context()`; the `CloseRead` goroutine's read returns with the cancelled context; `readCtx` cancels; the handler unblocks; defer runs. No goroutine leak. + +- *"Random conn-id collision."* Birthday bound: collisions become non-negligible at ~2¹⁶ live conns under ONE server-id. Far beyond v1 scale. If it ever matters, widen the suffix; the public API does not change. + +- *"`crypto/rand` fails (panic)."* On Linux/macOS, `getrandom`/`/dev/urandom` does not block once the kernel pool is initialised at boot. A failure here means the host is broken. Panic terminates the connection, http server recovers, other connections continue. A failing host at this depth is an operator-visible fault; refusing to serve is the correct response (loud failure over silent correction). + +### Verdict: PASS + +The handler enforces the AC's required header gate before allocating any WebSocket state, defers to `ClaimServer` for the atomic-claim primitive, and uses log fields that match the threat-model's hygiene rules exactly (with the conflict path's deliberate omission of `binary_version` called out as defence against version-disclosure probing). `OriginPatterns: ["*"]` is safe because the custom-header gate one layer above structurally excludes browser attackers. The `crypto/rand` conn-id suffix adds modest forward-compat hardening at zero cost. Connection caps and the slow-peer fleet are documented residual risks already named in `docs/threat-model.md`; this ticket inherits, not widens, those gaps. The `security-sensitive` label is justified — this is internet-facing, unauthenticated ingress on the binary side — and the review is correspondingly thorough. From 5df54ebae6ac18abd991c0d831068002fc34b623 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Fri, 8 May 2026 21:59:38 +0300 Subject: [PATCH 2/3] feat(relay): /v1/server WS upgrade with header gate and server-id claim (#16) Implements the binary-side WebSocket upgrade endpoint. Validates X-Pyrycode-Server, X-Pyrycode-Version, and User-Agent before invoking websocket.Accept; on success, wraps the conn in WSConn (with a crypto/rand-suffixed conn-id) and registers it via Registry.ClaimServer. Returns close code 4409 with reason "server-id already claimed" when the slot is already held. Holds the connection open via CloseRead until the peer disconnects; releases the slot in a defer. Logs structured fields only (server_id, binary_version, remote on claim; server_id and remote on conflict; server_id on release) per docs/threat-model.md log-hygiene rules. Closes #16 --- cmd/pyrycode-relay/main.go | 1 + internal/relay/server_endpoint.go | 114 +++++++++ internal/relay/server_endpoint_test.go | 306 +++++++++++++++++++++++++ 3 files changed, 421 insertions(+) create mode 100644 internal/relay/server_endpoint.go create mode 100644 internal/relay/server_endpoint_test.go diff --git a/cmd/pyrycode-relay/main.go b/cmd/pyrycode-relay/main.go index 6d54d7f..5b1083f 100644 --- a/cmd/pyrycode-relay/main.go +++ b/cmd/pyrycode-relay/main.go @@ -47,6 +47,7 @@ func main() { mux := http.NewServeMux() mux.Handle("/healthz", relay.NewHealthzHandler(reg, Version, startedAt)) + mux.Handle("/v1/server", relay.ServerHandler(reg, logger)) if *insecureListen != "" { logger.Info("starting", "version", Version, "mode", "insecure", "listen", *insecureListen) diff --git a/internal/relay/server_endpoint.go b/internal/relay/server_endpoint.go new file mode 100644 index 0000000..970cb3f --- /dev/null +++ b/internal/relay/server_endpoint.go @@ -0,0 +1,114 @@ +package relay + +// WebSocket close codes used by /v1/server. RFC 6455 reserves 4000–4999 for +// application use. +// +// - 1000 (StatusNormalClosure) clean close on shutdown / release. +// - 4409 (server-id already claimed) another binary holds the slot. +// +// Future tickets layer 4401 (token), 4404 (no server) on /v1/client and +// frame-forward errors on /v1/server. + +import ( + "crypto/rand" + "encoding/hex" + "errors" + "log/slog" + "net" + "net/http" + + "nhooyr.io/websocket" +) + +// ServerHandler returns the http.Handler for /v1/server: the binary-side +// WebSocket upgrade endpoint. It validates required headers, upgrades the +// connection, claims the server-id slot in reg, and holds the connection +// open until the binary closes it (or the slot is released). +func ServerHandler(reg *Registry, logger *slog.Logger) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + serverID := r.Header.Get("X-Pyrycode-Server") + versionHeader := r.Header.Get("X-Pyrycode-Version") + userAgent := r.Header.Get("User-Agent") + if serverID == "" || versionHeader == "" || userAgent == "" { + http.Error(w, "", http.StatusBadRequest) + return + } + + // OriginPatterns: ["*"] — this endpoint is not browser-facing. + // The custom-header gate above structurally excludes browser-driven + // CSWSH attempts: raw WebSocket from a browser cannot set custom + // request headers, and a fetch with custom headers triggers CORS + // pre-flight that this endpoint does not satisfy. + c, err := websocket.Accept(w, r, &websocket.AcceptOptions{ + OriginPatterns: []string{"*"}, + }) + if err != nil { + // nhooyr.io/websocket has already written a 4xx response. + return + } + + connID := "server-" + serverID + "-" + randHex8() + wsconn := NewWSConn(c, connID) + + if err := reg.ClaimServer(serverID, wsconn); err != nil { + if errors.Is(err, ErrServerIDConflict) { + // Close directly on the underlying *websocket.Conn so we + // can send the 4409 close code; WSConn.Close always emits + // StatusNormalClosure. This is a stillborn WSConn — no + // Send was attempted, no goroutine holds writeMu — so the + // "reach the connection only through WSConn methods" + // invariant is preserved in spirit. + _ = c.Close(websocket.StatusCode(4409), "server-id already claimed") + logger.Info("server_id_conflict", + "server_id", serverID, + "remote", remoteHost(r)) + return + } + // No other error is reachable from the current registry; close + // defensively without logging, since the diff that introduces + // such an error should add the log. + wsconn.Close() + return + } + + logger.Info("server_claimed", + "server_id", serverID, + "binary_version", versionHeader, + "remote", remoteHost(r)) + + defer func() { + reg.ReleaseServer(serverID) + wsconn.Close() + logger.Info("server_released", "server_id", serverID) + }() + + // Hold the connection open until the peer closes it. CloseRead + // spawns a goroutine that drains-and-discards frames (including + // control frames like ping/pong, which must be processed for the + // connection to observe a peer-side close) and returns a context + // cancelled when the conn ends. The frame loop (#6) replaces this + // block with a real read loop later. + readCtx := c.CloseRead(r.Context()) + <-readCtx.Done() + }) +} + +// randHex8 returns 8 hex chars (32 bits) from crypto/rand. Panics on a +// crypto/rand read failure — the OS RNG failing is not a recoverable error +// in this process. +func randHex8() string { + var b [4]byte + if _, err := rand.Read(b[:]); err != nil { + panic("crypto/rand: " + err.Error()) + } + return hex.EncodeToString(b[:]) +} + +// remoteHost strips the port from r.RemoteAddr. Falls back to RemoteAddr +// verbatim if SplitHostPort fails. Used only for logging. +func remoteHost(r *http.Request) string { + if host, _, err := net.SplitHostPort(r.RemoteAddr); err == nil { + return host + } + return r.RemoteAddr +} diff --git a/internal/relay/server_endpoint_test.go b/internal/relay/server_endpoint_test.go new file mode 100644 index 0000000..3ce2358 --- /dev/null +++ b/internal/relay/server_endpoint_test.go @@ -0,0 +1,306 @@ +package relay + +import ( + "context" + "encoding/hex" + "errors" + "io" + "log/slog" + "net/http" + "net/http/httptest" + "strings" + "testing" + "time" + + "nhooyr.io/websocket" +) + +// startServer spins up an httptest.NewServer running ServerHandler against a +// fresh registry. It returns the registry, the WS URL, and a cleanup. +func startServer(t *testing.T) (*Registry, string, func()) { + t.Helper() + reg := NewRegistry() + logger := slog.New(slog.NewTextHandler(io.Discard, nil)) + srv := httptest.NewServer(ServerHandler(reg, logger)) + wsURL := "ws" + strings.TrimPrefix(srv.URL, "http") + return reg, wsURL, srv.Close +} + +func dialWith(t *testing.T, wsURL string, hdr http.Header) (*websocket.Conn, *http.Response, error) { + t.Helper() + ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + return websocket.Dial(ctx, wsURL, &websocket.DialOptions{HTTPHeader: hdr}) +} + +func validHeaders(serverID string) http.Header { + h := http.Header{} + h.Set("X-Pyrycode-Server", serverID) + h.Set("X-Pyrycode-Version", "0.1.0-test") + h.Set("User-Agent", "pyry-test/0.1.0") + return h +} + +func TestServerEndpoint_ValidUpgrade_RegistersBinary(t *testing.T) { + reg, wsURL, cleanup := startServer(t) + defer cleanup() + + c, _, err := dialWith(t, wsURL, validHeaders("s1")) + if err != nil { + t.Fatalf("dial: %v", err) + } + defer c.Close(websocket.StatusNormalClosure, "") + + // The handler runs in its own goroutine; ClaimServer happens after + // Accept returns to the dialer. Poll briefly for the registration. + deadline := time.Now().Add(time.Second) + var conn Conn + var ok bool + for time.Now().Before(deadline) { + conn, ok = reg.BinaryFor("s1") + if ok { + break + } + time.Sleep(10 * time.Millisecond) + } + if !ok { + t.Fatalf("BinaryFor(s1) not registered") + } + + id := conn.ConnID() + const prefix = "server-s1-" + if !strings.HasPrefix(id, prefix) { + t.Fatalf("ConnID() = %q, want prefix %q", id, prefix) + } + suffix := strings.TrimPrefix(id, prefix) + if len(suffix) != 8 { + t.Fatalf("ConnID suffix = %q (len %d), want 8 hex chars", suffix, len(suffix)) + } + if _, err := hex.DecodeString(suffix); err != nil { + t.Fatalf("ConnID suffix %q is not hex: %v", suffix, err) + } +} + +func TestServerEndpoint_HeaderGate_400(t *testing.T) { + type row struct { + name string + header string + value string + delete bool + } + rows := []row{ + {name: "missing_X-Pyrycode-Server", header: "X-Pyrycode-Server", delete: true}, + {name: "empty_X-Pyrycode-Server", header: "X-Pyrycode-Server", value: ""}, + {name: "missing_X-Pyrycode-Version", header: "X-Pyrycode-Version", delete: true}, + {name: "empty_X-Pyrycode-Version", header: "X-Pyrycode-Version", value: ""}, + {name: "missing_User-Agent", header: "User-Agent", delete: true}, + {name: "empty_User-Agent", header: "User-Agent", value: ""}, + } + + for _, tc := range rows { + tc := tc + t.Run(tc.name, func(t *testing.T) { + reg := NewRegistry() + logger := slog.New(slog.NewTextHandler(io.Discard, nil)) + srv := httptest.NewServer(ServerHandler(reg, logger)) + defer srv.Close() + + req, err := http.NewRequest(http.MethodGet, srv.URL+"/", nil) + if err != nil { + t.Fatalf("new request: %v", err) + } + req.Header.Set("Upgrade", "websocket") + req.Header.Set("Connection", "Upgrade") + req.Header.Set("Sec-WebSocket-Version", "13") + req.Header.Set("Sec-WebSocket-Key", "dGhlIHNhbXBsZSBub25jZQ==") + req.Header.Set("X-Pyrycode-Server", "s1") + req.Header.Set("X-Pyrycode-Version", "0.1.0-test") + req.Header.Set("User-Agent", "pyry-test/0.1.0") + if tc.delete { + // For User-Agent, http.Client adds a default value when + // the header is Del'd. Setting the key to a nil slice + // keeps Header.has true so the default is not injected, + // while sending no header on the wire. + req.Header[tc.header] = nil + } else { + req.Header.Set(tc.header, tc.value) + } + + client := &http.Client{Timeout: 5 * time.Second} + resp, err := client.Do(req) + if err != nil { + t.Fatalf("do: %v", err) + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusBadRequest { + t.Fatalf("status = %d, want 400", resp.StatusCode) + } + + if b, p := reg.Counts(); b != 0 || p != 0 { + t.Fatalf("registry counts = (%d, %d), want (0, 0)", b, p) + } + }) + } +} + +func TestServerEndpoint_DuplicateClaim_4409(t *testing.T) { + reg, wsURL, cleanup := startServer(t) + defer cleanup() + + c1, _, err := dialWith(t, wsURL, validHeaders("s1")) + if err != nil { + t.Fatalf("dial #1: %v", err) + } + defer c1.Close(websocket.StatusNormalClosure, "") + + // Wait for the first claim to land. + deadline := time.Now().Add(time.Second) + for time.Now().Before(deadline) { + if _, ok := reg.BinaryFor("s1"); ok { + break + } + time.Sleep(10 * time.Millisecond) + } + if _, ok := reg.BinaryFor("s1"); !ok { + t.Fatalf("first claim did not register") + } + + c2, _, err := dialWith(t, wsURL, validHeaders("s1")) + if err != nil { + t.Fatalf("dial #2: %v", err) + } + defer c2.Close(websocket.StatusNormalClosure, "") + + readCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + _, _, readErr := c2.Read(readCtx) + if readErr == nil { + t.Fatalf("Read on duplicate conn returned nil error, want close") + } + var ce websocket.CloseError + if !errors.As(readErr, &ce) { + t.Fatalf("Read err = %v (%T), want *websocket.CloseError", readErr, readErr) + } + if ce.Code != websocket.StatusCode(4409) { + t.Fatalf("close code = %d, want 4409", ce.Code) + } + if ce.Reason != "server-id already claimed" { + t.Fatalf("close reason = %q, want %q", ce.Reason, "server-id already claimed") + } + + if _, ok := reg.BinaryFor("s1"); !ok { + t.Fatalf("first claim was evicted by duplicate; registry no longer holds s1") + } +} + +func TestServerEndpoint_PeerClose_ReleasesSlot(t *testing.T) { + reg, wsURL, cleanup := startServer(t) + defer cleanup() + + c, _, err := dialWith(t, wsURL, validHeaders("s1")) + if err != nil { + t.Fatalf("dial: %v", err) + } + + deadline := time.Now().Add(time.Second) + for time.Now().Before(deadline) { + if _, ok := reg.BinaryFor("s1"); ok { + break + } + time.Sleep(10 * time.Millisecond) + } + if _, ok := reg.BinaryFor("s1"); !ok { + t.Fatalf("claim did not register") + } + + if err := c.Close(websocket.StatusNormalClosure, ""); err != nil { + t.Fatalf("client Close: %v", err) + } + + deadline = time.Now().Add(2 * time.Second) + released := false + for time.Now().Before(deadline) { + if _, ok := reg.BinaryFor("s1"); !ok { + released = true + break + } + time.Sleep(10 * time.Millisecond) + } + if !released { + t.Fatalf("BinaryFor(s1) still registered after peer close") + } + + c2, _, err := dialWith(t, wsURL, validHeaders("s1")) + if err != nil { + t.Fatalf("re-dial: %v", err) + } + defer c2.Close(websocket.StatusNormalClosure, "") + + deadline = time.Now().Add(time.Second) + for time.Now().Before(deadline) { + if _, ok := reg.BinaryFor("s1"); ok { + return + } + time.Sleep(10 * time.Millisecond) + } + t.Fatalf("re-claim did not register") +} + +func TestServerEndpoint_WrongMethod_NoPanic(t *testing.T) { + cases := []struct { + name string + do func(srvURL string) (*http.Response, error) + }{ + { + name: "GET_no_upgrade_headers", + do: func(srvURL string) (*http.Response, error) { + req, err := http.NewRequest(http.MethodGet, srvURL+"/", nil) + if err != nil { + return nil, err + } + req.Header.Set("X-Pyrycode-Server", "s1") + req.Header.Set("X-Pyrycode-Version", "0.1.0-test") + req.Header.Set("User-Agent", "pyry-test/0.1.0") + return (&http.Client{Timeout: 5 * time.Second}).Do(req) + }, + }, + { + name: "POST", + do: func(srvURL string) (*http.Response, error) { + req, err := http.NewRequest(http.MethodPost, srvURL+"/", nil) + if err != nil { + return nil, err + } + req.Header.Set("Content-Type", "application/octet-stream") + req.Header.Set("X-Pyrycode-Server", "s1") + req.Header.Set("X-Pyrycode-Version", "0.1.0-test") + req.Header.Set("User-Agent", "pyry-test/0.1.0") + return (&http.Client{Timeout: 5 * time.Second}).Do(req) + }, + }, + } + + for _, tc := range cases { + tc := tc + t.Run(tc.name, func(t *testing.T) { + reg := NewRegistry() + logger := slog.New(slog.NewTextHandler(io.Discard, nil)) + srv := httptest.NewServer(ServerHandler(reg, logger)) + defer srv.Close() + + resp, err := tc.do(srv.URL) + if err != nil { + t.Fatalf("do: %v", err) + } + defer resp.Body.Close() + + if resp.StatusCode < 400 || resp.StatusCode >= 500 { + t.Fatalf("status = %d, want 4xx", resp.StatusCode) + } + if b, p := reg.Counts(); b != 0 || p != 0 { + t.Fatalf("registry counts = (%d, %d), want (0, 0)", b, p) + } + }) + } +} From 24a032404ade058c01bd16482198df745fa1932d Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Fri, 8 May 2026 22:08:03 +0300 Subject: [PATCH 3/3] docs: /v1/server WS upgrade feature, close-code ADR, memory updates (#16) - Add features/server-endpoint.md covering the binary-side ingress: header gate pre-upgrade, ClaimServer integration, 4409 conflict path, CloseRead-held connection until #6's frame loop lands. - Add ADR-0005 capturing the rule that application close codes go through the underlying *websocket.Conn (stillborn-WSConn window), keeping WSConn.Close fixed to StatusNormalClosure. - Update INDEX with new feature and ADR. - Update PROJECT-MEMORY: mark /v1/server done; add patterns for pre-upgrade validation, close-code emission, defer-after-claim ordering, and the CloseRead bridge for long-lived WS handlers. - Add four lessons: control-frame requirement, close-code emission rule, defer ordering for ReleaseServer, and crypto/rand panic-as-fatal idiom. --- docs/PROJECT-MEMORY.md | 9 +- docs/knowledge/INDEX.md | 2 + ...ication-close-codes-via-underlying-conn.md | 52 ++++++ docs/knowledge/features/server-endpoint.md | 151 ++++++++++++++++++ docs/lessons.md | 16 ++ 5 files changed, 228 insertions(+), 2 deletions(-) create mode 100644 docs/knowledge/decisions/0005-application-close-codes-via-underlying-conn.md create mode 100644 docs/knowledge/features/server-endpoint.md diff --git a/docs/PROJECT-MEMORY.md b/docs/PROJECT-MEMORY.md index aa10785..de99412 100644 --- a/docs/PROJECT-MEMORY.md +++ b/docs/PROJECT-MEMORY.md @@ -13,8 +13,9 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex | TLS via autocert in `--domain` mode (`NewAutocertManager`, `EnforceHost`, `TLSConfig`, `ErrCacheDirInsecure`) | Done (#9) | `internal/relay/tls.go`, `cmd/pyrycode-relay/main.go` | | `WSConn` adapter (`nhooyr.io/websocket.Conn` → registry `Conn`; per-conn write mutex; `Close`-cancelled context; 10s `Send` deadline) | Done (#15) | `internal/relay/ws_conn.go` | | `/healthz` JSON endpoint (`status`, `version`, `connected_binaries`, `connected_phones`, `uptime_seconds`; `Cache-Control: no-store`; unauthenticated) | Done (#10) | `internal/relay/healthz.go`, `cmd/pyrycode-relay/main.go` | -| WS upgrade on `/v1/server` and `/v1/client` | Not started | — | -| Header validation (`x-pyrycode-server`, `x-pyrycode-token`) | Not started | — | +| WS upgrade on `/v1/server` (header gate pre-upgrade; `ClaimServer`; `4409` on conflict; `CloseRead`-held until #6) | Done (#16) | `internal/relay/server_endpoint.go`, `cmd/pyrycode-relay/main.go` | +| WS upgrade on `/v1/client` | Not started | — | +| Header validation (`x-pyrycode-server`, `x-pyrycode-version`, `user-agent` on `/v1/server`; `x-pyrycode-token` on `/v1/client`) | `/v1/server` done (#16); `/v1/client` not started | `internal/relay/server_endpoint.go` | | Frame forwarding using the routing envelope | Not started | — | | `conn_id` generation scheme | Not started | — | | Threat model doc — operational surface (deploy, supply chain, DoS, log hygiene, cert handling, TLS, error leakage) | Done (#11) | `docs/threat-model.md` | @@ -31,6 +32,10 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex - **Interface methods called under the lock are documented as non-blocking getters.** The registry's `Conn.ConnID()` is invoked under the write lock during `UnregisterPhone`; `Send` and `Close` are never called while the lock is held. Pattern: state the contract on the interface, never call something that could block on I/O while a mutex is held. - **Adapters bridge interface↔library API mismatches by owning policy locally.** When a library method needs a `context.Context` but the registry's `Conn` interface doesn't take one (and shouldn't — most callers don't have a context to thread), the adapter owns its own context: derived in the constructor, cancelled by `Close`, narrowed per-call with `WithTimeout` for deadline policy. Adopted in `WSConn` (#15); avoids forcing context-plumbing changes into upstream interfaces. - **Handler factories return `http.Handler`, not `http.HandlerFunc`.** `NewHealthzHandler(reg, version, startedAt) http.Handler` keeps construction factory-shaped so a future ticket adding per-handler state (logger, injectable clock) can do so without touching the call site in `main`. Adopted in `/healthz` (#10). +- **Validate adversarial input *before* allocating WS state.** Header-gate checks (presence + non-empty) run before `websocket.Accept`; a malformed request never receives an upgrade response and the registry is untouched on every error path. Adopted in `/v1/server` (#16); the same shape applies to `/v1/client` token validation (#5). +- **Application WS close codes are emitted on the underlying `*websocket.Conn`, not via `WSConn`.** `WSConn.Close()` always emits `StatusNormalClosure` — no policy parameters. Handlers that need `4409` / `4401` / `4404` call `c.Close(websocket.StatusCode(code), reason)` directly in the stillborn-WSConn window (after construction, before any `Send`). See [ADR-0005](knowledge/decisions/0005-application-close-codes-via-underlying-conn.md). Adopted in `/v1/server` (#16); applies to `/v1/client` (#5). +- **`defer { Release; Close; log }` is registered AFTER the successful claim**, never before. A conflict path (or any pre-claim error) returns before the defer is in place, so we never `ReleaseServer` on a slot we don't hold. Adopted in `/v1/server` (#16); same structural ordering applies to `/v1/client`'s phone registration. +- **Hold long-lived WS handlers open with `c.CloseRead(r.Context())` plus `<-readCtx.Done()` until the real read loop lands.** `CloseRead` drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close). The frame loop ticket replaces `<-readCtx.Done()` with the actual read body in the same call site. Adopted in `/v1/server` (#16) pending #6. - **Capture process-state timestamps in `main` after `flag.Parse()`, not as package-level vars.** `startedAt := time.Now()` lives inside `main` and is passed into the handler factory. A package-level `var startedAt = time.Now()` would fire at import time — before flag parsing, before `--version` early-returns — and be wrong for short-lived test binaries and any future deferred-serve setup. Adopted in #10. ## Conventions diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index a7a3355..b5a24ff 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Features +- [`/v1/server` WS upgrade](features/server-endpoint.md) — binary-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Version` / `User-Agent` pre-upgrade, claims the slot, emits `4409` on conflict, holds the conn via `CloseRead` until #6's frame loop replaces it. - [`/healthz` JSON endpoint](features/healthz.md) — unauthenticated `GET /healthz` returning `{status, version, connected_binaries, connected_phones, uptime_seconds}`; `Cache-Control: no-store`, body bounded ≈135 bytes. - [WSConn adapter](features/ws-conn-adapter.md) — wraps `nhooyr.io/websocket.Conn` to satisfy the registry's `Conn`; owns the per-conn write mutex and a `Close`-cancelled context with a 10s per-`Send` deadline. - [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`. @@ -12,6 +13,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Decisions +- [ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`](decisions/0005-application-close-codes-via-underlying-conn.md) — `WSConn.Close()` stays minimal (`StatusNormalClosure`); handlers emit `4409` / `4401` / `4404` directly on the `*websocket.Conn` in stillborn-WSConn paths. - [ADR-0004: WS library choice and adapter context strategy](decisions/0004-ws-library-and-adapter-context-strategy.md) — `nhooyr.io/websocket` v1.8.x; adapter-owned context cancelled by `Close`, 10s per-`Send` deadline; registry's `Send` signature stays context-free. - [ADR-0003: Connection registry as a passive store](decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, narrow `ReleaseServer` so grace logic (#8) can wrap it. - [ADR-0002: Explicit-failure 404 on `:80` instead of HTTPS redirect](decisions/0002-autocert-explicit-failure-on-port-80.md) — pass `http.NotFoundHandler()` to `autocert.Manager.HTTPHandler`, not `nil`; loud failure over silent scheme-switching. diff --git a/docs/knowledge/decisions/0005-application-close-codes-via-underlying-conn.md b/docs/knowledge/decisions/0005-application-close-codes-via-underlying-conn.md new file mode 100644 index 0000000..1a07fd3 --- /dev/null +++ b/docs/knowledge/decisions/0005-application-close-codes-via-underlying-conn.md @@ -0,0 +1,52 @@ +# ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`, not `WSConn` + +**Status:** Accepted (#16) +**Date:** 2026-05-08 + +## Context + +`WSConn` ([ADR-0004](0004-ws-library-and-adapter-context-strategy.md), #15) is the registry's `Conn`-interface adapter over `nhooyr.io/websocket.Conn`. Its `Close()` always emits `StatusNormalClosure` — that is the registry's contract: "drop this connection cleanly," with no policy on *why*. + +The protocol defines application close codes in the 4000–4999 range: + +- `4401` — token invalid (on `/v1/client`, #5). +- `4404` — no server for `serverID` (on `/v1/client`, #5). +- `4409` — `serverID` already claimed (on `/v1/server`, #16). + +Plus future frame-forward error codes on both endpoints. The upgrade handlers need to emit these, but `WSConn` does not know *why* a close is happening — that knowledge lives in the handler. + +So: where does the code that picks a close code (and the human-readable reason) belong? + +## Decision + +**Application close codes are emitted by calling `Close(code, reason)` directly on the underlying `*websocket.Conn`, not through `WSConn`.** `WSConn.Close()` keeps a fixed `StatusNormalClosure` body. The handler that holds the policy holds the close-code call site. + +Specifically: at the moment a handler decides to refuse a connection (e.g. `4409` on duplicate claim), it calls `c.Close(websocket.StatusCode(code), reason)` on the `*websocket.Conn` it received from `websocket.Accept`, before `wsconn.Close()` would have run. The `WSConn` was constructed but no `Send` was attempted and no goroutine holds `writeMu` — a stillborn WSConn — so the WSConn invariant ("callers must reach the connection only through WSConn methods") is preserved in spirit: the conn was never *used* through the adapter. + +Adopted in `internal/relay/server_endpoint.go` (#16) for `4409`. The same pattern applies to `4401` / `4404` in `/v1/client` (#5). + +## Rationale + +Three options were on the table: + +1. **Add `WSConn.CloseWithCode(code websocket.StatusCode, reason string)`.** Inflates the adapter's API surface for one caller per close code. Each new code adds a method or makes the existing one's signature creep. The adapter would also need policy on what to do if `Close` was already called — the `closeOnce` guard exists precisely to make `Close` idempotent, and a new method would have to either share the guard (then which code wins?) or duplicate it. Rejected — the adapter's small-surface property is load-bearing for review (#15's security review hinged on it). + +2. **Pass the close code into `WSConn` at construction time, store it, and emit it from `Close`.** Doesn't fit: the close code at construction time is *not* the close code the handler eventually decides on. By the time we know it's `4409`, the WSConn already exists (we constructed it to pass to `ClaimServer`). + +3. **Direct `c.Close(code, reason)` from the handler, no method on `WSConn`.** Adopted. Keeps `WSConn` a closed, narrow type. Keeps the close-code knowledge in the handler that holds the policy. + +Option 3 has one cost: it documents an exception to WSConn's "all access goes through the adapter" invariant. The exception is bounded by structure — a stillborn WSConn before any `Send`, no concurrent writer to interleave with, no `writeMu` contender — so the spirit of the invariant survives. The handler comments name the exception so a future reader doesn't generalise it into a "you can use `c` directly any time" pattern. + +## Consequences + +- `WSConn.Close()` stays minimal: `StatusNormalClosure`, idempotent, owns the cancel of `closeCtx`. No close-code variants, no policy parameters. +- Handlers that emit application close codes do so on the `*websocket.Conn` they received from `websocket.Accept`, before the WSConn would have been closed via `defer`. The handler must comment the exception inline so it doesn't read as a violation. +- The pattern only applies to **stillborn WSConn** paths — pre-`Send`, pre-defer. Once a WSConn is in use (registered with `Send` reachable from broadcasters), close goes through `WSConn.Close()` and the code is `StatusNormalClosure`. Application close codes for an *active* WSConn (e.g. binary disconnects mid-stream and we want to surface a forward-error code) need a different solution — likely a small `WSConn.CloseWithCode` added at the time the first such use case lands, justified by an ADR amendment. +- Each new application close code (`4401`, `4404`, future frame-forward codes) follows this pattern at its handler, not by widening `WSConn`. + +## Related + +- [ADR-0003](0003-connection-registry-passive-store.md) — registry's `Conn` interface and the `Send/Close` contract `WSConn` implements. +- [ADR-0004](0004-ws-library-and-adapter-context-strategy.md) — why `WSConn.Close()` cancels `closeCtx` instead of taking `writeMu`; the property the stillborn-conn case relies on. +- [Server endpoint feature](../features/server-endpoint.md) — first concrete user of this pattern (`4409`). +- [WSConn adapter](../features/ws-conn-adapter.md) — the type whose surface this ADR keeps minimal. diff --git a/docs/knowledge/features/server-endpoint.md b/docs/knowledge/features/server-endpoint.md new file mode 100644 index 0000000..216fa16 --- /dev/null +++ b/docs/knowledge/features/server-endpoint.md @@ -0,0 +1,151 @@ +# `/v1/server` — binary-side WebSocket upgrade + +`/v1/server` is the relay's ingress for pyry binaries. A binary opens an outbound WSS to the relay, sends three required headers, and — if no other binary holds the same `serverID` — gets registered in the connection registry and held there until the connection ends. The endpoint enforces first-claim-wins via `Registry.ClaimServer`; a duplicate claim is closed with WS code `4409`. + +This is the binary side only. Phone ingress (`/v1/client`) is #5; frame forwarding is #6; heartbeat is #7; the 30-second grace period on disconnect is #8. The handler currently holds the connection open by draining-and-discarding frames — #6 swaps in the real read loop in the same call site. + +## Wire shape + +```http +GET /v1/server HTTP/1.1 +Host: relay.example.com +Upgrade: websocket +Connection: Upgrade +Sec-WebSocket-Version: 13 +Sec-WebSocket-Key: +X-Pyrycode-Server: +X-Pyrycode-Version: +User-Agent: +``` + +| Header (canonical) | Source | Enforcement | +|---|---|---| +| `X-Pyrycode-Server` | binary's server-id | non-empty | +| `X-Pyrycode-Version` | binary's version string | non-empty | +| `User-Agent` | HTTP standard | non-empty | + +`r.Header.Get` canonicalises case. Validation runs **before** `websocket.Accept`: a missing header returns `400 Bad Request` with an empty body, no upgrade attempted, registry untouched. Wrong method (`POST`) or missing `Upgrade` headers fall through to `websocket.Accept`, which writes the library's standard 4xx. The handler returns silently in both cases. + +## Close codes + +| Code | Meaning | +|---|---| +| `1000` (`StatusNormalClosure`) | clean close on shutdown / release. | +| `4409` | `serverID` already claimed by another binary. Reason: `"server-id already claimed"`. | + +`4409` is application-defined per RFC 6455 (4000–4999 range). Future tickets layer `4401` (token) and `4404` (no server) on `/v1/client`, plus frame-forward errors here. + +## API + +Package `internal/relay` (`server_endpoint.go`): + +```go +func ServerHandler(reg *Registry, logger *slog.Logger) http.Handler +``` + +One exported symbol. No new types, no new sentinel errors. Wired in `cmd/pyrycode-relay/main.go` next to `/healthz`: + +```go +mux.Handle("/v1/server", relay.ServerHandler(reg, logger)) +``` + +## Algorithm + +1. Validate the three required headers. Any missing/empty → `400`, return. +2. `websocket.Accept` with `OriginPatterns: []string{"*"}`. On error, the library has already written a 4xx; return silently. +3. Construct `connID = "server-" + serverID + "-" + randHex8()` (8 hex chars from `crypto/rand`). +4. Wrap with `NewWSConn(c, connID)`. +5. `reg.ClaimServer(serverID, wsconn)`: + - On `ErrServerIDConflict` → close the underlying `*websocket.Conn` with code `4409`, log `server_id_conflict`, return. + - On any other error → `wsconn.Close()`, return (defensive; not currently reachable). +6. Log `server_claimed`. +7. `defer { reg.ReleaseServer; wsconn.Close; log server_released }`. The defer is registered **after** the successful claim so a missed/conflicted claim never tries to release a slot we don't hold. +8. `readCtx := c.CloseRead(r.Context()); <-readCtx.Done()` — block until the peer closes. + +## Logging + +Three structured event types. Field set is fixed; nothing else (full headers, payloads, conn-id internals) is logged. + +| event | fields | +|---|---| +| `server_claimed` | `server_id`, `binary_version`, `remote` | +| `server_id_conflict` | `server_id`, `remote` | +| `server_released` | `server_id` | + +`binary_version` is the value of `X-Pyrycode-Version`, advertised by the binary itself — informational. `remote` is the IP host portion of `r.RemoteAddr`, no port (via `net.SplitHostPort`, with a verbatim fallback). `User-Agent` is required for entry but never logged (no operational use). + +The conflict path **deliberately omits** `binary_version`: an unauthenticated probe surface that echoed the version of the legitimate squatter would be a version-disclosure oracle. Header gate failures are not logged at all — avoids amplifying header-floods into log volume. + +## Concurrency + +| Step | Goroutine | Lock | Lifecycle | +|---|---|---|---| +| Header validation | request goroutine | none | pre-upgrade; no resources held | +| `websocket.Accept` | request goroutine | none | conn allocated on success | +| `ClaimServer` | request goroutine | registry write lock (held internally) | one-shot | +| `CloseRead` | spawns one read-discard goroutine | none | terminates on conn close / `r.Context()` cancel | +| `<-readCtx.Done()` | request goroutine, blocking | none | unblocks on peer close | +| `defer` (release/close/log) | request goroutine | `wsconn.closeOnce` | runs only on the success path; idempotent | + +Shutdown: `http.Server` cancels `r.Context()` → `CloseRead`'s read errors → `readCtx` cancels → handler unblocks → defer runs → slot released. `ReleaseServer`'s grace-period wrapper from #8 will hook the same defer. + +## Design notes + +- **`OriginPatterns: ["*"]`.** `nhooyr.io/websocket.Accept` rejects cross-origin upgrades by default. Programmatic clients don't have to send `Origin`; setting `["*"]` says "this endpoint is not browser-facing." Safe because the custom-header gate one layer above structurally excludes browser-driven CSWSH: raw browser WebSocket cannot set custom request headers, and `fetch` with custom headers triggers a CORS pre-flight this endpoint does not satisfy. +- **Direct `c.Close` on conflict, not `wsconn.Close()`.** `WSConn.Close` always emits `StatusNormalClosure`; the conflict path needs `4409`. Adding a custom-code close to `WSConn` would inflate its surface for one caller. The conflict case is a stillborn WSConn — no `Send` was attempted, no goroutine holds `writeMu` — so the WSConn invariant ("reach the connection only through WSConn methods") is preserved in spirit. The handler comments name this exception so a future reader doesn't generalise it. See [ADR-0005](../decisions/0005-application-close-codes-via-underlying-conn.md). +- **`CloseRead` instead of `for { c.Read() }`.** The handler must keep the goroutine alive until the peer disconnects, but reading frames is #6's job. `CloseRead` spawns a goroutine that drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close) and returns a context cancelled on close. When #6 lands, `<-readCtx.Done()` is replaced with a real frame loop in the same call site. +- **`crypto/rand`-backed `connID` suffix.** 32 bits is sufficient: scoped per server-id, used only as an opaque map key in `UnregisterPhone`. Birthday bound non-trivial only at ~2¹⁶ live conns under one server-id — far beyond v1 scale. Using `crypto/rand` over `math/rand` is forward-compat hardening: if a future ticket exposes the conn-id (e.g. echoing it in an error envelope), unguessable bytes avoid creating an oracle by accident. +- **`crypto/rand` failure is fatal-by-design.** `randHex8` panics on RNG read failure; on Linux/macOS the OS RNG does not block once boot-initialised, so a failure here means a broken host. `http.Server` panic recovery terminates the connection; the slot is not yet claimed when `randHex8` runs, so the registry stays clean. + +## What this handler deliberately does NOT do + +- **No length cap or charset check on `serverID`.** Registry treats it as opaque; the protocol spec is the right layer to define shape. +- **No `400` log line.** Avoids amplifying header-floods into log volume. +- **No log on `websocket.Accept` errors.** Library writes a 4xx; the failure is visible in the http access log. A per-failure log would invite log-flood from misconfigured clients. A counter (Prometheus, future) is the right shape if observability later wants this. +- **No connection caps (per-IP or global).** Documented residual in `docs/threat-model.md` § DoS resistance; the WS upgrade path is named there. Inherited gap, not widened. +- **No frame loop.** `CloseRead` discards frames until #6 replaces it. +- **No heartbeat / ping-pong.** #7. +- **No grace-period release.** Immediate release; #8 wraps `ReleaseServer` in a deferred timer. +- **No token validation.** `/v1/server` is unauthenticated by design — the binary owns the trust relationship with phones. +- **No log-line sanitisation.** `slog`'s text handler quotes string values, so a newline in `binary_version` cannot forge a log line. JSON handler in production preserves the property. + +## Adversarial framing + +The handler is internet-exposed and unauthenticated. Threats considered: + +- **Fake `serverID` claim.** First-claim-wins is the trust primitive at this layer. Mitigation lives at the protocol layer (binary validates phone tokens; phones validate the binary out-of-band). Out of scope here. +- **Squatting on a `serverID`.** Legitimate binary sees `4409` until the squatter disconnects. Phones treat unresponsive binaries as offline. Connection caps are a future DoS ticket. +- **Crafted headers (NULs, very long values).** `slog` quotes string values; `net.SplitHostPort` runs on `r.RemoteAddr` (set by `net/http`, not attacker-controlled); `r.Header.Get` does no parsing. No fragile parse step the attacker controls. +- **Space-only headers.** `!= ""` accepts them. Hardening to "trim then non-empty" would be invented enforcement; the fields are informational, not enforcement, and no observed failure motivates it. +- **Frame floods after upgrade.** `CloseRead` discards. One goroutine per connection. No amplification. Per-frame work is a header read + body drop. Acceptable for v1. +- **Browser-driven CSWSH via `OriginPatterns: ["*"]`.** Browsers cannot set custom request headers on raw WebSocket; `fetch` with custom headers triggers CORS pre-flight. The header gate structurally excludes browser attackers; `OriginPatterns: ["*"]` is safe because of the gate, not despite it. +- **`POST /v1/server` to confuse the upgrade.** `websocket.Accept` rejects non-GET with a library 4xx. Registry untouched. Tested. +- **Conn-id collision.** Birthday bound is fine at v1 scale. Widen the suffix if scale ever demands; public API doesn't change. +- **`crypto/rand` failure.** Panic terminates the connection; the http server recovers; other connections continue. A failing host at this depth is operator-visible; refusing to serve is correct (loud failure over silent correction). +- **Goroutine leak on shutdown.** `r.Context()` cancellation propagates through `CloseRead` to `readCtx`; defer runs; no leak. + +Verdict from #16's security review: **PASS**. No new defences invented; no new threats opened; the residual risks are the same connection-cap and slow-peer gaps `docs/threat-model.md` already names. + +## Testing + +`internal/relay/server_endpoint_test.go`, `package relay`. End-to-end against a real `*websocket.Conn` via `httptest.NewServer(ServerHandler(reg, logger))`. Logger is `slog.New(slog.NewTextHandler(io.Discard, nil))` — log-field assertions are out of scope; future tickets that need them swap in a JSON handler over a `bytes.Buffer`. + +Tests (1:1 with AC bullets): + +- `TestServerEndpoint_ValidUpgrade_RegistersBinary` — valid headers; assert dial succeeds, `reg.BinaryFor(...)` returns a non-nil `Conn` whose `ConnID()` is `"server--<8 hex>"`. +- `TestServerEndpoint_HeaderGate_400` — table-driven over the three required headers (each row deletes or empties one); plain `http.Client.Do` with standard upgrade headers; expect `400`; `reg.Counts() == (0, 0)`. +- `TestServerEndpoint_DuplicateClaim_4409` — first dial succeeds; second dial with same headers; one `Read` on the second conn returns a `*websocket.CloseError` with `Code == 4409` and `Reason == "server-id already claimed"` (asserted via `errors.As`). First conn stays registered. +- `TestServerEndpoint_PeerClose_ReleasesSlot` — dial, assert registered; client `Close(StatusNormalClosure, "")`; poll `reg.BinaryFor` (~1s, 10ms ticks) because release races the client close; assert eventually unregistered; re-claim succeeds. +- `TestServerEndpoint_WrongMethod_NoPanic` — `GET` without upgrade headers and `POST` with valid pyrycode headers; both expect a 4xx, no panic, registry empty. + +Not tested: registry mocks (use the real one — it's race-tested in #3); library mocks; `randHex8` distribution (stdlib); exact log output. + +## Related + +- [Connection registry](connection-registry.md) — `ClaimServer` / `ReleaseServer` / `ErrServerIDConflict` are the primitives this handler routes through. +- [WSConn adapter](ws-conn-adapter.md) — what the handler hands to `ClaimServer`. The handler's direct-`c.Close` on conflict is the documented exception to the WSConn-only invariant. +- [`/healthz`](healthz.md) — sibling unauthenticated endpoint; mirrors the handler-factory shape. +- [ADR-0005](../decisions/0005-application-close-codes-via-underlying-conn.md) — close-code emission goes through the underlying `*websocket.Conn`, not WSConn. +- [ADR-0003](../decisions/0003-connection-registry-passive-store.md) — narrow `ReleaseServer` so #8's grace logic can wrap it. +- [Threat model](../../threat-model.md) § DoS resistance — connection-cap residual is named there. +- [Protocol spec § Authentication → Binary→relay](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#binary--relay) — authoritative wire shape. diff --git a/docs/lessons.md b/docs/lessons.md index c07a80f..faca29a 100644 --- a/docs/lessons.md +++ b/docs/lessons.md @@ -2,6 +2,22 @@ Gotchas worth carrying forward. Each entry: what bit us (or nearly did), and what we do about it. +## A long-lived WS handler that does not read frames will never observe peer close + +A WebSocket connection only sees a peer-side close when *something* on this side reads from the conn — control frames (ping/pong/close) are processed inline with reads. A handler that just blocks forever (e.g. `select {}` or `<-r.Context().Done()`) keeps the goroutine alive but does not move the close machinery: the conn stays "open" until the kernel TCP timeout, and `r.Context()` does not cancel on peer close (it cancels on *server* shutdown / client TCP RST seen by the http server). Use `c.CloseRead(r.Context())` to spawn a drain-and-discard goroutine and block on the returned context — that goroutine processes control frames and cancels the context on close. When the real frame loop lands later, swap `<-readCtx.Done()` for the loop body. Source: `/v1/server` (#16). + +## `*websocket.Conn.Close(code, reason)` emits an application close code; the `WSConn` adapter does not + +The registry's `WSConn.Close()` is fixed to `StatusNormalClosure` by contract. To emit `4409` / `4401` / `4404`, call `Close` directly on the underlying `*websocket.Conn` from the upgrade handler — but only in the stillborn-WSConn window (after construction, before any `Send`, before the close `defer` is registered). Outside that window you'd race the adapter's `closeOnce` and `writeMu`. The handler comments name the exception so it doesn't read as a "use the underlying conn whenever you like" pattern. Captured as ADR-0005. Source: `/v1/server` conflict path (#16). + +## `defer ReleaseServer(...)` must be registered AFTER the successful `ClaimServer` + +If you `defer reg.ReleaseServer(serverID)` before `ClaimServer` returns, the conflict path (`ErrServerIDConflict`) ends up calling `ReleaseServer` on a slot it does not hold — currently a no-op (returns false) but a structural foot-gun if a future grace-period wrapper from #8 starts a timer on every `ReleaseServer` call. Order: claim, then log success, then `defer`. Source: `/v1/server` (#16). + +## `crypto/rand.Read` failure is fatal-by-design on Linux/macOS + +The OS RNG (`getrandom` / `/dev/urandom`) does not block once the kernel pool is initialised at boot. A failure at runtime means the host is broken in a way the relay cannot meaningfully recover from. Stdlib `crypto/rand` examples treat it as fatal; `panic("crypto/rand: " + err.Error())` is the idiom. The http server's panic recovery terminates the connection cleanly; other connections continue. Source: `randHex8` in `/v1/server` (#16). + ## `json.Valid` rejects nil and empty bytes `encoding/json.Valid(nil)` and `json.Valid([]byte{})` both return `false`. Useful: a single `if !json.Valid(frame)` check covers nil, empty, and malformed bytes — no need for separate length guards. Source: routing-envelope `Marshal` (#1).