From 91194ce8b7f241a0be8d7d9c8c9c66f9f5661775 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Mon, 11 May 2026 09:46:21 +0300 Subject: [PATCH 1/3] spec: cap phones per server-id (#30) --- docs/specs/architecture/30-phone-cap.md | 489 ++++++++++++++++++++++++ 1 file changed, 489 insertions(+) create mode 100644 docs/specs/architecture/30-phone-cap.md diff --git a/docs/specs/architecture/30-phone-cap.md b/docs/specs/architecture/30-phone-cap.md new file mode 100644 index 0000000..76c425a --- /dev/null +++ b/docs/specs/architecture/30-phone-cap.md @@ -0,0 +1,489 @@ +# Spec — relay: cap phones per server-id (#30) + +## Files to read first + +- `internal/relay/registry.go` — full file (≈270 lines, dense). The sentinel block (lines 13-21) is the pattern this ticket extends; `RegisterPhone` (lines 186-198) is the method this ticket adds a sibling to; `Conn` interface (lines 32-47) is unchanged but worth re-reading because the new method takes the same `Conn` shape. +- `internal/relay/registry.go:186-198` — `RegisterPhone`'s current body. The cap-check must happen on the same lock-held path, BEFORE the `append`. Pattern to mirror: lock, check pre-conditions (binary present, cap), then mutate, then unlock. +- `internal/relay/client_endpoint.go:43-66` — the stillborn-WSConn 4404 branch. The new at-cap branch lives next to it, structurally identical: `errors.Is`, close the underlying `*websocket.Conn` with a 44xx code, log, return. WSConn.Close was not called — there's no goroutine holding writeMu — so the underlying-conn direct close is correct (ADR-0005). +- `internal/relay/server_endpoint.go:61-80` — sibling 4409 stillborn-WSConn branch (same shape; reference when matching the new 4429 branch's structure). +- `internal/relay/client_endpoint_test.go:156-185` — `TestClientEndpoint_NoBinary_4404` — the integration-test template for the new at-cap close-code test (dial, read, assert `websocket.CloseError` with the expected `Code` and `Reason`). +- `internal/relay/client_endpoint_test.go:18-50` — helpers (`startClient`, `dialWithClient`, `validClientHeaders`, `seedBinary`, `waitForPhones`). The new test reuses every one of these. +- `internal/relay/registry_test.go:94-109` — `TestRegisterPhone_RequiresBinary` — the unit-test template for the new cap tests (claim a binary, then exercise the registration path). +- `cmd/pyrycode-relay/main.go:45-52` — wiring site. The literal `16` lands at line 51 as the third arg to `ClientHandler`. The pattern is already in place for ServerHandler's `30*time.Second` on line 50. +- `docs/PROJECT-MEMORY.md` lines 41-43 — three patterns this ticket inherits: (a) "Application WS close codes are emitted on the underlying `*websocket.Conn`" (stillborn-conn variant), (b) "Policy values live at the wiring site", (c) "Sentinel errors, branched via `errors.Is`". +- `docs/PROJECT-MEMORY.md` line 40 — "Validate adversarial input *before* allocating WS state." Not directly applicable (the cap check is per-server-id, post-upgrade), but worth noting that the cap-rejection happens AFTER `websocket.Accept` because the cap is per-server-id state, not header-shape state. +- `docs/specs/architecture/21-server-endpoint-grace-release.md` § Design → "Rationale for constructor parameter" — the precedent this ticket mirrors for threading policy through the handler factory. +- `docs/threat-model.md` lines 32-40 — DoS resistance, the threat-model entry this ticket partially addresses (the per-server-id leg of the three-ticket family). +- `pyrycode/pyrycode/docs/protocol-mobile.md:544-552` — the protocol spec's WS close-code table. **4429 is not currently listed.** See Open Questions §1 — this ticket's close-code choice depends on a coordinated spec-side addition. + +## Context + +The relay's `phones[serverID]` slice grows without bound. `Registry.RegisterPhone` (`internal/relay/registry.go:190-198`) appends without checking a cap, so a misbehaving binary — or a leaked binary token shared across many devices — can grow per-server memory linearly with attacker effort. The cost compounds at the data path: `StartBinaryForwarder` (#26) walks the full slice for every binary→phone frame, so unbounded phone count means unbounded per-frame fanout cost on the same server-id. + +`docs/threat-model.md` § DoS resistance flags this as one of three deferred-from-v1 hardenings. The other two are explicitly out of scope here and tracked separately: + +- `SetReadLimit` on inbound WS frames — caps **per-frame** byte cost. +- Per-IP rate limit on `/v1/server` + `/v1/client` upgrades (#34) — caps **upgrade-attempt** cost. + +This ticket caps the **per-server-id phone-count** cost only. + +The constraint shape: the cap check and the slice append must be atomic under one lock — anything else admits a TOCTOU race where two concurrent registrations both observe `len < max` and both append, ending at `len > max`. The race-shape mirrors `ClaimServer`'s first-claim-wins. + +## Design + +### Files touched + +Modified: + +- `internal/relay/registry.go` — add `ErrPhonesAtCap`; add `RegisterPhoneCapped`; refactor `RegisterPhone` body to delegate to it. +- `internal/relay/registry_test.go` — three new unit tests for the cap behaviour. +- `internal/relay/client_endpoint.go` — add `maxPhones int` parameter to `ClientHandler`; switch the registration call from `RegisterPhone` → `RegisterPhoneCapped`; add the at-cap branch (close 4429, log `phone_register_at_cap`). +- `internal/relay/client_endpoint_test.go` — add `maxPhones int` to the 3 `ClientHandler(...)` call sites (`startClient` helper at line 25, inline at line 114, plus any new helper for the cap-integration test). Add one integration test for the 4429 close. +- `cmd/pyrycode-relay/main.go` — pass `16` as the third arg to `relay.ClientHandler`. + +No new files. No new packages. No new exported types. + +### Why dual-method `RegisterPhone` + `RegisterPhoneCapped` (and not a signature change) + +The naive design is `RegisterPhone(serverID, conn, max int) error`. That changes one method signature with 23 existing call sites across `registry_test.go`, `forward_test.go`, `healthz_test.go`, and `client_endpoint.go`. Even with mechanical `, 0)` appends, the architect's >10-call-site red line (`agents/architect/CLAUDE.md` § "Edit fan-out check") trips — and the precedent (Pyrycode #75: 26 mechanical `, nil` appends → 61 turns / `max_turns`) is exactly this shape. + +The Strangler Fig alternative — three children (introduce new method, migrate prod, migrate tests + delete old) — is also overkill for production code that's ~30 lines. + +The chosen design adds a sibling method: + +```go +// RegisterPhoneCapped registers conn against serverID, returning +// ErrPhonesAtCap when len(phones[serverID]) already equals max. The cap +// check and the slice append run under one write lock, so two concurrent +// callers at max-1 cannot both succeed (race shape mirrors ClaimServer's +// first-claim-wins; see Registry.ClaimServer). +// +// max <= 0 disables the cap and the method behaves as RegisterPhone. This +// is what RegisterPhone uses internally to preserve its no-cap contract +// without duplicating logic. +// +// Returns ErrNoServer when no binary holds serverID (same precondition as +// RegisterPhone). ErrNoServer takes precedence over ErrPhonesAtCap when +// both apply — the no-server check runs first because the cap check +// against a zero-length slice would otherwise mask the missing binary +// for max == 0 (it doesn't, but for callers reading the code the +// no-server-first order is the contract). +// +// The mapping ErrPhonesAtCap → WS close code 4429 is informational; the +// registry does not interpret close codes. /v1/client (#30) emits 4429 +// in the stillborn-WSConn window. +func (r *Registry) RegisterPhoneCapped(serverID string, conn Conn, max int) error { + r.mu.Lock() + defer r.mu.Unlock() + if _, ok := r.binaries[serverID]; !ok { + return ErrNoServer + } + if max > 0 && len(r.phones[serverID]) >= max { + return ErrPhonesAtCap + } + r.phones[serverID] = append(r.phones[serverID], conn) + return nil +} + +func (r *Registry) RegisterPhone(serverID string, conn Conn) error { + return r.RegisterPhoneCapped(serverID, conn, 0) +} +``` + +Tradeoff: a second method on the registry's small API surface. Justification: cap-awareness is a permanent semantic distinction (the test fixture path genuinely doesn't want a cap; the prod path always wants one). The two are sibling APIs, not legacy + new — they are intended to coexist. `RegisterPhone` is the "no cap" convenience; `RegisterPhoneCapped` is the cap-aware variant. + +The alternative considered and rejected: a `Registry.SetMaxPhones(n int)` setter with a `Registry.maxPhones` field. Cleaner external API (one method, no `max` parameter on `RegisterPhone`), but a setter on a struct that is otherwise immutable post-construction violates the "passive in-memory store" framing in `docs/PROJECT-MEMORY.md` and creates a two-step-init bug surface ("forgot the setter"). Rejected. + +### Sentinel error + +Added alongside the existing two in `internal/relay/registry.go`: + +```go +// ErrPhonesAtCap is returned by RegisterPhoneCapped when len(phones[serverID]) +// already equals the max-phones policy value passed by the caller. The error +// is returned BEFORE the slice is mutated. Maps to WS close code 4429. +ErrPhonesAtCap = errors.New("relay: phones at cap for server-id") +``` + +The error string follows the existing `relay: ...` prefix convention (`ErrServerIDConflict`: `"relay: server-id already claimed"`; `ErrNoServer`: `"relay: no binary for server-id"`). + +### `ClientHandler` signature change + +```go +// ClientHandler returns the http.Handler for /v1/client. The maxPhones +// parameter caps the number of phones that may register against a single +// server-id at one time; an over-cap registration is rejected with WS +// close code 4429. Production passes 16 per docs/threat-model.md § DoS +// resistance — sized for phone + tablet + several desktops + headroom. +// Tests pass 0 to disable the cap (or a small value to exercise the +// boundary). +func ClientHandler(reg *Registry, logger *slog.Logger, maxPhones int) http.Handler +``` + +The `maxPhones` parameter is fully trusted (compile-time literal at the wiring site, never crosses the network). `maxPhones <= 0` degrades to "no cap" — same contract as `RegisterPhoneCapped`'s `max <= 0`. The handler does not validate it (see Open Questions §2). + +### Cap-rejection branch in `/v1/client` + +The current 4404 branch (`internal/relay/client_endpoint.go:53-66`): + +```go +if err := reg.RegisterPhone(serverID, wsconn); err != nil { + if errors.Is(err, ErrNoServer) { + _ = c.Close(websocket.StatusCode(4404), "no server with that id") + logger.Info("phone_register_no_server", + "server_id", serverID, + "remote", remoteHost(r)) + return + } + wsconn.Close() + return +} +``` + +Becomes: + +```go +if err := reg.RegisterPhoneCapped(serverID, wsconn, maxPhones); err != nil { + if errors.Is(err, ErrNoServer) { + _ = c.Close(websocket.StatusCode(4404), "no server with that id") + logger.Info("phone_register_no_server", + "server_id", serverID, + "remote", remoteHost(r)) + return + } + if errors.Is(err, ErrPhonesAtCap) { + _ = c.Close(websocket.StatusCode(4429), "too many phones for server-id") + logger.Info("phone_register_at_cap", + "server_id", serverID, + "remote", remoteHost(r)) + return + } + wsconn.Close() + return +} +``` + +Three structural points the developer should preserve: + +1. **Order of checks is `ErrNoServer` first, then `ErrPhonesAtCap`.** This matches the registry's check order (no-server before cap) and means a no-binary state yields 4404, not 4429, even when the cap is 0. Symmetric with the registry; consistent error semantics for the caller. +2. **The at-cap log line does NOT include `device_name` or `conn_id`.** Registration didn't succeed, so the conn_id is local-only (the random hex was generated but the conn is going to be closed); device_name is uninteresting for a rejected registration. The field set matches the no-server line: `server_id` + `remote`. +3. **Token is still never logged.** The handler reads the token into a local string for the presence check and lets it go out of scope (`internal/relay/client_endpoint.go:30-35`). The at-cap path runs after that, so the local is already out of scope. Per `docs/PROJECT-MEMORY.md` line 45 — "Credentials the relay does not validate are presence-checked then discarded — never logged". + +The unregister-defer block (`internal/relay/client_endpoint.go:74-80`) is unchanged — like the 4404 path, the at-cap path returns before the defer is registered, so `UnregisterPhone` is never called for a registration that never landed. + +### Wiring (`cmd/pyrycode-relay/main.go`) + +One line change at line 51: + +```go +mux.Handle("/v1/client", relay.ClientHandler(reg, logger, 16)) +``` + +The literal `16` lives at the wiring site, mirroring `30*time.Second` on the line above. The value comes from PO's technical-note suggestion ("phone + tablet + several desktops + headroom"); see "Cap value rationale" below. + +### Cap value rationale + +`16` is the upper bound of "plausible legitimate device fleet for a single binary": one primary phone, one tablet, two or three desktops, plus headroom for short-lived debug clients and reconnect races where an old phone hasn't been swept yet. The threat-model concern starts at the *unbounded* end of the spectrum — there's no precise number that distinguishes "legitimate fleet" from "attacker squat", just a heuristic. `16` is large enough that no legitimate user trips it, small enough that an attacker who maximally exploits it incurs a 16× memory amplification per server-id, not a 1000× or 10000× one. + +If a legitimate user trips `16`, they see `phone_register_at_cap` in the relay logs and we revisit. No production deploys exist at the time of writing (#28 deploys the binary; the relay is pre-prod) so this is the no-cost moment to pick a value. + +### Concurrency model + +No new goroutines. The cap-check-and-append happens under the registry's existing `r.mu` write lock — same lock that `RegisterPhone`, `ClaimServer`, `UnregisterPhone`, `ScheduleReleaseServer`, and `handleGraceExpiry` already serialise on. No new lock ordering. The race shape ("first call wins; second observes `>= max` and rejects") is structurally identical to `ClaimServer`'s "first claim wins; second observes already-set and gets `ErrServerIDConflict`" path, which is already race-tested via `TestClaimServer_FirstWinsSecondConflicts` and the broad `TestRegistry_RaceFreedom`. + +### Error handling + +Errors from `RegisterPhoneCapped` are surfaced through the existing `errors.Is` branch pattern. The `wsconn.Close()` fallthrough at `internal/relay/client_endpoint.go:65` continues to exist — it is the catch-all for any future sentinel the registry might add. The fallthrough does not log; per the same convention as the server endpoint's defensive close (`internal/relay/server_endpoint.go:78`), the diff that introduces a new sentinel adds its own log line. + +### Testing strategy + +Three new unit tests in `internal/relay/registry_test.go` for `RegisterPhoneCapped`. One new integration test in `internal/relay/client_endpoint_test.go` for the 4429 close path. Two trivial mechanical edits in `client_endpoint_test.go` to thread `0` through existing `ClientHandler(...)` call sites (these don't exercise the cap). + +#### 1. `TestRegisterPhoneCapped_BoundaryAndRejection` + +Boundary: registering exactly `max` phones succeeds; the next attempt returns `ErrPhonesAtCap`. Template: `TestRegisterPhone_RequiresBinary` shape. + +```go +func TestRegisterPhoneCapped_BoundaryAndRejection(t *testing.T) { + t.Parallel() + r := NewRegistry() + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + const max = 3 + for i := 0; i < max; i++ { + p := &fakeConn{id: fmt.Sprintf("p-%d", i)} + if err := r.RegisterPhoneCapped("s1", p, max); err != nil { + t.Fatalf("RegisterPhoneCapped #%d at <=cap: got %v, want nil", i, err) + } + } + over := &fakeConn{id: "p-over"} + if err := r.RegisterPhoneCapped("s1", over, max); !errors.Is(err, ErrPhonesAtCap) { + t.Fatalf("RegisterPhoneCapped at cap: got %v, want errors.Is(_, ErrPhonesAtCap)", err) + } + // The rejected registration must NOT have appended. + if got := r.PhonesFor("s1"); len(got) != max { + t.Errorf("PhonesFor after rejection: got len=%d, want %d", len(got), max) + } +} +``` + +#### 2. `TestRegisterPhoneCapped_RecoveryAfterUnregister` + +After `UnregisterPhone` brings the count below `max`, the next `RegisterPhoneCapped` succeeds again. + +```go +func TestRegisterPhoneCapped_RecoveryAfterUnregister(t *testing.T) { + t.Parallel() + r := NewRegistry() + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + const max = 2 + p1 := &fakeConn{id: "p-1"} + p2 := &fakeConn{id: "p-2"} + if err := r.RegisterPhoneCapped("s1", p1, max); err != nil { + t.Fatalf("RegisterPhoneCapped p1: %v", err) + } + if err := r.RegisterPhoneCapped("s1", p2, max); err != nil { + t.Fatalf("RegisterPhoneCapped p2: %v", err) + } + if err := r.RegisterPhoneCapped("s1", &fakeConn{id: "p-3"}, max); !errors.Is(err, ErrPhonesAtCap) { + t.Fatalf("RegisterPhoneCapped at cap: got %v, want errors.Is(_, ErrPhonesAtCap)", err) + } + r.UnregisterPhone("s1", "p-1") + p3 := &fakeConn{id: "p-3"} + if err := r.RegisterPhoneCapped("s1", p3, max); err != nil { + t.Fatalf("RegisterPhoneCapped after unregister: got %v, want nil", err) + } +} +``` + +#### 3. `TestRegisterPhoneCapped_PerServerIDIndependent` + +The cap is per-server-id, not global — a second server-id can register up to its own cap independently. + +```go +func TestRegisterPhoneCapped_PerServerIDIndependent(t *testing.T) { + t.Parallel() + r := NewRegistry() + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer s1: %v", err) + } + if err := r.ClaimServer("s2", &fakeConn{id: "b-2"}); err != nil { + t.Fatalf("ClaimServer s2: %v", err) + } + const max = 2 + // Fill s1 to cap. + for i := 0; i < max; i++ { + if err := r.RegisterPhoneCapped("s1", &fakeConn{id: fmt.Sprintf("p-s1-%d", i)}, max); err != nil { + t.Fatalf("RegisterPhoneCapped s1 #%d: %v", i, err) + } + } + if err := r.RegisterPhoneCapped("s1", &fakeConn{id: "p-s1-over"}, max); !errors.Is(err, ErrPhonesAtCap) { + t.Fatalf("RegisterPhoneCapped s1 over: got %v, want ErrPhonesAtCap", err) + } + // s2 is still empty; up-to-cap registrations on s2 must succeed. + for i := 0; i < max; i++ { + if err := r.RegisterPhoneCapped("s2", &fakeConn{id: fmt.Sprintf("p-s2-%d", i)}, max); err != nil { + t.Fatalf("RegisterPhoneCapped s2 #%d (s1 at cap): %v", i, err) + } + } +} +``` + +#### 4. Optional: `TestRegisterPhone_DelegatesToCappedWithZero` + +The wrapper invariant — `RegisterPhone(sid, conn) == RegisterPhoneCapped(sid, conn, 0)`. Not strictly required by the AC, but a one-liner that documents the invariant in code and catches future regressions in the wrapper. + +```go +func TestRegisterPhone_NoCapAfterDelegation(t *testing.T) { + t.Parallel() + r := NewRegistry() + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + // 64 registrations through RegisterPhone must all succeed — the + // wrapper passes max=0 which disables the cap. + for i := 0; i < 64; i++ { + if err := r.RegisterPhone("s1", &fakeConn{id: fmt.Sprintf("p-%d", i)}); err != nil { + t.Fatalf("RegisterPhone #%d: %v", i, err) + } + } +} +``` + +Include this test — it is cheap and pins the wrapper-delegation contract. + +#### 5. `TestClientEndpoint_AtCap_4429` (integration) + +Template: `TestClientEndpoint_NoBinary_4404` (`internal/relay/client_endpoint_test.go:156-185`). Spin up a `ClientHandler` with a small cap (e.g. `maxPhones = 2`), pre-register a binary, dial the cap-many phones successfully, then dial one more and assert the WS read returns a `*websocket.CloseError` with `Code == 4429` and `Reason == "too many phones for server-id"`. Add a dedicated helper to spin up the server with the cap, parallel to the existing `startClient`: + +```go +func startClientWithCap(t *testing.T, maxPhones int) (*Registry, string, func()) { + t.Helper() + reg := NewRegistry() + logger := slog.New(slog.NewTextHandler(io.Discard, nil)) + srv := httptest.NewServer(ClientHandler(reg, logger, maxPhones)) + wsURL := "ws" + strings.TrimPrefix(srv.URL, "http") + return reg, wsURL, srv.Close +} +``` + +The test: + +```go +func TestClientEndpoint_AtCap_4429(t *testing.T) { + const cap = 2 + reg, wsURL, cleanup := startClientWithCap(t, cap) + defer cleanup() + seedBinary(t, reg, "s1") + + // Dial the cap-many phones successfully. + inCap := make([]*websocket.Conn, 0, cap) + for i := 0; i < cap; i++ { + c, _, err := dialWithClient(t, wsURL, validClientHeaders("s1")) + if err != nil { + t.Fatalf("dial #%d: %v", i, err) + } + inCap = append(inCap, c) + } + waitForPhones(t, reg, "s1", cap, time.Second) + + // The (cap+1)-th must be closed with 4429. + over, _, err := dialWithClient(t, wsURL, validClientHeaders("s1")) + if err != nil { + t.Fatalf("dial over-cap: %v", err) + } + defer over.Close(websocket.StatusNormalClosure, "") + + readCtx, cancelRead := context.WithTimeout(context.Background(), 5*time.Second) + defer cancelRead() + _, _, readErr := over.Read(readCtx) + var ce websocket.CloseError + if !errors.As(readErr, &ce) { + t.Fatalf("over-cap Read err = %v (%T), want *websocket.CloseError", readErr, readErr) + } + if ce.Code != websocket.StatusCode(4429) { + t.Fatalf("over-cap close code = %d, want 4429", ce.Code) + } + if ce.Reason != "too many phones for server-id" { + t.Fatalf("over-cap close reason = %q, want %q", ce.Reason, "too many phones for server-id") + } + + // The cap-many in-cap phones are unaffected and remain registered. + if got := reg.PhonesFor("s1"); len(got) != cap { + t.Errorf("PhonesFor after over-cap rejection: got len=%d, want %d", len(got), cap) + } + + // Clean up in-cap phones. + for _, c := range inCap { + _ = c.Close(websocket.StatusNormalClosure, "") + } +} +``` + +#### 6. Existing `ClientHandler(...)` call sites — mechanical edits + +Two sites in `internal/relay/client_endpoint_test.go` need `maxPhones` threaded: + +- `startClient` helper at line 25: `httptest.NewServer(ClientHandler(reg, logger))` → `httptest.NewServer(ClientHandler(reg, logger, 0))`. The existing tests (`TestClientEndpoint_ValidUpgrade_RegistersPhone`, `TestClientEndpoint_NoBinary_4404`, `TestClientEndpoint_PeerClose_UnregistersPhone`, `TestClientEndpoint_MultiplePhones_IndependentLifecycle`, `TestClientEndpoint_DeviceNameOptional_HandlerAccepts`) do not exercise the cap; passing `0` (no cap) preserves their semantics. +- Inline at line 114 inside `TestClientEndpoint_HeaderGate_400`: same `, 0)` append. + +The cap-integration test uses `startClientWithCap` (above), so it does not perturb existing tests. + +`cmd/pyrycode-relay/main.go:51` becomes `relay.ClientHandler(reg, logger, 16)` — one Edit. + +Total simultaneous-update call-site count: **3 `ClientHandler(...)` sites** (1 production + 2 test). Well below the 10 red line. + +### What this ticket deliberately does NOT do + +- **No global (across-server-ids) cap.** A different data structure and a different policy conversation. Out of scope. +- **No change to the data path.** `StartPhoneForwarder` / `StartBinaryForwarder` are untouched. The 1:N fanout cost is still proportional to slice length; this ticket bounds the slice but doesn't optimise the walk. +- **No change to existing 4404 / 4409 close-code semantics.** The new 4429 is additive. +- **No `SetReadLimit` on inbound WS frames.** Tracked separately (per-frame cost cap). +- **No per-IP rate limit on `/v1/server` + `/v1/client` upgrades.** Tracked in #34 (per-attempt cost cap). +- **No protocol-spec edit.** See Open Questions §1 — 4429 needs to be added to `protocol-mobile.md`'s WS close-code table in a coordinated spec-side change. Not done here because this repo doesn't own the canonical spec. + +## Open questions + +1. **WS close code 4429 is not currently listed in `pyrycode/pyrycode/docs/protocol-mobile.md:544-552`.** The existing table enumerates `1000`, `1011`, `4401`, `4404`, `4409`. The relay-side architect skill's instruction is explicit: "Do not invent message shapes; if the spec doesn't cover a case, surface that as a ticket against the spec, not as ad-hoc relay code." (`agents/architect/CLAUDE.md` § Repo Context). + + **Decision:** the spec adds 4429 with the rationale below; this ticket ships the relay-side enforcement; a paired protocol-spec PR adds 4429 to the close-code table before the relay change reaches production. The two are coordinated, not interlocked — a relay deployed with 4429 enforcement before the spec lands would just emit a 44xx code the phone client treats as a generic abort (existing close codes are already application-defined; the phone client must already be tolerant of unknown 44xx codes per RFC 6455's "no specific application-level meaning" framing). The developer of this ticket does NOT block on the spec PR; the developer of the spec PR cites this ticket. + + **Rationale for 4429:** The existing relay-emitted codes follow the convention "WS app code = HTTP status code value" (`4404` → HTTP 404 Not Found, `4409` → HTTP 409 Conflict). HTTP 429 is "Too Many Requests" — the standard rate-limit / cap-rejection status. Continuing the convention, `4429` is the unambiguous choice. PO endorsed "4429-range, treating it as the WS analogue of HTTP 429"; this spec confirms the exact value `4429`. + + **What the developer should do:** use `4429` as-is. If the spec PR proposes a different value (extremely unlikely; the convention is established), the relay change rebases trivially — one literal in the close-code line, one literal in the integration test's assertion. + +2. **Should `ClientHandler` validate `maxPhones > 0`?** No. Same reasoning as `ServerHandler`'s `grace` (`docs/specs/architecture/21-server-endpoint-grace-release.md` § Open Questions §2): the wiring site is the policy, and the wiring site passes `16`. `maxPhones <= 0` degrades to "no cap" — the same contract `RegisterPhoneCapped` documents. A test that wants no cap passes `0`; a production deploy that wants no cap can't get there short of a source edit, and that source edit is exactly the protocol-level signal we'd want. No panic, no clamp. + +3. **Should the at-cap log line include `device_name`?** No. Device-name is advertised by the phone via `X-Pyrycode-Device-Name`, an *optional* header (`docs/PROJECT-MEMORY.md` line 21 — "optional `x-pyrycode-device-name`"). The success log includes it (`phone_registered` log line at `internal/relay/client_endpoint.go:68-72`) because the registration succeeded and operators correlate by device. A rejected registration has nothing to correlate against — `server_id` and `remote` (the IP) are the operator-actionable fields. Symmetric with the 4404 line. + +4. **Should the at-cap log line distinguish from the 4404 path with a `close_code` field?** No new field. The log event name (`phone_register_at_cap` vs `phone_register_no_server`) is the distinguisher. Adding a `close_code` field to both would duplicate the event-name information without adding signal. Match the existing 4404 line's field set verbatim. + +## Done means + +- `internal/relay/registry.go` exports `ErrPhonesAtCap` and `RegisterPhoneCapped(serverID string, conn Conn, max int) error`. `RegisterPhone(serverID, conn) error` is unchanged in signature; its body delegates to `RegisterPhoneCapped(serverID, conn, 0)`. +- `internal/relay/registry_test.go` contains four new tests: `TestRegisterPhoneCapped_BoundaryAndRejection`, `TestRegisterPhoneCapped_RecoveryAfterUnregister`, `TestRegisterPhoneCapped_PerServerIDIndependent`, `TestRegisterPhone_NoCapAfterDelegation`. No existing tests modified. +- `internal/relay/client_endpoint.go`'s `ClientHandler` takes `maxPhones int` as the third parameter. The registration call is `reg.RegisterPhoneCapped(serverID, wsconn, maxPhones)`. A new `errors.Is(err, ErrPhonesAtCap)` branch closes the underlying `*websocket.Conn` with `websocket.StatusCode(4429)` and reason `"too many phones for server-id"`; it logs `phone_register_at_cap` with `server_id` and `remote`. The 4404 branch and the unregister-defer are unchanged. +- `internal/relay/client_endpoint_test.go` has the two existing `ClientHandler(...)` call sites updated to pass `0`, plus a new `startClientWithCap` helper and `TestClientEndpoint_AtCap_4429` test. +- `cmd/pyrycode-relay/main.go:51` passes `16` as the third arg. +- `make vet`, `make test` (`-race`), and `make build` all clean. +- One commit on `feature/30`: `feat(relay): cap phones per server-id (#30)`. + +--- + +## Security review + +**Verdict:** PASS + +### Threat surface for THIS ticket + +This ticket adds a per-server-id cap on phone registrations, a new sentinel, and a new 4429 close-code path on `/v1/client`. The new attack surface is whatever the cap-enforcement adds, plus whatever the new 4429 path exposes. The `maxPhones` value is passed at process startup; not adversarial input. + +### Categories walked + +- **Trust boundaries.** No new boundary. `serverID` continues to flow in via the existing `X-Pyrycode-Server` presence-check (`internal/relay/client_endpoint.go:29-35`); it is passed verbatim to `RegisterPhoneCapped`, which treats it as an opaque map key. The `maxPhones int` parameter is a compile-time literal at the wiring site (`cmd/pyrycode-relay/main.go:51`); no path injects it from the network. **Finding:** No findings. +- **Tokens, secrets, credentials.** The at-cap path runs AFTER the token presence check (lines 30-35) but BEFORE any token-bearing field could enter a log line. The token local is already out of scope by the time `RegisterPhoneCapped` returns. The new `phone_register_at_cap` log line includes only `server_id` and `remote` (the IP) — symmetric with the 4404 line. The reason string `"too many phones for server-id"` on the close frame contains no token-derived data. **Finding:** No findings — token-handling invariants preserved. +- **File operations.** N/A — no file I/O. +- **Subprocess / external command execution.** N/A. +- **Cryptographic primitives.** N/A — no crypto. `randHex8` for `connID` (`internal/relay/server_endpoint.go:118`) is unchanged. +- **Network & I/O.** + - **Resource exhaustion (the threat this ticket addresses).** The cap bounds `len(phones[serverID]) <= maxPhones = 16` per server-id, so a single binary's phone slice can't grow without bound, and the data-path fanout (`StartBinaryForwarder` walks the slice per frame) is bounded above. The threat from a leaked-token-fan-out attack — N devices all sharing a binary's token — is reduced from "linear in attacker effort" to "bounded by 16 per server-id". **Finding:** addresses the ticket's stated DoS-resistance goal. Per `docs/threat-model.md` § DoS resistance, this is one leg of a three-leg hardening (per-frame cap and per-IP upgrade rate-limit are the other two, deferred). + - **Cross-server-id resource exhaustion.** This ticket does NOT bound total phone count across all server-ids. An attacker can still register many distinct server-ids (each within its own cap) and grow the registry. **Finding:** **OUT OF SCOPE.** Per-IP rate limit on `/v1/client` upgrades (#34) is the per-attacker cap; total-connection cap is its own future ticket. Named explicitly in "What this ticket deliberately does NOT do". + - **Cap value choice.** `16` is the rationale-backed pick in § "Cap value rationale". It's a heuristic, not an attacker-derived bound. **Finding:** No findings — the value's selection is in scope for review; chosen value documented. + - **Memory amplification per close-frame.** A 4429 close emits an 8-byte close frame (`websocket.StatusCode` + reason). Cost per rejection is bounded. **Finding:** None. + - **Timeouts on the WS upgrade.** Unchanged. `http.Server` has the four explicit timeouts in `cmd/pyrycode-relay/main.go`. **Finding:** None. +- **Error messages, logs, telemetry.** + - New log event `phone_register_at_cap`. Fields: `server_id` (request-header value; already-logged elsewhere), `remote` (client IP via `remoteHost(r)`; already-logged elsewhere). No token. No device-name. No payload. **Finding:** None — log field set is symmetric with the existing 4404 line and complies with `docs/threat-model.md` § Log hygiene. + - New error string `"relay: phones at cap for server-id"` and close reason `"too many phones for server-id"`. Neither names the cap value, neither leaks state. The cap value is policy that lives at the wiring site; emitting `16` over the wire would tell an attacker the exact cap, which is mildly useful for a probe; not emitting it costs nothing. **Finding:** None — cap value is not in the close reason. +- **Concurrency.** + - **Cap check and append are atomic.** Both run under `r.mu.Lock()` in `RegisterPhoneCapped`. The TOCTOU race shape (two callers at `max-1` both passing the check, both appending, ending at `max+1`) is structurally impossible. **Finding:** None — atomic by construction. + - **Lock ordering.** No new locks. `RegisterPhoneCapped` takes `r.mu` exactly as `RegisterPhone` does; no change to the lock graph. **Finding:** None. + - **Race tests already cover the new method.** `TestRegistry_RaceFreedom` (`internal/relay/registry_test.go:457-484`) hammers `RegisterPhone` from many goroutines. The wrapper delegation means `RegisterPhoneCapped` is exercised under the same race conditions. Adding a `RegisterPhoneCapped` direct call into that race loop is cheap and worth doing for posterity — the developer should add `_ = r.RegisterPhoneCapped(sid, &fakeConn{id: ...}, 16)` to one of the existing race-test goroutines. **Finding:** SHOULD FIX — non-blocking; ask the developer to add one call into the race test. Not a gate. + - **Goroutine lifecycle.** No new goroutines. **Finding:** None. + - **Shutdown safety.** No on-disk state. Process exit drops cap-rejected close frames mid-flight — same as today's 4404 close on process exit. **Finding:** None. +- **Threat model alignment.** `pyrycode/pyrycode/docs/protocol-mobile.md` § Security model lists the resource-exhaustion / DoS surface as a known v1-deferred area; `docs/threat-model.md` § DoS resistance is the in-relay counterpart. This ticket lands one of the three named hardenings. The protocol spec's WS close-code table (line 544-552) needs a coordinated update to add 4429 — flagged as Open Questions §1, not a security gate but a coordination requirement. **Finding:** No security findings; protocol-spec alignment tracked outside this ticket. + +### Adversarial framings considered + +- *"Attacker connects 17 phones to the same server-id."* — 16 succeed; the 17th is closed 4429. The 16 in-cap phones are unaffected. The data path's per-frame fanout is bounded at 16. This is the feature. +- *"Attacker connects to server-id A up to cap, then to B, C, D, ..."* — Per-server-id cap doesn't stop this. Total memory grows as (number of server-ids in flight) × 16. Out of scope; addressed by per-IP rate-limit (#34) and any future total-connection cap. +- *"Attacker disconnects an in-cap phone to free a slot, then immediately registers."* — The slot frees as soon as the disconnect handler runs (`reg.UnregisterPhone` in the cleanup defer). The cap is "live count", not "high-water mark", so an attacker can churn the slot. This is correct behaviour — a legitimate user re-pairing a device after a network blip should not be permanently locked out. No finding. +- *"Race: two registrations arrive at exactly the cap-1 point."* — Atomic under `r.mu.Lock()`. The first wins (count → max, ok), the second observes count == max, returns `ErrPhonesAtCap`. Same race shape as `ClaimServer` and covered by `TestRegistry_RaceFreedom` via the `RegisterPhone` delegation. +- *"Negative or zero `maxPhones` slips into production."* — `cmd/pyrycode-relay/main.go:51` is a literal `16`. No CLI flag; no config file; no path to inject. If `0` somehow got through, the cap would be disabled and the relay would behave as it does today (unbounded slice growth). Safe degradation, not unsafe. +- *"Attacker passes a malicious value for `serverID` to exploit the cap check."* — The cap check is `len(r.phones[serverID]) >= max`. `serverID` is a map key; Go's map indexing returns the zero value (nil slice, len 0) for an unknown key. No injection; no panic; no allocation of attacker-controlled size. +- *"Attacker triggers many 4429 closes hoping to amplify cost."* — Per-rejection cost is bounded by one map read + one close-frame write. Same shape as the existing 4404 path. No amplification. +- *"4429 not in the protocol spec — does the phone client misbehave?"* — Per RFC 6455, an application-level close code in [4000, 4999] is "no specific application-level meaning" without prior agreement. A phone client unaware of 4429 sees the close, reads the reason string, and aborts. The relay does not depend on the phone client's specific reaction. The phone client team's interpretation of 4429 is tracked in the coordinated spec PR (Open Questions §1). + +### Findings summary + +- [Trust boundaries] No findings — boundary unchanged; `maxPhones` is a compile-time literal at the wiring site. +- [Tokens] No findings — at-cap log line includes neither token nor device-name; symmetric with the 4404 line. +- [Network & I/O] No findings on the per-server-id cap itself; cross-server-id total cap is OUT OF SCOPE (deferred to #34 and a future total-cap ticket, named in "What this ticket deliberately does NOT do"). +- [Logs] No findings — `phone_register_at_cap` field set is `server_id` + `remote`, identical to the 4404 line. +- [Concurrency] SHOULD FIX (non-gating) — the developer should add one `RegisterPhoneCapped` call into `TestRegistry_RaceFreedom`'s inner loop for posterity, even though the wrapper-delegation already exercises the same path. +- [Threat model alignment] OUT OF SCOPE for the protocol-spec edit — coordinated outside this ticket per Open Questions §1. + +**Reviewer:** architect (self-review per `architect/security-review.md`) +**Date:** 2026-05-11 From aa414d2b9474e256ca141e209cb2a4d959ef1bc0 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Mon, 11 May 2026 09:49:50 +0300 Subject: [PATCH 2/3] feat(relay): cap phones per server-id (#30) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bound the 1:N phone slice per server-id so a misbehaving binary — or one whose token has leaked to many phones — cannot grow phones[serverID] without limit and inflate per-frame fanout cost. - Add ErrPhonesAtCap sentinel and RegisterPhoneCapped(serverID, conn, max) to Registry. RegisterPhone now delegates with max=0 (no cap). The cap check and the slice append run under one write lock; race shape mirrors ClaimServer's first-claim-wins. - ClientHandler gains a maxPhones parameter. Over-cap registrations are rejected with WS close code 4429 and reason "too many phones for server-id" on the underlying *websocket.Conn (stillborn-WSConn pattern, symmetric with the 4404 branch). Log event phone_register_at_cap carries server_id + remote; no token, no device-name. - Wiring: cmd/pyrycode-relay/main.go passes 16 (phone + tablet + several desktops + headroom) at construction; tests pass 0 to disable the cap. - Tests: four new registry unit tests (boundary, recovery, per-server-id independence, wrapper-delegation invariant) and one integration test asserting 4429 reaches the wire. TestRegistry_RaceFreedom now also exercises RegisterPhoneCapped under the race detector. Co-Authored-By: Claude Opus 4.7 --- cmd/pyrycode-relay/main.go | 4 +- internal/relay/client_endpoint.go | 18 ++++- internal/relay/client_endpoint_test.go | 61 ++++++++++++++++- internal/relay/registry.go | 30 +++++++++ internal/relay/registry_test.go | 91 ++++++++++++++++++++++++++ 5 files changed, 199 insertions(+), 5 deletions(-) diff --git a/cmd/pyrycode-relay/main.go b/cmd/pyrycode-relay/main.go index b4a3109..54cadf9 100644 --- a/cmd/pyrycode-relay/main.go +++ b/cmd/pyrycode-relay/main.go @@ -54,7 +54,9 @@ func main() { mux := http.NewServeMux() mux.Handle("/healthz", relay.NewHealthzHandler(reg, Version, startedAt)) mux.Handle("/v1/server", relay.ServerHandler(reg, logger, 30*time.Second, maxFrameBytes)) - mux.Handle("/v1/client", relay.ClientHandler(reg, logger, maxFrameBytes)) + // maxPhones=16 caps phones per server-id; over-cap registrations are + // rejected with WS close 4429. Per #30 architect spec. + mux.Handle("/v1/client", relay.ClientHandler(reg, logger, maxFrameBytes, 16)) if *insecureListen != "" { logger.Info("starting", "version", Version, "mode", "insecure", "listen", *insecureListen) diff --git a/internal/relay/client_endpoint.go b/internal/relay/client_endpoint.go index 8273ff3..7659e4c 100644 --- a/internal/relay/client_endpoint.go +++ b/internal/relay/client_endpoint.go @@ -5,6 +5,8 @@ package relay // // - 1000 (StatusNormalClosure) clean close on shutdown / unregister. // - 4404 (no server with that id) no binary currently holds the slot. +// - 4429 (too many phones for server-id) +// phones-per-server-id cap exceeded. // // The relay treats x-pyrycode-token as opaque — its presence is required, // its value is never parsed, compared, or logged. Token verification is the @@ -27,7 +29,12 @@ import ( // // maxFrameBytes is the per-frame read cap threaded into NewWSConn; see // docs/specs/architecture/29-wsconn-read-limit.md for the derivation. -func ClientHandler(reg *Registry, logger *slog.Logger, maxFrameBytes int64) http.Handler { +// +// maxPhones caps the number of phones that may register against a single +// server-id at one time; an over-cap registration is rejected with WS close +// code 4429. maxPhones <= 0 disables the cap (used by tests that do not +// exercise the boundary). +func ClientHandler(reg *Registry, logger *slog.Logger, maxFrameBytes int64, maxPhones int) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { serverID := r.Header.Get("X-Pyrycode-Server") token := r.Header.Get("X-Pyrycode-Token") @@ -53,7 +60,7 @@ func ClientHandler(reg *Registry, logger *slog.Logger, maxFrameBytes int64) http connID := "client-" + serverID + "-" + randHex8() wsconn := NewWSConn(c, connID, maxFrameBytes) - if err := reg.RegisterPhone(serverID, wsconn); err != nil { + if err := reg.RegisterPhoneCapped(serverID, wsconn, maxPhones); err != nil { if errors.Is(err, ErrNoServer) { // Stillborn WSConn — close directly on the underlying // *websocket.Conn so the 4404 application code reaches the @@ -64,6 +71,13 @@ func ClientHandler(reg *Registry, logger *slog.Logger, maxFrameBytes int64) http "remote", remoteHost(r)) return } + if errors.Is(err, ErrPhonesAtCap) { + _ = c.Close(websocket.StatusCode(4429), "too many phones for server-id") + logger.Info("phone_register_at_cap", + "server_id", serverID, + "remote", remoteHost(r)) + return + } wsconn.Close() return } diff --git a/internal/relay/client_endpoint_test.go b/internal/relay/client_endpoint_test.go index 2327a4d..e94a15f 100644 --- a/internal/relay/client_endpoint_test.go +++ b/internal/relay/client_endpoint_test.go @@ -22,7 +22,18 @@ func startClient(t *testing.T) (*Registry, string, func()) { t.Helper() reg := NewRegistry() logger := slog.New(slog.NewTextHandler(io.Discard, nil)) - srv := httptest.NewServer(ClientHandler(reg, logger, 256*1024)) + srv := httptest.NewServer(ClientHandler(reg, logger, 256*1024, 0)) + wsURL := "ws" + strings.TrimPrefix(srv.URL, "http") + return reg, wsURL, srv.Close +} + +// startClientWithCap is like startClient but threads maxPhones through to +// ClientHandler. Used by the 4429 cap-rejection integration test. +func startClientWithCap(t *testing.T, maxPhones int) (*Registry, string, func()) { + t.Helper() + reg := NewRegistry() + logger := slog.New(slog.NewTextHandler(io.Discard, nil)) + srv := httptest.NewServer(ClientHandler(reg, logger, 256*1024, maxPhones)) wsURL := "ws" + strings.TrimPrefix(srv.URL, "http") return reg, wsURL, srv.Close } @@ -111,7 +122,7 @@ func TestClientEndpoint_HeaderGate_400(t *testing.T) { t.Run(tc.name, func(t *testing.T) { reg := NewRegistry() logger := slog.New(slog.NewTextHandler(io.Discard, nil)) - srv := httptest.NewServer(ClientHandler(reg, logger, 256*1024)) + srv := httptest.NewServer(ClientHandler(reg, logger, 256*1024, 0)) defer srv.Close() req, err := http.NewRequest(http.MethodGet, srv.URL+"/", nil) @@ -290,3 +301,49 @@ func TestClientEndpoint_DeviceNameOptional_HandlerAccepts(t *testing.T) { _ = c2.Close(websocket.StatusNormalClosure, "") waitForPhones(t, reg, "s1", 0, 2*time.Second) } + +func TestClientEndpoint_AtCap_4429(t *testing.T) { + const cap = 2 + reg, wsURL, cleanup := startClientWithCap(t, cap) + defer cleanup() + seedBinary(t, reg, "s1") + + inCap := make([]*websocket.Conn, 0, cap) + for i := 0; i < cap; i++ { + c, _, err := dialWithClient(t, wsURL, validClientHeaders("s1")) + if err != nil { + t.Fatalf("dial #%d: %v", i, err) + } + inCap = append(inCap, c) + } + waitForPhones(t, reg, "s1", cap, time.Second) + + over, _, err := dialWithClient(t, wsURL, validClientHeaders("s1")) + if err != nil { + t.Fatalf("dial over-cap: %v", err) + } + defer over.Close(websocket.StatusNormalClosure, "") + + readCtx, cancelRead := context.WithTimeout(context.Background(), 5*time.Second) + defer cancelRead() + _, _, readErr := over.Read(readCtx) + var ce websocket.CloseError + if !errors.As(readErr, &ce) { + t.Fatalf("over-cap Read err = %v (%T), want *websocket.CloseError", readErr, readErr) + } + if ce.Code != websocket.StatusCode(4429) { + t.Fatalf("over-cap close code = %d, want 4429", ce.Code) + } + if ce.Reason != "too many phones for server-id" { + t.Fatalf("over-cap close reason = %q, want %q", ce.Reason, "too many phones for server-id") + } + + if got := reg.PhonesFor("s1"); len(got) != cap { + t.Errorf("PhonesFor after over-cap rejection: got len=%d, want %d", len(got), cap) + } + + for _, c := range inCap { + _ = c.Close(websocket.StatusNormalClosure, "") + } + waitForPhones(t, reg, "s1", 0, 2*time.Second) +} diff --git a/internal/relay/registry.go b/internal/relay/registry.go index e65254b..fba7abb 100644 --- a/internal/relay/registry.go +++ b/internal/relay/registry.go @@ -18,6 +18,12 @@ var ( // ErrNoServer is returned by RegisterPhone when no binary holds the // requested serverID. Maps to WS close code 4404. ErrNoServer = errors.New("relay: no binary for server-id") + + // ErrPhonesAtCap is returned by RegisterPhoneCapped when + // len(phones[serverID]) already equals the max-phones policy value + // passed by the caller. The error is returned BEFORE the slice is + // mutated. Maps to WS close code 4429. + ErrPhonesAtCap = errors.New("relay: phones at cap for server-id") ) // Conn is the registry's view of a WebSocket connection. Real implementations @@ -187,12 +193,36 @@ func (r *Registry) handleGraceExpiry(serverID string, self *graceEntry) { // ErrNoServer if no binary currently holds serverID. The registry does not // deduplicate by ConnID — the caller is responsible for not registering the // same phone twice. +// +// Equivalent to RegisterPhoneCapped(serverID, conn, 0): no cap. The dedicated +// no-cap entry point exists for test fixtures and callers that explicitly do +// not want the cap-aware contract. func (r *Registry) RegisterPhone(serverID string, conn Conn) error { + return r.RegisterPhoneCapped(serverID, conn, 0) +} + +// RegisterPhoneCapped appends conn to the phones slice for serverID, +// returning ErrPhonesAtCap when len(phones[serverID]) already equals max. +// The cap check and the slice append run under one write lock, so two +// concurrent callers at max-1 cannot both succeed (race shape mirrors +// ClaimServer's first-claim-wins). +// +// max <= 0 disables the cap and the method behaves as the legacy +// RegisterPhone. ErrNoServer takes precedence over ErrPhonesAtCap when both +// would apply — the no-server check runs first. +// +// The mapping ErrPhonesAtCap → WS close code 4429 is informational; the +// registry does not interpret close codes. /v1/client emits 4429 in the +// stillborn-WSConn window. +func (r *Registry) RegisterPhoneCapped(serverID string, conn Conn, max int) error { r.mu.Lock() defer r.mu.Unlock() if _, ok := r.binaries[serverID]; !ok { return ErrNoServer } + if max > 0 && len(r.phones[serverID]) >= max { + return ErrPhonesAtCap + } r.phones[serverID] = append(r.phones[serverID], conn) return nil } diff --git a/internal/relay/registry_test.go b/internal/relay/registry_test.go index 042c9f5..54e6202 100644 --- a/internal/relay/registry_test.go +++ b/internal/relay/registry_test.go @@ -108,6 +108,95 @@ func TestRegisterPhone_RequiresBinary(t *testing.T) { } } +func TestRegisterPhoneCapped_BoundaryAndRejection(t *testing.T) { + t.Parallel() + r := NewRegistry() + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + const max = 3 + for i := 0; i < max; i++ { + p := &fakeConn{id: fmt.Sprintf("p-%d", i)} + if err := r.RegisterPhoneCapped("s1", p, max); err != nil { + t.Fatalf("RegisterPhoneCapped #%d at <=cap: got %v, want nil", i, err) + } + } + over := &fakeConn{id: "p-over"} + if err := r.RegisterPhoneCapped("s1", over, max); !errors.Is(err, ErrPhonesAtCap) { + t.Fatalf("RegisterPhoneCapped at cap: got %v, want errors.Is(_, ErrPhonesAtCap)", err) + } + if got := r.PhonesFor("s1"); len(got) != max { + t.Errorf("PhonesFor after rejection: got len=%d, want %d", len(got), max) + } +} + +func TestRegisterPhoneCapped_RecoveryAfterUnregister(t *testing.T) { + t.Parallel() + r := NewRegistry() + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + const max = 2 + p1 := &fakeConn{id: "p-1"} + p2 := &fakeConn{id: "p-2"} + if err := r.RegisterPhoneCapped("s1", p1, max); err != nil { + t.Fatalf("RegisterPhoneCapped p1: %v", err) + } + if err := r.RegisterPhoneCapped("s1", p2, max); err != nil { + t.Fatalf("RegisterPhoneCapped p2: %v", err) + } + if err := r.RegisterPhoneCapped("s1", &fakeConn{id: "p-3"}, max); !errors.Is(err, ErrPhonesAtCap) { + t.Fatalf("RegisterPhoneCapped at cap: got %v, want errors.Is(_, ErrPhonesAtCap)", err) + } + r.UnregisterPhone("s1", "p-1") + p3 := &fakeConn{id: "p-3"} + if err := r.RegisterPhoneCapped("s1", p3, max); err != nil { + t.Fatalf("RegisterPhoneCapped after unregister: got %v, want nil", err) + } +} + +func TestRegisterPhoneCapped_PerServerIDIndependent(t *testing.T) { + t.Parallel() + r := NewRegistry() + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer s1: %v", err) + } + if err := r.ClaimServer("s2", &fakeConn{id: "b-2"}); err != nil { + t.Fatalf("ClaimServer s2: %v", err) + } + const max = 2 + for i := 0; i < max; i++ { + if err := r.RegisterPhoneCapped("s1", &fakeConn{id: fmt.Sprintf("p-s1-%d", i)}, max); err != nil { + t.Fatalf("RegisterPhoneCapped s1 #%d: %v", i, err) + } + } + if err := r.RegisterPhoneCapped("s1", &fakeConn{id: "p-s1-over"}, max); !errors.Is(err, ErrPhonesAtCap) { + t.Fatalf("RegisterPhoneCapped s1 over: got %v, want ErrPhonesAtCap", err) + } + for i := 0; i < max; i++ { + if err := r.RegisterPhoneCapped("s2", &fakeConn{id: fmt.Sprintf("p-s2-%d", i)}, max); err != nil { + t.Fatalf("RegisterPhoneCapped s2 #%d (s1 at cap): %v", i, err) + } + } +} + +// TestRegisterPhone_NoCapAfterDelegation pins the wrapper-delegation +// contract: RegisterPhone passes max=0 to RegisterPhoneCapped, which +// disables the cap. A regression that hard-codes a cap into the wrapper +// would trip this test. +func TestRegisterPhone_NoCapAfterDelegation(t *testing.T) { + t.Parallel() + r := NewRegistry() + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + for i := 0; i < 64; i++ { + if err := r.RegisterPhone("s1", &fakeConn{id: fmt.Sprintf("p-%d", i)}); err != nil { + t.Fatalf("RegisterPhone #%d: %v", i, err) + } + } +} + func TestUnregisterPhone_RemovesByConnID(t *testing.T) { t.Parallel() r := NewRegistry() @@ -473,8 +562,10 @@ func TestRegistry_RaceFreedom(t *testing.T) { _ = r.ClaimServer(sid, bin) _, _ = r.BinaryFor(sid) _ = r.RegisterPhone(sid, &fakeConn{id: fmt.Sprintf("p-%d-%d", g, i)}) + _ = r.RegisterPhoneCapped(sid, &fakeConn{id: fmt.Sprintf("pc-%d-%d", g, i)}, 16) _ = r.PhonesFor(sid) r.UnregisterPhone(sid, fmt.Sprintf("p-%d-%d", g, i)) + r.UnregisterPhone(sid, fmt.Sprintf("pc-%d-%d", g, i)) _, _ = r.Counts() _ = r.ReleaseServer(sid) } From d8c9ef5487377c2e2e163017f25ff0676d86f0b1 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Mon, 11 May 2026 09:58:18 +0300 Subject: [PATCH 3/3] docs: cap phones per server-id (#30) Per-ticket note in codebase/30.md plus updates to the client-endpoint and connection-registry feature docs and the knowledge index for the new ErrPhonesAtCap sentinel, RegisterPhoneCapped, the 4429 close code, and the phone_register_at_cap log event. Co-Authored-By: Claude Opus 4.7 --- docs/knowledge/INDEX.md | 4 +- docs/knowledge/codebase/30.md | 39 +++++++++++++++++++ docs/knowledge/features/client-endpoint.md | 31 +++++++++------ .../knowledge/features/connection-registry.md | 11 ++++-- 4 files changed, 69 insertions(+), 16 deletions(-) create mode 100644 docs/knowledge/codebase/30.md diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index b5daf33..e0b90e9 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -11,11 +11,11 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o - [Binary-side frame forwarder](features/binary-forwarder.md) — per-binary read pump: unwraps each inbound routing envelope, linear-scans `PhonesFor(serverID)` for `env.ConnID`, writes `env.Frame` verbatim to that phone; opaque inner bytes; synchronous (handler discards the return); diverges from #25 in error policy — unknown `conn_id`, malformed envelope, phone `Send` error all log+continue (a single bad frame never tears down the binary); replaced `/v1/server`'s `CloseRead` placeholder (#26). - [WebSocket heartbeat](features/heartbeat.md) — per-conn goroutine on both endpoints sends RFC 6455 ping every 30s; closes with `1011 "heartbeat timeout"` if no pong within 30s. Detects half-open TCP within 60s; ctx-cancel exit path leaves close to the handler defer (#7). - [Phone-side frame forwarder](features/phone-forwarder.md) — per-phone read pump: wraps each inbound phone frame in the routing envelope keyed by the phone's `conn_id` and `Send`s it to the binary holding `serverID`; opaque inner bytes; synchronous (handler discards the return); replaced `/v1/client`'s `CloseRead` placeholder; added `WSConn.Read` (single-caller) (#25). -- [`/v1/client` WS upgrade](features/client-endpoint.md) — phone-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Token` / `User-Agent` pre-upgrade (token presence-only, never parsed/logged); registers phone on the binary's slot, emits `4404` if no binary holds the id, hands the conn to `StartPhoneForwarder` for the data path (#5, #25). +- [`/v1/client` WS upgrade](features/client-endpoint.md) — phone-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Token` / `User-Agent` pre-upgrade (token presence-only, never parsed/logged); registers phone on the binary's slot, emits `4404` if no binary holds the id, `4429` if the per-server-id phone cap (16) is reached (#30), hands the conn to `StartPhoneForwarder` for the data path (#5, #25, #30). - [`/v1/server` WS upgrade](features/server-endpoint.md) — binary-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Version` / `User-Agent` pre-upgrade, claims the slot, emits `4409` on conflict, hands the conn to `StartBinaryForwarder` for the data path (#26); on disconnect schedules a 30s grace release so a quick reconnect inherits the slot (#16, #21, #26). - [`/healthz` JSON endpoint](features/healthz.md) — unauthenticated `GET /healthz` returning `{status, version, connected_binaries, connected_phones, uptime_seconds}`; `Cache-Control: no-store`, body bounded ≈135 bytes. - [WSConn adapter](features/ws-conn-adapter.md) — wraps `nhooyr.io/websocket.Conn` to satisfy the registry's `Conn`; owns the per-conn write mutex and a `Close`-cancelled context with a 10s per-`Send` deadline. -- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`, deferred-release `ScheduleReleaseServer` with grace-window reclaim semantics. +- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404` / `4429`, cap-aware `RegisterPhoneCapped` (caller supplies the cap value, #30), deferred-release `ScheduleReleaseServer` with grace-window reclaim semantics. - [Autocert TLS](features/autocert-tls.md) — `--domain` mode wiring: `:443` WSS via Let's Encrypt + `:80` ACME http-01, host gates, cache-dir permission policy, TLS 1.2 floor. - [Routing envelope](features/routing-envelope.md) — typed Go wrapper (`Envelope`, `Marshal`, `Unmarshal`) for the `{conn_id, frame}` wire shape exchanged between relay and binary. diff --git a/docs/knowledge/codebase/30.md b/docs/knowledge/codebase/30.md new file mode 100644 index 0000000..35630f3 --- /dev/null +++ b/docs/knowledge/codebase/30.md @@ -0,0 +1,39 @@ +# Ticket #30 — Cap phones per server-id (`ErrPhonesAtCap` / WS `4429`) + +Bounds the per-server-id phone slice so a misbehaving binary (or one whose token has leaked to many devices) can't grow `phones[serverID]` without limit and inflate per-frame fanout cost in `StartBinaryForwarder` (#26). One of the three deferred-from-v1 DoS hardenings named in `docs/threat-model.md` § DoS resistance — the other two (`SetReadLimit` on inbound frames; per-IP upgrade rate-limit, #34) are explicitly out of scope here. + +## Implementation + +- **`internal/relay/registry.go`** — added `ErrPhonesAtCap` sentinel (string `"relay: phones at cap for server-id"`, follows the existing `relay: …` prefix) and `RegisterPhoneCapped(serverID, conn, max) error`. The cap check (`len(r.phones[serverID]) >= max`) and the append run under one `r.mu.Lock()`; race shape mirrors `ClaimServer`'s first-claim-wins. `ErrNoServer` takes precedence over `ErrPhonesAtCap` — no-binary check runs first. `RegisterPhone` is unchanged in signature; its body now delegates to `RegisterPhoneCapped(serverID, conn, 0)`. `max <= 0` disables the cap (the no-cap contract is preserved verbatim). +- **`internal/relay/client_endpoint.go`** — `ClientHandler` gains `maxPhones int` as the third parameter. The registration call switches to `RegisterPhoneCapped(serverID, wsconn, maxPhones)`. A new `errors.Is(err, ErrPhonesAtCap)` branch closes the underlying `*websocket.Conn` with `websocket.StatusCode(4429)` and reason `"too many phones for server-id"`, then logs `phone_register_at_cap` with `server_id` + `remote` (symmetric with the 4404 line — no token, no `device_name`, no `conn_id`). Branch order is `ErrNoServer` → `ErrPhonesAtCap` → defensive `wsconn.Close()` fallthrough. +- **`cmd/pyrycode-relay/main.go:51`** — wiring site passes `16` (phone + tablet + several desktops + headroom; rationale lives in the architecture spec). Mirrors the `30*time.Second` literal on the `ServerHandler` line. +- **Sibling-method choice, not signature change.** A `max int` parameter on `RegisterPhone` would have rippled into ~23 call sites across the relay test suite. Adding `RegisterPhoneCapped` as a sibling keeps the wrapper trivial and mechanically touches three `ClientHandler(...)` call sites (one prod + two test). The two methods are intended to coexist permanently — `RegisterPhone` is the no-cap convenience for test fixtures; `RegisterPhoneCapped` is the cap-aware production entry point. +- **`internal/relay/registry_test.go`** — four new unit tests: + - `TestRegisterPhoneCapped_BoundaryAndRejection` — exactly `max` succeed; the next returns `ErrPhonesAtCap`; the rejected registration does NOT append (`PhonesFor` length unchanged). + - `TestRegisterPhoneCapped_RecoveryAfterUnregister` — after `UnregisterPhone` brings the count below cap, the next `RegisterPhoneCapped` succeeds. + - `TestRegisterPhoneCapped_PerServerIDIndependent` — `s1` at cap doesn't block `s2`'s registrations. + - `TestRegisterPhone_NoCapAfterDelegation` — 64 `RegisterPhone` calls all succeed; pins the wrapper-delegation invariant in code. + - `TestRegistry_RaceFreedom` was extended to also drive `RegisterPhoneCapped` (security-review SHOULD-FIX). The wrapper-delegation already exercises the same path, but the explicit call documents the intent for posterity. +- **`internal/relay/client_endpoint_test.go`** — `startClient` and the inline `ClientHandler(...)` in `TestClientEndpoint_HeaderGate_400` thread `0` (no cap). New `startClientWithCap(t, maxPhones)` helper plus `TestClientEndpoint_AtCap_4429` integration test: dials cap-many phones with a real binary seeded, asserts the over-cap dial reads back a `*websocket.CloseError` with `Code == 4429` and `Reason == "too many phones for server-id"`, and confirms the in-cap phones remain registered. + +## Patterns this ticket inherits (no new patterns coined) + +- **Sentinel + `errors.Is` branch** (#3 / #20). +- **Stillborn-WSConn application close code on the underlying `*websocket.Conn`** — ADR-0005, the same window the 4404 branch uses on this endpoint and the 4409 branch uses on `/v1/server`. +- **Policy values live at the wiring site, not as package-level constants** — `16` lands at `main.go:51`, threaded as a constructor parameter. +- **Log field set excludes credentials and uncorrelatable extras** — `phone_register_at_cap` carries `server_id` + `remote` only, identical to the 4404 line. + +## Out of scope (named explicitly) + +- **No global (across-server-ids) cap.** Per-IP upgrade rate-limit (#34) and any future total-connection cap own that. +- **No change to the data path.** `StartPhoneForwarder` / `StartBinaryForwarder` are untouched; per-frame fanout cost is now bounded by `16`, but the linear walk is still linear. +- **No `SetReadLimit` on inbound WS frames.** Separate ticket (per-frame cost cap). +- **No protocol-spec edit.** `4429` is not yet listed in `pyrycode/pyrycode/docs/protocol-mobile.md`'s WS close-code table; a coordinated spec-side PR adds it. The relay does not block on that PR — `4429` follows the established `WS app code = HTTP status code` convention (`4404` → 404, `4409` → 409), so the value is the unambiguous continuation, and RFC 6455 lets a phone client treat unknown 44xx codes as generic aborts in the interim. + +## Cross-links + +- [Feature: Connection registry](../features/connection-registry.md) — adds `ErrPhonesAtCap` / `RegisterPhoneCapped` to the API surface and the "what the registry deliberately does NOT do" entry on capping flips to "does, via an explicit `max` parameter from the caller". +- [Feature: `/v1/client` WS upgrade](../features/client-endpoint.md) — adds the `maxPhones` parameter, the `4429` close code, and the `phone_register_at_cap` log event. +- [ADR-0005: Application WS close codes via the underlying conn](../decisions/0005-application-close-codes-via-underlying-conn.md) — the stillborn-WSConn pattern the new branch inherits. +- [Threat model § DoS resistance](../../threat-model.md) — names this as one of three deferred-from-v1 hardenings; this ticket lands the per-server-id leg. +- [Protocol spec § Phone → relay → binary](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#phone--relay--binary) — authoritative WS close-code table; `4429` addition tracked in a coordinated spec-side PR. diff --git a/docs/knowledge/features/client-endpoint.md b/docs/knowledge/features/client-endpoint.md index 50ea98b..b83117c 100644 --- a/docs/knowledge/features/client-endpoint.md +++ b/docs/knowledge/features/client-endpoint.md @@ -1,6 +1,6 @@ # `/v1/client` — phone-side WebSocket upgrade -`/v1/client` is the relay's ingress for mobile phones. A phone opens an outbound WSS to the relay, sends three required headers (one of which is the device token, opaque to the relay), and — if a binary currently holds the requested `serverID` — gets registered on that binary's phone slice in the connection registry. If no binary holds the slot, the WS is closed with application code `4404`. +`/v1/client` is the relay's ingress for mobile phones. A phone opens an outbound WSS to the relay, sends three required headers (one of which is the device token, opaque to the relay), and — if a binary currently holds the requested `serverID` and the per-server-id phone cap is not reached — gets registered on that binary's phone slice in the connection registry. If no binary holds the slot, the WS is closed with application code `4404`; if the slot is at cap, with `4429`. This is the public, internet-exposed endpoint. The peer is *less* trusted than `/v1/server`'s peer (which at least runs operator-issued software); anyone on the internet who learns the relay hostname can connect. Header validation runs **before** `websocket.Accept`; the token is presence-checked only and never logged. @@ -38,25 +38,28 @@ The relay does not validate the token — the binary does. The token is read int |---|---| | `1000` (`StatusNormalClosure`) | clean close on shutdown / unregister / binary-grace expiry. | | `4404` | no binary currently holds the requested `serverID`. Reason: `"no server with that id"`. | +| `4429` | per-server-id phone cap reached (#30). Reason: `"too many phones for server-id"`. | -`4404` is application-defined per RFC 6455 (4000–4999 range); mirrors the `4409` pattern on `/v1/server`. The close-reason is the protocol-spec literal — it deliberately does not echo the requested server-id back (that would invite probing). +`4404` / `4429` are application-defined per RFC 6455 (4000–4999 range); both mirror the `4409` pattern on `/v1/server` (the convention is `WS app code = HTTP status code value` — 4404 → 404, 4409 → 409, 4429 → 429). The close-reasons are protocol-spec literals — they deliberately do not echo the requested server-id (or the exact cap value) back, to avoid probe oracles. `4429` is pending a coordinated addition to `pyrycode/pyrycode/docs/protocol-mobile.md`'s close-code table; per RFC 6455 a phone client that does not recognise it sees a generic abort. ## API Package `internal/relay` (`client_endpoint.go`): ```go -func ClientHandler(reg *Registry, logger *slog.Logger) http.Handler +func ClientHandler(reg *Registry, logger *slog.Logger, maxPhones int) http.Handler ``` -One exported symbol. No new types, no new sentinel errors, no new package-level constants. No `grace` parameter — phone disconnect is immediate (`UnregisterPhone`); the only grace concept on this endpoint lives entirely inside the registry's `handleGraceExpiry`, which closes orphan phones when a binary's grace window expires (see § Concurrency below). +One exported symbol. No new types. `maxPhones` is the per-server-id cap on concurrent phone registrations (#30); `<= 0` disables the cap and is the contract test fixtures pass to opt out. No `grace` parameter — phone disconnect is immediate (`UnregisterPhone`); the only grace concept on this endpoint lives entirely inside the registry's `handleGraceExpiry`, which closes orphan phones when a binary's grace window expires (see § Concurrency below). Wired in `cmd/pyrycode-relay/main.go` next to `/v1/server`: ```go -mux.Handle("/v1/client", relay.ClientHandler(reg, logger)) +mux.Handle("/v1/client", relay.ClientHandler(reg, logger, 16)) ``` +The literal `16` lives at the wiring site, mirroring `30*time.Second` on the `ServerHandler` line above — "policy values live at the wiring site" (#21). `16` is sized for phone + tablet + several desktops + headroom; an attacker who maximally exploits it incurs a 16× memory amplification per server-id, not unbounded. + ## Algorithm 1. Validate the three required headers. Any missing/empty → `400`, return. @@ -64,9 +67,12 @@ mux.Handle("/v1/client", relay.ClientHandler(reg, logger)) 3. `websocket.Accept` with `OriginPatterns: []string{"*"}`. On error, the library has already written a 4xx; return silently. 4. Construct `connID = "client-" + serverID + "-" + randHex8()` (8 hex chars from `crypto/rand`). 5. Wrap with `NewWSConn(c, connID)`. -6. `reg.RegisterPhone(serverID, wsconn)`: +6. `reg.RegisterPhoneCapped(serverID, wsconn, maxPhones)`: - On `ErrNoServer` → close the underlying `*websocket.Conn` with code `4404` and reason `"no server with that id"`, log `phone_register_no_server`, return. + - On `ErrPhonesAtCap` → close the underlying `*websocket.Conn` with code `4429` and reason `"too many phones for server-id"`, log `phone_register_at_cap`, return. - On any other error → `wsconn.Close()`, return (defensive; not currently reachable). + + Branch order is `ErrNoServer` first, then `ErrPhonesAtCap` — mirrors the registry's internal check order so an empty slot yields `4404`, not `4429`, even when the cap would otherwise apply. 7. Log `phone_registered`. 8. `defer { reg.UnregisterPhone(serverID, connID); wsconn.Close(); log phone_unregistered }`. Registered **after** the successful `RegisterPhone` so a no-server path never tries to unregister a slot we never owned. 9. `_ = StartPhoneForwarder(r.Context(), reg, serverID, wsconn, logger)` — synchronous read pump that wraps inbound frames and `Send`s them to the binary holding `serverID`. Returns on phone close, ctx cancel, missing binary, or `Send` failure. The handler discards the return; the forwarder logs the cause. See [phone-forwarder.md](phone-forwarder.md). @@ -81,6 +87,7 @@ Three structured event types. Field set is fixed; nothing else (token, full head |---|---| | `phone_registered` | `server_id`, `conn_id`, `device_name`, `remote` | | `phone_register_no_server` | `server_id`, `remote` | +| `phone_register_at_cap` | `server_id`, `remote` | | `phone_unregistered` | `server_id`, `conn_id` | Explicitly **not logged on any path:** @@ -91,7 +98,7 @@ Explicitly **not logged on any path:** `device_name` is the user-supplied label (e.g. `"Juhana's iPhone"`); the threat model's MAY-be-logged list names it explicitly. Empty string on absence; logged literally — slog's text/JSON handlers escape control characters, so log-line forging via crafted device names is structurally blocked. `remote` is the IP host portion of `r.RemoteAddr`, no port. -`conn_id` is logged on register and unregister so operators can correlate a session across both events. `phone_register_no_server` deliberately omits `conn_id` because no `WSConn` was constructed before the close on that path. +`conn_id` is logged on register and unregister so operators can correlate a session across both events. `phone_register_no_server` and `phone_register_at_cap` deliberately omit `conn_id` and `device_name`: a rejected registration has nothing to correlate against — `server_id` and `remote` (the IP) are the operator-actionable fields. Both rejection lines carry the same field set verbatim. Header gate failures (`400`) are not logged — same hygiene rationale as `/v1/server`: avoid amplifying header-floods into log volume. @@ -137,8 +144,8 @@ The phone observes the close on its socket as `StatusNormalClosure` — by delib ## What this handler deliberately does NOT do - **No token validation.** The relay is not the trust boundary for the phone token; the binary owns it. The relay's authorization model on `/v1/client` is purely "is there a binary holding this server-id?" -- **No connection caps (per-IP or global).** Documented residual in `docs/threat-model.md` § DoS resistance. Same gap as `/v1/server`, named there, not widened. -- **No phone-count-per-server-id cap.** `phones[serverID]` grows under attack; the broadcast cost (when #6 lands) is the DoS shape that owns it. +- **No connection caps (per-IP or global).** Documented residual in `docs/threat-model.md` § DoS resistance. Same gap as `/v1/server`, named there, not widened. Per-IP rate-limit is tracked in #34. +- **No global (across-server-ids) phone cap.** Only the per-server-id cap (#30) is enforced here; an attacker can still register many distinct server-ids (each within its own cap) and grow the registry — owned by #34 and any future total-connection cap. - **No inner-frame parsing.** `StartPhoneForwarder` wraps each frame in the routing envelope and forwards opaque bytes; the binary owns inner-frame validation. - **No heartbeat policy in the handler.** The handler launches `go runHeartbeat(...)` after the successful register and registers `defer cancelHB()` so the goroutine exits cleanly under handler unwind (#7). The heartbeat policy itself — 30s interval, 30s pong timeout, `1011 "heartbeat timeout"` close — lives in `heartbeat.go`. See [Heartbeat feature](heartbeat.md). - **No phone-side reconnect grace.** The binary-side grace from #20 already closes orphan phones cleanly on expiry; phone-side grace is not in the protocol spec and is out of scope. @@ -152,7 +159,8 @@ The phone observes the close on its socket as `StatusNormalClosure` — by delib - **Crafted `X-Pyrycode-Device-Name` to forge log lines.** slog's text and JSON handlers quote string values and escape control characters. Log-line forging is structurally blocked. - **Slow-loris on upgrade.** Peer that completes handshake but never sends frames parks one goroutine in `WSConn.Read` indefinitely. Connection caps deferred. - **Race: binary disconnects between phone's `RegisterPhone` and phone observing its conn live.** `RegisterPhone` succeeds; the binary's grace timer arms in parallel; if grace expires, `handleGraceExpiry` closes this phone; the handler's defer runs cleanly. Phone observes `StatusNormalClosure` and reconnects. -- **Phone re-dials thousands of times to grow the phones slice forever.** Each successful register appends; each disconnect removes via `UnregisterPhone`. The slice does not leak across connections. Steady state is "concurrent live phones" — bounded by file descriptors / connection cap (deferred). +- **Phone re-dials thousands of times to grow the phones slice forever.** Each successful register appends; each disconnect removes via `UnregisterPhone`. The slice does not leak across connections. Concurrent live phones per server-id are bounded by `maxPhones = 16` (#30); total concurrent phones across the relay are bounded by file descriptors / per-IP rate-limit (#34, deferred). +- **Many devices share a leaked binary token to balloon `phones[serverID]`.** Past 16 concurrent registrations on the same server-id, the next attempt closes with `4429`. The in-cap 16 remain unaffected; the data-path fanout in `StartBinaryForwarder` (#26) is bounded by the same 16. Slot churn (disconnect-then-re-register) is allowed by design — a legitimate user re-pairing after a network blip must not be permanently locked out. - **Token-presence oracle via 4404.** Token-absent → 400; token-present + no binary → 4404; token-present + binary → success. The 400 vs 4404 difference is gated on token presence, but the *validity* of the token is not testable through the relay (the binary owns that). 4404 is not a presence oracle — it confirms the gate passed, which an attacker already knew because they sent a non-empty value. - **`crypto/rand` panic.** Same posture as `/v1/server`: panic terminates the connection, http server recovers, defer is not yet registered (panic fires inside `randHex8`, before `RegisterPhone`), registry stays clean. @@ -160,7 +168,7 @@ Verdict from the security review: **PASS**. Reuses `/v1/server`'s audited shape ## Testing -`internal/relay/client_endpoint_test.go`, `package relay`. Same harness as `server_endpoint_test.go`: `httptest.NewServer(ClientHandler(reg, logger))` with `websocket.Dial` clients, `slog.NewTextHandler(io.Discard, nil)` for logs. +`internal/relay/client_endpoint_test.go`, `package relay`. Same harness as `server_endpoint_test.go`: `httptest.NewServer(ClientHandler(reg, logger, 0))` with `websocket.Dial` clients, `slog.NewTextHandler(io.Discard, nil)` for logs. Tests that do not exercise the cap pass `0` (the no-cap contract); the cap-integration test uses a dedicated `startClientWithCap(t, maxPhones)` helper. `fakeConn` is reused from `registry_test.go` to seed a binary on tests that exercise the success path; tests for the 4404 path skip the seed. @@ -169,6 +177,7 @@ Tests (1:1 with AC bullets): - `TestClientEndpoint_ValidUpgrade_RegistersPhone` — seed binary; dial with valid headers; poll `reg.PhonesFor` until populated; assert `ConnID()` matches `"client--<8 hex>"`. - `TestClientEndpoint_HeaderGate_400` — table-driven over the three required headers; expect `400`; registry empty. - `TestClientEndpoint_NoBinary_4404` — no seed; dial; one `Read` returns `*websocket.CloseError` with `Code == 4404` and `Reason == "no server with that id"`; registry empty. +- `TestClientEndpoint_AtCap_4429` (#30) — `startClientWithCap(t, 2)`; seed; dial 2 phones (in-cap, succeed); dial a 3rd; one `Read` on it returns `*websocket.CloseError` with `Code == 4429` and `Reason == "too many phones for server-id"`; in-cap phones remain registered. - `TestClientEndpoint_PeerClose_UnregistersPhone` — seed; dial; close client; poll `reg.PhonesFor` until empty; binary remains claimed. - `TestClientEndpoint_MultiplePhones_IndependentLifecycle` — seed; dial three phones; close them in non-FIFO order; assert removal order does not corrupt other entries (covers `UnregisterPhone`'s swap-with-last-then-truncate). - `TestClientEndpoint_DeviceNameOptional_HandlerAccepts` — dial without and with `X-Pyrycode-Device-Name`; both succeed. diff --git a/docs/knowledge/features/connection-registry.md b/docs/knowledge/features/connection-registry.md index a302317..20a0683 100644 --- a/docs/knowledge/features/connection-registry.md +++ b/docs/knowledge/features/connection-registry.md @@ -28,6 +28,7 @@ func (r *Registry) ClaimServer(serverID string, conn Conn) error func (r *Registry) ReleaseServer(serverID string) (released bool) func (r *Registry) ScheduleReleaseServer(serverID string, d time.Duration) func (r *Registry) RegisterPhone(serverID string, conn Conn) error +func (r *Registry) RegisterPhoneCapped(serverID string, conn Conn, max int) error func (r *Registry) UnregisterPhone(serverID string, connID string) func (r *Registry) BinaryFor(serverID string) (Conn, bool) func (r *Registry) PhonesFor(serverID string) []Conn @@ -39,10 +40,13 @@ Sentinel errors (branch with `errors.Is`): | Error | Returned by | Maps to | |---|---|---| | `ErrServerIDConflict` | `ClaimServer` when slot already held | WS close `4409` | -| `ErrNoServer` | `RegisterPhone` when no binary holds the slot | WS close `4404` | +| `ErrNoServer` | `RegisterPhone` / `RegisterPhoneCapped` when no binary holds the slot | WS close `4404` | +| `ErrPhonesAtCap` | `RegisterPhoneCapped` when `len(phones[serverID]) >= max` | WS close `4429` | The mapping to close codes is informational; the registry doesn't know about WebSockets — the upgrade handlers translate. +`RegisterPhone(serverID, conn)` is equivalent to `RegisterPhoneCapped(serverID, conn, 0)` — the legacy no-cap entry point, retained for test fixtures and callers that explicitly do not want the cap-aware contract. Production traffic on `/v1/client` uses `RegisterPhoneCapped` with the wiring-site `maxPhones` value (16 today, see [`/v1/client`](client-endpoint.md)). `ErrNoServer` takes precedence over `ErrPhonesAtCap`; the cap check (`max > 0 && len(phones[serverID]) >= max`) and the slice append both run under the same write lock, so two concurrent callers at `max-1` cannot both succeed (race shape mirrors `ClaimServer`'s first-claim-wins). + ## Grace-period reclaim (`ScheduleReleaseServer`, #20) `ScheduleReleaseServer(serverID, d)` arms a deferred release after duration `d`. The grace window IS the reclaim path — see [ADR-0006](../decisions/0006-grace-period-as-reclaim-path.md) for the rationale. @@ -93,7 +97,7 @@ The duration `d` is fully trusted — degenerate values (`d <= 0` fires immediat - **Does not close `Conn` handles in the immediate-release path.** `ReleaseServer` and `UnregisterPhone` only remove map entries. Connection lifetime is owned by the WS upgrade handlers. The grace-period expiry handler is the one exception — it `Close()`s phones whose binary did not reclaim within the grace window. See [ADR-0006](../decisions/0006-grace-period-as-reclaim-path.md). - **Does not deduplicate phones by `ConnID`.** Caller invariant: each `Conn` is registered once. -- **Does not bound the per-server-id phone slice.** Per-server caps belong to the WS upgrade layer (#5). +- **Does not own the per-server-id phone cap value.** `RegisterPhoneCapped` enforces a cap supplied by the caller (#30); the value itself lives at the wiring site in `cmd/pyrycode-relay/main.go`. `RegisterPhone` (no cap) remains available for fixtures that opt out. - **Does not propagate binary loss to phones via `ReleaseServer`.** When `ReleaseServer` runs, the phones slice is left in place. The owner of phones must `UnregisterPhone` (and choose whether to close them) when their server's binary is gone. The deferred sibling `ScheduleReleaseServer` *does* close phones — but only if the timer expires without a reclaim. This split lets the disconnect path opt into the grace window while the synchronous path (e.g. cleanup after a failed claim) stays narrow. - **Does not validate `serverID` content.** Length, charset, prefix — owned by `x-pyrycode-server` header validation in #4. The registry treats any string as an opaque map key. @@ -103,7 +107,7 @@ The registry has no networking and no direct exposure to adversarial input. Adve - **`serverID` of pathological size or content** is harmless to the registry (just a map key) but the WS upgrade layer must cap header lengths. - **`Conn.ConnID()` returning unstable values** would cause `UnregisterPhone` to remove the wrong phone (or none). Documented as a contract on the `Conn` interface; violation is a leak, not a cross-tenant data leak. -- **Repeated `RegisterPhone` for the same server-id** is unbounded memory growth, gated only by the requirement that a binary holds the slot. Per-server connection caps are #5's job. +- **Repeated `RegisterPhone` for the same server-id** is unbounded memory growth. The cap-aware `RegisterPhoneCapped` (#30) is the entry point production wiring uses; the no-cap `RegisterPhone` remains for test fixtures. Across-server-id total caps are not the registry's job (deferred to per-IP rate-limit #34 and any future global cap). - **Slow `ConnID` blocking the write lock** is the cost of the documented non-blocking contract; copying the slice and scanning outside the lock would create a TOCTOU window where a phone could be added between snapshot and removal. ## Testing @@ -115,6 +119,7 @@ Functional cases (one subtest per AC bullet): - `ClaimServer` first-wins / second-conflicts; `BinaryFor` still resolves to the first conn. - `ReleaseServer` reclaim — release returns `true`, second release of an unheld id returns `false`, claim again with a different conn succeeds. - `RegisterPhone` requires a binary (`ErrNoServer` until `ClaimServer`). +- `RegisterPhoneCapped` cap semantics (#30): exact-cap registrations succeed, the next returns `ErrPhonesAtCap` without mutating the slice (`TestRegisterPhoneCapped_BoundaryAndRejection`); slot frees after `UnregisterPhone` and the next registration succeeds again (`TestRegisterPhoneCapped_RecoveryAfterUnregister`); cap is per-server-id, not global (`TestRegisterPhoneCapped_PerServerIDIndependent`); `RegisterPhone` delegates to `RegisterPhoneCapped(_, _, 0)` and the no-cap contract is preserved (`TestRegisterPhone_NoCapAfterDelegation`). `TestRegistry_RaceFreedom` was extended to drive `RegisterPhoneCapped` under the race detector. - `UnregisterPhone` removes by `ConnID`, leaves siblings intact, no-op on unknown id. - `PhonesFor` snapshot isolation: mutating the returned slice doesn't affect the registry; later registrations don't appear in earlier snapshots. - `Counts` across the full lifecycle including the `ReleaseServer` orphan case (`(0, 1)` while phones survive a release).