Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion cmd/pyrycode-relay/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,9 @@ func main() {
mux := http.NewServeMux()
mux.Handle("/healthz", relay.NewHealthzHandler(reg, Version, startedAt))
mux.Handle("/v1/server", relay.ServerHandler(reg, logger, 30*time.Second, maxFrameBytes))
mux.Handle("/v1/client", relay.ClientHandler(reg, logger, maxFrameBytes))
// maxPhones=16 caps phones per server-id; over-cap registrations are
// rejected with WS close 4429. Per #30 architect spec.
mux.Handle("/v1/client", relay.ClientHandler(reg, logger, maxFrameBytes, 16))

if *insecureListen != "" {
logger.Info("starting", "version", Version, "mode", "insecure", "listen", *insecureListen)
Expand Down
4 changes: 2 additions & 2 deletions docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o
- [Binary-side frame forwarder](features/binary-forwarder.md) — per-binary read pump: unwraps each inbound routing envelope, linear-scans `PhonesFor(serverID)` for `env.ConnID`, writes `env.Frame` verbatim to that phone; opaque inner bytes; synchronous (handler discards the return); diverges from #25 in error policy — unknown `conn_id`, malformed envelope, phone `Send` error all log+continue (a single bad frame never tears down the binary); replaced `/v1/server`'s `CloseRead` placeholder (#26).
- [WebSocket heartbeat](features/heartbeat.md) — per-conn goroutine on both endpoints sends RFC 6455 ping every 30s; closes with `1011 "heartbeat timeout"` if no pong within 30s. Detects half-open TCP within 60s; ctx-cancel exit path leaves close to the handler defer (#7).
- [Phone-side frame forwarder](features/phone-forwarder.md) — per-phone read pump: wraps each inbound phone frame in the routing envelope keyed by the phone's `conn_id` and `Send`s it to the binary holding `serverID`; opaque inner bytes; synchronous (handler discards the return); replaced `/v1/client`'s `CloseRead` placeholder; added `WSConn.Read` (single-caller) (#25).
- [`/v1/client` WS upgrade](features/client-endpoint.md) — phone-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Token` / `User-Agent` pre-upgrade (token presence-only, never parsed/logged); registers phone on the binary's slot, emits `4404` if no binary holds the id, hands the conn to `StartPhoneForwarder` for the data path (#5, #25).
- [`/v1/client` WS upgrade](features/client-endpoint.md) — phone-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Token` / `User-Agent` pre-upgrade (token presence-only, never parsed/logged); registers phone on the binary's slot, emits `4404` if no binary holds the id, `4429` if the per-server-id phone cap (16) is reached (#30), hands the conn to `StartPhoneForwarder` for the data path (#5, #25, #30).
- [`/v1/server` WS upgrade](features/server-endpoint.md) — binary-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Version` / `User-Agent` pre-upgrade, claims the slot, emits `4409` on conflict, hands the conn to `StartBinaryForwarder` for the data path (#26); on disconnect schedules a 30s grace release so a quick reconnect inherits the slot (#16, #21, #26).
- [`/healthz` JSON endpoint](features/healthz.md) — unauthenticated `GET /healthz` returning `{status, version, connected_binaries, connected_phones, uptime_seconds}`; `Cache-Control: no-store`, body bounded ≈135 bytes.
- [WSConn adapter](features/ws-conn-adapter.md) — wraps `nhooyr.io/websocket.Conn` to satisfy the registry's `Conn`; owns the per-conn write mutex and a `Close`-cancelled context with a 10s per-`Send` deadline.
- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`, deferred-release `ScheduleReleaseServer` with grace-window reclaim semantics.
- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404` / `4429`, cap-aware `RegisterPhoneCapped` (caller supplies the cap value, #30), deferred-release `ScheduleReleaseServer` with grace-window reclaim semantics.
- [Autocert TLS](features/autocert-tls.md) — `--domain` mode wiring: `:443` WSS via Let's Encrypt + `:80` ACME http-01, host gates, cache-dir permission policy, TLS 1.2 floor.
- [Routing envelope](features/routing-envelope.md) — typed Go wrapper (`Envelope`, `Marshal`, `Unmarshal`) for the `{conn_id, frame}` wire shape exchanged between relay and binary.

Expand Down
39 changes: 39 additions & 0 deletions docs/knowledge/codebase/30.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Ticket #30 — Cap phones per server-id (`ErrPhonesAtCap` / WS `4429`)

Bounds the per-server-id phone slice so a misbehaving binary (or one whose token has leaked to many devices) can't grow `phones[serverID]` without limit and inflate per-frame fanout cost in `StartBinaryForwarder` (#26). One of the three deferred-from-v1 DoS hardenings named in `docs/threat-model.md` § DoS resistance — the other two (`SetReadLimit` on inbound frames; per-IP upgrade rate-limit, #34) are explicitly out of scope here.

## Implementation

- **`internal/relay/registry.go`** — added `ErrPhonesAtCap` sentinel (string `"relay: phones at cap for server-id"`, follows the existing `relay: …` prefix) and `RegisterPhoneCapped(serverID, conn, max) error`. The cap check (`len(r.phones[serverID]) >= max`) and the append run under one `r.mu.Lock()`; race shape mirrors `ClaimServer`'s first-claim-wins. `ErrNoServer` takes precedence over `ErrPhonesAtCap` — no-binary check runs first. `RegisterPhone` is unchanged in signature; its body now delegates to `RegisterPhoneCapped(serverID, conn, 0)`. `max <= 0` disables the cap (the no-cap contract is preserved verbatim).
- **`internal/relay/client_endpoint.go`** — `ClientHandler` gains `maxPhones int` as the third parameter. The registration call switches to `RegisterPhoneCapped(serverID, wsconn, maxPhones)`. A new `errors.Is(err, ErrPhonesAtCap)` branch closes the underlying `*websocket.Conn` with `websocket.StatusCode(4429)` and reason `"too many phones for server-id"`, then logs `phone_register_at_cap` with `server_id` + `remote` (symmetric with the 4404 line — no token, no `device_name`, no `conn_id`). Branch order is `ErrNoServer` → `ErrPhonesAtCap` → defensive `wsconn.Close()` fallthrough.
- **`cmd/pyrycode-relay/main.go:51`** — wiring site passes `16` (phone + tablet + several desktops + headroom; rationale lives in the architecture spec). Mirrors the `30*time.Second` literal on the `ServerHandler` line.
- **Sibling-method choice, not signature change.** A `max int` parameter on `RegisterPhone` would have rippled into ~23 call sites across the relay test suite. Adding `RegisterPhoneCapped` as a sibling keeps the wrapper trivial and mechanically touches three `ClientHandler(...)` call sites (one prod + two test). The two methods are intended to coexist permanently — `RegisterPhone` is the no-cap convenience for test fixtures; `RegisterPhoneCapped` is the cap-aware production entry point.
- **`internal/relay/registry_test.go`** — four new unit tests:
- `TestRegisterPhoneCapped_BoundaryAndRejection` — exactly `max` succeed; the next returns `ErrPhonesAtCap`; the rejected registration does NOT append (`PhonesFor` length unchanged).
- `TestRegisterPhoneCapped_RecoveryAfterUnregister` — after `UnregisterPhone` brings the count below cap, the next `RegisterPhoneCapped` succeeds.
- `TestRegisterPhoneCapped_PerServerIDIndependent` — `s1` at cap doesn't block `s2`'s registrations.
- `TestRegisterPhone_NoCapAfterDelegation` — 64 `RegisterPhone` calls all succeed; pins the wrapper-delegation invariant in code.
- `TestRegistry_RaceFreedom` was extended to also drive `RegisterPhoneCapped` (security-review SHOULD-FIX). The wrapper-delegation already exercises the same path, but the explicit call documents the intent for posterity.
- **`internal/relay/client_endpoint_test.go`** — `startClient` and the inline `ClientHandler(...)` in `TestClientEndpoint_HeaderGate_400` thread `0` (no cap). New `startClientWithCap(t, maxPhones)` helper plus `TestClientEndpoint_AtCap_4429` integration test: dials cap-many phones with a real binary seeded, asserts the over-cap dial reads back a `*websocket.CloseError` with `Code == 4429` and `Reason == "too many phones for server-id"`, and confirms the in-cap phones remain registered.

## Patterns this ticket inherits (no new patterns coined)

- **Sentinel + `errors.Is` branch** (#3 / #20).
- **Stillborn-WSConn application close code on the underlying `*websocket.Conn`** — ADR-0005, the same window the 4404 branch uses on this endpoint and the 4409 branch uses on `/v1/server`.
- **Policy values live at the wiring site, not as package-level constants** — `16` lands at `main.go:51`, threaded as a constructor parameter.
- **Log field set excludes credentials and uncorrelatable extras** — `phone_register_at_cap` carries `server_id` + `remote` only, identical to the 4404 line.

## Out of scope (named explicitly)

- **No global (across-server-ids) cap.** Per-IP upgrade rate-limit (#34) and any future total-connection cap own that.
- **No change to the data path.** `StartPhoneForwarder` / `StartBinaryForwarder` are untouched; per-frame fanout cost is now bounded by `16`, but the linear walk is still linear.
- **No `SetReadLimit` on inbound WS frames.** Separate ticket (per-frame cost cap).
- **No protocol-spec edit.** `4429` is not yet listed in `pyrycode/pyrycode/docs/protocol-mobile.md`'s WS close-code table; a coordinated spec-side PR adds it. The relay does not block on that PR — `4429` follows the established `WS app code = HTTP status code` convention (`4404` → 404, `4409` → 409), so the value is the unambiguous continuation, and RFC 6455 lets a phone client treat unknown 44xx codes as generic aborts in the interim.

## Cross-links

- [Feature: Connection registry](../features/connection-registry.md) — adds `ErrPhonesAtCap` / `RegisterPhoneCapped` to the API surface and the "what the registry deliberately does NOT do" entry on capping flips to "does, via an explicit `max` parameter from the caller".
- [Feature: `/v1/client` WS upgrade](../features/client-endpoint.md) — adds the `maxPhones` parameter, the `4429` close code, and the `phone_register_at_cap` log event.
- [ADR-0005: Application WS close codes via the underlying conn](../decisions/0005-application-close-codes-via-underlying-conn.md) — the stillborn-WSConn pattern the new branch inherits.
- [Threat model § DoS resistance](../../threat-model.md) — names this as one of three deferred-from-v1 hardenings; this ticket lands the per-server-id leg.
- [Protocol spec § Phone → relay → binary](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#phone--relay--binary) — authoritative WS close-code table; `4429` addition tracked in a coordinated spec-side PR.
Loading