Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions cmd/pyrycode-relay/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,16 @@ func main() {
startedAt := time.Now()
reg := relay.NewRegistry()

// maxFrameBytes: 256 KiB per-frame read cap. Derivation:
// docs/specs/architecture/29-wsconn-read-limit.md (≤50-message
// message_chunk envelope + routing wrapper, headroom for outliers,
// four orders of magnitude below nhooyr's 32 MiB default).
const maxFrameBytes int64 = 256 * 1024

mux := http.NewServeMux()
mux.Handle("/healthz", relay.NewHealthzHandler(reg, Version, startedAt))
mux.Handle("/v1/server", relay.ServerHandler(reg, logger, 30*time.Second))
mux.Handle("/v1/client", relay.ClientHandler(reg, logger))
mux.Handle("/v1/server", relay.ServerHandler(reg, logger, 30*time.Second, maxFrameBytes))
mux.Handle("/v1/client", relay.ClientHandler(reg, logger, maxFrameBytes))

if *insecureListen != "" {
logger.Info("starting", "version", Version, "mode", "insecure", "listen", *insecureListen)
Expand Down
1 change: 1 addition & 0 deletions docs/PROJECT-MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex
- **Define narrow read-side interfaces at the consumer, not on the adapter.** `phoneSource` and `binarySource` (`ConnID() + Read(ctx)`) live in `forward.go`, not on `WSConn`. Production passes `*WSConn`; tests substitute a fake without touching `WSConn`. The two interfaces are structurally identical but named distinctly so the `Start*Forwarder` signatures read self-describingly at call sites; promote to a shared interface only if a third caller emerges. Adopted in #25 and #26.
- **Per-conn goroutines exit cleanly via LIFO defers, not by closing the conn themselves.** Pattern for any goroutine launched alongside an upgraded WS conn (heartbeat, future binary-side ack pump, etc.): derive `hbCtx, cancelHB := context.WithCancel(r.Context())`, register `defer cancelHB()` *after* the release/unregister defer, then `go runX(hbCtx, wsconn, ...)`. Under LIFO unwind `cancelHB` fires first → goroutine observes `ctx.Done()` and returns without touching the conn → release defer's `wsconn.Close()` runs unopposed. The goroutine owns ONLY the failure-path close (heartbeat: `CloseWithCode(1011, "heartbeat timeout")`); cleanup-path close is the handler defer's. `closeOnce` makes the two paths idempotent against each other regardless of which fires first. Adopted in heartbeat (#7); same shape applies to any future per-conn watchdog/timer goroutine.
- **Active-conn application close codes go through `WSConn.CloseWithCode`; stillborn-conn close codes go through the underlying `*websocket.Conn`.** ADR-0005's stillborn-conn pattern (`c.Close(websocket.StatusCode(4409), reason)` directly on the underlying conn before any `Send` could have run) coexists with ADR-0007's active-conn pattern (`wsconn.CloseWithCode(1011, "heartbeat timeout")` post-claim, where `closeOnce`/`writeMu` invariants are in play). Both share the same `closeOnce` guard via `Close()` delegating to `CloseWithCode(StatusNormalClosure, "")`. The two patterns describe structurally distinct windows of a WSConn's lifecycle. Adopted: stillborn `4409`/`4404` (#16, #5); active `1011` (#7).
- **Input-bounding policy applied at the adapter constructor, not per-handler.** When a library default is dangerously generous (nhooyr's 32 MiB `SetReadLimit` default), set the policy at the single chokepoint *both* endpoints reach — the wrapping adapter's constructor — rather than at each handler entry. `NewWSConn(c, connID, maxFrameBytes)` calls `c.SetReadLimit(maxFrameBytes)` before returning the struct, which structurally discharges "before any `Read` is performed": no goroutine other than the constructor holds a reference at that point. Distinct from the "policy values live at the wiring site" pattern (which is about *literal placement* — the value still lives as a `const` in `main`); this one is about *enforcement placement* — apply at the choke point so neither handler can forget. Adopted in `WSConn` (#29). Same shape applies to any future "bound the adversarial input at the wrapping seam" policy (e.g. max-frames-per-second, if it lands).
- **Capture process-state timestamps in `main` after `flag.Parse()`, not as package-level vars.** `startedAt := time.Now()` lives inside `main` and is passed into the handler factory. A package-level `var startedAt = time.Now()` would fire at import time — before flag parsing, before `--version` early-returns — and be wrong for short-lived test binaries and any future deferred-serve setup. Adopted in #10.

## Conventions
Expand Down
42 changes: 42 additions & 0 deletions docs/knowledge/codebase/29.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Ticket #29 — `SetReadLimit` on `WSConn`: 256 KiB per-frame cap

Retires the "no per-message size cap on `Read`" follow-up flagged in `ws-conn-adapter.md` since #15. `nhooyr.io/websocket`'s 32 MiB default read limit was two orders of magnitude above the largest legitimate pyrycode envelope (`message_chunk` at ≤50 messages, text-only payloads — worst case comfortably under 256 KiB with the routing wrapper). With both forwarders now landed (#25, #26), the cap is applied at the single chokepoint both reach: `NewWSConn`.

## Implementation

- **`internal/relay/ws_conn.go`** — `NewWSConn` gains `maxFrameBytes int64` and calls `c.SetReadLimit(maxFrameBytes)` as its first statement, *before* the struct is returned. This discharges the "before any `Read` is performed against it" guarantee structurally: no goroutine other than the constructor holds a reference at that point. Type is `int64` because that is the exact `*websocket.Conn.SetReadLimit` signature — no conversion at the call site. Doc-comment on `NewWSConn` names the on-overlimit contract (library closes with `StatusMessageTooBig` / 1009; next `Read` surfaces a non-nil error).
- **`internal/relay/server_endpoint.go`** — `ServerHandler` gains `maxFrameBytes int64` after `grace`, threaded into `NewWSConn`. Doc-comment points to the spec for derivation.
- **`internal/relay/client_endpoint.go`** — `ClientHandler` gains `maxFrameBytes int64` after `logger`, threaded into `NewWSConn`. Doc-comment points to the spec.
- **`cmd/pyrycode-relay/main.go`** — single `const maxFrameBytes int64 = 256 * 1024` at the composition root, with an inline comment summarising the derivation and pointing to `docs/specs/architecture/29-wsconn-read-limit.md`. Same shape as the `30*time.Second` grace literal (#21): one literal, one place, no package-level constant in `internal/relay`. Mux registrations pass the const into both handlers.
- **`internal/relay/ws_conn_test.go`** — `startEcho` gains a `maxFrameBytes int64` parameter and is rebuilt to echo received frames back to the client (in addition to the existing push-onto-channel behaviour). The echo path lets the cap tests round-trip an oversize frame through the server so the *client-side* cap surfaces on the receiving `WSConn.Read`. Four existing callers updated to pass the production literal (`256 * 1024`); cap tests pass tight values (64 / 256 bytes). New tests:
- `TestWSConn_Read_FrameExceedingCap_ReturnsError` — cap=64, send 256-byte payload; assert first `Read` returns non-nil error AND a subsequent `Read` (under a tight ctx) also returns non-nil. Error type is **not** asserted on (library-dependent wrapping; the spec's "subsequent reads fail" contract is structurally distinct from the close-error type).
- `TestWSConn_Read_FrameAtCap_DeliveredIntact` — cap=256, send exactly 256 bytes; assert it arrives byte-stable on the server-channel and the echoed frame returns to the client with `err == nil`.
- **`internal/relay/heartbeat_test.go`, `internal/relay/server_endpoint_test.go`, `internal/relay/client_endpoint_test.go`** — mechanical signature updates for the new `NewWSConn` / `ServerHandler` / `ClientHandler` parameter. Production-sized cap (`256 * 1024`) at every test wiring site; no test-only flag/hook.
- **`docs/knowledge/features/ws-conn-adapter.md`** — retired the "No per-message size cap on `Read`" bullet from § *What the adapter deliberately does NOT do*; added a "Per-frame read cap" entry to § *Adversarial framing* documenting the 256 KiB policy, the wiring-site literal, the library's 1009-close behaviour, and the spec link.

## Cap derivation (256 KiB)

From `pyrycode/pyrycode/docs/protocol-mobile.md` § *Backfill semantics* (l.501–510) — `message_chunk` envelopes are bounded at ≤50 messages, v1 is text-only (l.327). A generous-but-not-absurd assistant message at ≤4 KiB UTF-8 gives 50 × 4 KiB ≈ 200 KiB. Plus envelope keys, JSON whitespace, and the binary-side routing wrapper (`conn_id` + `frame`, ~80 bytes) stays comfortably under 256 KiB. All other envelopes (`hello`, `hello_ack`, `error`, `send_message`, `ack`, `register_push_token`, `backfill_since`, `backfill_done`) are <2 KiB. 256 KiB is one order of magnitude above the worst-case legitimate envelope, four orders below the library default. Single value covers both forwarders; differentiation is deferred until binary-side framing demonstrates a different bound in practice.

## Concurrency

No new concurrency. `SetReadLimit` mutates state on `*websocket.Conn` once, on the construction goroutine, before any reference escapes. `nhooyr.io/websocket` does not document `SetReadLimit` as needing concurrency protection, and the post-construction `WSConn` API never calls it again.

## Failure semantics

When a peer sends an over-cap frame, `nhooyr.io/websocket` closes the underlying conn with `StatusMessageTooBig` (1009) and the next `WSConn.Read` returns a non-nil error. The forwarder loop sees the error and returns; the handler's existing `defer { Unregister/ScheduleRelease; Close; log }` (#16/#21/#5) runs in the same LIFO chain it always has. No new defer, no new code path, no new log call. `wsconn.Close()` in the defer is idempotent under `closeOnce` against the library's already-performed close.

## Deliberately out of scope

- **Per-IP / per-server-id rate limiting.** Connection-count caps remain separate.
- **Slow-loris hardening on the WS upgrade handshake.** `http.Server.ReadHeaderTimeout` already covers the pre-upgrade phase; post-upgrade slow-drain is bounded by the forwarder's read context.
- **Differentiating phone vs. binary caps.** Single value covers both; revisit when binary-side framing demands a different bound.
- **Operator-configurable via flag.** Promote from `const` only when a second operator needs a different value (current "literal at wiring site" pattern, #21).

## Cross-links

- [Feature: WSConn adapter](../features/ws-conn-adapter.md) — § *Adversarial framing* documents the live policy; § *What the adapter deliberately does NOT do* no longer lists this as a follow-up.
- [Spec: 29-wsconn-read-limit](../../specs/architecture/29-wsconn-read-limit.md) — architect's derivation and security review.
- [ADR-0004](../decisions/0004-ws-library-and-adapter-context-strategy.md) — adapter-owned context strategy that `SetReadLimit` slots into structurally (one more piece of policy applied at constructor time).
- [Phone-side frame forwarder](../features/phone-forwarder.md) / [Binary-side frame forwarder](../features/binary-forwarder.md) — the two callers of `WSConn.Read` whose pre-#29 residual this retires.
- [Protocol spec § Backfill semantics](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#backfill-semantics) — the upper-bound source for the 256 KiB derivation.
8 changes: 5 additions & 3 deletions docs/knowledge/features/ws-conn-adapter.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Package `internal/relay` (`ws_conn.go`):
```go
type WSConn struct { /* *websocket.Conn + connID + writeMu + closeOnce + closeCtx */ }

func NewWSConn(c *websocket.Conn, connID string) *WSConn
func NewWSConn(c *websocket.Conn, connID string, maxFrameBytes int64) *WSConn

func (w *WSConn) ConnID() string
func (w *WSConn) Send(msg []byte) error
Expand Down Expand Up @@ -66,7 +66,6 @@ These are documented so the next contributor doesn't add defensive code that doe
- **No handshake / header validation / subprotocol selection.** Lives at the upgrade boundary (#4/#16, #5).
- **No heartbeat policy.** The adapter exposes `Ping(ctx)` as a pure forwarder; the *policy* (interval, timeout, what to do on failure) lives in `runHeartbeat` in `internal/relay/heartbeat.go` (#7). The adapter does not own the heartbeat goroutine, the ticker, or the close decision — it only provides the pinging primitive.
- **No read-side frame loop or envelope wrap/unwrap.** `Read` is a single-frame primitive; the loop and envelope wrapping live in `internal/relay/forward.go` ([phone-forwarder.md](phone-forwarder.md)).
- **No per-message size cap on `Read`.** Inherited from `*websocket.Conn`'s default (nhooyr's 32 MiB read limit). A deliberate `SetReadLimit` policy is a follow-up so it covers both forwarders.
- **No per-conn send queue / backpressure / rate limit.** `Send` writes synchronously and returns. None of those are in the registry's `Conn` contract.
- **No close-on-`Send`-error.** Caller observes the error and chooses to call `Close`.
- **No close-code semantics beyond `StatusNormalClosure`.** The adapter doesn't know why the registry asked it to close. Close-code mapping (`4401`/`4404`/`4409`) is the upgrade handler's job.
Expand All @@ -86,19 +85,22 @@ The adapter is on the hot path: every routed frame passes through `WSConn.Send`.
- **`Close` racing with mid-`Write` `Send`.** Library is documented to handle this; relying on that property is explicit. If the property regresses in a future library version, the race test surfaces a `DATA RACE`.
- **Same `*websocket.Conn` wrapped by two `WSConn` constructors.** Each gets its own `writeMu`; serial-write guarantee is broken; the wire interleaves. Caller-invariant; not enforced.
- **Adversarial `connID` bytes.** Stored and returned as-is; no interpretation. Charset/length is the conn-id-scheme ticket's job.
- **Per-frame read cap.** `NewWSConn` applies `*websocket.Conn.SetReadLimit(maxFrameBytes)` before returning. Production wires **256 KiB** as a `const` in `cmd/pyrycode-relay/main.go` (derivation: `pyrycode/pyrycode/docs/protocol-mobile.md` § *Backfill semantics* — `message_chunk` envelopes are bounded at ≤50 messages with text-only payloads, worst case comfortably under 256 KiB once routing-envelope overhead is added; see `docs/specs/architecture/29-wsconn-read-limit.md`). Setting the cap at the constructor — the single chokepoint both forwarders reach — discharges the "cover every reader" guarantee without per-handler code. On an over-cap frame the library closes with `StatusMessageTooBig` (1009); subsequent `Read` calls surface a non-nil error. The relay does not emit its own close code on this path.

The `security-sensitive` label was applied because `Send` is on every routed-frame path. Verdict from review: PASS — one new code-level slow-loris mitigation, no widening of documented threat surface beyond the supply-chain cost the threat model already names for adding any WS library.

## Testing

`internal/relay/ws_conn_test.go`, `package relay`. End-to-end against a real `*websocket.Conn` via `httptest.NewServer` — no library mocks. The `startEcho` helper spins up a server whose handler runs a tiny read-loop pushing received frames onto a buffered channel; the test gets back a connected `*WSConn` (client side) and the channel.
`internal/relay/ws_conn_test.go`, `package relay`. End-to-end against a real `*websocket.Conn` via `httptest.NewServer` — no library mocks. The `startEcho` helper spins up a server whose handler runs a tiny read-loop pushing received frames onto a buffered channel *and* echoing them back to the client; the test gets back a connected `*WSConn` (client side, capped at the caller-supplied `maxFrameBytes`) and the channel. The echo path lets cap tests round-trip a frame so the client-side cap surfaces on `Read`.

Tests (1:1 with the AC):

- `TestWSConn_ConnID_ReturnsConstructorValue` — locks the contract.
- `TestWSConn_ConcurrentSend_ProducesIntactFrames` — N=16 goroutines each `Send` a unique tagged payload; the test asserts N intact, distinct frames. Race detector under `-race` is the primary signal; interleaved frames would surface as malformed messages or corrupted writes.
- `TestWSConn_DoubleClose_DoesNotPanic`.
- `TestWSConn_SendAfterClose_ReturnsError` — non-nil error is the contract; the specific error type is library-dependent (any of `context.Canceled`, a closed-connection error, or a wrapped variant) and not asserted on.
- `TestWSConn_Read_FrameExceedingCap_ReturnsError` — round-trips an oversize frame and asserts the first AND a subsequent `Read` both return non-nil. The error type is not asserted on (library-dependent close-error wrapping).
- `TestWSConn_Read_FrameAtCap_DeliveredIntact` — round-trips a frame whose payload is exactly at the cap; asserts the server-side channel receives the bytes intact and the client-side `Read` of the echoed frame returns the same bytes with `err == nil`.

What we deliberately do not test: the library's behaviour itself (we trust `Write` to honour `ctx` and `Close` to be safe with in-flight `Write`); unit-level mocks of the library; performance.

Expand Down
4 changes: 4 additions & 0 deletions docs/lessons.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,10 @@ RFC 7230 §5.4 allows it. When comparing to a configured hostname, strip the por

The library's auto-pong machinery runs inline with `Read` (or `CloseRead`'s background drain). A peer that completes the handshake and then never reads — even just to discard — cannot respond to incoming pings, because the ping frame sits in the kernel's TCP buffer unobserved. This is exactly what makes RFC 6455 ping/pong a useful liveness signal: a wedged peer that has stopped processing the conn fails the heartbeat structurally, not just because it "chose not to" pong. Implication for tests: to test the *unresponsive-peer* path, the test client must NOT read; to test the *healthy-peer* path, the client must read (or `CloseRead`) so the library auto-pongs. Reverse the two and the assertions swap meanings. Source: `internal/relay/heartbeat_test.go` (#7).

## `nhooyr.io/websocket` ships with a 32 MiB default per-frame read limit — explicit cap, applied at the constructor

The library's `*websocket.Conn` accepts frames up to ~32 MiB out of the box. For a relay routing typed envelopes (worst-case `message_chunk` ≈ 200 KiB), that default lets a misbehaving peer pin two orders of magnitude more read buffer than any legitimate message needs. `SetReadLimit(n)` adjusts the cap; on an over-cap frame the library closes with `StatusMessageTooBig` (1009) and surfaces a non-nil error on the next `Read`. Apply it inside the wrapping adapter's constructor *before the struct returns* — that closes the window where a `Read` could fire against an uncapped conn without each handler having to remember. The cap value belongs at the composition root (`cmd/.../main.go` as a `const`), threaded through the handler constructor: one literal, one place, no package-level constant in the relay package. Source: `NewWSConn(c, connID, maxFrameBytes)` (#29).

## `autocert.Manager.TLSConfig()` doesn't set `MinVersion`

`gosec` G402 fires on `make lint` if you use it raw. Wrap it in a helper that pins `MinVersion = tls.VersionTLS12` (or 1.3) before handing it to `http.Server`. Centralising the override means a future bump is a one-line change. Source: `relay.TLSConfig` (#9).
Loading