diff --git a/docs/PROJECT-MEMORY.md b/docs/PROJECT-MEMORY.md index fcbaea2..16685bb 100644 --- a/docs/PROJECT-MEMORY.md +++ b/docs/PROJECT-MEMORY.md @@ -11,6 +11,7 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex | Routing envelope wrapper type (`Envelope`, `Marshal`, `Unmarshal`, sentinel errors) | Done (#1) | `internal/relay/envelope.go` | | Connection registry (`Conn`, `Registry`; 1:1 binary, 1:N phones; sentinel errors for `4409`/`4404`; race-tested) | Done (#3) | `internal/relay/registry.go` | | TLS via autocert in `--domain` mode (`NewAutocertManager`, `EnforceHost`, `TLSConfig`, `ErrCacheDirInsecure`) | Done (#9) | `internal/relay/tls.go`, `cmd/pyrycode-relay/main.go` | +| `WSConn` adapter (`nhooyr.io/websocket.Conn` → registry `Conn`; per-conn write mutex; `Close`-cancelled context; 10s `Send` deadline) | Done (#15) | `internal/relay/ws_conn.go` | | WS upgrade on `/v1/server` and `/v1/client` | Not started | — | | Header validation (`x-pyrycode-server`, `x-pyrycode-token`) | Not started | — | | Frame forwarding using the routing envelope | Not started | — | @@ -22,11 +23,12 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex - **Sentinel errors, branched via `errors.Is`** for validation failures at protocol boundaries. New routing-layer code follows the `Err...` naming and wraps with `fmt.Errorf("…: %w", err, sentinel)` when adding context. - **Opacity by type.** Inner-frame payloads are carried as `json.RawMessage`. The relay never deserialises payloads; the type makes that hard to violate accidentally. - **Validate at the envelope boundary, not deeper.** Structural checks (presence, non-empty, JSON well-formedness) belong here; semantic checks (token validity, message kind) belong to the binary. -- **Stdlib + `golang.org/x/crypto`.** First non-stdlib dep arrived in #9 (`acme/autocert`). Keep the dependency surface deliberate; new deps need a justification. +- **Stdlib + `golang.org/x/crypto` + `nhooyr.io/websocket`.** Direct deps arrived in #9 (`acme/autocert`) and #15 (`nhooyr.io/websocket` v1.8.x). Keep the dependency surface deliberate; new deps need a justification (the ADR is the justification). - **Loud failure over silent correction.** `--domain` and `--insecure-listen` are mutually exclusive and explicit. `ErrCacheDirInsecure` refuses to start on a world/group-readable cert cache rather than re-chmoding it. `:80` returns `404` for non-challenge traffic instead of redirecting to HTTPS (ADR-0002). - **Tests live in the same package** (`package relay`, not `relay_test`) so they can `errors.Is` against unexported sentinels if needed and reach private helpers without re-exporting. - **Passive in-memory stores guard mutation under one RWMutex; reads return copies, never references.** `PhonesFor` allocates a fresh slice so callers do slow work (broadcast, `Send` over the network) without holding the registry lock. Adopted in `internal/relay/registry.go` (#3); the same shape applies to any future "shared map of conns" type. - **Interface methods called under the lock are documented as non-blocking getters.** The registry's `Conn.ConnID()` is invoked under the write lock during `UnregisterPhone`; `Send` and `Close` are never called while the lock is held. Pattern: state the contract on the interface, never call something that could block on I/O while a mutex is held. +- **Adapters bridge interface↔library API mismatches by owning policy locally.** When a library method needs a `context.Context` but the registry's `Conn` interface doesn't take one (and shouldn't — most callers don't have a context to thread), the adapter owns its own context: derived in the constructor, cancelled by `Close`, narrowed per-call with `WithTimeout` for deadline policy. Adopted in `WSConn` (#15); avoids forcing context-plumbing changes into upstream interfaces. ## Conventions diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index 3d7a5e1..26ebf65 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -4,12 +4,14 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Features +- [WSConn adapter](features/ws-conn-adapter.md) — wraps `nhooyr.io/websocket.Conn` to satisfy the registry's `Conn`; owns the per-conn write mutex and a `Close`-cancelled context with a 10s per-`Send` deadline. - [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`. - [Autocert TLS](features/autocert-tls.md) — `--domain` mode wiring: `:443` WSS via Let's Encrypt + `:80` ACME http-01, host gates, cache-dir permission policy, TLS 1.2 floor. - [Routing envelope](features/routing-envelope.md) — typed Go wrapper (`Envelope`, `Marshal`, `Unmarshal`) for the `{conn_id, frame}` wire shape exchanged between relay and binary. ## Decisions +- [ADR-0004: WS library choice and adapter context strategy](decisions/0004-ws-library-and-adapter-context-strategy.md) — `nhooyr.io/websocket` v1.8.x; adapter-owned context cancelled by `Close`, 10s per-`Send` deadline; registry's `Send` signature stays context-free. - [ADR-0003: Connection registry as a passive store](decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, narrow `ReleaseServer` so grace logic (#8) can wrap it. - [ADR-0002: Explicit-failure 404 on `:80` instead of HTTPS redirect](decisions/0002-autocert-explicit-failure-on-port-80.md) — pass `http.NotFoundHandler()` to `autocert.Manager.HTTPHandler`, not `nil`; loud failure over silent scheme-switching. - [ADR-0001: Routing envelope shape and opacity](decisions/0001-routing-envelope-shape-and-opacity.md) — `json.RawMessage` for the inner frame; sentinel errors; validate at boundary, never parse payloads. diff --git a/docs/knowledge/decisions/0004-ws-library-and-adapter-context-strategy.md b/docs/knowledge/decisions/0004-ws-library-and-adapter-context-strategy.md new file mode 100644 index 0000000..4534d0c --- /dev/null +++ b/docs/knowledge/decisions/0004-ws-library-and-adapter-context-strategy.md @@ -0,0 +1,53 @@ +# ADR-0004: WS library choice and adapter context strategy + +**Status:** Accepted (#15) +**Date:** 2026-05-08 + +## Context + +The registry's `Conn` interface from #3 is implementation-agnostic, but everything that actually moves bytes in the relay needs a concrete WebSocket library. Two interlocking choices had to be made before the upgrade handlers (#4/#16, #5) could land: + +1. **Which Go WS library?** +2. **How does the `Send([]byte) error` ↔ library `Write(ctx, …)` API mismatch get bridged?** The registry's `Conn.Send` deliberately has no `context.Context` parameter (set in ADR-0003 / #3); the library requires one for every write. + +Both were decided in #15 along with the `WSConn` adapter that embodies them. + +## Decision + +**Library:** `nhooyr.io/websocket` (v1.8.x), pulled via the vanity import path. No alternative library is wrapped or kept compatible. + +**Context strategy in `WSConn`:** the adapter owns its own context. `NewWSConn` derives `closeCtx`/`cancel` from `context.Background()`. Each `Send` uses `context.WithTimeout(closeCtx, writeTimeout)` — `writeTimeout = 10 * time.Second`. `Close` cancels `closeCtx` first, then calls the library's `Close`. + +## Rationale + +### Library: `nhooyr.io/websocket` + +- **Context-aware API on every blocking method** (`Read`, `Write`, `Close`). Fits the relay's existing `context.Context` discipline; the adapter's per-call timeout strategy is only possible because of this. +- **`Conn.Close` is documented as safe with an in-flight `Write`.** This is the property the adapter relies on for non-deadlocking close — `Close` cancels `closeCtx` without taking `writeMu`, the in-flight `Write` aborts, and the lock unwinds. +- **Maintained.** `gorilla/websocket` was unmaintained as of 2022 (last meaningful release predates Go 1.18); `gobwas/ws` is a frame-level toolkit that would force the adapter to own framing details. + +The vanity path `nhooyr.io/websocket` resolves to v1.8.x; upstream development moved to `github.com/coder/websocket` but the published line continues. Use the vanity path; revisit when v2 (or the coder fork) introduces a meaningful behavioural change. + +### Context strategy: adapter-owned, per-call deadline + +Three options were on the table: + +1. **`context.Background()` only.** A slow peer hangs `Send` indefinitely. Rejected — the threat model already treats slow-loris on the read side as a known DoS vector; the same shape applies on the write side, and a broadcast loop fanning out to many phones cannot afford one slow phone parking a goroutine forever. +2. **Caller-supplied context.** Would require changing `relay.Conn.Send`'s signature, which is fixed by ADR-0003 and consumed by callers (broadcast loops iterating a `PhonesFor` snapshot) that don't naturally have a context to thread. Rejected — would force a registry-API change to fix an adapter-local concern. +3. **Adapter-owned: `closeCtx` cancelled by `Close`, plus per-call `WithTimeout`.** Bounds individual writes (slow peer); cancels in-flight writes on close (clean shutdown); does not require a registry API change. Adopted. + +`writeTimeout = 10s` is conservative — long enough that a momentarily-laggy mobile client on a poor network is not killed mid-frame, short enough that one hung `Send` cannot park a goroutine indefinitely. One constant, one place to change. Heartbeat (#7) will give real RTT signal to recalibrate against. + +### Why `Close` does not take `writeMu` + +Locking `writeMu` inside `Close` would deadlock against a slow peer holding the mutex inside `Send`. Cancelling `closeCtx` instead aborts the in-flight `Write`, which releases the mutex on its own. This is the crux of the design and the reason the library's "Close-safe-with-in-flight-Write" property mattered when picking it. + +## Consequences + +- The relay has exactly one WS library and one adapter. Future routing-layer code (#4/#16, #5, #6, #7) hands the registry a `*WSConn`, never a hand-rolled adapter. +- The `Conn.Send` signature stays context-free. Callers (broadcast loops, the upgrade handler) pass `[]byte`, get `error`, and the adapter handles deadline policy internally. +- `writeTimeout = 10s` is the relay's first code-level slow-loris mitigation on the write side. A peer that accepts but never reads costs one goroutine for at most 10 seconds; connection-count caps remain the upgrade layer's job. +- `govulncheck` now has a wider scan surface. If it flags a v1.8.x release, bump within the line. Do not pin to a non-stable line; do not switch to `github.com/coder/websocket` opportunistically. +- The adapter relies on a documented library property (`Close` safe with in-flight `Write`). If a future library version regresses it, the race test surfaces a `DATA RACE` and the lint pass breaks; that is the early-warning signal. +- `WSConn` owns its `*websocket.Conn` after construction. Calling the library directly on the same conn is documented as a caller invariant; there is no defensive code, and there cannot be without a global registry of wrapped conns. +- `Conn.Close` returns no error. Idempotency is `sync.Once`-guarded; the underlying `conn.Close`'s error is discarded by design — the registry's contract has no return value, and there is no useful action on a "close-after-close" or "close-after-network-error" report. diff --git a/docs/knowledge/features/ws-conn-adapter.md b/docs/knowledge/features/ws-conn-adapter.md new file mode 100644 index 0000000..1ed7ecb --- /dev/null +++ b/docs/knowledge/features/ws-conn-adapter.md @@ -0,0 +1,104 @@ +# WSConn — WebSocket adapter for the registry + +`WSConn` is the single adapter that wraps `nhooyr.io/websocket.Conn` so it satisfies the registry's `Conn` interface ([connection-registry.md](connection-registry.md)). The relay's WS upgrade handlers (`/v1/server` in #4/#16, `/v1/client` in #5) hand the registry a `*WSConn` rather than each inventing its own wrapper. Frame forwarding, broadcasts, and the future heartbeat ticket all reach the wire through this type. + +The adapter is small on purpose: one file, one mutex, one one-shot guard, no goroutines, no queues. Everything else (handshake, header validation, close-code semantics, ping/pong, read-side frame loop) lives in adjacent tickets. + +## API + +Package `internal/relay` (`ws_conn.go`): + +```go +type WSConn struct { /* *websocket.Conn + connID + writeMu + closeOnce + closeCtx */ } + +func NewWSConn(c *websocket.Conn, connID string) *WSConn + +func (w *WSConn) ConnID() string +func (w *WSConn) Send(msg []byte) error +func (w *WSConn) Close() +``` + +Plus one package-level constant: + +```go +const writeTimeout = 10 * time.Second +``` + +`writeTimeout` bounds a single `Send`. A slow peer cannot stall a caller past this deadline. + +## Concurrency model + +| Method | Lock | Context | Concurrent-safe with | +|---|---|---|---| +| `ConnID` | none | none | every other method | +| `Send` | `writeMu` (held across one frame write) | `WithTimeout(closeCtx, writeTimeout)` | other `Send` (serialised), `Close` (cancellation breaks it), `ConnID` | +| `Close` | `closeOnce` (one shot) | cancels `closeCtx` | `Send` (in-flight write is cancelled), other `Close` (no-op), `ConnID` | + +One mutex, one one-shot. The lock graph is a single node. There are no callbacks, no channels, no goroutines spawned by the adapter. + +`Close` deliberately does **not** take `writeMu`. Acquiring it would deadlock against a slow peer holding the mutex inside `Send`: the whole point of cancelling `closeCtx` is to abort the in-flight `Write` so it releases the mutex on its own. `nhooyr.io/websocket.Conn.Close` is documented to be safe with an in-flight `Write` — that property is why this library was chosen. + +## Context strategy + +The non-trivial design problem is the `Send([]byte) error` ↔ `Conn.Write(ctx, ...)` API mismatch: the registry's contract has no context, but the library requires one for every write. The adapter owns its own context: + +- `NewWSConn` derives a `closeCtx`/`cancel` pair from `context.Background()`. The context's only purpose is to be cancelled by `Close`. +- Each `Send` derives a per-call `context.WithTimeout(closeCtx, writeTimeout)`, calls `Write`, cancels the timer. +- `Close` cancels `closeCtx` first, then calls the library's `Close`. After that, every queued or future `Send` finds an already-cancelled context and `Write` short-circuits before touching the socket — no partial frame goes out. + +`writeTimeout = 10s` is conservative — long enough that a momentarily-laggy mobile client on a poor network is not killed mid-frame, short enough that one hung `Send` cannot park a goroutine indefinitely. Same order of magnitude as `http.Server.WriteTimeout: 60s` (which bounds a whole upgrade response) but tighter because it bounds a single forwarded frame. Constant lives in one place; revisit when the heartbeat ticket (#7) gives real RTT signal. + +See [ADR-0004](../decisions/0004-ws-library-and-adapter-context-strategy.md) for the alternatives considered (`context.Background()` only; caller-supplied context). + +## What the adapter deliberately does NOT do + +These are documented so the next contributor doesn't add defensive code that doesn't belong: + +- **No `connID` validation.** Length, charset, uniqueness are owned by the conn-id-scheme ticket. The adapter treats the id as opaque. +- **No handshake / header validation / subprotocol selection.** Lives at the upgrade boundary (#4/#16, #5). +- **No ping/pong.** Heartbeat is #7. +- **No read-side frame loop or envelope wrap/unwrap.** That is #6. +- **No per-conn send queue / backpressure / rate limit.** `Send` writes synchronously and returns. None of those are in the registry's `Conn` contract. +- **No close-on-`Send`-error.** Caller observes the error and chooses to call `Close`. +- **No close-code semantics beyond `StatusNormalClosure`.** The adapter doesn't know why the registry asked it to close. Close-code mapping (`4401`/`4404`/`4409`) is the upgrade handler's job. +- **No nil-checks on `c`.** Passing a nil `*websocket.Conn` is a programmer error. + +## Caller invariant: WSConn owns the underlying conn + +Documented on `NewWSConn`: after construction, callers must reach the connection only through `WSConn` methods. Calling `c.Write` or `c.Close` directly defeats the adapter's serialisation and cancellation guarantees. There is no defensive code to enforce this — the adapter cannot keep a global registry of wrapped `*websocket.Conn` — so it lives as a doc-comment contract on the constructor. + +## Adversarial framing + +The adapter is on the hot path: every routed frame passes through `WSConn.Send`. Threats considered: + +- **Slow-loris on the write side.** A peer that completes the handshake and stops reading would, with `context.Background()` writes, hang every `Send` forever. Mitigated by `writeTimeout = 10s` — the first code-level slow-loris defence in the relay. A flood of such peers consumes one goroutine each for at most 10 seconds; connection-count caps belong to the WS upgrade ticket. +- **Concurrent `Send` corrupting frames.** `nhooyr.io/websocket.Conn.Write` is *not* safe to call from multiple goroutines at once — concurrent calls would interleave bytes mid-frame. `writeMu` serialises every `Send`. Race-detector test under `-race` is the verification, not a smoke test. +- **`Close` while many goroutines are blocked on `writeMu`.** `Close` cancels `closeCtx` without taking `writeMu`. The currently-writing goroutine's `ctx` is cancelled and `Write` returns. Each next-in-line goroutine acquires the mutex, derives a `WithTimeout(closeCtx, …)` from an already-cancelled parent, and `Write` short-circuits with the cancellation error. No goroutine is stuck. +- **`Close` racing with mid-`Write` `Send`.** Library is documented to handle this; relying on that property is explicit. If the property regresses in a future library version, the race test surfaces a `DATA RACE`. +- **Same `*websocket.Conn` wrapped by two `WSConn` constructors.** Each gets its own `writeMu`; serial-write guarantee is broken; the wire interleaves. Caller-invariant; not enforced. +- **Adversarial `connID` bytes.** Stored and returned as-is; no interpretation. Charset/length is the conn-id-scheme ticket's job. + +The `security-sensitive` label was applied because `Send` is on every routed-frame path. Verdict from review: PASS — one new code-level slow-loris mitigation, no widening of documented threat surface beyond the supply-chain cost the threat model already names for adding any WS library. + +## Testing + +`internal/relay/ws_conn_test.go`, `package relay`. End-to-end against a real `*websocket.Conn` via `httptest.NewServer` — no library mocks. The `startEcho` helper spins up a server whose handler runs a tiny read-loop pushing received frames onto a buffered channel; the test gets back a connected `*WSConn` (client side) and the channel. + +Tests (1:1 with the AC): + +- `TestWSConn_ConnID_ReturnsConstructorValue` — locks the contract. +- `TestWSConn_ConcurrentSend_ProducesIntactFrames` — N=16 goroutines each `Send` a unique tagged payload; the test asserts N intact, distinct frames. Race detector under `-race` is the primary signal; interleaved frames would surface as malformed messages or corrupted writes. +- `TestWSConn_DoubleClose_DoesNotPanic`. +- `TestWSConn_SendAfterClose_ReturnsError` — non-nil error is the contract; the specific error type is library-dependent (any of `context.Canceled`, a closed-connection error, or a wrapped variant) and not asserted on. + +What we deliberately do not test: the library's behaviour itself (we trust `Write` to honour `ctx` and `Close` to be safe with in-flight `Write`); unit-level mocks of the library; performance. + +## Dependency + +`nhooyr.io/websocket` (v1.8.x line). First WS library in the project; second non-stdlib direct dep (after `golang.org/x/crypto` for autocert). The vanity import path remains `nhooyr.io/websocket` even though the upstream module home moved to `github.com/coder/websocket`. `make lint` (`govulncheck`) covers known CVEs; the residual supply-chain risk (a malicious release tagged by an authentic maintainer would see every routed frame in cleartext) is named in `docs/threat-model.md` § "Supply chain — Go dependencies." + +## Related + +- [ADR-0004: WS library choice and adapter context strategy](../decisions/0004-ws-library-and-adapter-context-strategy.md) +- [Connection registry](connection-registry.md) — the `Conn` interface this implements. +- [Threat model](../../threat-model.md) — slow-loris and supply-chain framings. diff --git a/docs/lessons.md b/docs/lessons.md index d93fbe7..c07a80f 100644 --- a/docs/lessons.md +++ b/docs/lessons.md @@ -26,6 +26,14 @@ The AC asked for `go test -race -count=20` coverage. `count` belongs on the comm When a shared map holds connection handles (or anything you'll iterate to do I/O over), the read accessor should allocate and return a fresh copy. Callers can then `Send` / `Close` / fan out without holding the lock — and concurrent mutators don't trip the iteration. The cost is one allocation per call; the benefit is no foot-gun where a slow `Send` blocks every other registry caller. Source: `Registry.PhonesFor` (#3). +## To close cleanly through a serialised writer, cancel a context — don't take the write mutex + +Wrapping a library whose `Write` is *not* concurrency-safe forces a per-conn write mutex. The naive `Close` then takes the same mutex to ensure no `Write` is in flight — and that deadlocks against a slow peer holding the mutex inside an in-flight `Write`. The fix: give the adapter a `closeCtx` cancelled by `Close`; `Send` derives `WithTimeout(closeCtx, …)` per call. `Close` cancels the context *without taking the write mutex*; the in-flight `Write` aborts on cancellation and releases the mutex on its own. Requires the underlying library to document `Close`-safe-with-in-flight-`Write` (the case for `nhooyr.io/websocket`). Source: `internal/relay/ws_conn.go` (#15). + +## `nhooyr.io/websocket` is the v1.8.x line under a vanity import path + +Vanity import path is `nhooyr.io/websocket`. Upstream development moved to `github.com/coder/websocket`, but the `nhooyr.io/websocket` line continues to be published. Use the vanity path; do not opportunistically switch to the coder fork without an ADR. `go get nhooyr.io/websocket@latest` resolves into v1.8.x. Source: ADR-0004 (#15). + ## `autocert.Manager.HTTPHandler(nil)` redirects to HTTPS — it does not 404 The autocert docs and the `nil` argument are easy to misread as "no fallback → 404." Actual behaviour: GET/HEAD get a `302` to HTTPS; other methods get `400`. To get an explicit 404 (or any other handler), pass it explicitly: `mgr.HTTPHandler(http.NotFoundHandler())`. See ADR-0002. Source: autocert TLS wiring (#9). diff --git a/docs/specs/architecture/15-wsconn-adapter.md b/docs/specs/architecture/15-wsconn-adapter.md new file mode 100644 index 0000000..4e93f1b --- /dev/null +++ b/docs/specs/architecture/15-wsconn-adapter.md @@ -0,0 +1,321 @@ +# Spec — `WSConn` adapter for `nhooyr.io/websocket` (#15) + +## Files to read first + +- `internal/relay/registry.go:22-46` — the `Conn` interface this adapter must implement, with the doc-comment contract for each of `ConnID`, `Send`, `Close`. The source of truth. +- `internal/relay/registry.go:117-145` — `UnregisterPhone` is the one site that calls `ConnID()` under the registry write lock; that is why `ConnID` must be a non-blocking pure getter. +- `internal/relay/registry_test.go:14-22` — the `fakeConn` shape established for the registry's own tests. Useful as a reference for "how `Conn` implementations are framed in this package," not as a thing this ticket touches. +- `internal/relay/envelope.go:1-19` — sentinel-error idiom and doc-comment style. New code matches it. +- `internal/relay/doc.go` — `package relay` doc string. The new file lives in the same package. +- `docs/PROJECT-MEMORY.md:22-29` — established patterns: stdlib-default, `errors.Is`, "loud failure over silent correction," interface-methods-called-under-lock are documented as non-blocking, package-internal tests. +- `docs/architecture.md:7-15` — relay is the TLS terminus and routes opaque frames; the WS library sees plaintext, hence the supply-chain framing of the new dependency. +- `docs/threat-model.md` § "Supply chain — Go dependencies" and § "DoS resistance — connection floods, slow-loris, fork-bomb retry" — `nhooyr.io/websocket` joins the dependency surface for `govulncheck`; `Send` must not block indefinitely (slow-loris on the write side). +- `Makefile:14-24` — `make test` runs `go test -race ./...`. `make lint` runs `gosec` and `govulncheck`. The race detector is the default mode for every PR. +- `go.mod` — current dep set: stdlib + `golang.org/x/crypto`. This ticket adds the second non-stdlib direct dep. +- `docs/specs/architecture/3-connection-registry.md:34-62` — historical context for why `Conn` is defined where it is; do not re-derive. + +## Context + +The registry from #3 takes any `Conn` (interface in `internal/relay/registry.go:22-46`); its contract requires `ConnID()` non-blocking, `Send` serialised per connection, and `Close` idempotent and safe concurrently with `Send`. Both endpoint handlers (`/v1/server` in #4/#16, `/v1/client` in #5) need a wrapper around the chosen WebSocket library. Extracting the adapter into one ticket prevents two handlers from inventing two adapters that drift, and isolates the new dependency introduction in a single, small, reviewable PR. + +`nhooyr.io/websocket` is the chosen library (modern, context-aware, idiomatic Go, actively maintained — `gorilla/websocket` was unmaintained as of 2022; `gobwas/ws` is lower-level than required). The vanity import path remains `nhooyr.io/websocket`; the upstream module home moved to `github.com/coder/websocket` but the `nhooyr.io/websocket` import path is still the published v1.8.x line. The developer pulls it with `go get nhooyr.io/websocket@latest` and verifies the resolved version is in the v1.8.x line. + +The non-trivial design problem is the `Send([]byte) error` ↔ `Conn.Write(ctx, ...)` API mismatch: the registry's contract has no context, but the library requires one for every write. The adapter must own a context strategy that (a) bounds write duration so a slow peer cannot stall callers, and (b) cancels in-flight writes when `Close` runs. + +## Design + +### Package & files + +- New file: `internal/relay/ws_conn.go` (`package relay`). +- New test file: `internal/relay/ws_conn_test.go` (`package relay` — same convention as `envelope_test.go` and `registry_test.go`). +- `go.mod` and `go.sum` updated for `nhooyr.io/websocket` v1.8.x. + +No edits to `registry.go`, `envelope.go`, `doc.go`, `cmd/pyrycode-relay/main.go`, or `docs/`. Wiring the adapter into HTTP handlers lands in #4/#16 and #5. + +### Types & API + +```go +package relay + +import ( + "context" + "sync" + "time" + + "nhooyr.io/websocket" +) + +// writeTimeout bounds a single Send. A slow peer cannot stall a caller +// past this deadline; Send returns context.DeadlineExceeded (wrapped by +// the library) and the caller decides whether to drop the connection. +const writeTimeout = 10 * time.Second + +// WSConn adapts a *websocket.Conn from nhooyr.io/websocket to the +// registry's Conn interface. It owns the per-connection write mutex +// (the underlying library forbids concurrent Write) and a per-connection +// cancellation context that Close trips to abort in-flight writes. +// +// All exported methods are safe for concurrent use. Send serialises; +// ConnID is a pure getter; Close is idempotent and may run concurrently +// with Send. +type WSConn struct { + conn *websocket.Conn + connID string + + writeMu sync.Mutex // serialises Conn.Write; held only across one frame. + + closeOnce sync.Once + closeCtx context.Context // cancelled by Close. + cancel context.CancelFunc // called once via closeOnce. +} + +// NewWSConn wraps c with the relay-assigned connection id. The caller +// retains responsibility for the WebSocket handshake and for choosing +// connID; this constructor neither validates connID nor inspects c. +// +// After construction, the WSConn owns c: callers must reach the +// connection only through WSConn methods. Calling c.Write or c.Close +// directly defeats the adapter's serialisation and cancellation +// guarantees. +func NewWSConn(c *websocket.Conn, connID string) *WSConn + +// ConnID returns the constructor-supplied id. Pure getter; safe to call +// from any goroutine and never blocks. The registry calls this under +// its write lock — see registry.go:117-145. +func (w *WSConn) ConnID() string + +// Send writes msg as a single binary WebSocket frame. Concurrent Send +// callers are serialised on a per-WSConn mutex; the wire receives whole, +// non-interleaved frames in some order. Each call has a fixed write +// deadline (writeTimeout); Send after Close returns a non-nil error +// (the library reports the cancelled context; the caller treats any +// non-nil return as "drop the connection"). +func (w *WSConn) Send(msg []byte) error + +// Close cancels in-flight writes and closes the WebSocket with +// StatusNormalClosure. Idempotent: only the first call reaches the +// underlying *websocket.Conn; subsequent calls are no-ops. Safe to call +// concurrently with Send. Returns no value, matching the registry's +// Conn.Close contract. +func (w *WSConn) Close() +``` + +### Algorithms + +**`NewWSConn`** — capture the conn and id, derive a `closeCtx`/`cancel` from `context.Background()`. The context's only purpose is to be cancelled by `Close`; it has no parent because the WSConn outlives the request scope (the registry holds the reference until `Close` runs). + +**`Send`** — take `writeMu`, derive a per-call `context.WithTimeout(closeCtx, writeTimeout)`, call `w.conn.Write(ctx, websocket.MessageBinary, msg)`, release the mutex. + +```go +func (w *WSConn) Send(msg []byte) error { + w.writeMu.Lock() + defer w.writeMu.Unlock() + ctx, cancel := context.WithTimeout(w.closeCtx, writeTimeout) + defer cancel() + return w.conn.Write(ctx, websocket.MessageBinary, msg) +} +``` + +If `Close` has already run, `closeCtx` is already cancelled and `context.WithTimeout` returns an immediately-cancelled context; `Write` short-circuits and returns the cancellation error. No partial frame goes out. (`nhooyr.io/websocket.Conn.Write` honours `ctx.Done()` before touching the socket.) + +**`Close`** — `closeOnce.Do` of "cancel `closeCtx`; call `w.conn.Close(websocket.StatusNormalClosure, "")`; discard its returned error". The library's `Close` is safe to call concurrently with an in-flight `Write` — that's a key reason for choosing `nhooyr.io/websocket` over alternatives. + +```go +func (w *WSConn) Close() { + w.closeOnce.Do(func() { + w.cancel() + _ = w.conn.Close(websocket.StatusNormalClosure, "") + }) +} +``` + +`Close` does **not** take `writeMu`. Acquiring it would deadlock against a slow peer holding the mutex inside `Send`: the whole point of cancelling `closeCtx` is to abort the in-flight `Write` so it releases the mutex on its own. Once the cancelled `Write` returns, the second `Close` step (`conn.Close`) finalises the WebSocket; if it races with the still-unwinding `Send`, `nhooyr.io/websocket` is documented to handle that race internally. + +The returned error from `conn.Close` is discarded by design: `Conn.Close` in the registry's contract returns nothing, and there is no useful action the adapter can take on a "close-after-close" or "close-after-network-error" report. Logging the close code is the WS upgrade handler's responsibility (#4/#16, #5), not the adapter's. + +**`ConnID`** — `return w.connID`. No lock, no allocation. + +### Concurrency model + +| Method | Lock | Context | Concurrent-safe with | +|---|---|---|---| +| `ConnID` | none | none | every other method | +| `Send` | `writeMu` (held across one frame write) | `WithTimeout(closeCtx, writeTimeout)` | other `Send` (serialised), `Close` (cancellation breaks it), `ConnID` | +| `Close` | `closeOnce` (one shot) | cancels `closeCtx` | `Send` (in-flight write is cancelled), other `Close` (no-op), `ConnID` | + +There is exactly one mutex (`writeMu`) and one one-shot guard (`closeOnce`). The lock graph is a single node; there are no callbacks, no channels, no goroutines spawned by the adapter. + +### Error handling + +- `Send` returns the underlying library error verbatim; no wrapping. The registry never inspects it; the WS upgrade handler decides whether to log and whether to drop the connection. (No new sentinel errors — there is no protocol-level distinction this adapter needs to expose.) +- `Close` returns nothing, matching the registry's `Conn.Close` contract. +- `NewWSConn` returns no error: every input is trusted by the caller (the WS upgrade handler chose to upgrade and chose the `connID`). +- No panics. No nil checks on `c` — passing a nil `*websocket.Conn` is a programmer error, not a runtime case. + +### Lifecycle expectations the adapter does NOT enforce + +These are listed so the developer doesn't add defensive code: + +- The adapter does not validate `connID` (length, charset, uniqueness). Generation rules belong to the conn-id-scheme ticket. +- The adapter does not perform the WebSocket handshake, header validation, subprotocol selection, ping/pong, or read-side frame loop. Those live in #4/#16, #5, #6, #7. +- The adapter does not bound the per-connection send queue. There is no queue: `Send` writes synchronously and returns. Throttling and backpressure are caller decisions. +- The adapter does not close on `Send` errors. The caller observes the error and chooses to call `Close`. + +### Why this context strategy + +Three options were considered: + +1. **`context.Background()` only.** A slow peer hangs `Send` indefinitely. Rejected — the threat model already treats slow-loris on the read side as a known DoS vector; the same shape applies on the write side. +2. **Caller-supplied context.** Would require changing `relay.Conn.Send`'s signature, which is fixed by #3 and consumed by code that does not naturally have a context to thread (e.g. broadcast loops iterating a `PhonesFor` snapshot). Rejected — would force a registry-API change to fix an adapter-local concern. +3. **Adapter-owned context: `closeCtx` cancelled by `Close`, plus a per-call `WithTimeout`.** Bounds individual writes (slow peer); cancels in-flight writes on close (clean shutdown); does not require a registry API change. Adopted. + +`writeTimeout = 10s` is conservative — long enough that a momentarily-laggy mobile client on a poor network is not killed mid-frame, short enough that a single hung `Send` cannot park a goroutine indefinitely. This is the same order of magnitude as the existing `http.Server.WriteTimeout: 60s` (which bounds the whole upgrade response) but tighter because it bounds a single forwarded frame, not a full HTTP exchange. If a future ticket finds this wrong, it changes one constant. + +### Why the library is `nhooyr.io/websocket` + +Choice fixed by the ticket; this spec records the rationale so the developer doesn't relitigate it: + +- Context-aware API on every blocking method (`Read`, `Write`, `Close`) — fits the relay's existing `context.Context` discipline. +- `Conn.Close` is documented as safe-with-an-in-flight-`Write`, which is the property the adapter relies on for non-deadlocking close. +- Maintained: `gorilla/websocket` is in maintenance-only mode, last meaningful release predates Go 1.18; `gobwas/ws` is a frame-level toolkit that would force the adapter to own framing details that aren't relevant to the relay. + +The vanity import path `nhooyr.io/websocket` resolves to the v1.8.x module (the upstream maintainer moved development to `github.com/coder/websocket`, but the `nhooyr.io/websocket` line continues to be published). Use the vanity path; do not switch to `github.com/coder/websocket` in this ticket. + +## Testing strategy + +`internal/relay/ws_conn_test.go`, `package relay`. End-to-end against a real `*websocket.Conn` via `httptest.NewServer` — no mocks of the library, per the ticket. + +### Test harness + +A single helper in the test file spins up an `httptest.NewServer` whose handler accepts a WebSocket and runs a tiny read-loop that pushes received messages onto a channel: + +```go +// startEcho returns a connected *WSConn (client side) and a channel that +// receives every frame the test server reads. Caller defers the cleanup. +func startEcho(t *testing.T) (*WSConn, <-chan []byte, func()) { + t.Helper() + received := make(chan []byte, 1024) + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + c, err := websocket.Accept(w, r, nil) + if err != nil { t.Errorf("accept: %v", err); return } + defer c.Close(websocket.StatusInternalError, "test ended") + for { + _, data, err := c.Read(r.Context()) + if err != nil { return } + received <- data + } + })) + + wsURL := "ws" + strings.TrimPrefix(srv.URL, "http") + client, _, err := websocket.Dial(context.Background(), wsURL, nil) + if err != nil { t.Fatalf("dial: %v", err) } + + wc := NewWSConn(client, "test-conn-id") + cleanup := func() { + wc.Close() + srv.Close() + } + return wc, received, cleanup +} +``` + +The test server's handler reads with `r.Context()` so closing the test server tears down the read loop cleanly. The receive channel is buffered generously so the producer side is never blocked by an inattentive test. + +### Tests (1:1 with AC bullets) + +1. **`TestWSConn_ConnID_ReturnsConstructorValue`** — call `NewWSConn(c, "abc")`, assert `ConnID() == "abc"`. Trivial; included for AC coverage and to lock the contract. + +2. **`TestWSConn_ConcurrentSend_ProducesIntactFrames`** — start the harness; spawn N goroutines (e.g. N=16), each calling `wc.Send([]byte(fmt.Sprintf("g%d", id)))` once (or M times with a goroutine-tagged payload); `wg.Wait`; drain `received` and assert N (or N×M) intact, distinct frames. The race detector is the primary signal — interleaved frames would produce malformed messages on the receiver, and corrupted writes would surface as `Read` errors. Use `-race` (the Makefile already does). + +3. **`TestWSConn_DoubleClose_DoesNotPanic`** — `wc.Close(); wc.Close()`. The test passes by not panicking. + +4. **`TestWSConn_SendAfterClose_ReturnsError`** — `wc.Close()`; assert `wc.Send([]byte("late")) != nil`. Do not assert on the specific error type; the library may return any of `context.Canceled`, a closed-connection error, or a wrapped variant. The contract is "non-nil," not a specific value. + +5. **(Optional, free coverage)** **`TestWSConn_SlowPeer_TimesOut`** — sketch only; deferred. Building a "peer that accepts but never reads" requires either hijacking the TCP socket in the handler or filling the OS send buffer, both of which are flaky in CI. Not in the ticket's AC. Listed here so the developer doesn't add it on a hunch. + +### What we deliberately do not test + +- **The library's behaviour itself.** We trust `nhooyr.io/websocket.Conn.Write` to honour `ctx`, and `Conn.Close` to be safe with in-flight `Write`. Wrapping the library to test the library is out of scope. +- **Unit-level mocks of the library.** The ticket explicitly asks for a real `*websocket.Conn` end-to-end. The race-and-correctness signal from `httptest` + `-race` is stronger than what a mock could prove. +- **Performance.** No throughput or latency assertions. Capacity work belongs to a profiling ticket if and when it's needed. + +### Lint expectations + +`make vet`, `make test`, `make build` all clean. `gosec` and `govulncheck` clean. Adding the dependency widens `govulncheck`'s scan surface; if it flags a known-issue version, bump within v1.8.x to a clean release before merging. Do not pin to a non-stable line. + +## Open questions + +1. **`writeTimeout = 10s` — right number?** This is a guess informed by typical mobile-network frame budgets (a single forwarded text/voice frame should not take seconds; if it does, the connection is lost). No real traffic to calibrate against in v1. Leave as `const writeTimeout = 10 * time.Second` for now; revisit when the heartbeat ticket (#7) lands and we have a sense of in-the-wild RTT distributions. +2. **Should `Close` set a `StatusNormalClosure` reason string?** No, leave it empty. Close-code semantics for protocol errors (`4401`, `4404`, `4409`) are owned by the WS upgrade handlers; the adapter sees only "the registry asked us to close" and cannot distinguish reasons. Adding placeholder text would mislead operators reading captures. +3. **Is the per-call `context.WithTimeout` allocation acceptable on every `Send`?** Yes — the relay's hot path is human-mediated frame forwarding (chat, command), not microsecond-bound RPC. The allocation is a few-hundred-nanosecond `context.WithDeadline` plus a goroutine-cancel timer; insignificant against the WebSocket frame cost. If profiling later disagrees, swap to a long-lived context with manual deadline management; the public API does not change. + +## Out of scope (re-stated, for the developer) + +- No edits to `internal/relay/registry.go` or its test. The `Conn` interface is fixed. +- No HTTP handler, no upgrade logic. That is #4/#16 (`/v1/server`) and #5 (`/v1/client`). +- No header validation (`x-pyrycode-server`, `x-pyrycode-token`). Header checks live at the upgrade boundary; they have no presence in the adapter. +- No close-code semantics beyond `StatusNormalClosure`. `4401`/`4404`/`4409` mapping is the upgrade handler's job (#4/#5/#16). +- No heartbeat / ping-pong (#7). +- No frame loop, no envelope wrap/unwrap (#6). +- No conn-id generation. `connID` is supplied by the caller. +- No throttling, queueing, or per-conn rate limit. None of those are in the registry's contract. + +## Done means + +- `internal/relay/ws_conn.go` exists with `WSConn`, `NewWSConn`, `ConnID`, `Send`, `Close`, plus the `writeTimeout` constant. Every exported symbol carries a doc comment that names its concurrency contract — what is locked, what is cancelled, what is safe to call concurrently with what. +- `internal/relay/ws_conn_test.go` covers the four functional tests above. +- `go.mod` lists `nhooyr.io/websocket` (v1.8.x) as a direct dependency; `go.sum` is updated. +- `make vet`, `make test` (under `-race`), `make build` all clean. `gosec ./...` and `govulncheck ./...` clean. +- One commit on `feature/15`: `feat(relay): WSConn adapter for nhooyr.io/websocket (#15)`. + +--- + +## Security review (security-sensitive label) + +### Threat surface for THIS ticket + +The adapter sits directly on the path between the network and the registry: every byte from a binary or a phone passes through `WSConn.Send` (in the reverse direction it is a future ticket's read-loop). Because the relay is the TLS terminus, the WSConn's underlying `*websocket.Conn` sees plaintext. The adapter itself does not interpret payloads, but it must not introduce footguns that a downstream caller's mistake turns into a vulnerability. + +### Categories walked + +- **Trust boundaries.** The adapter trusts (a) `connID` is whatever the WS upgrade handler assigned (validation is the conn-id-scheme ticket's job — `WSConn` treats it as opaque); (b) the caller, which holds the only `*WSConn` reference for that connection, will not reach around it to call `c.Write` or `c.Close` directly; (c) `nhooyr.io/websocket` honours its documented contract (context cancellation aborts `Write`; `Close` is safe with in-flight `Write`). Each trust is named in the doc comment of `NewWSConn` or in the spec's "lifecycle expectations." **Finding:** none introduced; trust shape matches what #3 already documented for the `Conn` interface. + +- **Resource exhaustion / DoS.** The threat-model document calls out slow-loris and connection-flood vectors as known v1 gaps. This adapter is the right place to defend against the **write-side** slow-peer case — a peer that completes the WS handshake and then stops reading would, with `context.Background()` writes, hang every `Send` to that peer indefinitely, parking goroutines and (more importantly) blocking broadcast loops that fan out to many phones. **Finding:** mitigated by `writeTimeout = 10s` on every `Send`. A single slow peer cannot stall a broadcast goroutine for more than 10 seconds before it errors out and the upper layer drops the connection. This is the first code-level slow-loris mitigation in the relay; document the constant clearly. + +- **Resource exhaustion / memory.** No internal queue, no background goroutine, no buffer pool. `Send` is synchronous; the only allocation per call is the `context.WithTimeout` triplet. **Finding:** the adapter contributes nothing to per-connection memory growth. + +- **Race conditions / data corruption on the wire.** `nhooyr.io/websocket.Conn.Write` is **not** safe to call from multiple goroutines at once; concurrent calls would interleave bytes mid-frame. The adapter's `writeMu` serialises every `Send`. The race-detector test under `-race` is the verification mechanism, not just a smoke test. **Finding:** mitigated; tested. + +- **Lock-order / deadlock.** Single mutex (`writeMu`); one one-shot guard (`closeOnce`). `Close` deliberately does **not** take `writeMu` — instead it cancels `closeCtx`, which causes the in-flight `Write` to abort and release the mutex on its own. **Finding:** no deadlock path; the design rationale is named in the spec so a future contributor doesn't "tidy up" by adding `Close`-takes-`writeMu` and reintroduce the deadlock. + +- **Use-after-close / double-close.** `Close` is `sync.Once`-guarded; the underlying `conn.Close` runs at most once. `Send` after `Close` finds `closeCtx` already cancelled and `Write` returns the cancellation error without touching the socket. **Finding:** safe; tested. + +- **Information disclosure.** The adapter sends what it is given; it does not log payloads, headers, or `connID`. (No `slog` calls anywhere in the adapter.) **Finding:** matches `docs/threat-model.md` § "Log hygiene" — no leakage path through this file. + +- **Supply chain.** Adds `nhooyr.io/websocket` to `go.mod`. The threat model already names this as a known cost when the WS library lands (`docs/threat-model.md` § "Supply chain — Go dependencies": *"A compromised future WebSocket library would see every routed frame in cleartext"*). The mitigation lives in `make lint` (`govulncheck`) and the project's "new dependencies need a justification" rule (this spec is the justification). **Finding:** acceptable; documented. + +- **Panics / nil-deref.** No `nil` checks; the adapter trusts the caller to pass a valid `*websocket.Conn`. No slice indexing, no map access. **Finding:** none observed. + +- **Error leakage to the wire.** `Send` returns library errors verbatim. The error is consumed by registry callers (broadcast loops, the upgrade handler), not sent to a peer. `Close` discards `conn.Close`'s error — no leakage path. **Finding:** none. + +### Adversarial framings considered + +- *"What if a peer completes the handshake, then never reads?"* — `Send` blocks for up to `writeTimeout` (10s), then returns `context.DeadlineExceeded`. The caller drops the connection. A flood of such peers consumes one goroutine each for 10 seconds, capped. Connection-count caps belong to the WS upgrade ticket; this is the per-connection bound. + +- *"What if many goroutines call `Send` concurrently with one slow peer?"* — `writeMu` serialises them. Each waits in turn; each that does eventually win the mutex hits the same 10-second timeout. The serial-tail is bounded — N callers wait at most N×10s in the worst case. Acceptable for v1; if it becomes a problem, the upgrade layer adds connection caps before the adapter changes. + +- *"What if `Close` is called while ten goroutines are blocked on `writeMu`?"* — `Close` cancels `closeCtx` without taking `writeMu`. The currently-writing goroutine's `ctx` is cancelled and `Write` returns. The mutex is released. The next-in-line goroutine acquires `writeMu`, derives its own `WithTimeout(closeCtx, ...)` — `closeCtx` is already cancelled, so `Write` short-circuits. Each queued `Send` returns in turn with a cancellation error. No goroutine is stuck. + +- *"What if `Close` runs concurrently with a `Send` that is mid-`Write`?"* — `Close` cancels `closeCtx`, then calls `conn.Close`. `nhooyr.io/websocket` is documented to handle `Close`-with-in-flight-`Write` cleanly: the `Write` returns an error; the `Close` finalises. The adapter relies on this property — it's why this library was chosen. If the property regresses in a future library version, `govulncheck` won't catch it (it's a behaviour change, not a CVE), but the race test would surface a `DATA RACE` and the lint pass would break. + +- *"What if the caller passes the same `*websocket.Conn` to two `WSConn` constructors?"* — Each gets its own `writeMu`; serial-write guarantee is broken; the wire interleaves. **The doc comment on `NewWSConn` says the WSConn owns `c`.** This is a caller-invariant; the adapter does not (and cannot, without keeping a global registry of wrapped `*websocket.Conn`) defend against it. Documented; not coded. + +- *"What if `connID` contains adversarial bytes (NULs, very long, control characters)?"* — `WSConn` stores it and returns it from `ConnID()`. The registry uses it as a map key (slice-scan key). No interpretation. The conn-id-scheme ticket is responsible for charset/length; the adapter's contract is "opaque string." This matches #3's stance on `serverID`. + +- *"What if `nhooyr.io/websocket` v1.8.x is compromised?"* — Plaintext exposure of every routed frame, as `docs/threat-model.md` § "Supply chain" already names. `govulncheck` catches known CVEs; it does not catch a malicious release tagged by an authentic maintainer. No new mitigation in this ticket; the residual is the same risk the threat model already accepts for adding any WS library. + +- *"Does the per-call `context.WithTimeout` create a goroutine that leaks?"* — No. `context.WithTimeout` schedules a single timer that is cancelled by the `defer cancel()` in `Send`. No goroutine is spawned by the cancel-context machinery; the underlying timer is reclaimed. + +### Verdict: PASS + +The adapter introduces one new code-level slow-loris mitigation (`writeTimeout`), preserves the registry's lock-discipline assumptions, and does not widen any documented threat surface beyond the supply-chain cost the threat model already names for adding a WS library. The doc-comment-encoded caller invariants ("WSConn owns `c`") are clearly out of the adapter's hands and are explicitly noted; no defensive code can be added against them without changing the API. The `security-sensitive` label is justified — `Send` is on every routed-frame path — and the review is correspondingly thorough. diff --git a/go.mod b/go.mod index d2246ec..7737307 100644 --- a/go.mod +++ b/go.mod @@ -2,7 +2,10 @@ module github.com/pyrycode/pyrycode-relay go 1.26.2 -require golang.org/x/crypto v0.50.0 +require ( + golang.org/x/crypto v0.50.0 + nhooyr.io/websocket v1.8.17 +) require ( golang.org/x/net v0.52.0 // indirect diff --git a/go.sum b/go.sum index 00cc43c..5b6d84b 100644 --- a/go.sum +++ b/go.sum @@ -4,3 +4,5 @@ golang.org/x/net v0.52.0 h1:He/TN1l0e4mmR3QqHMT2Xab3Aj3L9qjbhRm78/6jrW0= golang.org/x/net v0.52.0/go.mod h1:R1MAz7uMZxVMualyPXb+VaqGSa3LIaUqk0eEt3w36Sw= golang.org/x/text v0.36.0 h1:JfKh3XmcRPqZPKevfXVpI1wXPTqbkE5f7JA92a55Yxg= golang.org/x/text v0.36.0/go.mod h1:NIdBknypM8iqVmPiuco0Dh6P5Jcdk8lJL0CUebqK164= +nhooyr.io/websocket v1.8.17 h1:KEVeLJkUywCKVsnLIDlD/5gtayKp8VoCkksHCGGfT9Y= +nhooyr.io/websocket v1.8.17/go.mod h1:rN9OFWIUwuxg4fR5tELlYC04bXYowCP9GX47ivo2l+c= diff --git a/internal/relay/ws_conn.go b/internal/relay/ws_conn.go new file mode 100644 index 0000000..29d179e --- /dev/null +++ b/internal/relay/ws_conn.go @@ -0,0 +1,82 @@ +package relay + +import ( + "context" + "sync" + "time" + + "nhooyr.io/websocket" +) + +// writeTimeout bounds a single Send. A slow peer cannot stall a caller +// past this deadline; Send returns the library's wrapped cancellation +// error and the caller decides whether to drop the connection. +const writeTimeout = 10 * time.Second + +// WSConn adapts a *websocket.Conn from nhooyr.io/websocket to the +// registry's Conn interface. It owns the per-connection write mutex +// (the underlying library forbids concurrent Write) and a per-connection +// cancellation context that Close trips to abort in-flight writes. +// +// All exported methods are safe for concurrent use. Send serialises +// concurrent callers; ConnID is a pure getter; Close is idempotent and +// may run concurrently with Send. +type WSConn struct { + conn *websocket.Conn + connID string + + writeMu sync.Mutex + + closeOnce sync.Once + closeCtx context.Context + cancel context.CancelFunc +} + +// NewWSConn wraps c with the relay-assigned connection id. The caller +// retains responsibility for the WebSocket handshake and for choosing +// connID; this constructor neither validates connID nor inspects c. +// +// After construction, the WSConn owns c: callers must reach the +// connection only through WSConn methods. Calling c.Write or c.Close +// directly defeats the adapter's serialisation and cancellation +// guarantees. +func NewWSConn(c *websocket.Conn, connID string) *WSConn { + ctx, cancel := context.WithCancel(context.Background()) + return &WSConn{ + conn: c, + connID: connID, + closeCtx: ctx, + cancel: cancel, + } +} + +// ConnID returns the constructor-supplied id. Pure getter; safe to call +// from any goroutine and never blocks. The registry calls this under +// its write lock — see registry.go UnregisterPhone. +func (w *WSConn) ConnID() string { + return w.connID +} + +// Send writes msg as a single binary WebSocket frame. Concurrent Send +// callers are serialised on a per-WSConn mutex; the wire receives whole, +// non-interleaved frames in some order. Each call has a fixed write +// deadline (writeTimeout); Send after Close returns a non-nil error. +// Callers treat any non-nil return as "drop the connection." +func (w *WSConn) Send(msg []byte) error { + w.writeMu.Lock() + defer w.writeMu.Unlock() + ctx, cancel := context.WithTimeout(w.closeCtx, writeTimeout) + defer cancel() + return w.conn.Write(ctx, websocket.MessageBinary, msg) +} + +// Close cancels in-flight writes and closes the WebSocket with +// StatusNormalClosure. Idempotent: only the first call reaches the +// underlying *websocket.Conn; subsequent calls are no-ops. Safe to call +// concurrently with Send. +func (w *WSConn) Close() { + w.closeOnce.Do(func() { + w.cancel() + _ = w.conn.Close(websocket.StatusNormalClosure, "") + }) +} diff --git a/internal/relay/ws_conn_test.go b/internal/relay/ws_conn_test.go new file mode 100644 index 0000000..e3c6652 --- /dev/null +++ b/internal/relay/ws_conn_test.go @@ -0,0 +1,117 @@ +package relay + +import ( + "context" + "fmt" + "net/http" + "net/http/httptest" + "strings" + "sync" + "testing" + "time" + + "nhooyr.io/websocket" +) + +// startEcho stands up an httptest server whose handler upgrades to a +// WebSocket and forwards every received frame to a buffered channel. +// It returns a connected WSConn (client side), the receive channel, +// and a cleanup function the caller defers. +func startEcho(t *testing.T) (*WSConn, <-chan []byte, func()) { + t.Helper() + received := make(chan []byte, 1024) + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + c, err := websocket.Accept(w, r, nil) + if err != nil { + t.Errorf("accept: %v", err) + return + } + defer c.Close(websocket.StatusInternalError, "test ended") + for { + _, data, err := c.Read(r.Context()) + if err != nil { + return + } + received <- data + } + })) + + wsURL := "ws" + strings.TrimPrefix(srv.URL, "http") + dialCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + client, _, err := websocket.Dial(dialCtx, wsURL, nil) + if err != nil { + srv.Close() + t.Fatalf("dial: %v", err) + } + + wc := NewWSConn(client, "test-conn-id") + cleanup := func() { + wc.Close() + srv.Close() + } + return wc, received, cleanup +} + +func TestWSConn_ConnID_ReturnsConstructorValue(t *testing.T) { + wc := NewWSConn(nil, "abc") + if got := wc.ConnID(); got != "abc" { + t.Fatalf("ConnID() = %q, want %q", got, "abc") + } +} + +func TestWSConn_ConcurrentSend_ProducesIntactFrames(t *testing.T) { + wc, received, cleanup := startEcho(t) + defer cleanup() + + const n = 16 + var wg sync.WaitGroup + wg.Add(n) + for i := 0; i < n; i++ { + go func(id int) { + defer wg.Done() + payload := []byte(fmt.Sprintf("g%02d", id)) + if err := wc.Send(payload); err != nil { + t.Errorf("Send(%d): %v", id, err) + } + }(i) + } + wg.Wait() + + got := make(map[string]int, n) + timeout := time.After(5 * time.Second) + for i := 0; i < n; i++ { + select { + case frame := <-received: + got[string(frame)]++ + case <-timeout: + t.Fatalf("timed out after receiving %d/%d frames; got=%v", i, n, got) + } + } + if len(got) != n { + t.Fatalf("got %d distinct frames, want %d (got=%v)", len(got), n, got) + } + for k, v := range got { + if v != 1 { + t.Errorf("frame %q received %d times, want 1", k, v) + } + } +} + +func TestWSConn_DoubleClose_DoesNotPanic(t *testing.T) { + wc, _, cleanup := startEcho(t) + defer cleanup() + + wc.Close() + wc.Close() +} + +func TestWSConn_SendAfterClose_ReturnsError(t *testing.T) { + wc, _, cleanup := startEcho(t) + defer cleanup() + + wc.Close() + if err := wc.Send([]byte("late")); err == nil { + t.Fatal("Send after Close returned nil error, want non-nil") + } +}