Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/PROJECT-MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex
| Routing envelope wrapper type (`Envelope`, `Marshal`, `Unmarshal`, sentinel errors) | Done (#1) | `internal/relay/envelope.go` |
| Connection registry (`Conn`, `Registry`; 1:1 binary, 1:N phones; sentinel errors for `4409`/`4404`; race-tested) | Done (#3) | `internal/relay/registry.go` |
| TLS via autocert in `--domain` mode (`NewAutocertManager`, `EnforceHost`, `TLSConfig`, `ErrCacheDirInsecure`) | Done (#9) | `internal/relay/tls.go`, `cmd/pyrycode-relay/main.go` |
| `WSConn` adapter (`nhooyr.io/websocket.Conn` → registry `Conn`; per-conn write mutex; `Close`-cancelled context; 10s `Send` deadline) | Done (#15) | `internal/relay/ws_conn.go` |
| WS upgrade on `/v1/server` and `/v1/client` | Not started | — |
| Header validation (`x-pyrycode-server`, `x-pyrycode-token`) | Not started | — |
| Frame forwarding using the routing envelope | Not started | — |
Expand All @@ -22,11 +23,12 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex
- **Sentinel errors, branched via `errors.Is`** for validation failures at protocol boundaries. New routing-layer code follows the `Err...` naming and wraps with `fmt.Errorf("…: %w", err, sentinel)` when adding context.
- **Opacity by type.** Inner-frame payloads are carried as `json.RawMessage`. The relay never deserialises payloads; the type makes that hard to violate accidentally.
- **Validate at the envelope boundary, not deeper.** Structural checks (presence, non-empty, JSON well-formedness) belong here; semantic checks (token validity, message kind) belong to the binary.
- **Stdlib + `golang.org/x/crypto`.** First non-stdlib dep arrived in #9 (`acme/autocert`). Keep the dependency surface deliberate; new deps need a justification.
- **Stdlib + `golang.org/x/crypto` + `nhooyr.io/websocket`.** Direct deps arrived in #9 (`acme/autocert`) and #15 (`nhooyr.io/websocket` v1.8.x). Keep the dependency surface deliberate; new deps need a justification (the ADR is the justification).
- **Loud failure over silent correction.** `--domain` and `--insecure-listen` are mutually exclusive and explicit. `ErrCacheDirInsecure` refuses to start on a world/group-readable cert cache rather than re-chmoding it. `:80` returns `404` for non-challenge traffic instead of redirecting to HTTPS (ADR-0002).
- **Tests live in the same package** (`package relay`, not `relay_test`) so they can `errors.Is` against unexported sentinels if needed and reach private helpers without re-exporting.
- **Passive in-memory stores guard mutation under one RWMutex; reads return copies, never references.** `PhonesFor` allocates a fresh slice so callers do slow work (broadcast, `Send` over the network) without holding the registry lock. Adopted in `internal/relay/registry.go` (#3); the same shape applies to any future "shared map of conns" type.
- **Interface methods called under the lock are documented as non-blocking getters.** The registry's `Conn.ConnID()` is invoked under the write lock during `UnregisterPhone`; `Send` and `Close` are never called while the lock is held. Pattern: state the contract on the interface, never call something that could block on I/O while a mutex is held.
- **Adapters bridge interface↔library API mismatches by owning policy locally.** When a library method needs a `context.Context` but the registry's `Conn` interface doesn't take one (and shouldn't — most callers don't have a context to thread), the adapter owns its own context: derived in the constructor, cancelled by `Close`, narrowed per-call with `WithTimeout` for deadline policy. Adopted in `WSConn` (#15); avoids forcing context-plumbing changes into upstream interfaces.

## Conventions

Expand Down
2 changes: 2 additions & 0 deletions docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,14 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [WSConn adapter](features/ws-conn-adapter.md) — wraps `nhooyr.io/websocket.Conn` to satisfy the registry's `Conn`; owns the per-conn write mutex and a `Close`-cancelled context with a 10s per-`Send` deadline.
- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`.
- [Autocert TLS](features/autocert-tls.md) — `--domain` mode wiring: `:443` WSS via Let's Encrypt + `:80` ACME http-01, host gates, cache-dir permission policy, TLS 1.2 floor.
- [Routing envelope](features/routing-envelope.md) — typed Go wrapper (`Envelope`, `Marshal`, `Unmarshal`) for the `{conn_id, frame}` wire shape exchanged between relay and binary.

## Decisions

- [ADR-0004: WS library choice and adapter context strategy](decisions/0004-ws-library-and-adapter-context-strategy.md) — `nhooyr.io/websocket` v1.8.x; adapter-owned context cancelled by `Close`, 10s per-`Send` deadline; registry's `Send` signature stays context-free.
- [ADR-0003: Connection registry as a passive store](decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, narrow `ReleaseServer` so grace logic (#8) can wrap it.
- [ADR-0002: Explicit-failure 404 on `:80` instead of HTTPS redirect](decisions/0002-autocert-explicit-failure-on-port-80.md) — pass `http.NotFoundHandler()` to `autocert.Manager.HTTPHandler`, not `nil`; loud failure over silent scheme-switching.
- [ADR-0001: Routing envelope shape and opacity](decisions/0001-routing-envelope-shape-and-opacity.md) — `json.RawMessage` for the inner frame; sentinel errors; validate at boundary, never parse payloads.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# ADR-0004: WS library choice and adapter context strategy

**Status:** Accepted (#15)
**Date:** 2026-05-08

## Context

The registry's `Conn` interface from #3 is implementation-agnostic, but everything that actually moves bytes in the relay needs a concrete WebSocket library. Two interlocking choices had to be made before the upgrade handlers (#4/#16, #5) could land:

1. **Which Go WS library?**
2. **How does the `Send([]byte) error` ↔ library `Write(ctx, …)` API mismatch get bridged?** The registry's `Conn.Send` deliberately has no `context.Context` parameter (set in ADR-0003 / #3); the library requires one for every write.

Both were decided in #15 along with the `WSConn` adapter that embodies them.

## Decision

**Library:** `nhooyr.io/websocket` (v1.8.x), pulled via the vanity import path. No alternative library is wrapped or kept compatible.

**Context strategy in `WSConn`:** the adapter owns its own context. `NewWSConn` derives `closeCtx`/`cancel` from `context.Background()`. Each `Send` uses `context.WithTimeout(closeCtx, writeTimeout)` — `writeTimeout = 10 * time.Second`. `Close` cancels `closeCtx` first, then calls the library's `Close`.

## Rationale

### Library: `nhooyr.io/websocket`

- **Context-aware API on every blocking method** (`Read`, `Write`, `Close`). Fits the relay's existing `context.Context` discipline; the adapter's per-call timeout strategy is only possible because of this.
- **`Conn.Close` is documented as safe with an in-flight `Write`.** This is the property the adapter relies on for non-deadlocking close — `Close` cancels `closeCtx` without taking `writeMu`, the in-flight `Write` aborts, and the lock unwinds.
- **Maintained.** `gorilla/websocket` was unmaintained as of 2022 (last meaningful release predates Go 1.18); `gobwas/ws` is a frame-level toolkit that would force the adapter to own framing details.

The vanity path `nhooyr.io/websocket` resolves to v1.8.x; upstream development moved to `github.com/coder/websocket` but the published line continues. Use the vanity path; revisit when v2 (or the coder fork) introduces a meaningful behavioural change.

### Context strategy: adapter-owned, per-call deadline

Three options were on the table:

1. **`context.Background()` only.** A slow peer hangs `Send` indefinitely. Rejected — the threat model already treats slow-loris on the read side as a known DoS vector; the same shape applies on the write side, and a broadcast loop fanning out to many phones cannot afford one slow phone parking a goroutine forever.
2. **Caller-supplied context.** Would require changing `relay.Conn.Send`'s signature, which is fixed by ADR-0003 and consumed by callers (broadcast loops iterating a `PhonesFor` snapshot) that don't naturally have a context to thread. Rejected — would force a registry-API change to fix an adapter-local concern.
3. **Adapter-owned: `closeCtx` cancelled by `Close`, plus per-call `WithTimeout`.** Bounds individual writes (slow peer); cancels in-flight writes on close (clean shutdown); does not require a registry API change. Adopted.

`writeTimeout = 10s` is conservative — long enough that a momentarily-laggy mobile client on a poor network is not killed mid-frame, short enough that one hung `Send` cannot park a goroutine indefinitely. One constant, one place to change. Heartbeat (#7) will give real RTT signal to recalibrate against.

### Why `Close` does not take `writeMu`

Locking `writeMu` inside `Close` would deadlock against a slow peer holding the mutex inside `Send`. Cancelling `closeCtx` instead aborts the in-flight `Write`, which releases the mutex on its own. This is the crux of the design and the reason the library's "Close-safe-with-in-flight-Write" property mattered when picking it.

## Consequences

- The relay has exactly one WS library and one adapter. Future routing-layer code (#4/#16, #5, #6, #7) hands the registry a `*WSConn`, never a hand-rolled adapter.
- The `Conn.Send` signature stays context-free. Callers (broadcast loops, the upgrade handler) pass `[]byte`, get `error`, and the adapter handles deadline policy internally.
- `writeTimeout = 10s` is the relay's first code-level slow-loris mitigation on the write side. A peer that accepts but never reads costs one goroutine for at most 10 seconds; connection-count caps remain the upgrade layer's job.
- `govulncheck` now has a wider scan surface. If it flags a v1.8.x release, bump within the line. Do not pin to a non-stable line; do not switch to `github.com/coder/websocket` opportunistically.
- The adapter relies on a documented library property (`Close` safe with in-flight `Write`). If a future library version regresses it, the race test surfaces a `DATA RACE` and the lint pass breaks; that is the early-warning signal.
- `WSConn` owns its `*websocket.Conn` after construction. Calling the library directly on the same conn is documented as a caller invariant; there is no defensive code, and there cannot be without a global registry of wrapped conns.
- `Conn.Close` returns no error. Idempotency is `sync.Once`-guarded; the underlying `conn.Close`'s error is discarded by design — the registry's contract has no return value, and there is no useful action on a "close-after-close" or "close-after-network-error" report.
104 changes: 104 additions & 0 deletions docs/knowledge/features/ws-conn-adapter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# WSConn — WebSocket adapter for the registry

`WSConn` is the single adapter that wraps `nhooyr.io/websocket.Conn` so it satisfies the registry's `Conn` interface ([connection-registry.md](connection-registry.md)). The relay's WS upgrade handlers (`/v1/server` in #4/#16, `/v1/client` in #5) hand the registry a `*WSConn` rather than each inventing its own wrapper. Frame forwarding, broadcasts, and the future heartbeat ticket all reach the wire through this type.

The adapter is small on purpose: one file, one mutex, one one-shot guard, no goroutines, no queues. Everything else (handshake, header validation, close-code semantics, ping/pong, read-side frame loop) lives in adjacent tickets.

## API

Package `internal/relay` (`ws_conn.go`):

```go
type WSConn struct { /* *websocket.Conn + connID + writeMu + closeOnce + closeCtx */ }

func NewWSConn(c *websocket.Conn, connID string) *WSConn

func (w *WSConn) ConnID() string
func (w *WSConn) Send(msg []byte) error
func (w *WSConn) Close()
```

Plus one package-level constant:

```go
const writeTimeout = 10 * time.Second
```

`writeTimeout` bounds a single `Send`. A slow peer cannot stall a caller past this deadline.

## Concurrency model

| Method | Lock | Context | Concurrent-safe with |
|---|---|---|---|
| `ConnID` | none | none | every other method |
| `Send` | `writeMu` (held across one frame write) | `WithTimeout(closeCtx, writeTimeout)` | other `Send` (serialised), `Close` (cancellation breaks it), `ConnID` |
| `Close` | `closeOnce` (one shot) | cancels `closeCtx` | `Send` (in-flight write is cancelled), other `Close` (no-op), `ConnID` |

One mutex, one one-shot. The lock graph is a single node. There are no callbacks, no channels, no goroutines spawned by the adapter.

`Close` deliberately does **not** take `writeMu`. Acquiring it would deadlock against a slow peer holding the mutex inside `Send`: the whole point of cancelling `closeCtx` is to abort the in-flight `Write` so it releases the mutex on its own. `nhooyr.io/websocket.Conn.Close` is documented to be safe with an in-flight `Write` — that property is why this library was chosen.

## Context strategy

The non-trivial design problem is the `Send([]byte) error` ↔ `Conn.Write(ctx, ...)` API mismatch: the registry's contract has no context, but the library requires one for every write. The adapter owns its own context:

- `NewWSConn` derives a `closeCtx`/`cancel` pair from `context.Background()`. The context's only purpose is to be cancelled by `Close`.
- Each `Send` derives a per-call `context.WithTimeout(closeCtx, writeTimeout)`, calls `Write`, cancels the timer.
- `Close` cancels `closeCtx` first, then calls the library's `Close`. After that, every queued or future `Send` finds an already-cancelled context and `Write` short-circuits before touching the socket — no partial frame goes out.

`writeTimeout = 10s` is conservative — long enough that a momentarily-laggy mobile client on a poor network is not killed mid-frame, short enough that one hung `Send` cannot park a goroutine indefinitely. Same order of magnitude as `http.Server.WriteTimeout: 60s` (which bounds a whole upgrade response) but tighter because it bounds a single forwarded frame. Constant lives in one place; revisit when the heartbeat ticket (#7) gives real RTT signal.

See [ADR-0004](../decisions/0004-ws-library-and-adapter-context-strategy.md) for the alternatives considered (`context.Background()` only; caller-supplied context).

## What the adapter deliberately does NOT do

These are documented so the next contributor doesn't add defensive code that doesn't belong:

- **No `connID` validation.** Length, charset, uniqueness are owned by the conn-id-scheme ticket. The adapter treats the id as opaque.
- **No handshake / header validation / subprotocol selection.** Lives at the upgrade boundary (#4/#16, #5).
- **No ping/pong.** Heartbeat is #7.
- **No read-side frame loop or envelope wrap/unwrap.** That is #6.
- **No per-conn send queue / backpressure / rate limit.** `Send` writes synchronously and returns. None of those are in the registry's `Conn` contract.
- **No close-on-`Send`-error.** Caller observes the error and chooses to call `Close`.
- **No close-code semantics beyond `StatusNormalClosure`.** The adapter doesn't know why the registry asked it to close. Close-code mapping (`4401`/`4404`/`4409`) is the upgrade handler's job.
- **No nil-checks on `c`.** Passing a nil `*websocket.Conn` is a programmer error.

## Caller invariant: WSConn owns the underlying conn

Documented on `NewWSConn`: after construction, callers must reach the connection only through `WSConn` methods. Calling `c.Write` or `c.Close` directly defeats the adapter's serialisation and cancellation guarantees. There is no defensive code to enforce this — the adapter cannot keep a global registry of wrapped `*websocket.Conn` — so it lives as a doc-comment contract on the constructor.

## Adversarial framing

The adapter is on the hot path: every routed frame passes through `WSConn.Send`. Threats considered:

- **Slow-loris on the write side.** A peer that completes the handshake and stops reading would, with `context.Background()` writes, hang every `Send` forever. Mitigated by `writeTimeout = 10s` — the first code-level slow-loris defence in the relay. A flood of such peers consumes one goroutine each for at most 10 seconds; connection-count caps belong to the WS upgrade ticket.
- **Concurrent `Send` corrupting frames.** `nhooyr.io/websocket.Conn.Write` is *not* safe to call from multiple goroutines at once — concurrent calls would interleave bytes mid-frame. `writeMu` serialises every `Send`. Race-detector test under `-race` is the verification, not a smoke test.
- **`Close` while many goroutines are blocked on `writeMu`.** `Close` cancels `closeCtx` without taking `writeMu`. The currently-writing goroutine's `ctx` is cancelled and `Write` returns. Each next-in-line goroutine acquires the mutex, derives a `WithTimeout(closeCtx, …)` from an already-cancelled parent, and `Write` short-circuits with the cancellation error. No goroutine is stuck.
- **`Close` racing with mid-`Write` `Send`.** Library is documented to handle this; relying on that property is explicit. If the property regresses in a future library version, the race test surfaces a `DATA RACE`.
- **Same `*websocket.Conn` wrapped by two `WSConn` constructors.** Each gets its own `writeMu`; serial-write guarantee is broken; the wire interleaves. Caller-invariant; not enforced.
- **Adversarial `connID` bytes.** Stored and returned as-is; no interpretation. Charset/length is the conn-id-scheme ticket's job.

The `security-sensitive` label was applied because `Send` is on every routed-frame path. Verdict from review: PASS — one new code-level slow-loris mitigation, no widening of documented threat surface beyond the supply-chain cost the threat model already names for adding any WS library.

## Testing

`internal/relay/ws_conn_test.go`, `package relay`. End-to-end against a real `*websocket.Conn` via `httptest.NewServer` — no library mocks. The `startEcho` helper spins up a server whose handler runs a tiny read-loop pushing received frames onto a buffered channel; the test gets back a connected `*WSConn` (client side) and the channel.

Tests (1:1 with the AC):

- `TestWSConn_ConnID_ReturnsConstructorValue` — locks the contract.
- `TestWSConn_ConcurrentSend_ProducesIntactFrames` — N=16 goroutines each `Send` a unique tagged payload; the test asserts N intact, distinct frames. Race detector under `-race` is the primary signal; interleaved frames would surface as malformed messages or corrupted writes.
- `TestWSConn_DoubleClose_DoesNotPanic`.
- `TestWSConn_SendAfterClose_ReturnsError` — non-nil error is the contract; the specific error type is library-dependent (any of `context.Canceled`, a closed-connection error, or a wrapped variant) and not asserted on.

What we deliberately do not test: the library's behaviour itself (we trust `Write` to honour `ctx` and `Close` to be safe with in-flight `Write`); unit-level mocks of the library; performance.

## Dependency

`nhooyr.io/websocket` (v1.8.x line). First WS library in the project; second non-stdlib direct dep (after `golang.org/x/crypto` for autocert). The vanity import path remains `nhooyr.io/websocket` even though the upstream module home moved to `github.com/coder/websocket`. `make lint` (`govulncheck`) covers known CVEs; the residual supply-chain risk (a malicious release tagged by an authentic maintainer would see every routed frame in cleartext) is named in `docs/threat-model.md` § "Supply chain — Go dependencies."

## Related

- [ADR-0004: WS library choice and adapter context strategy](../decisions/0004-ws-library-and-adapter-context-strategy.md)
- [Connection registry](connection-registry.md) — the `Conn` interface this implements.
- [Threat model](../../threat-model.md) — slow-loris and supply-chain framings.
Loading