Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cmd/pyrycode-relay/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ func main() {

mux := http.NewServeMux()
mux.Handle("/healthz", relay.NewHealthzHandler(reg, Version, startedAt))
mux.Handle("/v1/server", relay.ServerHandler(reg, logger))

if *insecureListen != "" {
logger.Info("starting", "version", Version, "mode", "insecure", "listen", *insecureListen)
Expand Down
9 changes: 7 additions & 2 deletions docs/PROJECT-MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,9 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex
| TLS via autocert in `--domain` mode (`NewAutocertManager`, `EnforceHost`, `TLSConfig`, `ErrCacheDirInsecure`) | Done (#9) | `internal/relay/tls.go`, `cmd/pyrycode-relay/main.go` |
| `WSConn` adapter (`nhooyr.io/websocket.Conn` → registry `Conn`; per-conn write mutex; `Close`-cancelled context; 10s `Send` deadline) | Done (#15) | `internal/relay/ws_conn.go` |
| `/healthz` JSON endpoint (`status`, `version`, `connected_binaries`, `connected_phones`, `uptime_seconds`; `Cache-Control: no-store`; unauthenticated) | Done (#10) | `internal/relay/healthz.go`, `cmd/pyrycode-relay/main.go` |
| WS upgrade on `/v1/server` and `/v1/client` | Not started | — |
| Header validation (`x-pyrycode-server`, `x-pyrycode-token`) | Not started | — |
| WS upgrade on `/v1/server` (header gate pre-upgrade; `ClaimServer`; `4409` on conflict; `CloseRead`-held until #6) | Done (#16) | `internal/relay/server_endpoint.go`, `cmd/pyrycode-relay/main.go` |
| WS upgrade on `/v1/client` | Not started | — |
| Header validation (`x-pyrycode-server`, `x-pyrycode-version`, `user-agent` on `/v1/server`; `x-pyrycode-token` on `/v1/client`) | `/v1/server` done (#16); `/v1/client` not started | `internal/relay/server_endpoint.go` |
| Frame forwarding using the routing envelope | Not started | — |
| `conn_id` generation scheme | Not started | — |
| Threat model doc — operational surface (deploy, supply chain, DoS, log hygiene, cert handling, TLS, error leakage) | Done (#11) | `docs/threat-model.md` |
Expand All @@ -31,6 +32,10 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex
- **Interface methods called under the lock are documented as non-blocking getters.** The registry's `Conn.ConnID()` is invoked under the write lock during `UnregisterPhone`; `Send` and `Close` are never called while the lock is held. Pattern: state the contract on the interface, never call something that could block on I/O while a mutex is held.
- **Adapters bridge interface↔library API mismatches by owning policy locally.** When a library method needs a `context.Context` but the registry's `Conn` interface doesn't take one (and shouldn't — most callers don't have a context to thread), the adapter owns its own context: derived in the constructor, cancelled by `Close`, narrowed per-call with `WithTimeout` for deadline policy. Adopted in `WSConn` (#15); avoids forcing context-plumbing changes into upstream interfaces.
- **Handler factories return `http.Handler`, not `http.HandlerFunc`.** `NewHealthzHandler(reg, version, startedAt) http.Handler` keeps construction factory-shaped so a future ticket adding per-handler state (logger, injectable clock) can do so without touching the call site in `main`. Adopted in `/healthz` (#10).
- **Validate adversarial input *before* allocating WS state.** Header-gate checks (presence + non-empty) run before `websocket.Accept`; a malformed request never receives an upgrade response and the registry is untouched on every error path. Adopted in `/v1/server` (#16); the same shape applies to `/v1/client` token validation (#5).
- **Application WS close codes are emitted on the underlying `*websocket.Conn`, not via `WSConn`.** `WSConn.Close()` always emits `StatusNormalClosure` — no policy parameters. Handlers that need `4409` / `4401` / `4404` call `c.Close(websocket.StatusCode(code), reason)` directly in the stillborn-WSConn window (after construction, before any `Send`). See [ADR-0005](knowledge/decisions/0005-application-close-codes-via-underlying-conn.md). Adopted in `/v1/server` (#16); applies to `/v1/client` (#5).
- **`defer { Release; Close; log }` is registered AFTER the successful claim**, never before. A conflict path (or any pre-claim error) returns before the defer is in place, so we never `ReleaseServer` on a slot we don't hold. Adopted in `/v1/server` (#16); same structural ordering applies to `/v1/client`'s phone registration.
- **Hold long-lived WS handlers open with `c.CloseRead(r.Context())` plus `<-readCtx.Done()` until the real read loop lands.** `CloseRead` drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close). The frame loop ticket replaces `<-readCtx.Done()` with the actual read body in the same call site. Adopted in `/v1/server` (#16) pending #6.
- **Capture process-state timestamps in `main` after `flag.Parse()`, not as package-level vars.** `startedAt := time.Now()` lives inside `main` and is passed into the handler factory. A package-level `var startedAt = time.Now()` would fire at import time — before flag parsing, before `--version` early-returns — and be wrong for short-lived test binaries and any future deferred-serve setup. Adopted in #10.

## Conventions
Expand Down
2 changes: 2 additions & 0 deletions docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [`/v1/server` WS upgrade](features/server-endpoint.md) — binary-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Version` / `User-Agent` pre-upgrade, claims the slot, emits `4409` on conflict, holds the conn via `CloseRead` until #6's frame loop replaces it.
- [`/healthz` JSON endpoint](features/healthz.md) — unauthenticated `GET /healthz` returning `{status, version, connected_binaries, connected_phones, uptime_seconds}`; `Cache-Control: no-store`, body bounded ≈135 bytes.
- [WSConn adapter](features/ws-conn-adapter.md) — wraps `nhooyr.io/websocket.Conn` to satisfy the registry's `Conn`; owns the per-conn write mutex and a `Close`-cancelled context with a 10s per-`Send` deadline.
- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`.
Expand All @@ -12,6 +13,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Decisions

- [ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`](decisions/0005-application-close-codes-via-underlying-conn.md) — `WSConn.Close()` stays minimal (`StatusNormalClosure`); handlers emit `4409` / `4401` / `4404` directly on the `*websocket.Conn` in stillborn-WSConn paths.
- [ADR-0004: WS library choice and adapter context strategy](decisions/0004-ws-library-and-adapter-context-strategy.md) — `nhooyr.io/websocket` v1.8.x; adapter-owned context cancelled by `Close`, 10s per-`Send` deadline; registry's `Send` signature stays context-free.
- [ADR-0003: Connection registry as a passive store](decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, narrow `ReleaseServer` so grace logic (#8) can wrap it.
- [ADR-0002: Explicit-failure 404 on `:80` instead of HTTPS redirect](decisions/0002-autocert-explicit-failure-on-port-80.md) — pass `http.NotFoundHandler()` to `autocert.Manager.HTTPHandler`, not `nil`; loud failure over silent scheme-switching.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`, not `WSConn`

**Status:** Accepted (#16)
**Date:** 2026-05-08

## Context

`WSConn` ([ADR-0004](0004-ws-library-and-adapter-context-strategy.md), #15) is the registry's `Conn`-interface adapter over `nhooyr.io/websocket.Conn`. Its `Close()` always emits `StatusNormalClosure` — that is the registry's contract: "drop this connection cleanly," with no policy on *why*.

The protocol defines application close codes in the 4000–4999 range:

- `4401` — token invalid (on `/v1/client`, #5).
- `4404` — no server for `serverID` (on `/v1/client`, #5).
- `4409` — `serverID` already claimed (on `/v1/server`, #16).

Plus future frame-forward error codes on both endpoints. The upgrade handlers need to emit these, but `WSConn` does not know *why* a close is happening — that knowledge lives in the handler.

So: where does the code that picks a close code (and the human-readable reason) belong?

## Decision

**Application close codes are emitted by calling `Close(code, reason)` directly on the underlying `*websocket.Conn`, not through `WSConn`.** `WSConn.Close()` keeps a fixed `StatusNormalClosure` body. The handler that holds the policy holds the close-code call site.

Specifically: at the moment a handler decides to refuse a connection (e.g. `4409` on duplicate claim), it calls `c.Close(websocket.StatusCode(code), reason)` on the `*websocket.Conn` it received from `websocket.Accept`, before `wsconn.Close()` would have run. The `WSConn` was constructed but no `Send` was attempted and no goroutine holds `writeMu` — a stillborn WSConn — so the WSConn invariant ("callers must reach the connection only through WSConn methods") is preserved in spirit: the conn was never *used* through the adapter.

Adopted in `internal/relay/server_endpoint.go` (#16) for `4409`. The same pattern applies to `4401` / `4404` in `/v1/client` (#5).

## Rationale

Three options were on the table:

1. **Add `WSConn.CloseWithCode(code websocket.StatusCode, reason string)`.** Inflates the adapter's API surface for one caller per close code. Each new code adds a method or makes the existing one's signature creep. The adapter would also need policy on what to do if `Close` was already called — the `closeOnce` guard exists precisely to make `Close` idempotent, and a new method would have to either share the guard (then which code wins?) or duplicate it. Rejected — the adapter's small-surface property is load-bearing for review (#15's security review hinged on it).

2. **Pass the close code into `WSConn` at construction time, store it, and emit it from `Close`.** Doesn't fit: the close code at construction time is *not* the close code the handler eventually decides on. By the time we know it's `4409`, the WSConn already exists (we constructed it to pass to `ClaimServer`).

3. **Direct `c.Close(code, reason)` from the handler, no method on `WSConn`.** Adopted. Keeps `WSConn` a closed, narrow type. Keeps the close-code knowledge in the handler that holds the policy.

Option 3 has one cost: it documents an exception to WSConn's "all access goes through the adapter" invariant. The exception is bounded by structure — a stillborn WSConn before any `Send`, no concurrent writer to interleave with, no `writeMu` contender — so the spirit of the invariant survives. The handler comments name the exception so a future reader doesn't generalise it into a "you can use `c` directly any time" pattern.

## Consequences

- `WSConn.Close()` stays minimal: `StatusNormalClosure`, idempotent, owns the cancel of `closeCtx`. No close-code variants, no policy parameters.
- Handlers that emit application close codes do so on the `*websocket.Conn` they received from `websocket.Accept`, before the WSConn would have been closed via `defer`. The handler must comment the exception inline so it doesn't read as a violation.
- The pattern only applies to **stillborn WSConn** paths — pre-`Send`, pre-defer. Once a WSConn is in use (registered with `Send` reachable from broadcasters), close goes through `WSConn.Close()` and the code is `StatusNormalClosure`. Application close codes for an *active* WSConn (e.g. binary disconnects mid-stream and we want to surface a forward-error code) need a different solution — likely a small `WSConn.CloseWithCode` added at the time the first such use case lands, justified by an ADR amendment.
- Each new application close code (`4401`, `4404`, future frame-forward codes) follows this pattern at its handler, not by widening `WSConn`.

## Related

- [ADR-0003](0003-connection-registry-passive-store.md) — registry's `Conn` interface and the `Send/Close` contract `WSConn` implements.
- [ADR-0004](0004-ws-library-and-adapter-context-strategy.md) — why `WSConn.Close()` cancels `closeCtx` instead of taking `writeMu`; the property the stillborn-conn case relies on.
- [Server endpoint feature](../features/server-endpoint.md) — first concrete user of this pattern (`4409`).
- [WSConn adapter](../features/ws-conn-adapter.md) — the type whose surface this ADR keeps minimal.
Loading