diff --git a/.gitignore b/.gitignore index 87e1b05..1aab78e 100644 --- a/.gitignore +++ b/.gitignore @@ -5,6 +5,14 @@ bin/ *.out coverage.txt +# Default `go build ./...` from repo root emits a binary named after the +# module's last path segment ("pyrycode-relay") into CWD. Anchored with +# leading slash so it matches the repo-root binary only, not any future +# package directory of the same name. Same shape as the .codegraph fix +# below — auto-commit safety nets need a trustworthy gitignore. Salvage +# on PR #27 swept the binary in before this entry existed. +/pyrycode-relay + # Go go.work go.work.sum diff --git a/docs/PROJECT-MEMORY.md b/docs/PROJECT-MEMORY.md index f6c3b49..0d6e151 100644 --- a/docs/PROJECT-MEMORY.md +++ b/docs/PROJECT-MEMORY.md @@ -46,6 +46,8 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex - **Hold long-lived WS handlers open with `c.CloseRead(r.Context())` plus `<-readCtx.Done()` until the real read loop lands.** `CloseRead` drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close). The frame-loop ticket replaces both the `CloseRead` call and the `<-readCtx.Done()` block with the actual read pump in the same call site — keeping `CloseRead` alongside a real reader would race the sole-reader contract. Adopted in `/v1/server` (#16) and `/v1/client` (#5); `/v1/client` swapped to `StartPhoneForwarder` in #25 (`/v1/server`'s placeholder remains until the binary-side forwarder lands). - **Forwarder owns reading; handler owns cleanup.** The phone-side frame forwarder (#25) is a pure read pump — no `UnregisterPhone`, no `wsconn.Close()`, no extra goroutine. It runs synchronously on the HTTP handler goroutine; on return, the handler's `defer { UnregisterPhone; Close; log }` runs in the right order. Adding cleanup inside the forwarder would either double-close (idempotent, but muddies the lifecycle) or unregister twice. Pattern extends to the symmetric binary-side forwarder. - **Define narrow read-side interfaces at the consumer, not on the adapter.** `phoneSource` (`ConnID() + Read(ctx)`) lives in `forward.go`, not on `WSConn`. Production passes `*WSConn`; tests substitute a fake without touching `WSConn`. If a future caller needs the same shape, promote the interface then — don't anticipate. Adopted in #25. +- **Per-conn goroutines exit cleanly via LIFO defers, not by closing the conn themselves.** Pattern for any goroutine launched alongside an upgraded WS conn (heartbeat, future binary-side ack pump, etc.): derive `hbCtx, cancelHB := context.WithCancel(r.Context())`, register `defer cancelHB()` *after* the release/unregister defer, then `go runX(hbCtx, wsconn, ...)`. Under LIFO unwind `cancelHB` fires first → goroutine observes `ctx.Done()` and returns without touching the conn → release defer's `wsconn.Close()` runs unopposed. The goroutine owns ONLY the failure-path close (heartbeat: `CloseWithCode(1011, "heartbeat timeout")`); cleanup-path close is the handler defer's. `closeOnce` makes the two paths idempotent against each other regardless of which fires first. Adopted in heartbeat (#7); same shape applies to any future per-conn watchdog/timer goroutine. +- **Active-conn application close codes go through `WSConn.CloseWithCode`; stillborn-conn close codes go through the underlying `*websocket.Conn`.** ADR-0005's stillborn-conn pattern (`c.Close(websocket.StatusCode(4409), reason)` directly on the underlying conn before any `Send` could have run) coexists with ADR-0007's active-conn pattern (`wsconn.CloseWithCode(1011, "heartbeat timeout")` post-claim, where `closeOnce`/`writeMu` invariants are in play). Both share the same `closeOnce` guard via `Close()` delegating to `CloseWithCode(StatusNormalClosure, "")`. The two patterns describe structurally distinct windows of a WSConn's lifecycle. Adopted: stillborn `4409`/`4404` (#16, #5); active `1011` (#7). - **Capture process-state timestamps in `main` after `flag.Parse()`, not as package-level vars.** `startedAt := time.Now()` lives inside `main` and is passed into the handler factory. A package-level `var startedAt = time.Now()` would fire at import time — before flag parsing, before `--version` early-returns — and be wrong for short-lived test binaries and any future deferred-serve setup. Adopted in #10. ## Conventions diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index 3104fa3..cd7aa1b 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Features +- [WebSocket heartbeat](features/heartbeat.md) — per-conn goroutine on both endpoints sends RFC 6455 ping every 30s; closes with `1011 "heartbeat timeout"` if no pong within 30s. Detects half-open TCP within 60s; ctx-cancel exit path leaves close to the handler defer (#7). - [Phone-side frame forwarder](features/phone-forwarder.md) — per-phone read pump: wraps each inbound phone frame in the routing envelope keyed by the phone's `conn_id` and `Send`s it to the binary holding `serverID`; opaque inner bytes; synchronous (handler discards the return); replaced `/v1/client`'s `CloseRead` placeholder; added `WSConn.Read` (single-caller) (#25). - [`/v1/client` WS upgrade](features/client-endpoint.md) — phone-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Token` / `User-Agent` pre-upgrade (token presence-only, never parsed/logged); registers phone on the binary's slot, emits `4404` if no binary holds the id, hands the conn to `StartPhoneForwarder` for the data path (#5, #25). - [`/v1/server` WS upgrade](features/server-endpoint.md) — binary-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Version` / `User-Agent` pre-upgrade, claims the slot, emits `4409` on conflict, holds the conn via `CloseRead` until #6's frame loop replaces it; on disconnect schedules a 30s grace release so a quick reconnect inherits the slot (#21). @@ -15,6 +16,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Decisions +- [ADR-0007: `WSConn.CloseWithCode` for active-conn application close codes](decisions/0007-wsconn-closewithcode-for-active-conn.md) — extends ADR-0005 for the post-claim window: heartbeat (#7) needs `1011 "heartbeat timeout"` on a live WSConn; `Close()` delegates to `CloseWithCode(StatusNormalClosure, "")`, both share `closeOnce`. - [ADR-0006: Grace window IS the reclaim path](decisions/0006-grace-period-as-reclaim-path.md) — `ClaimServer` during a pending grace timer succeeds (not `4409`); pointer-identity wrapper defends against stale `time.AfterFunc` fires after `Stop()`. - [ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`](decisions/0005-application-close-codes-via-underlying-conn.md) — `WSConn.Close()` stays minimal (`StatusNormalClosure`); handlers emit `4409` / `4401` / `4404` directly on the `*websocket.Conn` in stillborn-WSConn paths. - [ADR-0004: WS library choice and adapter context strategy](decisions/0004-ws-library-and-adapter-context-strategy.md) — `nhooyr.io/websocket` v1.8.x; adapter-owned context cancelled by `Close`, 10s per-`Send` deadline; registry's `Send` signature stays context-free. diff --git a/docs/knowledge/codebase/7.md b/docs/knowledge/codebase/7.md new file mode 100644 index 0000000..94da528 --- /dev/null +++ b/docs/knowledge/codebase/7.md @@ -0,0 +1,36 @@ +# Ticket #7 — WebSocket heartbeat (RFC 6455 ping/pong every 30s, close at 60s) + +Per-connection liveness regime for both endpoints. A goroutine launched after each successful claim/register sends an RFC 6455 `Ping` every 30s; if the matching pong does not arrive within 30s, it closes the conn with `1011 server error` reason `"heartbeat timeout"`. Worst-case dead-connection detection: 60s. The relay's only mechanism for detecting half-open TCP from disappeared peers (NAT timeout, mobile interface flap, kernel panic, severed cable). + +## Implementation + +- **`internal/relay/heartbeat.go` (new)** — file-level constants `heartbeatInterval = 30s` and `heartbeatTimeout = 30s`; one unexported function `runHeartbeat(ctx, *WSConn, interval, timeout)`. `time.NewTicker` (not `time.Tick`) with `defer ticker.Stop()`; per-tick `context.WithTimeout` for the `Ping`; cancellation invariant — two `return` paths bypass the close (`<-ctx.Done()` between ticks, and `ctx.Err() != nil` after a Ping that returned ctx-cancellation), exactly one `CloseWithCode` call site gated on `ctx.Err() == nil`. +- **`internal/relay/ws_conn.go`** — surface widened by exactly two methods. `Ping(ctx) error` is a pure forwarder over `*websocket.Conn.Ping`; does not take `writeMu` (the library serialises control frames against writes internally). `CloseWithCode(code, reason)` is `closeOnce`-guarded; cancels `closeCtx` then calls `c.Close(code, reason)`. `Close()` is refactored to delegate: `CloseWithCode(StatusNormalClosure, "")` — both share one `closeOnce`, so whichever close-family call fires first wins. Behaviour of the existing `Close()` is unchanged. +- **`internal/relay/server_endpoint.go` and `internal/relay/client_endpoint.go`** — identical insertion in both handlers, immediately after the existing release/unregister defer is in place: `hbCtx, cancelHB := context.WithCancel(r.Context()); defer cancelHB(); go runHeartbeat(hbCtx, wsconn, heartbeatInterval, heartbeatTimeout)`. LIFO defer order: `cancelHB` fires before the release defer's `wsconn.Close()`, so the goroutine exits cleanly on the peer-close path before the conn is closed by the defer. On the heartbeat-timeout path, the goroutine has already emitted `CloseWithCode(1011, ...)`; the subsequent defer-driven `Close()` is a `closeOnce` no-op. +- **`internal/relay/heartbeat_test.go` (new)** — `startHeartbeatPair(t, interval, timeout)` helper sets up `httptest.NewServer` whose handler accepts the upgrade, wraps in `WSConn`, calls `c.CloseRead(r.Context())` (mirrors production handler shape so the server side processes incoming pongs), runs `runHeartbeat` in a goroutine. Three tests, 1:1 with AC bullets: + - `TestHeartbeat_HealthyPeer_KeepsConnOpen` — client `CloseRead` drives auto-pong; assert `done` is not closed across ~3 ping cycles. + - `TestHeartbeat_UnresponsivePeer_TriggersClose` — client deliberately does NOT read; sleep past `interval+timeout`, then a single `client.Read` returns `*websocket.CloseError{Code:1011, Reason:"heartbeat timeout"}`. `done` deadline is 7s because nhooyr's `Conn.Close` performs a 5s close handshake when the peer never reciprocates. + - `TestHeartbeat_ContextCancelled_ExitsCleanly` — cancel ctx; assert `done` closes within 1s; assert `client.Read` returns `context.DeadlineExceeded`, NOT a `CloseError` (proves heartbeat did not close on the cancel path). +- Production constants (30s/30s) are exercised implicitly by existing handler tests at sub-second flows; an explicit slow integration test that waits 30s is out of scope. + +## Concurrency + +3 goroutines per accepted upgrade (was 2 pre-#7): handler, `CloseRead` drain, heartbeat. No leak path — `defer cancelHB()` is unconditional once heartbeat has been launched, and heartbeat is launched only after the release defer is in place. `Ping` and `Send` interleave correctly via the library's internal control-frame serialisation; `WSConn.writeMu` guards Send-vs-Send only. + +## Deliberately out of scope + +- No application-level keepalive messages (would be a protocol change). +- No per-endpoint interval differences (phone vs binary). +- No adaptive intervals based on connection quality. +- No flag/config exposure of interval/timeout — file-level constants only. Revisit only if mobile-flapping evidence emerges. +- No new heartbeat-induced-close log line. The handler's `server_released` / `phone_unregistered` defer log fires regardless. Spec named diagnosability as an open question; deferred until ops evidence motivates it. + +## Cross-links + +- [Feature: WebSocket heartbeat](../features/heartbeat.md) — full design / cancellation invariant / close-handshake-timing / testing rationale. +- [ADR-0007: `WSConn.CloseWithCode` for active-conn application close codes](../decisions/0007-wsconn-closewithcode-for-active-conn.md) — the surface widening this ticket required; amends ADR-0005. +- [ADR-0005: Application WS close codes via the underlying `*websocket.Conn`](../decisions/0005-application-close-codes-via-underlying-conn.md) — still applies to stillborn-WSConn paths (`4409`/`4404`/`4401`). +- [WSConn adapter](../features/ws-conn-adapter.md) — `Ping` and `CloseWithCode` were added here. +- [`/v1/server` WS upgrade](../features/server-endpoint.md), [`/v1/client` WS upgrade](../features/client-endpoint.md) — the two wiring sites. +- [Spec: 7-heartbeat](../../specs/architecture/7-heartbeat.md) — the architect's pre-implementation design. +- [Protocol spec § Heartbeat](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#heartbeat) — authoritative source for 30s/30s/60s. diff --git a/docs/knowledge/decisions/0007-wsconn-closewithcode-for-active-conn.md b/docs/knowledge/decisions/0007-wsconn-closewithcode-for-active-conn.md new file mode 100644 index 0000000..c53b932 --- /dev/null +++ b/docs/knowledge/decisions/0007-wsconn-closewithcode-for-active-conn.md @@ -0,0 +1,65 @@ +# ADR-0007: `WSConn.CloseWithCode` for active-conn application close codes + +**Status:** Accepted (#7) +**Date:** 2026-05-10 + +## Context + +[ADR-0005](0005-application-close-codes-via-underlying-conn.md) decided that application WS close codes (`4409`, `4404`, `4401`) are emitted by calling `Close(code, reason)` directly on the underlying `*websocket.Conn`, **not** through `WSConn`. The decision held the adapter's surface narrow: one `Close()` whose body is fixed to `StatusNormalClosure`. The justification rested on the close-codes-so-far being emitted from a *stillborn-WSConn* window — after construction, before any `Send`, before any defer wires the conn into the registry — where the WSConn's `closeOnce`/`writeMu` invariants are not in play. + +ADR-0005 anticipated *this* ticket explicitly: + +> The pattern only applies to **stillborn WSConn** paths — pre-`Send`, pre-defer. Once a WSConn is in use (registered with `Send` reachable from broadcasters), close goes through `WSConn.Close()` and the code is `StatusNormalClosure`. Application close codes for an *active* WSConn (e.g. binary disconnects mid-stream and we want to surface a forward-error code) need a different solution — likely a small `WSConn.CloseWithCode` added at the time the first such use case lands, justified by an ADR amendment. + +The heartbeat ticket (#7) is that first use case. A goroutine running alongside the registered conn must, on a peer that stops responding to RFC 6455 pings, close the WS with `1011 server error` + reason `"heartbeat timeout"`. By the time heartbeat decides to close, the WSConn has been live in the registry — concurrent broadcasters may be in `Send`, the close defer is wired up, `closeOnce` is the only safe single-funnel for the close. + +So: where does the `1011 + "heartbeat timeout"` close-code call site live? + +## Decision + +**Add `WSConn.CloseWithCode(code websocket.StatusCode, reason string)`.** Refactor `WSConn.Close()` to delegate (`CloseWithCode(StatusNormalClosure, "")`). Both share the same `closeOnce` guard and both cancel `closeCtx` before calling the underlying `Close`. + +Concretely: + +```go +func (w *WSConn) CloseWithCode(code websocket.StatusCode, reason string) { + w.closeOnce.Do(func() { + w.cancel() + _ = w.conn.Close(code, reason) + }) +} + +func (w *WSConn) Close() { + w.CloseWithCode(websocket.StatusNormalClosure, "") +} +``` + +`closeOnce` ensures that whichever Close-family call fires first wins. If heartbeat fires `CloseWithCode(1011, "heartbeat timeout")` and the handler defer subsequently calls `wsconn.Close()`, the second call is a no-op — the wire saw 1011. Conversely, on the clean-shutdown path the defer's `Close()` runs first and any later `CloseWithCode` from a goroutine racing with the cancel becomes a no-op. Single-funnel guarantee preserved. + +Also added: `WSConn.Ping(ctx context.Context) error` — a pure forwarder over `c.Ping`. Heartbeat uses `Ping` rather than reaching for the underlying `*websocket.Conn`, so `runHeartbeat` operates entirely above the `WSConn` boundary. + +ADR-0005 still applies for *stillborn* WSConn paths. The pattern there (handler calls `c.Close(websocket.StatusCode(4409), reason)` directly on the underlying conn before any `Send` could have run) remains the right shape for `4409`/`4404`/`4401` — those close codes are decided pre-claim, where the WSConn is provably stillborn. This ADR extends the surface specifically for *post-claim* close-code emission, which is structurally distinct. + +## Rationale + +The three options ADR-0005 weighed (add a method; embed the code in the constructor; emit on the underlying conn from the call site) get re-evaluated against the active-conn case: + +1. **Add `WSConn.CloseWithCode`.** Adopted. Two callers (`Close`, future post-claim closes) share the `closeOnce` funnel through one body. Adapter surface grows by one method but keeps its single-mutex / single-once shape intact. No new `closeOnce` (rejected in ADR-0005 as "which close wins?") because `Close` now *delegates* to `CloseWithCode` — the body is a single function, the guard a single `sync.Once`, so the question of which close wins is settled by `Once.Do` semantics, exactly as before. +2. **Pass the close code into `WSConn` at construction.** Same reason it failed in ADR-0005: the close code at construction time is *not* the code the caller eventually decides on. Heartbeat decides "1011 because timeout" only after a Ping fails; the WSConn was constructed at `RegisterPhone`/`ClaimServer` time long before. +3. **Emit on the underlying `*websocket.Conn` from the heartbeat call site.** Rejected for the post-claim case. Reaching for `c` from a goroutine that holds neither `writeMu` nor `closeOnce` would race the adapter's invariants: a concurrent `Send` could be holding `writeMu` and have a `Write` mid-flight; a concurrent `Close` (handler defer) could be running `c.Close` independently. The library is documented to handle concurrent `Close` calls safely, but our `closeOnce` guarantee — "exactly one close-frame leaves this WSConn" — would be violated. The stillborn-conn carve-out from ADR-0005 cannot be extended here because the conn is no longer stillborn. + +The cost of (1) is one method's worth of API surface widening, kept minimal: same shape as `Close`, same idempotency guarantee, no new state field, no new mutex. The benefit is that any future "active-conn close with custom code" (frame-forward error codes from #6, e.g.) plugs into the same call site without a fresh ADR amendment. + +## Consequences + +- `WSConn.Close()` is now defined as `CloseWithCode(StatusNormalClosure, "")`. Behaviour is unchanged — same close code, same idempotency, same `closeCtx` cancel — but the implementation funnels through `CloseWithCode`. Existing call sites (handler defers, registry-side teardown) require no change. +- `WSConn` now exposes `Ping(ctx)` as a pure forwarder. Used by `runHeartbeat`; safe to use for any future caller that needs RFC 6455 ping/pong from above the adapter boundary. +- Application close codes for **active** WSConns are emitted via `wsconn.CloseWithCode(code, reason)`. The first user is `runHeartbeat` (`1011`/`"heartbeat timeout"`). Future frame-forward error codes on `/v1/server` and `/v1/client` (per #6) follow the same call site. +- ADR-0005 remains in force for **stillborn** WSConns: handlers continue to call `c.Close(websocket.StatusCode(4409), reason)` directly for `4409`, `4404`, `4401`, with the inline comment naming the exception. The two patterns coexist — they describe structurally distinct windows of a WSConn's lifecycle. + +## Related + +- [ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`](0005-application-close-codes-via-underlying-conn.md) — the parent decision; this ADR is the explicit amendment ADR-0005 named. +- [ADR-0004: WS library choice and adapter context strategy](0004-ws-library-and-adapter-context-strategy.md) — `closeCtx` cancellation is what `CloseWithCode` cancels, same as `Close`. +- [Heartbeat feature](../features/heartbeat.md) — first user of `CloseWithCode` and `Ping`. +- [WSConn adapter](../features/ws-conn-adapter.md) — the type whose surface this ADR widens by exactly two methods. diff --git a/docs/knowledge/features/client-endpoint.md b/docs/knowledge/features/client-endpoint.md index d02ce01..50ea98b 100644 --- a/docs/knowledge/features/client-endpoint.md +++ b/docs/knowledge/features/client-endpoint.md @@ -4,7 +4,7 @@ This is the public, internet-exposed endpoint. The peer is *less* trusted than `/v1/server`'s peer (which at least runs operator-issued software); anyone on the internet who learns the relay hostname can connect. Header validation runs **before** `websocket.Accept`; the token is presence-checked only and never logged. -This is the phone side only. Heartbeat is a future ticket. After header validation and `RegisterPhone`, the handler hands the connection to `StartPhoneForwarder` ([phone-forwarder.md](phone-forwarder.md), #25), which is the read pump for the data path. +This is the phone side only. After header validation and `RegisterPhone`, the handler hands the connection to `StartPhoneForwarder` ([phone-forwarder.md](phone-forwarder.md), #25), which is the read pump for the data path. The heartbeat goroutine ([Heartbeat feature](heartbeat.md), #7) runs alongside the handler — see § Concurrency below for the LIFO defer ordering. ## Wire shape @@ -140,7 +140,7 @@ The phone observes the close on its socket as `StatusNormalClosure` — by delib - **No connection caps (per-IP or global).** Documented residual in `docs/threat-model.md` § DoS resistance. Same gap as `/v1/server`, named there, not widened. - **No phone-count-per-server-id cap.** `phones[serverID]` grows under attack; the broadcast cost (when #6 lands) is the DoS shape that owns it. - **No inner-frame parsing.** `StartPhoneForwarder` wraps each frame in the routing envelope and forwards opaque bytes; the binary owns inner-frame validation. -- **No heartbeat / ping-pong.** Future ticket. +- **No heartbeat policy in the handler.** The handler launches `go runHeartbeat(...)` after the successful register and registers `defer cancelHB()` so the goroutine exits cleanly under handler unwind (#7). The heartbeat policy itself — 30s interval, 30s pong timeout, `1011 "heartbeat timeout"` close — lives in `heartbeat.go`. See [Heartbeat feature](heartbeat.md). - **No phone-side reconnect grace.** The binary-side grace from #20 already closes orphan phones cleanly on expiry; phone-side grace is not in the protocol spec and is out of scope. - **No `400` log line.** Avoids amplifying header-floods into log volume. - **No log on `websocket.Accept` errors.** Library writes a 4xx; the failure is visible in the http access log. diff --git a/docs/knowledge/features/heartbeat.md b/docs/knowledge/features/heartbeat.md new file mode 100644 index 0000000..5eb4a7d --- /dev/null +++ b/docs/knowledge/features/heartbeat.md @@ -0,0 +1,150 @@ +# WebSocket heartbeat — RFC 6455 ping/pong + +Every accepted WS connection (binary on `/v1/server`, phone on `/v1/client`) gets a per-connection heartbeat goroutine that sends an RFC 6455 ping every 30 seconds. If the matching pong does not arrive within 30 seconds of the most recent ping, the goroutine closes the conn with code `1011 server error` and reason `"heartbeat timeout"`. The handler's existing release/unregister defer then runs naturally (peer-close path, via `CloseRead` surfacing the close to `readCtx`). + +This is the only mechanism the relay has to detect a half-open TCP connection — a peer whose socket dies silently (NAT timeout, mobile interface flap, kernel panic, severed cable) leaves no application-layer signal otherwise. Worst-case dead-connection detection: **60 seconds** (interval + timeout). + +## Wire shape + +Standard RFC 6455 control frames. The library (`nhooyr.io/websocket`) handles ping/pong framing internally: + +- Server sends a `Ping` opcode with library-default empty payload. +- Peer's library auto-responds with a matching `Pong`. +- A peer that does not process incoming pings (no `Read` / `CloseRead` on its side) cannot pong. + +No application-level keepalive messages. No protocol-spec changes. Per-endpoint behaviour is **identical** for binaries and phones. + +## API + +Package `internal/relay` (`heartbeat.go`): + +```go +const ( + heartbeatInterval = 30 * time.Second + heartbeatTimeout = 30 * time.Second +) + +func runHeartbeat(ctx context.Context, w *WSConn, interval, timeout time.Duration) +``` + +One unexported function and two file-level constants. `runHeartbeat` is called from both upgrade handlers as `go runHeartbeat(hbCtx, wsconn, heartbeatInterval, heartbeatTimeout)` after the registry claim/register succeeds. + +Uses two methods on `WSConn` added in this ticket (see [ADR-0007](../decisions/0007-wsconn-closewithcode-for-active-conn.md)): + +- `WSConn.Ping(ctx) error` — pure forwarder over `*websocket.Conn.Ping`. +- `WSConn.CloseWithCode(code, reason)` — `closeOnce`-guarded close with custom code/reason; `Close()` now delegates to it as `CloseWithCode(StatusNormalClosure, "")`. + +## Constants are file-level, not flags + +`heartbeatInterval` and `heartbeatTimeout` are file-level constants — not `cmd/pyrycode-relay/main.go` flags, not constructor parameters. Two reasons: + +1. **Two wiring entry points share the same value.** `/v1/server` and `/v1/client` both call `runHeartbeat(..., heartbeatInterval, heartbeatTimeout)`. Naming the value once and referring to it from both sites is the threshold the project pattern names ("Promote to a package-level constant only when a second wiring entry point needs the same value"). The grace duration in `/v1/server` (#21) had only one wiring entry point and stayed inline; heartbeat has two and gets a constant. +2. **The AC explicitly requires file-level constants.** Surfacing them as flags would invite operator-tweaking against the protocol spec. v1 deliberately does not configure these. + +Tests pass shorter values (50ms / 100ms) directly to `runHeartbeat` so production constants are not coupled to test wall-clock cost. + +## Handler wiring + +Both handlers, immediately after the successful register and after the existing release/unregister defer is in place: + +```go +hbCtx, cancelHB := context.WithCancel(r.Context()) +defer cancelHB() +go runHeartbeat(hbCtx, wsconn, heartbeatInterval, heartbeatTimeout) +``` + +`cancelHB` is registered as a defer **after** the release defer, so it runs **first** under LIFO unwind. The order matters: + +1. `cancelHB()` fires → heartbeat goroutine observes `ctx.Done()` and returns without touching the conn. +2. Then the release defer runs `wsconn.Close()` (`StatusNormalClosure`). + +On the clean-shutdown / peer-close path the wire sees exactly one normal-close. On the heartbeat-timeout path, heartbeat has already emitted `CloseWithCode(1011, ...)` before unwind; the subsequent `wsconn.Close()` is a `closeOnce` no-op (the wire saw 1011). + +## The cancellation invariant + +`runHeartbeat`'s body has two `return` paths that **do not** call `CloseWithCode`: + +```go +case <-ctx.Done(): + return // (1) ctx fired between ticks +case <-ticker.C: + pingCtx, cancel := context.WithTimeout(ctx, timeout) + err := w.Ping(pingCtx) + cancel() + if err == nil { continue } + if ctx.Err() != nil { + return // (2) ctx fired while Ping was in flight + } + w.CloseWithCode(websocket.StatusInternalError, "heartbeat timeout") + return +``` + +Both ctx-cancellation cases — (1) and (2) — exit cleanly. The handler defer owns the clean-close path; `runHeartbeat` owns **only** the timeout-driven close. This is what AC bullet (c) — "exits cleanly when the connection's context is cancelled, without attempting a close call" — encodes. The structural property: there is exactly one `CloseWithCode` call site, gated by `ctx.Err() == nil`. + +`time.NewTicker` (with `defer ticker.Stop()`) is used rather than `time.Tick`; `Tick` leaks its underlying ticker for the lifetime of the program, which is wrong inside any goroutine that exits. + +## Concurrency + +Per accepted upgrade: **3 goroutines** (was 2 pre-#7). + +| Goroutine | Source | Lifecycle | +|---|---|---| +| handler | http.Server | request-scoped; blocks on `<-readCtx.Done()` | +| `CloseRead` drain | `c.CloseRead(r.Context())` | exits on conn close / r.Context cancel | +| heartbeat | `go runHeartbeat(...)` (this ticket) | exits via ctx.Done OR timeout-then-close | + +No leak path: `defer cancelHB()` is unconditional once heartbeat has been launched, and heartbeat is launched only after the release defer is in place. Any panic between the two still cancels. + +`Ping` (a control frame) and `Send` (a binary frame) at the underlying-library level: `nhooyr.io/websocket` serialises pings against writes via its own internal mutex. `WSConn.writeMu` guards *concurrent `Send` callers* (broadcasters), not Send-vs-Ping. `runHeartbeat` correctly does not take `writeMu` — control frames and data frames have separate ordering requirements, and the library's internal serialisation is the right granularity. + +`CloseWithCode` and `Close` share the same `closeOnce`. Whichever close-family call fires first wins; the wire sees exactly one close frame. + +## Close-frame timing — what the test had to reckon with + +`nhooyr.io/websocket.Conn.Close` performs a close handshake: it writes the close frame (5s write timeout) then waits up to 5s for the peer's reciprocal close frame before tearing down the TCP. For a peer that has stopped reading entirely (the heartbeat-target case), the reciprocal close never arrives and the handshake's 5s grace bounds how long `runHeartbeat` blocks inside `CloseWithCode`. + +Operationally this adds at most ~5s onto the 60s detection window before the handler unwinds and the registry slot is released. The protocol spec's "60s worst case" is about **detection**; teardown is bounded by the library's handshake timeout. No additional configuration is needed. + +For tests, the implication is that `done` (the channel that closes when `runHeartbeat` returns) lags the close-frame-on-the-wire by up to ~5s if the test client never reads. The test gives `done` a 7s deadline; the close-frame assertion is independent of `done` and uses a shorter window. + +## Testing + +`internal/relay/heartbeat_test.go`, `package relay`. The `startHeartbeatPair(t, interval, timeout)` helper sets up an `httptest.NewServer` whose handler accepts the upgrade, wraps in `WSConn`, calls `c.CloseRead(r.Context())` to mirror the production handler shape (so the server side processes incoming pongs and pings), and runs `runHeartbeat` in a goroutine. Returns the `*WSConn` (server side), the client `*websocket.Conn`, a `done` channel, the heartbeat `cancel` func, and a `teardown`. + +Three tests, 1:1 with AC bullets (a)/(b)/(c): + +- **`TestHeartbeat_HealthyPeer_KeepsConnOpen`** — client side runs `client.CloseRead(...)` so the lib auto-pongs server's pings. Sleep through ~3 ping cycles (interval=50ms, 3 cycles + slack). Assert `done` is **not** closed. +- **`TestHeartbeat_UnresponsivePeer_TriggersClose`** — client side does **not** read until after the heartbeat has had time to fire and emit its close. Reading earlier would auto-pong and keep the heartbeat satisfied (the inverse of what's wanted). After `interval + timeout + 200ms` the test does a single `client.Read` with a 5s deadline; assert it returns `*websocket.CloseError{Code: 1011, Reason: "heartbeat timeout"}`. The Read also triggers the client lib's automatic reciprocal close, which unblocks the server's Close handshake and lets `done` fire; assert `done` closes within 7s. +- **`TestHeartbeat_ContextCancelled_ExitsCleanly`** — cancel the heartbeat ctx immediately; assert `done` closes within 1s; assert a `client.Read` with a short (200ms) deadline returns `context.DeadlineExceeded` rather than a `CloseError` (proving the heartbeat did **not** emit a close on the cancel path). + +Production constants (30s/30s) are exercised implicitly by existing handler tests that already pass at sub-second flows; an explicit slow integration test that waits 30s is out of scope. + +## What this ticket deliberately does NOT do + +- **No application-level keepalive messages.** Would be a protocol change. +- **No per-endpoint interval differences (phone vs binary).** Both endpoints get the same regime; the relay treats both as untrusted internet sockets with identical liveness requirements. +- **No adaptive intervals based on connection quality.** v1. +- **No configurability of interval / timeout.** Revisit only if mobile-flapping evidence emerges. +- **No new logging on heartbeat-induced close.** The handler's existing `server_released` / `phone_unregistered` log fires on the defer path regardless of whether the close was peer-initiated or heartbeat-initiated. Distinguishing the two would help mobile-flapping diagnosability and the spec named it as an open question; not adopted in this ticket because no operational signal currently demands it. Revisit when ops evidence motivates it. + +## Adversarial framing + +The heartbeat operates on an already-authorised connection (post-`ClaimServer` / `RegisterPhone`). It introduces no new untrusted input. + +- **Bounded write rate.** One ping per 30s per connection — server-driven, not amplifiable by a hostile peer. +- **Bounded write size.** Library-default empty ping payload. No attacker-controlled byte enters the wire. +- **Slow-loris resistance — improvement.** A peer that withholds pongs is now detected within 60s and its slot reclaimed. Pre-#7, dead peers wedged the routing tables forever. A peer that slow-pongs (responds at 29s) keeps the conn alive but consumes nothing beyond the existing per-conn footprint. +- **Goroutine count.** +1 per accepted upgrade (3 total per conn). Bounded above by `Registry` slot limits. +- **No new log content.** The `1011` close reason is a fixed literal; no peer-supplied value enters logs or wire frames. +- **`Ping`-vs-`Send` interleaving.** Library serialises control frames against data writes internally; `WSConn.writeMu` guards Send-vs-Send. No new lock-order interaction. + +Verdict from the architect's pre-implementation security review (in the spec): **PASS**. No new defences invented; this ticket is a strict improvement over the pre-#7 dead-peer-detection state. + +## Related + +- [WSConn adapter](ws-conn-adapter.md) — `Ping` and `CloseWithCode` were added here for this ticket. +- [`/v1/server` WS upgrade](server-endpoint.md) — first heartbeat wiring site. +- [`/v1/client` WS upgrade](client-endpoint.md) — second heartbeat wiring site. +- [ADR-0007](../decisions/0007-wsconn-closewithcode-for-active-conn.md) — the WSConn-surface widening that this feature required. +- [ADR-0005](../decisions/0005-application-close-codes-via-underlying-conn.md) — still applies for stillborn-WSConn close codes (`4409`/`4404`/`4401`). +- [Protocol spec § Heartbeat](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#heartbeat) — authoritative source for 30s interval / 30s timeout / 60s detection. diff --git a/docs/knowledge/features/server-endpoint.md b/docs/knowledge/features/server-endpoint.md index e1ee028..15ee63b 100644 --- a/docs/knowledge/features/server-endpoint.md +++ b/docs/knowledge/features/server-endpoint.md @@ -110,7 +110,7 @@ Shutdown: `http.Server` cancels `r.Context()` → `CloseRead`'s read errors → - **No log on `websocket.Accept` errors.** Library writes a 4xx; the failure is visible in the http access log. A per-failure log would invite log-flood from misconfigured clients. A counter (Prometheus, future) is the right shape if observability later wants this. - **No connection caps (per-IP or global).** Documented residual in `docs/threat-model.md` § DoS resistance; the WS upgrade path is named there. Inherited gap, not widened. - **No frame loop.** `CloseRead` discards frames until #6 replaces it. -- **No heartbeat / ping-pong.** #7. +- **No heartbeat policy in the handler.** The handler launches `go runHeartbeat(...)` after the successful claim and registers `defer cancelHB()` so the goroutine exits cleanly under handler unwind (#7). The heartbeat policy itself — 30s interval, 30s pong timeout, `1011 "heartbeat timeout"` close — lives in `heartbeat.go`. See [Heartbeat feature](heartbeat.md). - **No frame-forward error handling during grace.** A phone whose `Send` errors against the closed binary `Conn` during the grace window is #6's territory; the handler does not observe these. - **No token validation.** `/v1/server` is unauthenticated by design — the binary owns the trust relationship with phones. - **No log-line sanitisation.** `slog`'s text handler quotes string values, so a newline in `binary_version` cannot forge a log line. JSON handler in production preserves the property. diff --git a/docs/knowledge/features/ws-conn-adapter.md b/docs/knowledge/features/ws-conn-adapter.md index b5fb55b..52650ea 100644 --- a/docs/knowledge/features/ws-conn-adapter.md +++ b/docs/knowledge/features/ws-conn-adapter.md @@ -17,6 +17,8 @@ func (w *WSConn) ConnID() string func (w *WSConn) Send(msg []byte) error func (w *WSConn) Read(ctx context.Context) ([]byte, error) func (w *WSConn) Close() +func (w *WSConn) CloseWithCode(code websocket.StatusCode, reason string) // #7 +func (w *WSConn) Ping(ctx context.Context) error // #7 ``` Plus one package-level constant: @@ -27,6 +29,8 @@ const writeTimeout = 10 * time.Second `writeTimeout` bounds a single `Send`. A slow peer cannot stall a caller past this deadline. +`CloseWithCode` and `Ping` were added in #7 (heartbeat). `Close()` now delegates to `CloseWithCode(StatusNormalClosure, "")` — both share the `closeOnce` funnel, so whichever close-family call fires first wins. `Ping` is a pure forwarder over `*websocket.Conn.Ping`; it does not take `writeMu` because `nhooyr.io/websocket` serialises control frames against data writes internally. See [ADR-0007](../decisions/0007-wsconn-closewithcode-for-active-conn.md) for why active-conn application close codes go on `WSConn` while stillborn-WSConn close codes (`4409`/`4404`/`4401`, [ADR-0005](../decisions/0005-application-close-codes-via-underlying-conn.md)) keep the underlying-`*websocket.Conn` pattern. + ## Concurrency model | Method | Lock | Context | Concurrent-safe with | @@ -60,7 +64,7 @@ These are documented so the next contributor doesn't add defensive code that doe - **No `connID` validation.** Length, charset, uniqueness are owned by the conn-id-scheme ticket. The adapter treats the id as opaque. - **No handshake / header validation / subprotocol selection.** Lives at the upgrade boundary (#4/#16, #5). -- **No ping/pong.** Heartbeat is #7. +- **No heartbeat policy.** The adapter exposes `Ping(ctx)` as a pure forwarder; the *policy* (interval, timeout, what to do on failure) lives in `runHeartbeat` in `internal/relay/heartbeat.go` (#7). The adapter does not own the heartbeat goroutine, the ticker, or the close decision — it only provides the pinging primitive. - **No read-side frame loop or envelope wrap/unwrap.** `Read` is a single-frame primitive; the loop and envelope wrapping live in `internal/relay/forward.go` ([phone-forwarder.md](phone-forwarder.md)). - **No per-message size cap on `Read`.** Inherited from `*websocket.Conn`'s default (nhooyr's 32 MiB read limit). A deliberate `SetReadLimit` policy is a follow-up so it covers both forwarders. - **No per-conn send queue / backpressure / rate limit.** `Send` writes synchronously and returns. None of those are in the registry's `Conn` contract. diff --git a/docs/lessons.md b/docs/lessons.md index 04951c3..27db8ce 100644 --- a/docs/lessons.md +++ b/docs/lessons.md @@ -70,6 +70,14 @@ A client can complete the TLS handshake using SNI for `relay.example.com` and th RFC 7230 §5.4 allows it. When comparing to a configured hostname, strip the port via `net.SplitHostPort` first (it errors out cleanly if there's no port — use the original string in that case). Compare with `strings.EqualFold` because hostnames are case-insensitive. Source: `EnforceHost` (#9). +## `nhooyr.io/websocket.Conn.Close` performs a 5s close-handshake, gating goroutine exit + +`Conn.Close(code, reason)` writes the close frame (5s write timeout) then **waits up to 5s for the peer's reciprocal close frame** before tearing down the TCP. If the peer has stopped reading entirely (the heartbeat-target case — that's why heartbeat fired), the reciprocal close never arrives and `Close` blocks for the full 5s. Implication for callers: the close frame reaches the wire promptly, but the goroutine sitting inside `Close` (e.g. `runHeartbeat`'s `CloseWithCode` path) only returns when the handshake's 5s grace expires. Tests that assert "the goroutine exited" need a >5s deadline; tests that assert "the close frame arrived" can use a much shorter window. Operationally this adds at most ~5s onto the worst-case dead-conn detection window before the handler unwinds and the registry slot releases. Source: `internal/relay/heartbeat_test.go` `TestHeartbeat_UnresponsivePeer_TriggersClose` (#7). + +## A WebSocket peer that does not read frames cannot pong + +The library's auto-pong machinery runs inline with `Read` (or `CloseRead`'s background drain). A peer that completes the handshake and then never reads — even just to discard — cannot respond to incoming pings, because the ping frame sits in the kernel's TCP buffer unobserved. This is exactly what makes RFC 6455 ping/pong a useful liveness signal: a wedged peer that has stopped processing the conn fails the heartbeat structurally, not just because it "chose not to" pong. Implication for tests: to test the *unresponsive-peer* path, the test client must NOT read; to test the *healthy-peer* path, the client must read (or `CloseRead`) so the library auto-pongs. Reverse the two and the assertions swap meanings. Source: `internal/relay/heartbeat_test.go` (#7). + ## `autocert.Manager.TLSConfig()` doesn't set `MinVersion` `gosec` G402 fires on `make lint` if you use it raw. Wrap it in a helper that pins `MinVersion = tls.VersionTLS12` (or 1.3) before handing it to `http.Server`. Centralising the override means a future bump is a one-line change. Source: `relay.TLSConfig` (#9). diff --git a/docs/specs/architecture/7-heartbeat.md b/docs/specs/architecture/7-heartbeat.md new file mode 100644 index 0000000..7d84d51 --- /dev/null +++ b/docs/specs/architecture/7-heartbeat.md @@ -0,0 +1,244 @@ +# Spec: WebSocket heartbeat — RFC 6455 ping/pong (#7) + +## Files to read first + +- `internal/relay/server_endpoint.go` (full, 122 lines) — handler structure, `defer { ScheduleReleaseServer; wsconn.Close; log }` pattern, `c.CloseRead(r.Context()) ; <-readCtx.Done()` placeholder. The new heartbeat goroutine launches inside this handler, after `ClaimServer` succeeds and after the existing release defer is registered. +- `internal/relay/client_endpoint.go` (full, 88 lines) — symmetric structure for `/v1/client`. Heartbeat wiring is identical in shape; the only difference is which register/unregister functions the handler defers around. +- `internal/relay/ws_conn.go` (full, 83 lines) — `WSConn` adapter, `closeOnce`, `closeCtx`, `writeMu`. The new `CloseWithCode` and `Ping` methods land here; `Close()` is refactored to delegate to `CloseWithCode`. +- `docs/knowledge/decisions/0005-application-close-codes-via-underlying-conn.md` — explicitly anticipates *this* ticket: "Application close codes for an *active* WSConn ... need a different solution — likely a small `WSConn.CloseWithCode` added at the time the first such use case lands, justified by an ADR amendment." Heartbeat is the first such use case. ADR-0005 still applies to *stillborn* WSConn paths (4404/4409); the new method extends the surface for *active* WSConn closes. +- `internal/relay/server_endpoint_test.go:21-43` — `startServer` / `dialWith` / `validHeaders` test helpers. The new heartbeat tests use the same `httptest.NewServer` + raw `nhooyr.io/websocket` pair pattern. +- `internal/relay/client_endpoint_test.go:21-93` — analogous helpers for `/v1/client`. +- `pyrycode/pyrycode/docs/protocol-mobile.md` § Heartbeat (linked from the ticket body) — authoritative behaviour: 30s ping interval, 30s pong timeout, 60s worst-case dead-connection detection, both endpoints. +- `docs/PROJECT-MEMORY.md` § Patterns established — esp. "Application WS close codes are emitted on the underlying `*websocket.Conn`, not via `WSConn`" (which this ticket *amends* for the active-conn case) and "Adapters bridge interface↔library API mismatches by owning policy locally" (the WSConn extension follows the same shape). + +## Context + +Both endpoints (`/v1/server` and `/v1/client`) currently park on `<-readCtx.Done()` after a successful register. `CloseRead` drains control frames in the background — including pongs — so the connection survives ping/pong handshakes initiated by the *peer*, but the relay never **initiates** any. A peer whose TCP socket dies silently (NAT timeout, mobile interface flap, kernel panic, severed cable) leaves the relay holding a registered `Conn` indefinitely, blocking the server-id slot and consuming a phone slot. + +RFC 6455 ping/pong at the WebSocket layer is the only signal the relay has to detect this. Per protocol-mobile.md § Heartbeat: send a ping every 30 seconds; if a pong does not arrive within 30 seconds of the ping, treat the connection as dead. Worst-case detection: 60 seconds. + +This ticket adds the heartbeat. It is intentionally symmetric across both endpoints — same interval, same timeout, same goroutine shape — because the relay treats both peers as untrusted internet-exposed sockets with identical liveness requirements. + +## Design + +### Module layout + +Add `internal/relay/heartbeat.go` containing two file-level constants and one unexported function: + +```go +const heartbeatInterval = 30 * time.Second +const heartbeatTimeout = 30 * time.Second + +func runHeartbeat(ctx context.Context, w *WSConn, interval, timeout time.Duration) +``` + +The function is unexported; both upgrade handlers live in `package relay` and call it directly. The constants are file-level — not exported, not configurable from `cmd/pyrycode-relay/main.go`. Tests in `heartbeat_test.go` call `runHeartbeat` directly with shorter values; the constants are used only at the production wiring sites. + +Why constants in the file (rather than a literal at the wiring site, which is the established `grace` pattern from #21): there are *two* wiring entry points (server + client) using the same value. Naming the value once and referring to it from both sites is the threshold the existing pattern memo names ("Promote to a package-level constant only when a second wiring entry point needs the same value"). The AC also asks for "file-level constants" explicitly. + +### `WSConn` extension + +`WSConn` gains two methods. Both are minimal forwarders; together they let heartbeat operate on a `*WSConn` without exposing the underlying `*websocket.Conn`. + +```go +// CloseWithCode cancels in-flight writes and closes the WebSocket with +// code/reason. Idempotent under closeOnce: only the first Close-family +// call reaches the underlying conn. Safe to call concurrently with Send. +// +// Used by paths that need to signal a non-normal close on an *active* +// WSConn (heartbeat timeout → 1011 "heartbeat timeout"). See ADR-0005 +// for why this lives on WSConn rather than at the call site. +func (w *WSConn) CloseWithCode(code websocket.StatusCode, reason string) { + w.closeOnce.Do(func() { + w.cancel() + _ = w.conn.Close(code, reason) + }) +} + +// Ping sends an RFC 6455 ping and blocks until the pong returns or ctx +// expires. Pure forwarder over the library's c.Ping; does not take +// writeMu (the library serialises control frames against writes +// internally). Used by runHeartbeat. +func (w *WSConn) Ping(ctx context.Context) error { + return w.conn.Ping(ctx) +} + +// Close is preserved as a normal-closure convenience. Existing call +// sites (handler defers, registry-side teardown) keep their current +// signature. +func (w *WSConn) Close() { + w.CloseWithCode(websocket.StatusNormalClosure, "") +} +``` + +Idempotency note: `closeOnce` ensures whichever `CloseWithCode`-family call arrives first wins. If heartbeat fires `CloseWithCode(1011, "heartbeat timeout")` and the handler defer subsequently calls `wsconn.Close()` (`StatusNormalClosure`), the second call is a `closeOnce` no-op — the wire saw 1011. Conversely, on the clean-shutdown path, the defer's `Close()` runs first and any later `CloseWithCode` from a goroutine racing with the cancel becomes a no-op. This is the single-funnel guarantee `closeOnce` was built for. + +### Heartbeat function + +```go +// runHeartbeat sends RFC 6455 pings every interval and closes w with +// 1011 "heartbeat timeout" if a pong does not arrive within timeout. +// Returns when ctx is cancelled (clean-shutdown path) without calling +// CloseWithCode itself: the handler's existing defer owns the +// clean-close path. runHeartbeat owns ONLY the timeout-driven close. +func runHeartbeat(ctx context.Context, w *WSConn, interval, timeout time.Duration) { + ticker := time.NewTicker(interval) + defer ticker.Stop() + for { + select { + case <-ctx.Done(): + return + case <-ticker.C: + pingCtx, cancel := context.WithTimeout(ctx, timeout) + err := w.Ping(pingCtx) + cancel() + if err == nil { + continue + } + // Cancellation race: if ctx fired while Ping was in flight, + // Ping returns ctx.Err(). Bail without closing — the handler + // defer owns the clean-close path. + if ctx.Err() != nil { + return + } + w.CloseWithCode(websocket.StatusInternalError, "heartbeat timeout") + return + } + } +} +``` + +Two invariants encoded: + +1. **Only one path calls `CloseWithCode`.** The ctx-cancelled paths (`<-ctx.Done()` and `ctx.Err() != nil` after a Ping that returned ctx-cancellation) both `return` without touching the conn. Acceptance criterion (c) — "exits cleanly when the connection's context is cancelled, without attempting a close call" — falls out structurally. +2. **`time.NewTicker` is used, not `time.Tick`.** `Tick` leaks; `NewTicker` + `defer Stop` is the only correct shape inside a goroutine that exits. + +### Handler wiring + +`/v1/server` (`server_endpoint.go`): immediately after the existing release-defer is registered and before `c.CloseRead`, launch the heartbeat with a cancellable context derived from `r.Context()`. The cancel is registered as a defer so it runs **before** the release defer (LIFO). + +```go +logger.Info("server_claimed", ...) + +defer func() { // EXISTING — release defer + reg.ScheduleReleaseServer(serverID, grace) + wsconn.Close() + logger.Info("server_released", "server_id", serverID) +}() + +hbCtx, cancelHB := context.WithCancel(r.Context()) // NEW +defer cancelHB() // NEW — runs first (LIFO) +go runHeartbeat(hbCtx, wsconn, heartbeatInterval, heartbeatTimeout) // NEW + +readCtx := c.CloseRead(r.Context()) // EXISTING +<-readCtx.Done() // EXISTING +``` + +`/v1/client` (`client_endpoint.go`): identical insertion at the analogous point (after the unregister defer is registered, before `c.CloseRead`). + +Defer-order rationale: when the handler returns, deferred funcs run LIFO. `cancelHB` fires first → heartbeat goroutine observes `ctx.Done()` and returns without closing. Then the release defer runs `wsconn.Close()` (StatusNormalClosure). Net wire effect on the clean-shutdown path: a single normal close. On the heartbeat-timeout path: heartbeat has already emitted `CloseWithCode(1011, "heartbeat timeout")` before the handler unwinds — the subsequent `wsconn.Close()` is a `closeOnce` no-op. + +### Concurrency model + +Per accepted upgrade: + +- 1 handler goroutine (the HTTP handler itself, runs `<-readCtx.Done()`). +- 1 `CloseRead` drain goroutine (existing, owned by the library after `c.CloseRead`). +- **1 heartbeat goroutine (new).** Spawned via `go runHeartbeat(...)` after register success. + +Goroutine count per connection: 3 (was 2). Bounded by `Registry`-imposed slot limits (1 binary per server-id; phone count cap is #19's responsibility). + +Lifecycle: + +- Heartbeat exits via `ctx.Done()` when the handler returns (peer-close path, conflict path is unreachable because heartbeat hasn't started yet on early-return). +- Heartbeat exits via the timeout path when `Ping` returns a non-ctx error after `timeout` elapsed; it emits `CloseWithCode(1011, ...)` first. +- No leak path: `defer cancelHB()` is unconditional once heartbeat has been launched, and heartbeat is launched only after the release defer is in place, so any panic between them still cancels. + +`Ping` (control frame) and `Send` (binary frame) at the underlying-library level: nhooyr.io/websocket serialises pings against writes via its own internal mutex. WSConn's `writeMu` guards *concurrent Send callers* (broadcasters), not Send-vs-Ping. Heartbeat does not take `writeMu`. This is correct: control frames and data frames have separate ordering requirements, and the library's internal serialisation is the right granularity for this case. + +### Error handling + +- `Ping` returns `nil` → loop continues to next tick. +- `Ping` returns non-nil error AND `ctx.Err() == nil` → real timeout / dead conn → emit 1011 and return. +- `Ping` returns non-nil error AND `ctx.Err() != nil` → handler is shutting down → return without close. +- `CloseWithCode` returns no error (it ignores the underlying close error via `_ = ...`); this matches `Close`'s existing pattern. +- The handler defer is unchanged: it still calls `wsconn.Close()` and `ScheduleReleaseServer` regardless of how readCtx ended (peer close, heartbeat-induced close, server shutdown). + +The "ping returning a non-nil error (timeout, connection-closed, etc.)" wording in the AC encompasses both the timeout case and the case where Ping fails because the conn is *already* closed (e.g., peer hung up between two ticks; Ping fails on the next tick before `CloseRead` has surfaced the close to readCtx). Both branches converge on the same path: `CloseWithCode(1011, "heartbeat timeout")`. If the conn is already closed, the underlying lib's `Close` no-ops and the wire effect is whatever the original close emitted; that's fine — the goroutine exits either way. + +### Testing strategy + +`internal/relay/heartbeat_test.go` (new). Each test sets up a websocket pair via `httptest.NewServer` running a small handler that calls `runHeartbeat` directly (NOT through the upgrade handlers). The function-direct shape lets tests pass tiny intervals (e.g., 50ms) without reaching for the production constants. + +Test fixture (single helper at the top of the file): + +```go +// startHeartbeatPair upgrades a server-side connection, wraps it in a +// WSConn, runs runHeartbeat in a goroutine with the supplied +// interval/timeout, and returns: +// - the *WSConn (server side; for assertions) +// - the *websocket.Conn (client side; for driving pings/pongs) +// - a done channel that closes when runHeartbeat returns +// - a cancel func that fires the heartbeat ctx +// - a teardown that closes everything +func startHeartbeatPair(t *testing.T, interval, timeout time.Duration) (...) +``` + +Three tests, one per AC bullet (a/b/c): + +**(a) Healthy peer keeps the conn open across cycles.** Client side runs a `c.Read(ctx)` loop in a goroutine — this drives the library's control-frame processing, which auto-pongs incoming pings. Sleep through 3 intervals (e.g. 150ms with interval=50ms). Assert: `done` channel is NOT closed; client's last Read has not seen a CloseError. Tear down. + +**(b) Unresponsive peer triggers 1011 close.** Client side does NOT read — pings sit in the conn unobserved, no pong is sent. After (interval + timeout + slack), the heartbeat must have emitted the 1011 close. Drive a single client Read with a generous deadline; assert it returns a `*websocket.CloseError` with `Code == 1011` and `Reason == "heartbeat timeout"`. Assert `done` channel closed. (Guard against an autoflush window: read deadline of ~1s on top of (interval + timeout) is plenty for the close frame to land.) + +**(c) Cancelling the parent ctx exits cleanly without closing.** Cancel the ctx immediately after launching. Wait for `done` to close. Assert: client's `Read(ctx)` does NOT return a `CloseError` from the heartbeat path — instead, when the test tears down (which closes everything), the close code is `StatusNormalClosure` (the test's teardown call, not the heartbeat). The cleanest assertion: after cancel + done, the underlying conn is still open enough for the *test* to send a `c.Ping(ctx)` from the client side and receive a pong. If that succeeds, the heartbeat did not close the conn. + +Production constants are exercised by an existing handler test (e.g. `TestServerEndpoint_PeerClose_ReleasesSlot`) implicitly — those tests already pass and a 30s heartbeat does not interfere with their sub-second flows. Adding a slow integration test that waits 30s is out of scope; the constants are validated by inspection of the wiring in `server_endpoint.go` / `client_endpoint.go`. + +### What does not change + +- `WSConn.Send`, `WSConn.ConnID`, `closeCtx` semantics. +- `Registry` API. Heartbeat operates entirely above the registry layer; from the registry's perspective, a heartbeat-induced close is indistinguishable from a peer-initiated close. +- `cmd/pyrycode-relay/main.go`. Heartbeat constants are not flag-controlled in v1. +- `CloseRead` placeholder in both handlers. Frame forwarding (#6) replaces it later; heartbeat is independent of that work. + +### Knowledge artefacts + +The developer should add (alongside the implementation): + +1. A new feature doc `docs/knowledge/features/heartbeat.md` (one page) — what runHeartbeat does, why 30s/30s, the cancellation invariant, the test approach. Mirrors the shape of existing feature docs. +2. An ADR amendment OR a new ADR-0007 covering the `WSConn.CloseWithCode` extension. ADR-0005 already names this case as "an ADR amendment"; either form is acceptable. If a new ADR, keep it short (<50 lines) and link from ADR-0005 as the realised follow-up. +3. The `INDEX.md` entry under `docs/knowledge/`. + +These are routine for this repo — the spec calls them out so they don't get missed. + +## Open questions + +- **Logging on heartbeat-induced close.** Should heartbeat log `heartbeat_timeout` with `conn_id` before calling `CloseWithCode`? The handler's existing `server_released` / `phone_unregistered` log fires regardless (defer path), but it doesn't distinguish "peer hung up cleanly" from "we 1011'd them after timeout." For mobile-flapping diagnosability this distinction matters. Recommendation: pass `*slog.Logger` and `connID` into `runHeartbeat` and log `heartbeat_timeout` at INFO before the `CloseWithCode` call. Cost: 2 extra parameters; benefit: ops visibility into a failure mode that's otherwise invisible. Decide during implementation; this spec does not gate on it. +- **Unit tests for the WSConn additions.** `Ping` is a pure forwarder; `CloseWithCode` is mostly tested transitively by the heartbeat tests. A small direct `TestWSConn_CloseWithCode_IsIdempotent` may be worth adding to `ws_conn_test.go`, but it is not strictly required by the AC. + +## Security review + +**Verdict:** PASS + +**Findings:** + +- **[Trust boundaries]** No findings. Heartbeat operates on an already-authorised connection (post-`ClaimServer` / `RegisterPhone`). Ping is a control frame; the relay neither reads nor writes application data here. There is no new untrusted input to validate. +- **[Tokens, secrets, credentials]** Not applicable. Heartbeat does not read or emit credentials. Close reason `"heartbeat timeout"` is a fixed literal — no credential, header, or peer-supplied value enters the close frame. +- **[File operations]** Not applicable. Heartbeat is purely in-memory + network. +- **[Subprocess / external command execution]** Not applicable. +- **[Cryptographic primitives]** Not applicable. RFC 6455 ping payload is unspecified by this design (library default — empty); pong is the library's automatic response. No keys, no comparisons. +- **[Network & I/O]** + - Bounded concurrency: one heartbeat goroutine per accepted upgrade. Bounded above by `Registry` slot limits (1 binary per server-id; phone-count cap is #19's responsibility — explicitly out of scope). + - Bounded write rate: one ping per 30 seconds per connection. A hostile peer cannot amplify or accelerate this — the interval is server-driven. + - Bounded write size: control frame, library-default empty payload. No attacker-controlled byte enters the wire. + - Slow-loris resistance: a peer that slow-pongs (responds to ping at 29s) keeps the conn alive but consumes no extra resources beyond a single heartbeat goroutine. A peer that withholds pongs is detected within 60s and the slot is reclaimed via the existing release path. This is *strictly an improvement* over the pre-#7 state, where dead peers wedged forever. +- **[Error messages, logs, telemetry]** No tokens, headers, or peer payloads enter logs. The proposed `heartbeat_timeout` log (open question) carries `conn_id` only — `conn_id` is a relay-assigned random hex8 + server-id, contains no peer secret. Close reason `"heartbeat timeout"` is fixed. +- **[Concurrency]** + - Goroutine lifecycle: heartbeat exits via `ctx.Done()` (handler unwind) or via the timeout-then-close path. No third exit; no leak. `defer cancelHB()` is unconditional. + - Lock ordering: heartbeat holds no locks. `WSConn.CloseWithCode` is `closeOnce`-guarded, same as `Close()`; no new lock-order interaction. + - Race with concurrent `Send`: nhooyr.io/websocket's internal write serialisation handles ping-vs-write ordering. WSConn's `writeMu` is for multiple Send callers, not Send-vs-Ping; heartbeat correctly does not take it. + - Race on close: heartbeat-emitted 1011 vs handler-defer-emitted normal close — `closeOnce` ensures whichever fires first wins. The wire shows exactly one close frame. +- **[Threat model alignment]** protocol-mobile.md § Heartbeat is the canonical source; this design implements it directly. The threat addressed: half-open TCP from disappeared peers wedging the routing tables forever (a DoS-by-leak vector against an internet-exposed relay). No threat is moved or expanded by this change. Future-ticket scope (per-IP connection limits, application-level keepalive) is named explicitly out of scope by the ticket body. + +**Reviewer:** architect (self-review per `architect/security-review.md`) +**Date:** 2026-05-09 diff --git a/internal/relay/client_endpoint.go b/internal/relay/client_endpoint.go index ce80104..3c11e25 100644 --- a/internal/relay/client_endpoint.go +++ b/internal/relay/client_endpoint.go @@ -11,6 +11,7 @@ package relay // binary's responsibility (per protocol-mobile.md § Phone → relay → binary). import ( + "context" "errors" "log/slog" "net/http" @@ -78,12 +79,22 @@ func ClientHandler(reg *Registry, logger *slog.Logger) http.Handler { "conn_id", connID) }() - // Phone-side read pump: wraps each inbound frame in the routing - // envelope and writes it to the binary holding serverID. Blocks - // until the phone closes the WS, ctx cancels, the binary - // disappears, or a Send to the binary fails. The handler's defer - // above runs when this returns. Return value is observability- - // only; the forwarder logs the cause. + // Heartbeat: per-conn goroutine sends RFC 6455 pings at + // heartbeatInterval and closes the conn with 1011 "heartbeat + // timeout" if a pong does not arrive within heartbeatTimeout. + // cancelHB runs first under LIFO defer order: it signals the + // goroutine to exit cleanly before the unregister defer closes + // the conn. closeOnce ensures whichever close fires first wins. + hbCtx, cancelHB := context.WithCancel(r.Context()) + defer cancelHB() + go runHeartbeat(hbCtx, wsconn, heartbeatInterval, heartbeatTimeout) + + // Phone-side read pump (#25): wraps each inbound frame in the + // routing envelope and writes it to the binary holding serverID. + // Blocks until the phone closes the WS, ctx cancels, the binary + // disappears, or a Send to the binary fails. Replaces the old + // CloseRead+Done placeholder. Return value is observability-only; + // the forwarder logs the cause. _ = StartPhoneForwarder(r.Context(), reg, serverID, wsconn, logger) }) } diff --git a/internal/relay/heartbeat.go b/internal/relay/heartbeat.go new file mode 100644 index 0000000..d3e4721 --- /dev/null +++ b/internal/relay/heartbeat.go @@ -0,0 +1,58 @@ +package relay + +import ( + "context" + "time" + + "nhooyr.io/websocket" +) + +// heartbeatInterval and heartbeatTimeout implement the per-connection +// liveness regime from protocol-mobile.md § Heartbeat: send an RFC 6455 +// ping every interval; if the matching pong has not arrived within +// timeout of the most recent ping, treat the connection as dead. Worst- +// case dead-connection detection is interval+timeout = 60s. +// +// File-level constants (not exposed as flags in v1) because two wiring +// entry points — /v1/server and /v1/client — share the same value, and +// the AC names them as such. Tests in heartbeat_test.go pass shorter +// values directly to runHeartbeat. +const ( + heartbeatInterval = 30 * time.Second + heartbeatTimeout = 30 * time.Second +) + +// runHeartbeat sends a Ping every interval and closes w with +// 1011 "heartbeat timeout" if a pong does not arrive within timeout. +// +// Returns when ctx is cancelled (clean-shutdown path) without calling +// CloseWithCode itself: the handler's existing release defer owns the +// clean-close path. runHeartbeat owns ONLY the timeout-driven close. +// +// Two cancellation cases must be distinguished. In both, ctx has fired +// while a Ping was either pending or in flight; in neither does +// runHeartbeat close the conn — the handler defer will. The second +// case (Ping returns a ctx-cancellation error) is detected by checking +// ctx.Err() after the call. +func runHeartbeat(ctx context.Context, w *WSConn, interval, timeout time.Duration) { + ticker := time.NewTicker(interval) + defer ticker.Stop() + for { + select { + case <-ctx.Done(): + return + case <-ticker.C: + pingCtx, cancel := context.WithTimeout(ctx, timeout) + err := w.Ping(pingCtx) + cancel() + if err == nil { + continue + } + if ctx.Err() != nil { + return + } + w.CloseWithCode(websocket.StatusInternalError, "heartbeat timeout") + return + } + } +} diff --git a/internal/relay/heartbeat_test.go b/internal/relay/heartbeat_test.go new file mode 100644 index 0000000..05b0053 --- /dev/null +++ b/internal/relay/heartbeat_test.go @@ -0,0 +1,203 @@ +package relay + +import ( + "context" + "errors" + "net/http" + "net/http/httptest" + "strings" + "testing" + "time" + + "nhooyr.io/websocket" +) + +// startHeartbeatPair stands up an httptest server whose handler upgrades +// to WebSocket, wraps the conn in a WSConn, calls c.CloseRead so the +// library auto-pongs and processes incoming control frames (mirrors the +// production handler shape), and runs runHeartbeat in a goroutine with +// the supplied interval/timeout. +// +// Returns: +// - the *WSConn (server side, for assertions; the heartbeat operates on it) +// - the *websocket.Conn (client side, for the test to drive ping/pong/read) +// - done — closes when runHeartbeat returns +// - cancel — fires the heartbeat ctx (lets test (c) trigger clean exit) +// - teardown — closes everything; safe to defer. +// +// The handler keeps the conn alive until teardown so individual tests +// can probe the conn's state after heartbeat exits. +func startHeartbeatPair(t *testing.T, interval, timeout time.Duration) ( + *WSConn, *websocket.Conn, <-chan struct{}, context.CancelFunc, func(), +) { + t.Helper() + + done := make(chan struct{}) + hbCtx, cancelHB := context.WithCancel(context.Background()) + handlerStop := make(chan struct{}) + handlerReady := make(chan struct{}) + var serverWS *WSConn + + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + c, err := websocket.Accept(w, r, nil) + if err != nil { + t.Errorf("accept: %v", err) + close(handlerReady) + return + } + serverWS = NewWSConn(c, "test-hb") + // CloseRead drains control frames (so the server side auto- + // processes incoming pongs and pings). Mirrors the production + // /v1/server and /v1/client handler shape. + c.CloseRead(r.Context()) + close(handlerReady) + + go func() { + runHeartbeat(hbCtx, serverWS, interval, timeout) + close(done) + }() + + <-handlerStop + })) + + wsURL := "ws" + strings.TrimPrefix(srv.URL, "http") + dialCtx, dialCancel := context.WithTimeout(context.Background(), 5*time.Second) + defer dialCancel() + client, _, err := websocket.Dial(dialCtx, wsURL, nil) + if err != nil { + cancelHB() + close(handlerStop) + srv.Close() + t.Fatalf("dial: %v", err) + } + <-handlerReady + if serverWS == nil { + cancelHB() + close(handlerStop) + _ = client.Close(websocket.StatusNormalClosure, "") + srv.Close() + t.Fatalf("accept failed; no serverWS") + } + + var teardownOnce bool + teardown := func() { + if teardownOnce { + return + } + teardownOnce = true + cancelHB() + close(handlerStop) + _ = client.Close(websocket.StatusNormalClosure, "") + srv.Close() + } + return serverWS, client, done, cancelHB, teardown +} + +// (a) Healthy peer responds to pings; heartbeat keeps the conn open +// across multiple cycles. +func TestHeartbeat_HealthyPeer_KeepsConnOpen(t *testing.T) { + interval := 50 * time.Millisecond + timeout := 50 * time.Millisecond + _, client, done, _, teardown := startHeartbeatPair(t, interval, timeout) + defer teardown() + + // Drive client reads so the library processes incoming pings and + // auto-pongs them. Without this, heartbeat would time out on the + // first cycle (which is exactly what test (b) below relies on). + client.CloseRead(context.Background()) + + select { + case <-done: + t.Fatalf("runHeartbeat exited unexpectedly during healthy cycles") + case <-time.After(3*interval + 100*time.Millisecond): + // Expected: heartbeat still running across ~3 ping cycles. + } +} + +// (b) Unresponsive peer (pong never arrives) → heartbeat closes the +// conn with code 1011 and reason "heartbeat timeout" within one +// timeout window. +// +// nhooyr's Conn.Close performs a close handshake with a 5s grace +// waiting for the peer's reciprocal close frame. The close frame +// itself reaches the wire immediately; runHeartbeat's exit (and so +// `done`) is gated on the handshake's grace expiring (the unresponsive +// peer never replies). Test asserts both: the close frame arrives at +// the client side promptly (proves the 1011 path), then waits longer +// for `done` to close (proves the goroutine exits without leaking). +func TestHeartbeat_UnresponsivePeer_TriggersClose(t *testing.T) { + interval := 50 * time.Millisecond + timeout := 100 * time.Millisecond + _, client, done, _, teardown := startHeartbeatPair(t, interval, timeout) + defer teardown() + + // Client deliberately does NOT read until the heartbeat has had + // time to fire and emit its close. Reading earlier would cause the + // client library to auto-pong incoming pings and keep the heartbeat + // satisfied — exactly the inverse of what this test wants. Sleep + // past one full ping cycle (interval + timeout) plus margin, then + // read; the close frame the heartbeat queued is waiting in the + // client's TCP buffer and the Read also auto-sends the reciprocal + // close that lets the server's Close handshake complete. + time.Sleep(interval + timeout + 200*time.Millisecond) + + readCtx, cancelRead := context.WithTimeout(context.Background(), 5*time.Second) + defer cancelRead() + _, _, err := client.Read(readCtx) + if err == nil { + t.Fatalf("client.Read returned nil error, want CloseError") + } + var ce websocket.CloseError + if !errors.As(err, &ce) { + t.Fatalf("client.Read err = %v (%T), want *websocket.CloseError", err, err) + } + if ce.Code != websocket.StatusInternalError { + t.Fatalf("close code = %d, want %d (StatusInternalError / 1011)", ce.Code, websocket.StatusInternalError) + } + if ce.Reason != "heartbeat timeout" { + t.Fatalf("close reason = %q, want %q", ce.Reason, "heartbeat timeout") + } + + // The library's Close handshake holds runHeartbeat in CloseWithCode + // for up to ~5s waiting for the peer's reciprocal close. Wait long + // enough for that to drain. + select { + case <-done: + case <-time.After(7 * time.Second): + t.Fatalf("runHeartbeat did not exit within 7s after emitting close; goroutine leak suspected") + } +} + +// (c) Cancelling the parent ctx → heartbeat exits cleanly without +// closing the conn itself. The handler defer (mirrored here by the +// fixture's teardown) owns the clean-close path. +func TestHeartbeat_ContextCancelled_ExitsCleanly(t *testing.T) { + interval := 50 * time.Millisecond + timeout := 50 * time.Millisecond + _, client, done, cancel, teardown := startHeartbeatPair(t, interval, timeout) + defer teardown() + + cancel() + + select { + case <-done: + case <-time.After(time.Second): + t.Fatalf("runHeartbeat did not exit after ctx cancellation") + } + + // Conn must still be alive: a Read with a short deadline must + // return a context-deadline-exceeded, NOT a *websocket.CloseError. + // If heartbeat had wrongly emitted CloseWithCode on the cancel + // path, the close frame would have landed and Read would surface + // it as a CloseError instead of timing out. + readCtx, cancelRead := context.WithTimeout(context.Background(), 200*time.Millisecond) + defer cancelRead() + _, _, err := client.Read(readCtx) + if err == nil { + t.Fatalf("client.Read returned nil error, want context-deadline-exceeded") + } + var ce websocket.CloseError + if errors.As(err, &ce) { + t.Fatalf("client.Read got CloseError %v — heartbeat closed the conn on the cancel path (must not)", err) + } +} diff --git a/internal/relay/server_endpoint.go b/internal/relay/server_endpoint.go index 1aa43e9..4fd710e 100644 --- a/internal/relay/server_endpoint.go +++ b/internal/relay/server_endpoint.go @@ -10,6 +10,7 @@ package relay // frame-forward errors on /v1/server. import ( + "context" "crypto/rand" "encoding/hex" "errors" @@ -89,6 +90,16 @@ func ServerHandler(reg *Registry, logger *slog.Logger, grace time.Duration) http logger.Info("server_released", "server_id", serverID) }() + // Heartbeat: per-conn goroutine sends RFC 6455 pings at + // heartbeatInterval and closes the conn with 1011 "heartbeat + // timeout" if a pong does not arrive within heartbeatTimeout. + // cancelHB runs first under LIFO defer order: it signals the + // goroutine to exit cleanly before the release defer closes the + // conn. closeOnce ensures whichever close fires first wins. + hbCtx, cancelHB := context.WithCancel(r.Context()) + defer cancelHB() + go runHeartbeat(hbCtx, wsconn, heartbeatInterval, heartbeatTimeout) + // Hold the connection open until the peer closes it. CloseRead // spawns a goroutine that drains-and-discards frames (including // control frames like ping/pong, which must be processed for the diff --git a/internal/relay/ws_conn.go b/internal/relay/ws_conn.go index f538560..7bff98a 100644 --- a/internal/relay/ws_conn.go +++ b/internal/relay/ws_conn.go @@ -88,8 +88,36 @@ func (w *WSConn) Read(ctx context.Context) ([]byte, error) { // underlying *websocket.Conn; subsequent calls are no-ops. Safe to call // concurrently with Send. func (w *WSConn) Close() { + w.CloseWithCode(websocket.StatusNormalClosure, "") +} + +// CloseWithCode cancels in-flight writes and closes the WebSocket with +// the given code and reason. Idempotent under the same closeOnce guard +// as Close: only the first Close-family call reaches the underlying +// *websocket.Conn; later calls (whether Close or CloseWithCode) are +// no-ops. Safe to call concurrently with Send. +// +// Used by paths that need to signal a non-normal close on an *active* +// WSConn — e.g. the heartbeat goroutine emitting 1011 "heartbeat +// timeout" when a peer stops responding to pings. See ADR-0007 for why +// this lives on WSConn rather than at the call site (ADR-0005 covers +// the stillborn-WSConn handler-side pattern that this method extends). +func (w *WSConn) CloseWithCode(code websocket.StatusCode, reason string) { w.closeOnce.Do(func() { w.cancel() - _ = w.conn.Close(websocket.StatusNormalClosure, "") + _ = w.conn.Close(code, reason) }) } + +// Ping sends an RFC 6455 ping control frame and blocks until the +// matching pong returns or ctx expires. Pure forwarder over the +// library's Conn.Ping; does not take writeMu — nhooyr.io/websocket +// serialises control frames against data writes internally. +// +// Used by runHeartbeat. A non-nil return is the caller's signal that +// the peer is unresponsive (ctx-deadline elapsed without a pong) or +// that the conn has died; runHeartbeat translates this into a +// CloseWithCode(1011, "heartbeat timeout"). +func (w *WSConn) Ping(ctx context.Context) error { + return w.conn.Ping(ctx) +} diff --git a/pyrycode-relay b/pyrycode-relay deleted file mode 100755 index 29cacdf..0000000 Binary files a/pyrycode-relay and /dev/null differ