Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,14 @@ bin/
*.out
coverage.txt

# Default `go build ./...` from repo root emits a binary named after the
# module's last path segment ("pyrycode-relay") into CWD. Anchored with
# leading slash so it matches the repo-root binary only, not any future
# package directory of the same name. Same shape as the .codegraph fix
# below — auto-commit safety nets need a trustworthy gitignore. Salvage
# on PR #27 swept the binary in before this entry existed.
/pyrycode-relay

# Go
go.work
go.work.sum
Expand Down
2 changes: 2 additions & 0 deletions docs/PROJECT-MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex
- **Hold long-lived WS handlers open with `c.CloseRead(r.Context())` plus `<-readCtx.Done()` until the real read loop lands.** `CloseRead` drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close). The frame-loop ticket replaces both the `CloseRead` call and the `<-readCtx.Done()` block with the actual read pump in the same call site — keeping `CloseRead` alongside a real reader would race the sole-reader contract. Adopted in `/v1/server` (#16) and `/v1/client` (#5); `/v1/client` swapped to `StartPhoneForwarder` in #25 (`/v1/server`'s placeholder remains until the binary-side forwarder lands).
- **Forwarder owns reading; handler owns cleanup.** The phone-side frame forwarder (#25) is a pure read pump — no `UnregisterPhone`, no `wsconn.Close()`, no extra goroutine. It runs synchronously on the HTTP handler goroutine; on return, the handler's `defer { UnregisterPhone; Close; log }` runs in the right order. Adding cleanup inside the forwarder would either double-close (idempotent, but muddies the lifecycle) or unregister twice. Pattern extends to the symmetric binary-side forwarder.
- **Define narrow read-side interfaces at the consumer, not on the adapter.** `phoneSource` (`ConnID() + Read(ctx)`) lives in `forward.go`, not on `WSConn`. Production passes `*WSConn`; tests substitute a fake without touching `WSConn`. If a future caller needs the same shape, promote the interface then — don't anticipate. Adopted in #25.
- **Per-conn goroutines exit cleanly via LIFO defers, not by closing the conn themselves.** Pattern for any goroutine launched alongside an upgraded WS conn (heartbeat, future binary-side ack pump, etc.): derive `hbCtx, cancelHB := context.WithCancel(r.Context())`, register `defer cancelHB()` *after* the release/unregister defer, then `go runX(hbCtx, wsconn, ...)`. Under LIFO unwind `cancelHB` fires first → goroutine observes `ctx.Done()` and returns without touching the conn → release defer's `wsconn.Close()` runs unopposed. The goroutine owns ONLY the failure-path close (heartbeat: `CloseWithCode(1011, "heartbeat timeout")`); cleanup-path close is the handler defer's. `closeOnce` makes the two paths idempotent against each other regardless of which fires first. Adopted in heartbeat (#7); same shape applies to any future per-conn watchdog/timer goroutine.
- **Active-conn application close codes go through `WSConn.CloseWithCode`; stillborn-conn close codes go through the underlying `*websocket.Conn`.** ADR-0005's stillborn-conn pattern (`c.Close(websocket.StatusCode(4409), reason)` directly on the underlying conn before any `Send` could have run) coexists with ADR-0007's active-conn pattern (`wsconn.CloseWithCode(1011, "heartbeat timeout")` post-claim, where `closeOnce`/`writeMu` invariants are in play). Both share the same `closeOnce` guard via `Close()` delegating to `CloseWithCode(StatusNormalClosure, "")`. The two patterns describe structurally distinct windows of a WSConn's lifecycle. Adopted: stillborn `4409`/`4404` (#16, #5); active `1011` (#7).
- **Capture process-state timestamps in `main` after `flag.Parse()`, not as package-level vars.** `startedAt := time.Now()` lives inside `main` and is passed into the handler factory. A package-level `var startedAt = time.Now()` would fire at import time — before flag parsing, before `--version` early-returns — and be wrong for short-lived test binaries and any future deferred-serve setup. Adopted in #10.

## Conventions
Expand Down
2 changes: 2 additions & 0 deletions docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [WebSocket heartbeat](features/heartbeat.md) — per-conn goroutine on both endpoints sends RFC 6455 ping every 30s; closes with `1011 "heartbeat timeout"` if no pong within 30s. Detects half-open TCP within 60s; ctx-cancel exit path leaves close to the handler defer (#7).
- [Phone-side frame forwarder](features/phone-forwarder.md) — per-phone read pump: wraps each inbound phone frame in the routing envelope keyed by the phone's `conn_id` and `Send`s it to the binary holding `serverID`; opaque inner bytes; synchronous (handler discards the return); replaced `/v1/client`'s `CloseRead` placeholder; added `WSConn.Read` (single-caller) (#25).
- [`/v1/client` WS upgrade](features/client-endpoint.md) — phone-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Token` / `User-Agent` pre-upgrade (token presence-only, never parsed/logged); registers phone on the binary's slot, emits `4404` if no binary holds the id, hands the conn to `StartPhoneForwarder` for the data path (#5, #25).
- [`/v1/server` WS upgrade](features/server-endpoint.md) — binary-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Version` / `User-Agent` pre-upgrade, claims the slot, emits `4409` on conflict, holds the conn via `CloseRead` until #6's frame loop replaces it; on disconnect schedules a 30s grace release so a quick reconnect inherits the slot (#21).
Expand All @@ -15,6 +16,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Decisions

- [ADR-0007: `WSConn.CloseWithCode` for active-conn application close codes](decisions/0007-wsconn-closewithcode-for-active-conn.md) — extends ADR-0005 for the post-claim window: heartbeat (#7) needs `1011 "heartbeat timeout"` on a live WSConn; `Close()` delegates to `CloseWithCode(StatusNormalClosure, "")`, both share `closeOnce`.
- [ADR-0006: Grace window IS the reclaim path](decisions/0006-grace-period-as-reclaim-path.md) — `ClaimServer` during a pending grace timer succeeds (not `4409`); pointer-identity wrapper defends against stale `time.AfterFunc` fires after `Stop()`.
- [ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`](decisions/0005-application-close-codes-via-underlying-conn.md) — `WSConn.Close()` stays minimal (`StatusNormalClosure`); handlers emit `4409` / `4401` / `4404` directly on the `*websocket.Conn` in stillborn-WSConn paths.
- [ADR-0004: WS library choice and adapter context strategy](decisions/0004-ws-library-and-adapter-context-strategy.md) — `nhooyr.io/websocket` v1.8.x; adapter-owned context cancelled by `Close`, 10s per-`Send` deadline; registry's `Send` signature stays context-free.
Expand Down
36 changes: 36 additions & 0 deletions docs/knowledge/codebase/7.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Ticket #7 — WebSocket heartbeat (RFC 6455 ping/pong every 30s, close at 60s)

Per-connection liveness regime for both endpoints. A goroutine launched after each successful claim/register sends an RFC 6455 `Ping` every 30s; if the matching pong does not arrive within 30s, it closes the conn with `1011 server error` reason `"heartbeat timeout"`. Worst-case dead-connection detection: 60s. The relay's only mechanism for detecting half-open TCP from disappeared peers (NAT timeout, mobile interface flap, kernel panic, severed cable).

## Implementation

- **`internal/relay/heartbeat.go` (new)** — file-level constants `heartbeatInterval = 30s` and `heartbeatTimeout = 30s`; one unexported function `runHeartbeat(ctx, *WSConn, interval, timeout)`. `time.NewTicker` (not `time.Tick`) with `defer ticker.Stop()`; per-tick `context.WithTimeout` for the `Ping`; cancellation invariant — two `return` paths bypass the close (`<-ctx.Done()` between ticks, and `ctx.Err() != nil` after a Ping that returned ctx-cancellation), exactly one `CloseWithCode` call site gated on `ctx.Err() == nil`.
- **`internal/relay/ws_conn.go`** — surface widened by exactly two methods. `Ping(ctx) error` is a pure forwarder over `*websocket.Conn.Ping`; does not take `writeMu` (the library serialises control frames against writes internally). `CloseWithCode(code, reason)` is `closeOnce`-guarded; cancels `closeCtx` then calls `c.Close(code, reason)`. `Close()` is refactored to delegate: `CloseWithCode(StatusNormalClosure, "")` — both share one `closeOnce`, so whichever close-family call fires first wins. Behaviour of the existing `Close()` is unchanged.
- **`internal/relay/server_endpoint.go` and `internal/relay/client_endpoint.go`** — identical insertion in both handlers, immediately after the existing release/unregister defer is in place: `hbCtx, cancelHB := context.WithCancel(r.Context()); defer cancelHB(); go runHeartbeat(hbCtx, wsconn, heartbeatInterval, heartbeatTimeout)`. LIFO defer order: `cancelHB` fires before the release defer's `wsconn.Close()`, so the goroutine exits cleanly on the peer-close path before the conn is closed by the defer. On the heartbeat-timeout path, the goroutine has already emitted `CloseWithCode(1011, ...)`; the subsequent defer-driven `Close()` is a `closeOnce` no-op.
- **`internal/relay/heartbeat_test.go` (new)** — `startHeartbeatPair(t, interval, timeout)` helper sets up `httptest.NewServer` whose handler accepts the upgrade, wraps in `WSConn`, calls `c.CloseRead(r.Context())` (mirrors production handler shape so the server side processes incoming pongs), runs `runHeartbeat` in a goroutine. Three tests, 1:1 with AC bullets:
- `TestHeartbeat_HealthyPeer_KeepsConnOpen` — client `CloseRead` drives auto-pong; assert `done` is not closed across ~3 ping cycles.
- `TestHeartbeat_UnresponsivePeer_TriggersClose` — client deliberately does NOT read; sleep past `interval+timeout`, then a single `client.Read` returns `*websocket.CloseError{Code:1011, Reason:"heartbeat timeout"}`. `done` deadline is 7s because nhooyr's `Conn.Close` performs a 5s close handshake when the peer never reciprocates.
- `TestHeartbeat_ContextCancelled_ExitsCleanly` — cancel ctx; assert `done` closes within 1s; assert `client.Read` returns `context.DeadlineExceeded`, NOT a `CloseError` (proves heartbeat did not close on the cancel path).
- Production constants (30s/30s) are exercised implicitly by existing handler tests at sub-second flows; an explicit slow integration test that waits 30s is out of scope.

## Concurrency

3 goroutines per accepted upgrade (was 2 pre-#7): handler, `CloseRead` drain, heartbeat. No leak path — `defer cancelHB()` is unconditional once heartbeat has been launched, and heartbeat is launched only after the release defer is in place. `Ping` and `Send` interleave correctly via the library's internal control-frame serialisation; `WSConn.writeMu` guards Send-vs-Send only.

## Deliberately out of scope

- No application-level keepalive messages (would be a protocol change).
- No per-endpoint interval differences (phone vs binary).
- No adaptive intervals based on connection quality.
- No flag/config exposure of interval/timeout — file-level constants only. Revisit only if mobile-flapping evidence emerges.
- No new heartbeat-induced-close log line. The handler's `server_released` / `phone_unregistered` defer log fires regardless. Spec named diagnosability as an open question; deferred until ops evidence motivates it.

## Cross-links

- [Feature: WebSocket heartbeat](../features/heartbeat.md) — full design / cancellation invariant / close-handshake-timing / testing rationale.
- [ADR-0007: `WSConn.CloseWithCode` for active-conn application close codes](../decisions/0007-wsconn-closewithcode-for-active-conn.md) — the surface widening this ticket required; amends ADR-0005.
- [ADR-0005: Application WS close codes via the underlying `*websocket.Conn`](../decisions/0005-application-close-codes-via-underlying-conn.md) — still applies to stillborn-WSConn paths (`4409`/`4404`/`4401`).
- [WSConn adapter](../features/ws-conn-adapter.md) — `Ping` and `CloseWithCode` were added here.
- [`/v1/server` WS upgrade](../features/server-endpoint.md), [`/v1/client` WS upgrade](../features/client-endpoint.md) — the two wiring sites.
- [Spec: 7-heartbeat](../../specs/architecture/7-heartbeat.md) — the architect's pre-implementation design.
- [Protocol spec § Heartbeat](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#heartbeat) — authoritative source for 30s/30s/60s.
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# ADR-0007: `WSConn.CloseWithCode` for active-conn application close codes

**Status:** Accepted (#7)
**Date:** 2026-05-10

## Context

[ADR-0005](0005-application-close-codes-via-underlying-conn.md) decided that application WS close codes (`4409`, `4404`, `4401`) are emitted by calling `Close(code, reason)` directly on the underlying `*websocket.Conn`, **not** through `WSConn`. The decision held the adapter's surface narrow: one `Close()` whose body is fixed to `StatusNormalClosure`. The justification rested on the close-codes-so-far being emitted from a *stillborn-WSConn* window — after construction, before any `Send`, before any defer wires the conn into the registry — where the WSConn's `closeOnce`/`writeMu` invariants are not in play.

ADR-0005 anticipated *this* ticket explicitly:

> The pattern only applies to **stillborn WSConn** paths — pre-`Send`, pre-defer. Once a WSConn is in use (registered with `Send` reachable from broadcasters), close goes through `WSConn.Close()` and the code is `StatusNormalClosure`. Application close codes for an *active* WSConn (e.g. binary disconnects mid-stream and we want to surface a forward-error code) need a different solution — likely a small `WSConn.CloseWithCode` added at the time the first such use case lands, justified by an ADR amendment.

The heartbeat ticket (#7) is that first use case. A goroutine running alongside the registered conn must, on a peer that stops responding to RFC 6455 pings, close the WS with `1011 server error` + reason `"heartbeat timeout"`. By the time heartbeat decides to close, the WSConn has been live in the registry — concurrent broadcasters may be in `Send`, the close defer is wired up, `closeOnce` is the only safe single-funnel for the close.

So: where does the `1011 + "heartbeat timeout"` close-code call site live?

## Decision

**Add `WSConn.CloseWithCode(code websocket.StatusCode, reason string)`.** Refactor `WSConn.Close()` to delegate (`CloseWithCode(StatusNormalClosure, "")`). Both share the same `closeOnce` guard and both cancel `closeCtx` before calling the underlying `Close`.

Concretely:

```go
func (w *WSConn) CloseWithCode(code websocket.StatusCode, reason string) {
w.closeOnce.Do(func() {
w.cancel()
_ = w.conn.Close(code, reason)
})
}

func (w *WSConn) Close() {
w.CloseWithCode(websocket.StatusNormalClosure, "")
}
```

`closeOnce` ensures that whichever Close-family call fires first wins. If heartbeat fires `CloseWithCode(1011, "heartbeat timeout")` and the handler defer subsequently calls `wsconn.Close()`, the second call is a no-op — the wire saw 1011. Conversely, on the clean-shutdown path the defer's `Close()` runs first and any later `CloseWithCode` from a goroutine racing with the cancel becomes a no-op. Single-funnel guarantee preserved.

Also added: `WSConn.Ping(ctx context.Context) error` — a pure forwarder over `c.Ping`. Heartbeat uses `Ping` rather than reaching for the underlying `*websocket.Conn`, so `runHeartbeat` operates entirely above the `WSConn` boundary.

ADR-0005 still applies for *stillborn* WSConn paths. The pattern there (handler calls `c.Close(websocket.StatusCode(4409), reason)` directly on the underlying conn before any `Send` could have run) remains the right shape for `4409`/`4404`/`4401` — those close codes are decided pre-claim, where the WSConn is provably stillborn. This ADR extends the surface specifically for *post-claim* close-code emission, which is structurally distinct.

## Rationale

The three options ADR-0005 weighed (add a method; embed the code in the constructor; emit on the underlying conn from the call site) get re-evaluated against the active-conn case:

1. **Add `WSConn.CloseWithCode`.** Adopted. Two callers (`Close`, future post-claim closes) share the `closeOnce` funnel through one body. Adapter surface grows by one method but keeps its single-mutex / single-once shape intact. No new `closeOnce` (rejected in ADR-0005 as "which close wins?") because `Close` now *delegates* to `CloseWithCode` — the body is a single function, the guard a single `sync.Once`, so the question of which close wins is settled by `Once.Do` semantics, exactly as before.
2. **Pass the close code into `WSConn` at construction.** Same reason it failed in ADR-0005: the close code at construction time is *not* the code the caller eventually decides on. Heartbeat decides "1011 because timeout" only after a Ping fails; the WSConn was constructed at `RegisterPhone`/`ClaimServer` time long before.
3. **Emit on the underlying `*websocket.Conn` from the heartbeat call site.** Rejected for the post-claim case. Reaching for `c` from a goroutine that holds neither `writeMu` nor `closeOnce` would race the adapter's invariants: a concurrent `Send` could be holding `writeMu` and have a `Write` mid-flight; a concurrent `Close` (handler defer) could be running `c.Close` independently. The library is documented to handle concurrent `Close` calls safely, but our `closeOnce` guarantee — "exactly one close-frame leaves this WSConn" — would be violated. The stillborn-conn carve-out from ADR-0005 cannot be extended here because the conn is no longer stillborn.

The cost of (1) is one method's worth of API surface widening, kept minimal: same shape as `Close`, same idempotency guarantee, no new state field, no new mutex. The benefit is that any future "active-conn close with custom code" (frame-forward error codes from #6, e.g.) plugs into the same call site without a fresh ADR amendment.

## Consequences

- `WSConn.Close()` is now defined as `CloseWithCode(StatusNormalClosure, "")`. Behaviour is unchanged — same close code, same idempotency, same `closeCtx` cancel — but the implementation funnels through `CloseWithCode`. Existing call sites (handler defers, registry-side teardown) require no change.
- `WSConn` now exposes `Ping(ctx)` as a pure forwarder. Used by `runHeartbeat`; safe to use for any future caller that needs RFC 6455 ping/pong from above the adapter boundary.
- Application close codes for **active** WSConns are emitted via `wsconn.CloseWithCode(code, reason)`. The first user is `runHeartbeat` (`1011`/`"heartbeat timeout"`). Future frame-forward error codes on `/v1/server` and `/v1/client` (per #6) follow the same call site.
- ADR-0005 remains in force for **stillborn** WSConns: handlers continue to call `c.Close(websocket.StatusCode(4409), reason)` directly for `4409`, `4404`, `4401`, with the inline comment naming the exception. The two patterns coexist — they describe structurally distinct windows of a WSConn's lifecycle.

## Related

- [ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`](0005-application-close-codes-via-underlying-conn.md) — the parent decision; this ADR is the explicit amendment ADR-0005 named.
- [ADR-0004: WS library choice and adapter context strategy](0004-ws-library-and-adapter-context-strategy.md) — `closeCtx` cancellation is what `CloseWithCode` cancels, same as `Close`.
- [Heartbeat feature](../features/heartbeat.md) — first user of `CloseWithCode` and `Ping`.
- [WSConn adapter](../features/ws-conn-adapter.md) — the type whose surface this ADR widens by exactly two methods.
Loading