Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions docs/PROJECT-MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,10 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex
- **Policy values live at the wiring site, not as package-level constants.** `30*time.Second` for the grace window is a literal in `cmd/pyrycode-relay/main.go`, threaded into `ServerHandler` as a constructor parameter — it appears exactly once, the value is policy (matches protocol spec), and inlining keeps the protocol-spec linkage visible where the relay is composed. Tests pass ms-scale durations through the same parameter. Promote to a package-level constant only when a second wiring entry point needs the same value. Adopted in `/v1/server` grace duration (#21).
- **Pointer-identity for stale `time.AfterFunc` fires.** `time.Timer.Stop()` returns false if the timer's func has already started executing. Wrap each pending timer in a small struct and store the wrapper pointer in a map; the `AfterFunc` closure captures the wrapper pointer and asserts `map[key] == self` under the lock before acting. If a faster goroutine replaced the entry, the pointer no longer matches and the closure no-ops. Capturing the `*time.Timer` directly forces a self-referential local var (assigned after `AfterFunc` returns) which trips the race detector under stress; the wrapper avoids that. Adopted in `Registry.ScheduleReleaseServer` (#20). Same shape applies to any "one cancellable timer per key" pattern.
- **Credentials the relay does not validate are presence-checked then discarded — never logged, never put in error strings.** `/v1/client`'s `X-Pyrycode-Token` is opaque to the relay (the binary owns verification). The handler reads the token into a local string, branches on `!= ""`, and lets the local go out of scope unread. The log-event field set is enumerated explicitly in the doc; the token name does not appear in any `slog` call, `fmt.Errorf`, or response body. Defence is layered: spec, code review, and the structural absence of any code path that uses the value after the gate. Same posture extends to any future "courier credential" the relay carries but does not own (e.g. session resume tokens). Adopted in `/v1/client` (#5).
- **Hold long-lived WS handlers open with `c.CloseRead(r.Context())` plus `<-readCtx.Done()` until the real read loop lands.** `CloseRead` drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close). The frame-loop ticket replaces both the `CloseRead` call and the `<-readCtx.Done()` block with the actual read pump in the same call site — keeping `CloseRead` alongside a real reader would race the sole-reader contract. Adopted in `/v1/server` (#16) and `/v1/client` (#5); `/v1/client` swapped to `StartPhoneForwarder` in #25 (`/v1/server`'s placeholder remains until the binary-side forwarder lands).
- **Forwarder owns reading; handler owns cleanup.** The phone-side frame forwarder (#25) is a pure read pump — no `UnregisterPhone`, no `wsconn.Close()`, no extra goroutine. It runs synchronously on the HTTP handler goroutine; on return, the handler's `defer { UnregisterPhone; Close; log }` runs in the right order. Adding cleanup inside the forwarder would either double-close (idempotent, but muddies the lifecycle) or unregister twice. Pattern extends to the symmetric binary-side forwarder.
- **Define narrow read-side interfaces at the consumer, not on the adapter.** `phoneSource` (`ConnID() + Read(ctx)`) lives in `forward.go`, not on `WSConn`. Production passes `*WSConn`; tests substitute a fake without touching `WSConn`. If a future caller needs the same shape, promote the interface then — don't anticipate. Adopted in #25.
- **Hold long-lived WS handlers open with `c.CloseRead(r.Context())` plus `<-readCtx.Done()` until the real read loop lands.** `CloseRead` drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close). The frame-loop ticket replaces both the `CloseRead` call and the `<-readCtx.Done()` block with the actual read pump in the same call site — keeping `CloseRead` alongside a real reader would race the sole-reader contract. Adopted in `/v1/server` (#16) and `/v1/client` (#5); `/v1/client` swapped to `StartPhoneForwarder` in #25, `/v1/server` swapped to `StartBinaryForwarder` in #26. Both `CloseRead` placeholders are now gone.
- **Forwarder owns reading; handler owns cleanup.** Frame forwarders are pure read pumps — no `UnregisterPhone` / `ScheduleReleaseServer`, no `wsconn.Close()`, no `cancelHB`, no extra goroutine. The forwarder runs synchronously on the HTTP handler goroutine; on return, the handler's `defer cancelHB` (#7) and `defer { release/unregister; Close; log }` (#16/#21/#5) run in LIFO order. Adding cleanup inside the forwarder would either double-close (idempotent, but muddies the lifecycle) or unregister twice. Adopted in `StartPhoneForwarder` (#25) and `StartBinaryForwarder` (#26).
- **Forwarder error policy tracks the sink fan-out.** A forwarder with **one sink** returns on sink errors (no alternative — the conn is dead): `StartPhoneForwarder` (#25) returns on `BinaryFor` miss, marshal error, and `binary.Send` error. A forwarder with **N sinks** continues on per-sink errors (other sinks are still good): `StartBinaryForwarder` (#26) logs+continues on unknown `conn_id`, malformed envelope, and `phone.Send` error. Both return on source `Read` error or ctx cancel — those are unambiguous. The shape isn't a stylistic choice; it follows from how many independent downstream lifecycles the loop is multiplexing.
- **Define narrow read-side interfaces at the consumer, not on the adapter.** `phoneSource` and `binarySource` (`ConnID() + Read(ctx)`) live in `forward.go`, not on `WSConn`. Production passes `*WSConn`; tests substitute a fake without touching `WSConn`. The two interfaces are structurally identical but named distinctly so the `Start*Forwarder` signatures read self-describingly at call sites; promote to a shared interface only if a third caller emerges. Adopted in #25 and #26.
- **Per-conn goroutines exit cleanly via LIFO defers, not by closing the conn themselves.** Pattern for any goroutine launched alongside an upgraded WS conn (heartbeat, future binary-side ack pump, etc.): derive `hbCtx, cancelHB := context.WithCancel(r.Context())`, register `defer cancelHB()` *after* the release/unregister defer, then `go runX(hbCtx, wsconn, ...)`. Under LIFO unwind `cancelHB` fires first → goroutine observes `ctx.Done()` and returns without touching the conn → release defer's `wsconn.Close()` runs unopposed. The goroutine owns ONLY the failure-path close (heartbeat: `CloseWithCode(1011, "heartbeat timeout")`); cleanup-path close is the handler defer's. `closeOnce` makes the two paths idempotent against each other regardless of which fires first. Adopted in heartbeat (#7); same shape applies to any future per-conn watchdog/timer goroutine.
- **Active-conn application close codes go through `WSConn.CloseWithCode`; stillborn-conn close codes go through the underlying `*websocket.Conn`.** ADR-0005's stillborn-conn pattern (`c.Close(websocket.StatusCode(4409), reason)` directly on the underlying conn before any `Send` could have run) coexists with ADR-0007's active-conn pattern (`wsconn.CloseWithCode(1011, "heartbeat timeout")` post-claim, where `closeOnce`/`writeMu` invariants are in play). Both share the same `closeOnce` guard via `Close()` delegating to `CloseWithCode(StatusNormalClosure, "")`. The two patterns describe structurally distinct windows of a WSConn's lifecycle. Adopted: stillborn `4409`/`4404` (#16, #5); active `1011` (#7).
- **Capture process-state timestamps in `main` after `flag.Parse()`, not as package-level vars.** `startedAt := time.Now()` lives inside `main` and is passed into the handler factory. A package-level `var startedAt = time.Now()` would fire at import time — before flag parsing, before `--version` early-returns — and be wrong for short-lived test binaries and any future deferred-serve setup. Adopted in #10.
Expand Down
3 changes: 2 additions & 1 deletion docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [Binary-side frame forwarder](features/binary-forwarder.md) — per-binary read pump: unwraps each inbound routing envelope, linear-scans `PhonesFor(serverID)` for `env.ConnID`, writes `env.Frame` verbatim to that phone; opaque inner bytes; synchronous (handler discards the return); diverges from #25 in error policy — unknown `conn_id`, malformed envelope, phone `Send` error all log+continue (a single bad frame never tears down the binary); replaced `/v1/server`'s `CloseRead` placeholder (#26).
- [WebSocket heartbeat](features/heartbeat.md) — per-conn goroutine on both endpoints sends RFC 6455 ping every 30s; closes with `1011 "heartbeat timeout"` if no pong within 30s. Detects half-open TCP within 60s; ctx-cancel exit path leaves close to the handler defer (#7).
- [Phone-side frame forwarder](features/phone-forwarder.md) — per-phone read pump: wraps each inbound phone frame in the routing envelope keyed by the phone's `conn_id` and `Send`s it to the binary holding `serverID`; opaque inner bytes; synchronous (handler discards the return); replaced `/v1/client`'s `CloseRead` placeholder; added `WSConn.Read` (single-caller) (#25).
- [`/v1/client` WS upgrade](features/client-endpoint.md) — phone-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Token` / `User-Agent` pre-upgrade (token presence-only, never parsed/logged); registers phone on the binary's slot, emits `4404` if no binary holds the id, hands the conn to `StartPhoneForwarder` for the data path (#5, #25).
- [`/v1/server` WS upgrade](features/server-endpoint.md) — binary-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Version` / `User-Agent` pre-upgrade, claims the slot, emits `4409` on conflict, holds the conn via `CloseRead` until #6's frame loop replaces it; on disconnect schedules a 30s grace release so a quick reconnect inherits the slot (#21).
- [`/v1/server` WS upgrade](features/server-endpoint.md) — binary-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Version` / `User-Agent` pre-upgrade, claims the slot, emits `4409` on conflict, hands the conn to `StartBinaryForwarder` for the data path (#26); on disconnect schedules a 30s grace release so a quick reconnect inherits the slot (#16, #21, #26).
- [`/healthz` JSON endpoint](features/healthz.md) — unauthenticated `GET /healthz` returning `{status, version, connected_binaries, connected_phones, uptime_seconds}`; `Cache-Control: no-store`, body bounded ≈135 bytes.
- [WSConn adapter](features/ws-conn-adapter.md) — wraps `nhooyr.io/websocket.Conn` to satisfy the registry's `Conn`; owns the per-conn write mutex and a `Close`-cancelled context with a 10s per-`Send` deadline.
- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`, deferred-release `ScheduleReleaseServer` with grace-window reclaim semantics.
Expand Down
44 changes: 44 additions & 0 deletions docs/knowledge/codebase/26.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Ticket #26 — Binary-side frame forwarder (unwrap envelope, dispatch to addressed phone)

Lands the binary-to-phone data path. After `/v1/server` claims the slot (#16/#21) and launches the heartbeat goroutine (#7), the handler now hands the conn to `StartBinaryForwarder`, which reads frames from the binary, unwraps each routing envelope, looks up the phone matching `env.ConnID` under the binary's `serverID`, and writes `env.Frame` verbatim to that phone. Replaces the `c.CloseRead(r.Context())` + `<-readCtx.Done()` placeholder #16 left in place; the relay still treats inner frames as opaque bytes.

## Implementation

- **`internal/relay/forward.go`** — added `binarySource` interface (`ConnID() + Read(ctx)`) and `StartBinaryForwarder(ctx, reg, serverID, binary, logger) error`. Structurally a mirror of `StartPhoneForwarder` (synchronous despite the `Start` verb; consumer-defined source interface; return value is observability-only and the handler discards it). Diverges in **error policy**: unknown `conn_id`, malformed envelope, and `phone.Send` error all `log + continue`, never `return`. Only `binary.Read` errors or ctx cancellation end the loop. The divergence is structural — a binary serves N phones, and a single failing sink (or an envelope addressed to a phone that just disconnected) is a normal race, not a binary fault.
- **`internal/relay/server_endpoint.go`** — the placeholder `readCtx := c.CloseRead(r.Context()); <-readCtx.Done()` block is replaced by `_ = StartBinaryForwarder(r.Context(), reg, serverID, wsconn, logger)`. The `CloseRead` call is **deleted**, not retained — keeping it would spawn a second reader that races the forwarder for the `*websocket.Conn` sole-reader contract. Handler's existing `defer { ScheduleReleaseServer; Close; log server_released }` (#16/#21) and `defer cancelHB()` (#7) run on return in the correct LIFO order: `cancelHB → release defer`. The forwarder owns reading; the handler owns cleanup.
- **`internal/relay/forward_test.go`** — extended with seven new tests. New fakes: `fakeBinarySource` (Read-side, mirrors `fakePhone`'s frames-chan shape) and `Send`-capture fields added to the existing `fakePhone` (`mu` + `sent` + `sendErr`) so it can serve as the sink. New helpers: `runBinaryForwarder`, `waitForPhoneSent`, `claimAndRegister`, `mustMarshal`. Tests (1:1 with AC):
- `TestStartBinaryForwarder_RoutesToAddressedPhone` — envelope to P1 lands at P1, byte-stable modulo whitespace; P2 untouched.
- `TestStartBinaryForwarder_MultiplePhones` — one envelope per phone, each receives only its own.
- `TestStartBinaryForwarder_UnknownConnID_DropsAndContinues` — bogus `conn_id` dropped; subsequent valid envelope still lands on P1; closing the binary's frames chan returns `io.EOF` (proves loop never died).
- `TestStartBinaryForwarder_MalformedEnvelope_DropsAndContinues` — raw `not-json` dropped; subsequent valid envelope still lands.
- `TestStartBinaryForwarder_PhoneSendError_DropsAndContinues` — `p1.sendErr` set; envelope to P1 dropped (no panic, no return); envelope to P2 still lands. Encodes the divergence from `StartPhoneForwarder`'s return-on-Send-error.
- `TestStartBinaryForwarder_BinaryDisconnect_Returns` — close source frames chan → forwarder returns `io.EOF`; test then mimics the handler defer via `reg.ScheduleReleaseServer("s1", 0)` and polls `BinaryFor` to confirm the slot clears — verifies the handler-defer-runs-as-expected AC bullet structurally.
- `TestStartBinaryForwarder_ContextCancellation_Returns` — `cancel()` → forwarder returns `context.Canceled` within 100 ms.
- The package-level `// Manual stress: go test -race -count=20 ./internal/relay/` doc comment already in `forward_test.go` (from #25) continues to cover the new tests; no edit needed there.

## Concurrency

One forwarder goroutine per binary (the HTTP handler goroutine itself; no extra goroutine spawned). Runs alongside the heartbeat goroutine; the two never share state. `phone.Send` is multi-caller across the system (every binary's forwarder writing to a given phone — `WSConn.writeMu` serialises). `reg.PhonesFor` returns a fresh snapshot under RLock and is iterated without holding any registry lock — `phone.ConnID()` / `phone.Send` are called outside the lock, matching the established passive-store pattern.

Four termination paths:
1. Binary closes WS → `Read` close error → loop returns → handler defer.
2. Server shutdown / request cancel → ctx cancels → `Read` ctx error → loop returns.
3. Heartbeat fires → `wsconn.CloseWithCode(1011, "heartbeat timeout")` → in-flight `Read` aborts → loop returns.
4. Per-frame errors → `continue`. Loop never terminated by frame faults.

## Deliberately out of scope

- **Per-message size cap on `WSConn`.** Same residual as #25 — belongs on `WSConn.SetReadLimit` so it covers both `/v1/server` and `/v1/client`. nhooyr's default 32 MiB read limit is a soft floor. Follow-up ticket.
- **Per-phone send queues.** A high-rate stream of envelopes destined for a single wedged phone can block the binary's loop for up to 10 s per envelope (the `WSConn.Send` deadline). Not observed yet; deferred.
- **Per-server-id `map[connID]Conn` for O(1) phone lookup.** Current iteration is O(N) per envelope; N is small (handful of phones per server). Registry-shape change should be motivated by observed cost, not anticipated.
- **No branching on specific `Unmarshal` sentinels.** All four (`ErrMalformedEnvelope`, `ErrMissingConnID`, `ErrMissingFrame`, defensive catch-all) get the same `warn + continue`; branching adds no behaviour and risks divergence if envelope errors are added later.

## Cross-links

- [Feature: Binary-side frame forwarder](../features/binary-forwarder.md) — full design / error-policy divergence / testing rationale.
- [Feature: `/v1/server` WS upgrade](../features/server-endpoint.md) — the wiring site; `CloseRead` placeholder removed.
- [Feature: Phone-side frame forwarder](../features/phone-forwarder.md) — mirror image; explains the shared `*Source` interface pattern.
- [Routing envelope](../features/routing-envelope.md) — `Unmarshal` is the per-frame structural check.
- [ADR-0001](../decisions/0001-routing-envelope-shape-and-opacity.md) — `json.RawMessage` opacity; sentinel errors funnel through one warn-and-continue branch.
- [Spec: 26-binary-forwarder](../../specs/architecture/26-binary-forwarder.md) — architect's pre-implementation design.
- [Protocol spec § Routing envelope](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#routing-envelope) — authoritative wire shape.
Loading