From 86f42b32a4c9e33a42bae7a447190f49cd0f06fa Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Mon, 11 May 2026 12:55:55 +0300 Subject: [PATCH] chore(docs): shrink PROJECT-MEMORY.md to an index; archive content sections MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Same shape as pyrycode/pyrycode PR #285 and agent-dispatcher-v2 PR #56. PROJECT-MEMORY.md was drifting from index toward content store (70 lines today, growing per-ticket). Restored to the canonical index shape — Where things live + project-level conventions + open follow-ups — with the historical content preserved verbatim in docs/archive/. Changes: - docs/archive/PROJECT-MEMORY-history-2026-05-11.md — verbatim snapshot of the pre-2026-05-11 "What's built" table (#1..#32) and "Patterns established" section (35 patterns). Preserved for cross-reference resolution and audit trail. - docs/PROJECT-MEMORY.md — restored to ~40-line index. Distilled the patterns down to the 7 genuinely project-level rules that apply to all future tickets; the rest were per-ticket and belong in codebase/.md going forward. Tests: make test (-race) unchanged, all green. --- docs/PROJECT-MEMORY.md | 78 ++++++------------- .../PROJECT-MEMORY-history-2026-05-11.md | 68 ++++++++++++++++ 2 files changed, 92 insertions(+), 54 deletions(-) create mode 100644 docs/archive/PROJECT-MEMORY-history-2026-05-11.md diff --git a/docs/PROJECT-MEMORY.md b/docs/PROJECT-MEMORY.md index 10df6ed..ada06f4 100644 --- a/docs/PROJECT-MEMORY.md +++ b/docs/PROJECT-MEMORY.md @@ -4,67 +4,37 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex > **READ-ONLY FOR AGENTS (as of 2026-05-11).** All five pipeline agents have explicit "Never Update docs/PROJECT-MEMORY.md" rules. Per-ticket content (implementation, patterns, lessons) goes in [`docs/knowledge/codebase/.md`](knowledge/codebase/). Humans maintain this file directly. -## What's built +## Where things live -> **FROZEN.** Pre-2026-05-10 history. Per-ticket implementation summaries live in [`docs/knowledge/codebase/.md`](knowledge/codebase/). The table below is kept for context — do not append. +- `docs/knowledge/codebase/.md` — **per-ticket implementation summary + patterns established + lessons learned.** One file per ticket. Directory listing IS the index. +- `docs/knowledge/features/` — evergreen feature docs. +- `docs/knowledge/decisions/` — ADRs, numbered sequentially. +- `docs/knowledge/INDEX.md` — one-line summaries. **Documentation phase is the sole writer.** +- `docs/lessons.md` — **frozen 2026-05-11.** Historical reference. New lessons go in `docs/knowledge/codebase/.md`. +- `docs/specs/architecture/-.md` — architect specs. +- `docs/architecture.md` — system-level design overview. +- `docs/threat-model.md` — operational threat model (deploy, supply chain, DoS, log hygiene, TLS). +- `docs/archive/PROJECT-MEMORY-history-2026-05-11.md` — pre-2026-05-11 "What's built" table + "Patterns established" content store, archived in place so existing cross-references still resolve. -| Area | Status | Where | -|---|---|---| -| Go skeleton | Done | `cmd/pyrycode-relay/main.go`, `internal/relay/doc.go` | -| `http.Server` with explicit timeouts (gosec G114) | Done | `cmd/pyrycode-relay/main.go` | -| Routing envelope wrapper type (`Envelope`, `Marshal`, `Unmarshal`, sentinel errors) | Done (#1) | `internal/relay/envelope.go` | -| Connection registry (`Conn`, `Registry`; 1:1 binary, 1:N phones; sentinel errors for `4409`/`4404`; race-tested) | Done (#3) | `internal/relay/registry.go` | -| Registry grace-period reclaim (`ScheduleReleaseServer`; grace-aware `ClaimServer` fast path; phones closed at expiry, inherited on reclaim; pointer-identity stale-fire defence) | Done (#20) | `internal/relay/registry.go` | -| TLS via autocert in `--domain` mode (`NewAutocertManager`, `EnforceHost`, `TLSConfig`, `ErrCacheDirInsecure`) | Done (#9) | `internal/relay/tls.go`, `cmd/pyrycode-relay/main.go` | -| `WSConn` adapter (`nhooyr.io/websocket.Conn` → registry `Conn`; per-conn write mutex; `Close`-cancelled context; 10s `Send` deadline) | Done (#15) | `internal/relay/ws_conn.go` | -| `/healthz` JSON endpoint (`status`, `version`, `connected_binaries`, `connected_phones`, `uptime_seconds`; `Cache-Control: no-store`; unauthenticated) | Done (#10) | `internal/relay/healthz.go`, `cmd/pyrycode-relay/main.go` | -| WS upgrade on `/v1/server` (header gate pre-upgrade; `ClaimServer`; `4409` on conflict; `CloseRead`-held until #6; disconnect defers a 30s grace release via `ScheduleReleaseServer`) | Done (#16, #21) | `internal/relay/server_endpoint.go`, `cmd/pyrycode-relay/main.go` | -| WS upgrade on `/v1/client` (header gate pre-upgrade; `RegisterPhone`; `4404` if no binary; data path via `StartPhoneForwarder`; disconnect calls `UnregisterPhone`; token never parsed or logged) | Done (#5, #25) | `internal/relay/client_endpoint.go`, `cmd/pyrycode-relay/main.go` | -| Header validation (`x-pyrycode-server`, `x-pyrycode-version`, `user-agent` on `/v1/server`; `x-pyrycode-server`, `x-pyrycode-token`, `user-agent` on `/v1/client`; optional `x-pyrycode-device-name` on `/v1/client`) | Done (#16, #5) | `internal/relay/server_endpoint.go`, `internal/relay/client_endpoint.go` | -| Phone-side frame forwarder (`StartPhoneForwarder`: read frame → `Marshal(connID, frame)` → `BinaryFor(serverID).Send(wrapped)`; synchronous; opaque inner bytes; per-frame `BinaryFor` picks up grace-window state) | Done (#25) | `internal/relay/forward.go`, `internal/relay/client_endpoint.go` | -| `WSConn.Read` (single-caller; no `closeCtx` join — underlying `Close` aborts in-flight reads) | Done (#25) | `internal/relay/ws_conn.go` | -| Binary-side frame forwarder | Not started | — | -| `conn_id` generation scheme | Not started | — | -| Threat model doc — operational surface (deploy, supply chain, DoS, log hygiene, cert handling, TLS, error leakage) | Done (#11) | `docs/threat-model.md` | +## Project-level conventions (human-maintained) -## Patterns established (frozen 2026-05-11) +Stable rules that span tickets. New entries land here only when genuinely cross-cutting; per-ticket detail goes in `codebase/.md`. -> **FROZEN.** Pre-2026-05-11 patterns. New per-ticket patterns live in `docs/knowledge/codebase/.md`. Do not append here. - -- **Sentinel errors, branched via `errors.Is`** for validation failures at protocol boundaries. New routing-layer code follows the `Err...` naming and wraps with `fmt.Errorf("…: %w", err, sentinel)` when adding context. -- **Opacity by type.** Inner-frame payloads are carried as `json.RawMessage`. The relay never deserialises payloads; the type makes that hard to violate accidentally. +- **Sentinel errors + `errors.Is` branching at protocol boundaries.** New routing-layer code follows the `Err...` naming and wraps with `fmt.Errorf("…: %w", err, sentinel)`. +- **Opacity by type for inner-frame payloads.** Carried as `json.RawMessage`. The relay never deserialises payloads; the type makes that hard to violate accidentally. - **Validate at the envelope boundary, not deeper.** Structural checks (presence, non-empty, JSON well-formedness) belong here; semantic checks (token validity, message kind) belong to the binary. -- **Stdlib + `golang.org/x/crypto` + `nhooyr.io/websocket`.** Direct deps arrived in #9 (`acme/autocert`) and #15 (`nhooyr.io/websocket` v1.8.x). Keep the dependency surface deliberate; new deps need a justification (the ADR is the justification). -- **Loud failure over silent correction.** `--domain` and `--insecure-listen` are mutually exclusive and explicit. `ErrCacheDirInsecure` refuses to start on a world/group-readable cert cache rather than re-chmoding it. `:80` returns `404` for non-challenge traffic instead of redirecting to HTTPS (ADR-0002). -- **Tests live in the same package** (`package relay`, not `relay_test`) so they can `errors.Is` against unexported sentinels if needed and reach private helpers without re-exporting. -- **Passive in-memory stores guard mutation under one RWMutex; reads return copies, never references.** `PhonesFor` allocates a fresh slice so callers do slow work (broadcast, `Send` over the network) without holding the registry lock. Adopted in `internal/relay/registry.go` (#3); the same shape applies to any future "shared map of conns" type. -- **Interface methods called under the lock are documented as non-blocking getters.** The registry's `Conn.ConnID()` is invoked under the write lock during `UnregisterPhone`; `Send` and `Close` are never called while the lock is held. Pattern: state the contract on the interface, never call something that could block on I/O while a mutex is held. -- **Adapters bridge interface↔library API mismatches by owning policy locally.** When a library method needs a `context.Context` but the registry's `Conn` interface doesn't take one (and shouldn't — most callers don't have a context to thread), the adapter owns its own context: derived in the constructor, cancelled by `Close`, narrowed per-call with `WithTimeout` for deadline policy. Adopted in `WSConn` (#15); avoids forcing context-plumbing changes into upstream interfaces. -- **Handler factories return `http.Handler`, not `http.HandlerFunc`.** `NewHealthzHandler(reg, version, startedAt) http.Handler` keeps construction factory-shaped so a future ticket adding per-handler state (logger, injectable clock) can do so without touching the call site in `main`. Adopted in `/healthz` (#10). -- **Validate adversarial input *before* allocating WS state.** Header-gate checks (presence + non-empty) run before `websocket.Accept`; a malformed request never receives an upgrade response and the registry is untouched on every error path. Adopted in `/v1/server` (#16) and `/v1/client` (#5). -- **Application WS close codes are emitted on the underlying `*websocket.Conn`, not via `WSConn`.** `WSConn.Close()` always emits `StatusNormalClosure` — no policy parameters. Handlers that need `4409` / `4401` / `4404` call `c.Close(websocket.StatusCode(code), reason)` directly in the stillborn-WSConn window (after construction, before any `Send`). See [ADR-0005](knowledge/decisions/0005-application-close-codes-via-underlying-conn.md). Adopted in `/v1/server` `4409` (#16) and `/v1/client` `4404` (#5). -- **`defer { ScheduleRelease; Close; log }` is registered AFTER the successful claim**, never before. A conflict path (or any pre-claim error) returns before the defer is in place, so we never schedule a release on a slot we don't hold. Originally `ReleaseServer` (#16); swapped to `ScheduleReleaseServer(serverID, grace)` in #21 so a quick reconnect lands back in the same slot. The structural rule sharpened in #21 — pre-grace this was a benign no-op when violated; now violating it would arm a stray timer for an id the handler never owned. Adopted on `/v1/client`'s phone register/unregister in #5: defer is registered after `RegisterPhone` returns nil, so the no-server (`4404`) path never tries to unregister a slot we never claimed. -- **Policy values live at the wiring site, not as package-level constants.** `30*time.Second` for the grace window is a literal in `cmd/pyrycode-relay/main.go`, threaded into `ServerHandler` as a constructor parameter — it appears exactly once, the value is policy (matches protocol spec), and inlining keeps the protocol-spec linkage visible where the relay is composed. Tests pass ms-scale durations through the same parameter. Promote to a package-level constant only when a second wiring entry point needs the same value. Adopted in `/v1/server` grace duration (#21). -- **Pointer-identity for stale `time.AfterFunc` fires.** `time.Timer.Stop()` returns false if the timer's func has already started executing. Wrap each pending timer in a small struct and store the wrapper pointer in a map; the `AfterFunc` closure captures the wrapper pointer and asserts `map[key] == self` under the lock before acting. If a faster goroutine replaced the entry, the pointer no longer matches and the closure no-ops. Capturing the `*time.Timer` directly forces a self-referential local var (assigned after `AfterFunc` returns) which trips the race detector under stress; the wrapper avoids that. Adopted in `Registry.ScheduleReleaseServer` (#20). Same shape applies to any "one cancellable timer per key" pattern. -- **Credentials the relay does not validate are presence-checked then discarded — never logged, never put in error strings.** `/v1/client`'s `X-Pyrycode-Token` is opaque to the relay (the binary owns verification). The handler reads the token into a local string, branches on `!= ""`, and lets the local go out of scope unread. The log-event field set is enumerated explicitly in the doc; the token name does not appear in any `slog` call, `fmt.Errorf`, or response body. Defence is layered: spec, code review, and the structural absence of any code path that uses the value after the gate. Same posture extends to any future "courier credential" the relay carries but does not own (e.g. session resume tokens). Adopted in `/v1/client` (#5). -- **Hold long-lived WS handlers open with `c.CloseRead(r.Context())` plus `<-readCtx.Done()` until the real read loop lands.** `CloseRead` drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close). The frame-loop ticket replaces both the `CloseRead` call and the `<-readCtx.Done()` block with the actual read pump in the same call site — keeping `CloseRead` alongside a real reader would race the sole-reader contract. Adopted in `/v1/server` (#16) and `/v1/client` (#5); `/v1/client` swapped to `StartPhoneForwarder` in #25, `/v1/server` swapped to `StartBinaryForwarder` in #26. Both `CloseRead` placeholders are now gone. -- **Forwarder owns reading; handler owns cleanup.** Frame forwarders are pure read pumps — no `UnregisterPhone` / `ScheduleReleaseServer`, no `wsconn.Close()`, no `cancelHB`, no extra goroutine. The forwarder runs synchronously on the HTTP handler goroutine; on return, the handler's `defer cancelHB` (#7) and `defer { release/unregister; Close; log }` (#16/#21/#5) run in LIFO order. Adding cleanup inside the forwarder would either double-close (idempotent, but muddies the lifecycle) or unregister twice. Adopted in `StartPhoneForwarder` (#25) and `StartBinaryForwarder` (#26). -- **Forwarder error policy tracks the sink fan-out.** A forwarder with **one sink** returns on sink errors (no alternative — the conn is dead): `StartPhoneForwarder` (#25) returns on `BinaryFor` miss, marshal error, and `binary.Send` error. A forwarder with **N sinks** continues on per-sink errors (other sinks are still good): `StartBinaryForwarder` (#26) logs+continues on unknown `conn_id`, malformed envelope, and `phone.Send` error. Both return on source `Read` error or ctx cancel — those are unambiguous. The shape isn't a stylistic choice; it follows from how many independent downstream lifecycles the loop is multiplexing. -- **Define narrow read-side interfaces at the consumer, not on the adapter.** `phoneSource` and `binarySource` (`ConnID() + Read(ctx)`) live in `forward.go`, not on `WSConn`. Production passes `*WSConn`; tests substitute a fake without touching `WSConn`. The two interfaces are structurally identical but named distinctly so the `Start*Forwarder` signatures read self-describingly at call sites; promote to a shared interface only if a third caller emerges. Adopted in #25 and #26. -- **Per-conn goroutines exit cleanly via LIFO defers, not by closing the conn themselves.** Pattern for any goroutine launched alongside an upgraded WS conn (heartbeat, future binary-side ack pump, etc.): derive `hbCtx, cancelHB := context.WithCancel(r.Context())`, register `defer cancelHB()` *after* the release/unregister defer, then `go runX(hbCtx, wsconn, ...)`. Under LIFO unwind `cancelHB` fires first → goroutine observes `ctx.Done()` and returns without touching the conn → release defer's `wsconn.Close()` runs unopposed. The goroutine owns ONLY the failure-path close (heartbeat: `CloseWithCode(1011, "heartbeat timeout")`); cleanup-path close is the handler defer's. `closeOnce` makes the two paths idempotent against each other regardless of which fires first. Adopted in heartbeat (#7); same shape applies to any future per-conn watchdog/timer goroutine. -- **Active-conn application close codes go through `WSConn.CloseWithCode`; stillborn-conn close codes go through the underlying `*websocket.Conn`.** ADR-0005's stillborn-conn pattern (`c.Close(websocket.StatusCode(4409), reason)` directly on the underlying conn before any `Send` could have run) coexists with ADR-0007's active-conn pattern (`wsconn.CloseWithCode(1011, "heartbeat timeout")` post-claim, where `closeOnce`/`writeMu` invariants are in play). Both share the same `closeOnce` guard via `Close()` delegating to `CloseWithCode(StatusNormalClosure, "")`. The two patterns describe structurally distinct windows of a WSConn's lifecycle. Adopted: stillborn `4409`/`4404` (#16, #5); active `1011` (#7). -- **Input-bounding policy applied at the adapter constructor, not per-handler.** When a library default is dangerously generous (nhooyr's 32 MiB `SetReadLimit` default), set the policy at the single chokepoint *both* endpoints reach — the wrapping adapter's constructor — rather than at each handler entry. `NewWSConn(c, connID, maxFrameBytes)` calls `c.SetReadLimit(maxFrameBytes)` before returning the struct, which structurally discharges "before any `Read` is performed": no goroutine other than the constructor holds a reference at that point. Distinct from the "policy values live at the wiring site" pattern (which is about *literal placement* — the value still lives as a `const` in `main`); this one is about *enforcement placement* — apply at the choke point so neither handler can forget. Adopted in `WSConn` (#29). Same shape applies to any future "bound the adversarial input at the wrapping seam" policy (e.g. max-frames-per-second, if it lands). -- **Capture process-state timestamps in `main` after `flag.Parse()`, not as package-level vars.** `startedAt := time.Now()` lives inside `main` and is passed into the handler factory. A package-level `var startedAt = time.Now()` would fire at import time — before flag parsing, before `--version` early-returns — and be wrong for short-lived test binaries and any future deferred-serve setup. Adopted in #10. -- **Digest-pin third-party base images with a `# Tracks: ` sibling comment.** `FROM image:tag@sha256:…` cannot be silently shifted by a tag-swap attack between Renovate bumps; the `# Tracks:` line tells a reviewer (or Renovate's diff) what the digest is supposed to correspond to, so a digest-changed-while-tag-comment-unchanged proposal is structurally reviewable. Same shape applies to any future supply-chain pin (action SHAs, downloaded toolchain archives) — never pin by digest alone, always pair with the human-readable tag the digest tracks. Adopted in `Dockerfile` for both `golang:1.26-bookworm` and `gcr.io/distroless/static-debian12:nonroot` (#32). -- **Belt-and-suspenders invariants survive base-layer regressions.** When a base image already enforces a security property upstream (distroless `:nonroot` sets `USER 65532`), re-state the property in our own layer (`USER nonroot:nonroot`) so a future base swap that drops the upstream property doesn't silently regress us. Both layers would need to regress simultaneously for the property to lapse. Cost is one line; benefit is regression resistance under unattended dependency updates. Adopted in the runtime stage of `Dockerfile` (#32). -- **Type-owned background goroutine: `ticker + done chan + WaitGroup`, idempotent + synchronous `Close`.** When a type owns a long-lived bookkeeping goroutine (sweep/eviction loop, not a per-conn read pump), start it from the constructor and stop it via the type's own `Close()`. Shape: `closeOnce.Do(func(){ close(done) })` then `wg.Wait()` **outside** the once — putting `wg.Wait()` inside would let a second concurrent caller return early while shutdown is still in flight. The wiring site registers `defer x.Close()` at the composition root. Adopted in `IPRateLimiter` (#50). Distinct from "Per-conn goroutines exit cleanly via LIFO defers" — that pattern is for goroutines whose lifecycle is owned by an HTTP handler; this one is for goroutines whose lifecycle is owned by the type itself. -- **When one timestamp must serve two semantic purposes, split it.** In `IPRateLimiter` (#50), refill math needs an anchor that advances in token-sized chunks (preserving fractional elapsed time for the next call) while eviction needs an anchor that updates on every poll (so an actively-hammered/denied bucket isn't dropped and recycled for a fresh burst). A single `last`-of-anything cannot do both without one purpose corrupting the other. Two fields (`refillLast`, `pollLast`) decouple them at zero structural cost and make a load-bearing security invariant (no free-burst-recycling) structurally true rather than carefully maintained. Generalisable: whenever a timestamp's "should it update here?" question gets conflicting answers from different callers, the answer is two timestamps. -- **Injectable wall-clock as an unexported `now func() time.Time` field, defaulting to `time.Now`.** When a type's correctness depends on `time.Now` and tests need determinism without sleeping, give the struct a package-private `now func() time.Time` field set to `time.Now` in the constructor; in-package tests assign a fake (`l.now = fakeClock.Now`) directly. No exported `WithClock` option, no clock interface, no constructor variants. The single eviction goroutine reads `l.now` on every tick; assignment before the first tick is race-free in practice and `-race` tests cover the steady-state. Adopted in `IPRateLimiter` (#50); applies to any future timer-driven primitive whose tests need fake-clock determinism. -- **Empty-string return as the "deny" signal on a pure security primitive — no `(string, error)` ceremony.** When every caller would handle the error identically (treat as deny), an `(value, error)` pair pushes a meaningless `if err != nil` into every call site. Instead: document the empty-string contract in the function's docstring, name the deny semantics explicitly, and let the wiring ticket enforce it in code. The helper performs no policy; the caller's `if ip == "" { deny }` is the policy. Adopted in `ClientIP` (#51). Applies to any pure security primitive whose output is keyed (rate-limit key, bucket id) and whose only failure mode is "no usable input." -- **Two helpers diverge on contract, not on mechanics → add a new export, don't mutate the existing one.** `internal/relay/server_endpoint.go`'s unexported `remoteHost` (logging helper: falls back to `r.RemoteAddr` verbatim on parse error, no XFF awareness) and the new `ClientIP` (security primitive: empty-string-as-deny, opt-in XFF trust) share the same `net.SplitHostPort` mechanics but have structurally different contracts. Mutating `remoteHost` to add XFF and change its empty-string semantics would have broken its log-line callers; adding a new export preserves both contracts and lets the wiring ticket decide whether logging migrates. Pattern: when a new caller needs the same mechanics with a *different* contract (deny semantics, validation strictness, return shape), the answer is a new export, not a parameter on the old one. Adopted in `ClientIP` (#51). +- **Credentials the relay does not validate are presence-checked then discarded** — never logged, never put in error strings. `X-Pyrycode-Token` is opaque to the relay (binary owns verification). +- **Loud failure over silent correction.** `--domain` and `--insecure-listen` are mutually exclusive and explicit. `ErrCacheDirInsecure` refuses to start on a world/group-readable cert cache rather than re-chmoding it. +- **Tests live in the same package** (`package relay`, not `relay_test`) so they can `errors.Is` against unexported sentinels. +- **Per-conn goroutines exit via LIFO defers, not by closing the conn themselves.** Handler owns cleanup; goroutines own only the failure-path close. ## Conventions - Single ticket = single commit on `feature/` named `feat(relay): (#)`. - `make vet`, `make test` (`-race`), and `make build` must all be clean before merge. -- Knowledge base lives under `docs/knowledge/` (features, decisions). One-line index in `docs/knowledge/INDEX.md`. + +## Open follow-ups + +*Human-maintained. Agents: do not edit. File new follow-ups as GitHub issues.* + +- (none currently — file new entries as GitHub issues) diff --git a/docs/archive/PROJECT-MEMORY-history-2026-05-11.md b/docs/archive/PROJECT-MEMORY-history-2026-05-11.md new file mode 100644 index 0000000..6c83c4a --- /dev/null +++ b/docs/archive/PROJECT-MEMORY-history-2026-05-11.md @@ -0,0 +1,68 @@ +# PROJECT-MEMORY history snapshot — 2026-05-11 + +Verbatim archive of the pre-2026-05-11 `docs/PROJECT-MEMORY.md` "What's built" table + "Patterns established" section. + +**Why archived.** PROJECT-MEMORY.md was drifting toward becoming a content store (per-ticket entries + per-ticket patterns). Per-ticket implementation summaries already have a per-ticket-file home in [`docs/knowledge/codebase/.md`](../knowledge/codebase/); per-ticket patterns + lessons co-locate there. The new `docs/PROJECT-MEMORY.md` is restored to its intended index shape. + +Cross-references in other docs that point at section anchors in this content still resolve here. + +--- + +## What's built + +> **FROZEN.** Pre-2026-05-10 history. Per-ticket implementation summaries live in [`docs/knowledge/codebase/.md`](knowledge/codebase/). The table below is kept for context — do not append. + +| Area | Status | Where | +|---|---|---| +| Go skeleton | Done | `cmd/pyrycode-relay/main.go`, `internal/relay/doc.go` | +| `http.Server` with explicit timeouts (gosec G114) | Done | `cmd/pyrycode-relay/main.go` | +| Routing envelope wrapper type (`Envelope`, `Marshal`, `Unmarshal`, sentinel errors) | Done (#1) | `internal/relay/envelope.go` | +| Connection registry (`Conn`, `Registry`; 1:1 binary, 1:N phones; sentinel errors for `4409`/`4404`; race-tested) | Done (#3) | `internal/relay/registry.go` | +| Registry grace-period reclaim (`ScheduleReleaseServer`; grace-aware `ClaimServer` fast path; phones closed at expiry, inherited on reclaim; pointer-identity stale-fire defence) | Done (#20) | `internal/relay/registry.go` | +| TLS via autocert in `--domain` mode (`NewAutocertManager`, `EnforceHost`, `TLSConfig`, `ErrCacheDirInsecure`) | Done (#9) | `internal/relay/tls.go`, `cmd/pyrycode-relay/main.go` | +| `WSConn` adapter (`nhooyr.io/websocket.Conn` → registry `Conn`; per-conn write mutex; `Close`-cancelled context; 10s `Send` deadline) | Done (#15) | `internal/relay/ws_conn.go` | +| `/healthz` JSON endpoint (`status`, `version`, `connected_binaries`, `connected_phones`, `uptime_seconds`; `Cache-Control: no-store`; unauthenticated) | Done (#10) | `internal/relay/healthz.go`, `cmd/pyrycode-relay/main.go` | +| WS upgrade on `/v1/server` (header gate pre-upgrade; `ClaimServer`; `4409` on conflict; `CloseRead`-held until #6; disconnect defers a 30s grace release via `ScheduleReleaseServer`) | Done (#16, #21) | `internal/relay/server_endpoint.go`, `cmd/pyrycode-relay/main.go` | +| WS upgrade on `/v1/client` (header gate pre-upgrade; `RegisterPhone`; `4404` if no binary; data path via `StartPhoneForwarder`; disconnect calls `UnregisterPhone`; token never parsed or logged) | Done (#5, #25) | `internal/relay/client_endpoint.go`, `cmd/pyrycode-relay/main.go` | +| Header validation (`x-pyrycode-server`, `x-pyrycode-version`, `user-agent` on `/v1/server`; `x-pyrycode-server`, `x-pyrycode-token`, `user-agent` on `/v1/client`; optional `x-pyrycode-device-name` on `/v1/client`) | Done (#16, #5) | `internal/relay/server_endpoint.go`, `internal/relay/client_endpoint.go` | +| Phone-side frame forwarder (`StartPhoneForwarder`: read frame → `Marshal(connID, frame)` → `BinaryFor(serverID).Send(wrapped)`; synchronous; opaque inner bytes; per-frame `BinaryFor` picks up grace-window state) | Done (#25) | `internal/relay/forward.go`, `internal/relay/client_endpoint.go` | +| `WSConn.Read` (single-caller; no `closeCtx` join — underlying `Close` aborts in-flight reads) | Done (#25) | `internal/relay/ws_conn.go` | +| Binary-side frame forwarder | Not started | — | +| `conn_id` generation scheme | Not started | — | +| Threat model doc — operational surface (deploy, supply chain, DoS, log hygiene, cert handling, TLS, error leakage) | Done (#11) | `docs/threat-model.md` | + +## Patterns established (frozen 2026-05-11) + +> **FROZEN.** Pre-2026-05-11 patterns. New per-ticket patterns live in `docs/knowledge/codebase/.md`. Do not append here. + +- **Sentinel errors, branched via `errors.Is`** for validation failures at protocol boundaries. New routing-layer code follows the `Err...` naming and wraps with `fmt.Errorf("…: %w", err, sentinel)` when adding context. +- **Opacity by type.** Inner-frame payloads are carried as `json.RawMessage`. The relay never deserialises payloads; the type makes that hard to violate accidentally. +- **Validate at the envelope boundary, not deeper.** Structural checks (presence, non-empty, JSON well-formedness) belong here; semantic checks (token validity, message kind) belong to the binary. +- **Stdlib + `golang.org/x/crypto` + `nhooyr.io/websocket`.** Direct deps arrived in #9 (`acme/autocert`) and #15 (`nhooyr.io/websocket` v1.8.x). Keep the dependency surface deliberate; new deps need a justification (the ADR is the justification). +- **Loud failure over silent correction.** `--domain` and `--insecure-listen` are mutually exclusive and explicit. `ErrCacheDirInsecure` refuses to start on a world/group-readable cert cache rather than re-chmoding it. `:80` returns `404` for non-challenge traffic instead of redirecting to HTTPS (ADR-0002). +- **Tests live in the same package** (`package relay`, not `relay_test`) so they can `errors.Is` against unexported sentinels if needed and reach private helpers without re-exporting. +- **Passive in-memory stores guard mutation under one RWMutex; reads return copies, never references.** `PhonesFor` allocates a fresh slice so callers do slow work (broadcast, `Send` over the network) without holding the registry lock. Adopted in `internal/relay/registry.go` (#3); the same shape applies to any future "shared map of conns" type. +- **Interface methods called under the lock are documented as non-blocking getters.** The registry's `Conn.ConnID()` is invoked under the write lock during `UnregisterPhone`; `Send` and `Close` are never called while the lock is held. Pattern: state the contract on the interface, never call something that could block on I/O while a mutex is held. +- **Adapters bridge interface↔library API mismatches by owning policy locally.** When a library method needs a `context.Context` but the registry's `Conn` interface doesn't take one (and shouldn't — most callers don't have a context to thread), the adapter owns its own context: derived in the constructor, cancelled by `Close`, narrowed per-call with `WithTimeout` for deadline policy. Adopted in `WSConn` (#15); avoids forcing context-plumbing changes into upstream interfaces. +- **Handler factories return `http.Handler`, not `http.HandlerFunc`.** `NewHealthzHandler(reg, version, startedAt) http.Handler` keeps construction factory-shaped so a future ticket adding per-handler state (logger, injectable clock) can do so without touching the call site in `main`. Adopted in `/healthz` (#10). +- **Validate adversarial input *before* allocating WS state.** Header-gate checks (presence + non-empty) run before `websocket.Accept`; a malformed request never receives an upgrade response and the registry is untouched on every error path. Adopted in `/v1/server` (#16) and `/v1/client` (#5). +- **Application WS close codes are emitted on the underlying `*websocket.Conn`, not via `WSConn`.** `WSConn.Close()` always emits `StatusNormalClosure` — no policy parameters. Handlers that need `4409` / `4401` / `4404` call `c.Close(websocket.StatusCode(code), reason)` directly in the stillborn-WSConn window (after construction, before any `Send`). See [ADR-0005](knowledge/decisions/0005-application-close-codes-via-underlying-conn.md). Adopted in `/v1/server` `4409` (#16) and `/v1/client` `4404` (#5). +- **`defer { ScheduleRelease; Close; log }` is registered AFTER the successful claim**, never before. A conflict path (or any pre-claim error) returns before the defer is in place, so we never schedule a release on a slot we don't hold. Originally `ReleaseServer` (#16); swapped to `ScheduleReleaseServer(serverID, grace)` in #21 so a quick reconnect lands back in the same slot. The structural rule sharpened in #21 — pre-grace this was a benign no-op when violated; now violating it would arm a stray timer for an id the handler never owned. Adopted on `/v1/client`'s phone register/unregister in #5: defer is registered after `RegisterPhone` returns nil, so the no-server (`4404`) path never tries to unregister a slot we never claimed. +- **Policy values live at the wiring site, not as package-level constants.** `30*time.Second` for the grace window is a literal in `cmd/pyrycode-relay/main.go`, threaded into `ServerHandler` as a constructor parameter — it appears exactly once, the value is policy (matches protocol spec), and inlining keeps the protocol-spec linkage visible where the relay is composed. Tests pass ms-scale durations through the same parameter. Promote to a package-level constant only when a second wiring entry point needs the same value. Adopted in `/v1/server` grace duration (#21). +- **Pointer-identity for stale `time.AfterFunc` fires.** `time.Timer.Stop()` returns false if the timer's func has already started executing. Wrap each pending timer in a small struct and store the wrapper pointer in a map; the `AfterFunc` closure captures the wrapper pointer and asserts `map[key] == self` under the lock before acting. If a faster goroutine replaced the entry, the pointer no longer matches and the closure no-ops. Capturing the `*time.Timer` directly forces a self-referential local var (assigned after `AfterFunc` returns) which trips the race detector under stress; the wrapper avoids that. Adopted in `Registry.ScheduleReleaseServer` (#20). Same shape applies to any "one cancellable timer per key" pattern. +- **Credentials the relay does not validate are presence-checked then discarded — never logged, never put in error strings.** `/v1/client`'s `X-Pyrycode-Token` is opaque to the relay (the binary owns verification). The handler reads the token into a local string, branches on `!= ""`, and lets the local go out of scope unread. The log-event field set is enumerated explicitly in the doc; the token name does not appear in any `slog` call, `fmt.Errorf`, or response body. Defence is layered: spec, code review, and the structural absence of any code path that uses the value after the gate. Same posture extends to any future "courier credential" the relay carries but does not own (e.g. session resume tokens). Adopted in `/v1/client` (#5). +- **Hold long-lived WS handlers open with `c.CloseRead(r.Context())` plus `<-readCtx.Done()` until the real read loop lands.** `CloseRead` drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close). The frame-loop ticket replaces both the `CloseRead` call and the `<-readCtx.Done()` block with the actual read pump in the same call site — keeping `CloseRead` alongside a real reader would race the sole-reader contract. Adopted in `/v1/server` (#16) and `/v1/client` (#5); `/v1/client` swapped to `StartPhoneForwarder` in #25, `/v1/server` swapped to `StartBinaryForwarder` in #26. Both `CloseRead` placeholders are now gone. +- **Forwarder owns reading; handler owns cleanup.** Frame forwarders are pure read pumps — no `UnregisterPhone` / `ScheduleReleaseServer`, no `wsconn.Close()`, no `cancelHB`, no extra goroutine. The forwarder runs synchronously on the HTTP handler goroutine; on return, the handler's `defer cancelHB` (#7) and `defer { release/unregister; Close; log }` (#16/#21/#5) run in LIFO order. Adding cleanup inside the forwarder would either double-close (idempotent, but muddies the lifecycle) or unregister twice. Adopted in `StartPhoneForwarder` (#25) and `StartBinaryForwarder` (#26). +- **Forwarder error policy tracks the sink fan-out.** A forwarder with **one sink** returns on sink errors (no alternative — the conn is dead): `StartPhoneForwarder` (#25) returns on `BinaryFor` miss, marshal error, and `binary.Send` error. A forwarder with **N sinks** continues on per-sink errors (other sinks are still good): `StartBinaryForwarder` (#26) logs+continues on unknown `conn_id`, malformed envelope, and `phone.Send` error. Both return on source `Read` error or ctx cancel — those are unambiguous. The shape isn't a stylistic choice; it follows from how many independent downstream lifecycles the loop is multiplexing. +- **Define narrow read-side interfaces at the consumer, not on the adapter.** `phoneSource` and `binarySource` (`ConnID() + Read(ctx)`) live in `forward.go`, not on `WSConn`. Production passes `*WSConn`; tests substitute a fake without touching `WSConn`. The two interfaces are structurally identical but named distinctly so the `Start*Forwarder` signatures read self-describingly at call sites; promote to a shared interface only if a third caller emerges. Adopted in #25 and #26. +- **Per-conn goroutines exit cleanly via LIFO defers, not by closing the conn themselves.** Pattern for any goroutine launched alongside an upgraded WS conn (heartbeat, future binary-side ack pump, etc.): derive `hbCtx, cancelHB := context.WithCancel(r.Context())`, register `defer cancelHB()` *after* the release/unregister defer, then `go runX(hbCtx, wsconn, ...)`. Under LIFO unwind `cancelHB` fires first → goroutine observes `ctx.Done()` and returns without touching the conn → release defer's `wsconn.Close()` runs unopposed. The goroutine owns ONLY the failure-path close (heartbeat: `CloseWithCode(1011, "heartbeat timeout")`); cleanup-path close is the handler defer's. `closeOnce` makes the two paths idempotent against each other regardless of which fires first. Adopted in heartbeat (#7); same shape applies to any future per-conn watchdog/timer goroutine. +- **Active-conn application close codes go through `WSConn.CloseWithCode`; stillborn-conn close codes go through the underlying `*websocket.Conn`.** ADR-0005's stillborn-conn pattern (`c.Close(websocket.StatusCode(4409), reason)` directly on the underlying conn before any `Send` could have run) coexists with ADR-0007's active-conn pattern (`wsconn.CloseWithCode(1011, "heartbeat timeout")` post-claim, where `closeOnce`/`writeMu` invariants are in play). Both share the same `closeOnce` guard via `Close()` delegating to `CloseWithCode(StatusNormalClosure, "")`. The two patterns describe structurally distinct windows of a WSConn's lifecycle. Adopted: stillborn `4409`/`4404` (#16, #5); active `1011` (#7). +- **Input-bounding policy applied at the adapter constructor, not per-handler.** When a library default is dangerously generous (nhooyr's 32 MiB `SetReadLimit` default), set the policy at the single chokepoint *both* endpoints reach — the wrapping adapter's constructor — rather than at each handler entry. `NewWSConn(c, connID, maxFrameBytes)` calls `c.SetReadLimit(maxFrameBytes)` before returning the struct, which structurally discharges "before any `Read` is performed": no goroutine other than the constructor holds a reference at that point. Distinct from the "policy values live at the wiring site" pattern (which is about *literal placement* — the value still lives as a `const` in `main`); this one is about *enforcement placement* — apply at the choke point so neither handler can forget. Adopted in `WSConn` (#29). Same shape applies to any future "bound the adversarial input at the wrapping seam" policy (e.g. max-frames-per-second, if it lands). +- **Capture process-state timestamps in `main` after `flag.Parse()`, not as package-level vars.** `startedAt := time.Now()` lives inside `main` and is passed into the handler factory. A package-level `var startedAt = time.Now()` would fire at import time — before flag parsing, before `--version` early-returns — and be wrong for short-lived test binaries and any future deferred-serve setup. Adopted in #10. +- **Digest-pin third-party base images with a `# Tracks: ` sibling comment.** `FROM image:tag@sha256:…` cannot be silently shifted by a tag-swap attack between Renovate bumps; the `# Tracks:` line tells a reviewer (or Renovate's diff) what the digest is supposed to correspond to, so a digest-changed-while-tag-comment-unchanged proposal is structurally reviewable. Same shape applies to any future supply-chain pin (action SHAs, downloaded toolchain archives) — never pin by digest alone, always pair with the human-readable tag the digest tracks. Adopted in `Dockerfile` for both `golang:1.26-bookworm` and `gcr.io/distroless/static-debian12:nonroot` (#32). +- **Belt-and-suspenders invariants survive base-layer regressions.** When a base image already enforces a security property upstream (distroless `:nonroot` sets `USER 65532`), re-state the property in our own layer (`USER nonroot:nonroot`) so a future base swap that drops the upstream property doesn't silently regress us. Both layers would need to regress simultaneously for the property to lapse. Cost is one line; benefit is regression resistance under unattended dependency updates. Adopted in the runtime stage of `Dockerfile` (#32). +- **Type-owned background goroutine: `ticker + done chan + WaitGroup`, idempotent + synchronous `Close`.** When a type owns a long-lived bookkeeping goroutine (sweep/eviction loop, not a per-conn read pump), start it from the constructor and stop it via the type's own `Close()`. Shape: `closeOnce.Do(func(){ close(done) })` then `wg.Wait()` **outside** the once — putting `wg.Wait()` inside would let a second concurrent caller return early while shutdown is still in flight. The wiring site registers `defer x.Close()` at the composition root. Adopted in `IPRateLimiter` (#50). Distinct from "Per-conn goroutines exit cleanly via LIFO defers" — that pattern is for goroutines whose lifecycle is owned by an HTTP handler; this one is for goroutines whose lifecycle is owned by the type itself. +- **When one timestamp must serve two semantic purposes, split it.** In `IPRateLimiter` (#50), refill math needs an anchor that advances in token-sized chunks (preserving fractional elapsed time for the next call) while eviction needs an anchor that updates on every poll (so an actively-hammered/denied bucket isn't dropped and recycled for a fresh burst). A single `last`-of-anything cannot do both without one purpose corrupting the other. Two fields (`refillLast`, `pollLast`) decouple them at zero structural cost and make a load-bearing security invariant (no free-burst-recycling) structurally true rather than carefully maintained. Generalisable: whenever a timestamp's "should it update here?" question gets conflicting answers from different callers, the answer is two timestamps. +- **Injectable wall-clock as an unexported `now func() time.Time` field, defaulting to `time.Now`.** When a type's correctness depends on `time.Now` and tests need determinism without sleeping, give the struct a package-private `now func() time.Time` field set to `time.Now` in the constructor; in-package tests assign a fake (`l.now = fakeClock.Now`) directly. No exported `WithClock` option, no clock interface, no constructor variants. The single eviction goroutine reads `l.now` on every tick; assignment before the first tick is race-free in practice and `-race` tests cover the steady-state. Adopted in `IPRateLimiter` (#50); applies to any future timer-driven primitive whose tests need fake-clock determinism. +- **Empty-string return as the "deny" signal on a pure security primitive — no `(string, error)` ceremony.** When every caller would handle the error identically (treat as deny), an `(value, error)` pair pushes a meaningless `if err != nil` into every call site. Instead: document the empty-string contract in the function's docstring, name the deny semantics explicitly, and let the wiring ticket enforce it in code. The helper performs no policy; the caller's `if ip == "" { deny }` is the policy. Adopted in `ClientIP` (#51). Applies to any pure security primitive whose output is keyed (rate-limit key, bucket id) and whose only failure mode is "no usable input." +- **Two helpers diverge on contract, not on mechanics → add a new export, don't mutate the existing one.** `internal/relay/server_endpoint.go`'s unexported `remoteHost` (logging helper: falls back to `r.RemoteAddr` verbatim on parse error, no XFF awareness) and the new `ClientIP` (security primitive: empty-string-as-deny, opt-in XFF trust) share the same `net.SplitHostPort` mechanics but have structurally different contracts. Mutating `remoteHost` to add XFF and change its empty-string semantics would have broken its log-line callers; adding a new export preserves both contracts and lets the wiring ticket decide whether logging migrates. Pattern: when a new caller needs the same mechanics with a *different* contract (deny semantics, validation strictness, return shape), the answer is a new export, not a parameter on the old one. Adopted in `ClientIP` (#51).