Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/PROJECT-MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex
| `http.Server` with explicit timeouts (gosec G114) | Done | `cmd/pyrycode-relay/main.go` |
| Routing envelope wrapper type (`Envelope`, `Marshal`, `Unmarshal`, sentinel errors) | Done (#1) | `internal/relay/envelope.go` |
| Connection registry (`Conn`, `Registry`; 1:1 binary, 1:N phones; sentinel errors for `4409`/`4404`; race-tested) | Done (#3) | `internal/relay/registry.go` |
| Registry grace-period reclaim (`ScheduleReleaseServer`; grace-aware `ClaimServer` fast path; phones closed at expiry, inherited on reclaim; pointer-identity stale-fire defence) | Done (#20) | `internal/relay/registry.go` |
| TLS via autocert in `--domain` mode (`NewAutocertManager`, `EnforceHost`, `TLSConfig`, `ErrCacheDirInsecure`) | Done (#9) | `internal/relay/tls.go`, `cmd/pyrycode-relay/main.go` |
| `WSConn` adapter (`nhooyr.io/websocket.Conn` → registry `Conn`; per-conn write mutex; `Close`-cancelled context; 10s `Send` deadline) | Done (#15) | `internal/relay/ws_conn.go` |
| `/healthz` JSON endpoint (`status`, `version`, `connected_binaries`, `connected_phones`, `uptime_seconds`; `Cache-Control: no-store`; unauthenticated) | Done (#10) | `internal/relay/healthz.go`, `cmd/pyrycode-relay/main.go` |
Expand All @@ -35,6 +36,7 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex
- **Validate adversarial input *before* allocating WS state.** Header-gate checks (presence + non-empty) run before `websocket.Accept`; a malformed request never receives an upgrade response and the registry is untouched on every error path. Adopted in `/v1/server` (#16); the same shape applies to `/v1/client` token validation (#5).
- **Application WS close codes are emitted on the underlying `*websocket.Conn`, not via `WSConn`.** `WSConn.Close()` always emits `StatusNormalClosure` — no policy parameters. Handlers that need `4409` / `4401` / `4404` call `c.Close(websocket.StatusCode(code), reason)` directly in the stillborn-WSConn window (after construction, before any `Send`). See [ADR-0005](knowledge/decisions/0005-application-close-codes-via-underlying-conn.md). Adopted in `/v1/server` (#16); applies to `/v1/client` (#5).
- **`defer { Release; Close; log }` is registered AFTER the successful claim**, never before. A conflict path (or any pre-claim error) returns before the defer is in place, so we never `ReleaseServer` on a slot we don't hold. Adopted in `/v1/server` (#16); same structural ordering applies to `/v1/client`'s phone registration.
- **Pointer-identity for stale `time.AfterFunc` fires.** `time.Timer.Stop()` returns false if the timer's func has already started executing. Wrap each pending timer in a small struct and store the wrapper pointer in a map; the `AfterFunc` closure captures the wrapper pointer and asserts `map[key] == self` under the lock before acting. If a faster goroutine replaced the entry, the pointer no longer matches and the closure no-ops. Capturing the `*time.Timer` directly forces a self-referential local var (assigned after `AfterFunc` returns) which trips the race detector under stress; the wrapper avoids that. Adopted in `Registry.ScheduleReleaseServer` (#20). Same shape applies to any "one cancellable timer per key" pattern.
- **Hold long-lived WS handlers open with `c.CloseRead(r.Context())` plus `<-readCtx.Done()` until the real read loop lands.** `CloseRead` drains-and-discards frames (including control frames — pings must be processed for the connection to observe a peer-side close). The frame loop ticket replaces `<-readCtx.Done()` with the actual read body in the same call site. Adopted in `/v1/server` (#16) pending #6.
- **Capture process-state timestamps in `main` after `flag.Parse()`, not as package-level vars.** `startedAt := time.Now()` lives inside `main` and is passed into the handler factory. A package-level `var startedAt = time.Now()` would fire at import time — before flag parsing, before `--version` early-returns — and be wrong for short-lived test binaries and any future deferred-serve setup. Adopted in #10.

Expand Down
3 changes: 2 additions & 1 deletion docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,13 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o
- [`/v1/server` WS upgrade](features/server-endpoint.md) — binary-side ingress: validates `X-Pyrycode-Server` / `X-Pyrycode-Version` / `User-Agent` pre-upgrade, claims the slot, emits `4409` on conflict, holds the conn via `CloseRead` until #6's frame loop replaces it.
- [`/healthz` JSON endpoint](features/healthz.md) — unauthenticated `GET /healthz` returning `{status, version, connected_binaries, connected_phones, uptime_seconds}`; `Cache-Control: no-store`, body bounded ≈135 bytes.
- [WSConn adapter](features/ws-conn-adapter.md) — wraps `nhooyr.io/websocket.Conn` to satisfy the registry's `Conn`; owns the per-conn write mutex and a `Close`-cancelled context with a 10s per-`Send` deadline.
- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`.
- [Connection registry](features/connection-registry.md) — thread-safe `Registry` (server-id → binary 1:1, server-id → phones 1:N) with `Conn` interface, snapshot-returning `PhonesFor`, sentinel errors for `4409` / `4404`, deferred-release `ScheduleReleaseServer` with grace-window reclaim semantics.
- [Autocert TLS](features/autocert-tls.md) — `--domain` mode wiring: `:443` WSS via Let's Encrypt + `:80` ACME http-01, host gates, cache-dir permission policy, TLS 1.2 floor.
- [Routing envelope](features/routing-envelope.md) — typed Go wrapper (`Envelope`, `Marshal`, `Unmarshal`) for the `{conn_id, frame}` wire shape exchanged between relay and binary.

## Decisions

- [ADR-0006: Grace window IS the reclaim path](decisions/0006-grace-period-as-reclaim-path.md) — `ClaimServer` during a pending grace timer succeeds (not `4409`); pointer-identity wrapper defends against stale `time.AfterFunc` fires after `Stop()`.
- [ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`](decisions/0005-application-close-codes-via-underlying-conn.md) — `WSConn.Close()` stays minimal (`StatusNormalClosure`); handlers emit `4409` / `4401` / `4404` directly on the `*websocket.Conn` in stillborn-WSConn paths.
- [ADR-0004: WS library choice and adapter context strategy](decisions/0004-ws-library-and-adapter-context-strategy.md) — `nhooyr.io/websocket` v1.8.x; adapter-owned context cancelled by `Close`, 10s per-`Send` deadline; registry's `Send` signature stays context-free.
- [ADR-0003: Connection registry as a passive store](decisions/0003-connection-registry-passive-store.md) — single RWMutex, snapshot returns, no callbacks, narrow `ReleaseServer` so grace logic (#8) can wrap it.
Expand Down
73 changes: 73 additions & 0 deletions docs/knowledge/decisions/0006-grace-period-as-reclaim-path.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# ADR-0006: Binary disconnect grace window IS the reclaim path

**Status:** Accepted (#20)
**Date:** 2026-05-08

## Context

The protocol spec ([`protocol-mobile.md` § Authentication → Binary → relay](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#binary--relay)) requires that when a binary's holding connection drops, its server-id slot is held for 30 seconds before being released. The motivation is operational: a transient drop followed by a quick reconnect should land the binary back in the *same* slot with its phones still attached — instead of forcing every phone to receive a no-server close and re-handshake.

The registry built in #3 ([ADR-0003](0003-connection-registry-passive-store.md)) deliberately left grace logic out. `ReleaseServer` is narrow and immediate; orphan phones survive a release. The doc-comment foreshadowed this ticket.

Three shapes were on the table for how the grace window should interact with a *competing* claim during the window:

1. **Block reclaims during grace.** A reclaim within the window returns `ErrServerIDConflict` (or a new `ErrGracePending`). The original binary must wait for the timer to fire, then re-claim cleanly.
2. **Reclaim only by the original binary, by some identity match.** Compare the new conn's identity with the prior one and accept only if they match.
3. **Grace window IS the reclaim path.** Any `ClaimServer` during the window succeeds, replaces the binary `Conn` atomically, cancels the timer, and inherits the phones registered for that slot.

## Decision

**Option 3.** During the grace window:

- The binary entry stays in the registry, pointing at the (now-closed) prior `Conn`. `BinaryFor` returns it; `Send` through it errors per the underlying impl's contract.
- `RegisterPhone` continues to succeed — the slot is still considered claimed.
- `ClaimServer(serverID, conn)` for the same `serverID` returns `nil`, replaces the binary `Conn`, and `Stop()`s the pending timer. Phones registered during the grace window are inherited by the new binary.

If the timer fires (no reclaim within the window), the binary entry is removed *and* every phone for that server-id is removed and `Close()`-called outside the registry's lock.

The new method is `ScheduleReleaseServer(serverID, d time.Duration)`. The synchronous `ReleaseServer` is unchanged. Adopted in `internal/relay/registry.go` (#20). The `/v1/server` handler call-site swap lives in #21.

## Rationale

### Why not block reclaims (option 1)

The protocol spec explicitly motivates the grace window with "a quick reconnect should land the binary back in the same slot." A 30-second hard block contradicts that goal: a binary that disconnects and immediately retries — the canonical fast-reconnect case — would receive `4409` and have to back off for 30 seconds before its slot becomes available. That's worse for the user than the no-grace baseline.

### Why not identity-match reclaims (option 2)

The relay has no durable identity for a binary. The `x-pyrycode-server` header is just a string; the relay cannot distinguish "the same binary reconnecting" from "a different binary with the same configured server-id." Any identity check would either reduce to the header (which is already the lookup key) or require new credential plumbing. The protocol spec doesn't mandate identity verification at this layer; that's the binary's auth concern.

The pragmatic answer: anyone who can pass the `/v1/server` header gate ([#16](../features/server-endpoint.md)) is already authoritative for that server-id. If a competing claim arrives during a grace window, treating it as the legitimate reclaim is the same trust model as the pre-grace baseline (where claims succeed once the slot is free). The grace window doesn't *expand* the trust surface.

### Why option 3 is structurally clean

- **No new error sentinel.** The grace path produces the same return shape as a normal claim (`nil`). Callers don't branch.
- **Phones survive a reclaim with no separate dance.** The registry already keeps phones across `ReleaseServer` (the orphan-phones invariant from ADR-0003). The grace path simply doesn't run the cleanup that the timer would have run; phones land on the new binary by virtue of never having been removed.
- **`Counts` stays meaningful.** A server-id in its grace window contributes `1` to the binary count and its live phone count — matching the operational mental model ("we still consider this binary connected; we are about to find out").

### Pointer-identity stale-fire defence

`time.AfterFunc(d, fn)`'s `Stop()` returns false if the timer has already fired *or* its `fn` is currently executing. The "currently executing" case is the race: another goroutine takes the registry lock first (e.g. `ClaimServer`'s grace fast path), `Stop()`s the timer (returns false because `fn` already started), replaces the entry, and returns. The `fn` is meanwhile blocked on the lock. When it acquires the lock, naïvely it would delete the binary entry and close the new claimant's phones.

The defence: every armed timer is wrapped in a `*graceEntry` and stored in `r.timers[serverID]`. The `time.AfterFunc` closure captures the `*graceEntry` pointer. On fire, the handler checks `r.timers[serverID] == self` under the lock — if `ClaimServer` (or a later `ScheduleReleaseServer`) replaced the entry, the pointer no longer matches and the handler returns without touching state.

The wrapper indirection (`graceEntry` rather than `*time.Timer` directly) is so the `AfterFunc` closure can capture a stable pointer at construction; capturing `*time.Timer` requires the local var to be assigned *after* `AfterFunc` returns, which both creates a self-reference and trips the race detector under stress.

### Why the duration is a parameter, not a constant

The protocol spec says 30 seconds; the handler in #21 passes `30*time.Second`. Tests use ms-scale durations to keep the suite fast. Pushing the constant into the registry would force tests to either wait 30s or expose a knob — both worse than just taking it as an argument. The registry trusts the duration; degenerate values (`d <= 0` fires immediately, `d` near `MaxInt64` never fires) are safe.

## Consequences

- **The reclaim path is opt-in for callers.** `ReleaseServer` (immediate) is still there for the "claim failed, clean up" use cases that don't want a timer. Only the disconnect path in `/v1/server` (and equivalents) calls `ScheduleReleaseServer`.
- **Phones registered during a grace window have asymmetric outcomes.** Reclaim → they survive on the new binary. Expiry → they get `Close()`d alongside the original phones. Documented in the registry doc-comments and the feature doc; bounded only by per-server caps (#5's concern).
- **The expiry handler is the only goroutine the registry spawns.** Lifetime is bounded by `d`; the `Stop()` + pointer-identity check rules out leaks. The registry itself is still passive — the goroutine is `time.AfterFunc`'s, not a long-running registry worker.
- **Phone-side close code on grace expiry is `1000`, not protocol's `1011`.** ADR-0005 fixes `WSConn.Close()` to `StatusNormalClosure`. Adopting `1011 "binary did not reconnect"` requires extending the `Conn` interface or adding a parallel close primitive. Phones treat any close as "reconnect" — user-visible behaviour is correct. Tracked as a follow-up.
- **`Counts` does not distinguish "live" from "in grace."** Operators reading `/healthz` see the binary as still claimed during the grace window. If a future ticket needs grace-window metrics, that's a separate field on `Counts` (or a sibling).
- **`ScheduleReleaseServer` on an unheld server-id is a defensive no-op.** It arms a timer that, on fire, deletes whatever happens to be present (nothing). Same shape as `ReleaseServer("unknown")` returning `false`. Documented to head off a "binary not found" return value.

## Related

- [ADR-0003: Connection registry as a passive store](0003-connection-registry-passive-store.md) — the orphan-phones invariant that makes option 3 work.
- [ADR-0005: Application WS close codes](0005-application-close-codes-via-underlying-conn.md) — why `Close()` on grace expiry emits `1000` rather than the protocol's `1011`.
- [Connection registry feature](../features/connection-registry.md) — current API surface including `ScheduleReleaseServer`.
Loading