From c1a2a62c02eedf9245a4a1ced7dd7c9f416f537c Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Wed, 13 May 2026 10:27:17 +0300 Subject: [PATCH 1/3] spec: frame-forwarded and grace-expiry counters (#58) --- .../58-frame-and-grace-counters.md | 678 ++++++++++++++++++ 1 file changed, 678 insertions(+) create mode 100644 docs/specs/architecture/58-frame-and-grace-counters.md diff --git a/docs/specs/architecture/58-frame-and-grace-counters.md b/docs/specs/architecture/58-frame-and-grace-counters.md new file mode 100644 index 0000000..e458acd --- /dev/null +++ b/docs/specs/architecture/58-frame-and-grace-counters.md @@ -0,0 +1,678 @@ +# Spec — `frames_forwarded_total` + `grace_expiries_total` counters (#58) + +## Files to read first + +- `internal/relay/registry.go:154-189` — `ScheduleReleaseServer` arms the + timer; `handleGraceExpiry` is the body that runs on `time.AfterFunc`'s + goroutine. Line 177 is the pointer-identity guard whose success branch + this spec hooks; the stale-fire `return` at line 179 must NOT touch the + hook. +- `internal/relay/registry.go:56-79` — `Registry` struct + constructor. + The new hook fields live here; constructor doesn't change. +- `internal/relay/forward.go:32-74` — `StartPhoneForwarder` loop. The + successful-path is line 66 (`binary.Send(wrapped)` returns nil); the + increment hook fires immediately after. +- `internal/relay/forward.go:110-158` — `StartBinaryForwarder` loop. The + successful-path is line 150 (`phone.Send(env.Frame)` returns nil); the + increment hook fires immediately after. The three `continue` paths + (unmarshal err, unknown conn_id, phone send err) MUST NOT increment — + re-read each branch to confirm the placement. +- `internal/relay/forward_test.go:118-160` — `fakeBinary` + `fakePhone` + shapes the new metrics tests reuse via `package relay` access. Same + package, no import or duplication needed. +- `internal/relay/forward_test.go:195-243` — `TestStartPhoneForwarder_ForwardsFramesBytewise` + is the shape reference for the phone→binary success-path test: stage + a binary in the registry via `ClaimServer`, drive frames into the + fake phone, assert the binary received them. Adapt: also construct + `*forwardMetrics`, set the hook on the registry, scrape and assert + the counter. +- `internal/relay/forward_test.go:419-491` — three "binary forwarder + drops + continues" tests (multiple-phones, unknown conn_id, malformed + envelope, phone send err). The new test mirrors their shape but asserts + the counter increment count, not the dropped-frame count. +- `internal/relay/registry_test.go:413-451` — `TestScheduleReleaseServer_RaceFreedomUnderRapidCycles`. + Shape reference for the new grace-stale-fire counter test: cancel-and- + replace race window, assert the counter is unchanged on the stale fire, + exactly +1 on the real eviction. The same `fakeConn` (same package) is + reused without duplication. +- `internal/relay/metrics_connections.go` (whole file, ~50 lines) — the + per-concern collector pattern this ticket follows. Constructor takes + `prometheus.Registerer` + `*Registry`; registers against the registerer; + holds no state beyond the metric descriptors. The forward + grace + counter constructors mirror this shape. +- `internal/relay/metrics_connections_test.go:241-303` — `assertGauge` + helper. The new test file's `assertCounter` helper has the same shape + (scrape, substring-match a literal text-format line, e.g. + `pyrycode_relay_grace_expiries_total 1`). +- `cmd/pyrycode-relay/main.go:97-110` — current metrics wiring block. + Two new constructor calls (`NewForwardMetrics`, `NewGraceMetrics`) land + between `NewConnectionsMetrics` (line 102) and the public-listener mux + construction. Each call BOTH registers the counter against `metricsReg` + AND wires the hook on `reg`. +- `docs/specs/architecture/61-connected-gauges.md` § *Why pull-based, not + push-based* + § *Artefacts → metrics_connections.go* — pattern reference. + This ticket's counters are push-shaped (counters of events, not + gauges-of-state); the seam admits both shapes per #59's spec. +- `docs/specs/architecture/59-metrics-scaffolding.md` § *The architect's + "package-vars-vs-struct" pick* — the per-concern collector struct pattern. + Two files (one per concern) is the precedent; this ticket follows it + with two prod files for two metrics. +- `docs/PROJECT-MEMORY.md` line 24 — *"Validate at the envelope boundary, + not deeper."* Negatively relevant: the `direction` label is hard-coded + to two strings (`phone_to_binary`, `binary_to_phone`); neither value is + attacker-influenced — load-bearing for security review § Tokens. +- `docs/threat-model.md` § *Log hygiene* (extends to metric labels per + #59's spec) — the `direction` label is the only label this ticket + introduces, and its values are constants. + +## Context + +Third slice of the metrics rollout (split from #37). #59 landed the +registry scaffolding; #57 (concurrent) adds upgrade/register counters; +#61 added the connected-* gauges; #60 wired the loopback listener. This +ticket lands the **forwarder frame counter** (label `direction`) and the +**grace-expiry counter** (no labels). + +The single load-bearing design choice the AC hands the architect: + +> The frame counter increments on the hot path inside +> `StartPhoneForwarder` / `StartBinaryForwarder` — one increment per +> forwarded frame. … the increment goes per successful `Send`, never on +> the loop iteration. +> +> The grace-expiry counter fires inside the `time.AfterFunc` callback in +> `Registry.ScheduleReleaseServer` — increment after the stale-fire +> pointer-identity guard returns true (i.e. only on a real eviction, +> never on stale fires). + +Both increment sites live in code that does NOT (and per ADR-0008 *Scope +of use* MUST NOT) import `prometheus`. So the design problem is: how to +plumb a `prometheus.Counter` to a call site inside `registry.go` and to +the forwarder body in `forward.go` without (a) importing prometheus into +those files and (b) starting a 10-call-site signature cascade through +`ClientHandler`, `ServerHandler`, `StartPhoneForwarder`, and every test +that constructs the handlers. + +This spec picks **nil-safe hooks on `*Registry`**. The justification is +the bulk of § Design. + +## Design + +### Why hooks-on-Registry, not parameter threading + +The two natural alternatives, with the cascade math each implies: + +1. **Parameter threading** (`StartPhoneForwarder` takes a + `*forwardMetrics`; `ClientHandler` takes one; `main.go` constructs + and threads it through). Concrete cascade: + - `ClientHandler` signature → 1 file + 4 call sites (3 tests + main.go) + - `ServerHandler` signature → 1 file + 4 call sites (3 tests + main.go) + - `StartPhoneForwarder` / `StartBinaryForwarder` signatures → 2 sites + in forward_test.go + - Total ≥ 10 call sites needing simultaneous updates, ≥ 7 files + modified. + - This is right at the edit-fan-out red line in + `architect/CLAUDE.md` § 1. The rationalization "they're mechanical + `, nil` appends" is exactly the smell the red line covers. Split or + redesign. + +2. **Package-level vars** (e.g. `var forwardCounter *prometheus.CounterVec` + set by `NewForwardMetrics`, read by the forwarder). #59's spec rejects + this category — a singleton-shaped seam forces one registry per + process, which forbids per-test registries and per-test isolation + under `-race`. + +The adopted shape — **nil-safe `func()` hook fields on `*Registry`** — +avoids both problems: + +- `registry.go` and `forward.go` import nothing new (the hooks are + `func()`, not `prometheus.Counter`). ADR-0008's *Scope of use* + boundary holds. +- No signature changes anywhere. Zero cascade through handlers, zero + cascade through tests. +- `*Registry` is already the dependency every forwarder, handler, and + test threads around the codebase. Adding three private function-pointer + fields adds zero new surface to the call graph. +- Tests construct their own registry, their own counter, set the hook, + and scrape. Existing tests are unaffected because all hooks default to + nil (no-op on call). +- The grace counter site is structurally identical: the hook fires from + inside `handleGraceExpiry`'s success branch, after the pointer-identity + guard, ensuring stale fires do not increment. + +The asymmetry-cost (Registry's surface gains hook fields it does not +"own") is a one-shot cost paid once for the whole metrics rollout. Any +future per-Registry-event counter (e.g. `phones_registered_total`) lands +the same way — one private hook + one setter + one nil-safe call site. + +### Artefacts + +Three modified files, three new files. Hook fields on `*Registry` are +private; the setter methods are exported because main.go (different +package) needs to call them. Counter values' `direction` label is a +hard-coded constant set { `phone_to_binary`, `binary_to_phone` } — no +external input reaches it. + +#### 1. `internal/relay/registry.go` (modified) + +- **Add three private fields** to the `Registry` struct (near the existing + `mu`, `binaries`, `phones`, `timers` block): + + - `onPhoneForwarded func()` — invoked by `StartPhoneForwarder` after + a successful `binary.Send`. Nil = no-op. + - `onBinaryForwarded func()` — invoked by `StartBinaryForwarder` after + a successful `phone.Send`. Nil = no-op. + - `onGraceExpiry func()` — invoked by `handleGraceExpiry` after the + pointer-identity guard's success branch. Nil = no-op. + + Doc comments on each: *"Set once at boot via `SetForwarderHooks` / + `SetGraceExpiryHook` before either listener starts serving. Nil-safe + by design. Called outside any lock so the hook body can do its own + locking; the hook MUST NOT acquire `r.mu` (deadlock risk). Concrete + hook bodies in `metrics_forward.go` and `metrics_grace.go` are pure + `prometheus.Counter.Inc` / `.WithLabelValues(...).Inc` calls — the + library handles its own atomicity, no relay-side lock involved."* + +- **Add two setter methods** to `Registry`: + + ```go + // SetForwarderHooks installs the two per-frame increment hooks. Either + // (or both) may be nil for no-op. Call once at boot before any forwarder + // goroutine runs; concurrent calls during serving are undefined. + func (r *Registry) SetForwarderHooks(phone, binary func()) + + // SetGraceExpiryHook installs the eviction-increment hook. May be nil + // for no-op. Call once at boot before any ScheduleReleaseServer call. + func (r *Registry) SetGraceExpiryHook(grace func()) + ``` + + Bodies are a single assignment each. No lock needed (single-goroutine + boot-time call); doc comment names the constraint. + +- **Hook invocation in `handleGraceExpiry`**: at the end of the success + branch, after `r.mu.Unlock()` and the phone-close `for` loop. The + hook fires for every real eviction, never for stale fires (the guard + returns before reaching this line). Shape: + + ``` + func (r *Registry) handleGraceExpiry(serverID string, self *graceEntry) { + r.mu.Lock() + if r.timers[serverID] != self { r.mu.Unlock(); return } + // ... existing eviction body (delete + snapshot) ... + r.mu.Unlock() + + for _, p := range snapshot { p.Close() } + + if h := r.onGraceExpiry; h != nil { h() } // <- new line + } + ``` + + The hook is the *last* thing the function does. Reading `r.onGraceExpiry` + is a single-word load: safe to read without the lock because the field + is set once at boot. A future ticket adding mid-run hook mutation would + need to revisit; the doc comment says boot-time only. + +#### 2. `internal/relay/forward.go` (modified) + +- **`StartPhoneForwarder` body**: immediately after `binary.Send(wrapped)` + returns nil (i.e. the existing `if err := binary.Send(wrapped); err != nil` + check passes), add a nil-safe hook call: + + ``` + if h := reg.onPhoneForwarded; h != nil { h() } + ``` + + Placement is critical: AFTER `Send` succeeds, BEFORE the loop's next + iteration. Not before `Send` (would over-count on Send error), not + inside the `err != nil` branch (would invert the AC). + +- **`StartBinaryForwarder` body**: immediately after `phone.Send(env.Frame)` + returns nil (i.e. the existing `if err := phone.Send(env.Frame); err != + nil` check passes), add the same nil-safe shape: + + ``` + if h := reg.onBinaryForwarded; h != nil { h() } + ``` + + Placement: AFTER the Send-success path. The three `continue` paths + (unmarshal err, unknown conn_id, phone Send err) do not reach this + line — they `continue` before it. That is the structural defence for + the AC's "no increment on … per-sink `Send` error" rule. + +- Ticket-body wording note (record only — does not change the spec): the + body says *"the loop fans out to multiple phones per envelope"*. The + current `StartBinaryForwarder` implementation routes one envelope to + at most one phone (the one matching `env.ConnID`). The AC text + ("exactly once per successful `phone.Send`") is correct; the rationale + paragraph's "multiple phones" framing is a leftover from a multi-recipient + design that did not land. Increment-per-`phone.Send` is what the spec + encodes — same answer regardless of which framing is read. + +#### 3. `cmd/pyrycode-relay/main.go` (modified) + +Two new constructor calls in the metrics-wiring block, immediately after +`relay.NewConnectionsMetrics(metricsReg, reg)` at line 102 and before +the mux construction: + +``` +relay.NewForwardMetrics(metricsReg, reg) +relay.NewGraceMetrics(metricsReg, reg) +``` + +Each constructor (a) registers its counter against `metricsReg` and +(b) calls the appropriate setter on `reg`. Boot-time ordering guarantee: +both hooks are wired before `srv.ListenAndServe()` is called and before +any forwarder goroutine launches (those start on the first connection). + +No other edits to main.go. + +#### 4. `internal/relay/metrics_forward.go` (new, ~35 lines) + +Per-concern collector struct. Constructor takes `prometheus.Registerer` +and `*Registry`. Mirrors `NewConnectionsMetrics`'s shape. + +- `forwardMetrics` struct holds a single `*prometheus.CounterVec` field. +- `NewForwardMetrics(reg prometheus.Registerer, src *Registry)` constructs + the CounterVec with `prometheus.CounterOpts{ Name: + "pyrycode_relay_frames_forwarded_total", Help: "…" }` and label set + `[]string{"direction"}`. Registers via `reg.MustRegister(m.counter)`. + Then calls `src.SetForwarderHooks(...)` passing two closures, each + pre-bound to the right label value: + - `func() { m.counter.WithLabelValues("phone_to_binary").Inc() }` + - `func() { m.counter.WithLabelValues("binary_to_phone").Inc() }` + The closures call `WithLabelValues` on every increment; CounterVec + caches by label set, so the cost is one map lookup per frame. (For + the relay's frame rate this is negligible; pre-binding to a `Counter` + in the constructor would be marginally faster but adds two more + fields. Spec lets the developer pre-bind if they prefer; either + satisfies the AC.) +- Doc comment on `forwardMetrics`: *"Push-shaped counter — the + forwarder loop is the source of truth, not a snapshot of registry + state (contrast `metrics_connections.go`'s pull-based gauges). The + hook indirection through `*Registry` keeps the prometheus dep out of + `forward.go`."* +- **Label values are hard-coded constants.** Both `phone_to_binary` and + `binary_to_phone` are string literals in this file; neither is + attacker-influenced. Load-bearing for security review § Tokens — + cardinality is exactly 2, regardless of traffic. Documented in the + constructor's doc comment so the next reader does not "fix" the + hard-coding by reading the direction from request state. + +#### 5. `internal/relay/metrics_grace.go` (new, ~25 lines) + +Per-concern collector struct, same shape as `metrics_forward.go` but +without labels. + +- `graceMetrics` struct holds a single `prometheus.Counter` field. +- `NewGraceMetrics(reg prometheus.Registerer, src *Registry)` constructs + the Counter with `prometheus.CounterOpts{ Name: + "pyrycode_relay_grace_expiries_total", Help: "…" }`, registers via + `reg.MustRegister(m.counter)`, then calls + `src.SetGraceExpiryHook(func() { m.counter.Inc() })`. +- Doc comment on `graceMetrics`: *"Push-shaped counter. The hook fires + from `Registry.handleGraceExpiry`'s success branch, structurally + ensuring stale-fire no-ops never increment. The pointer-identity guard + in `handleGraceExpiry` is the load-bearing defence for the + no-double-count invariant; the hook only fires after the guard + passes."* +- **No labels.** Spec calls out the rejected `{server=""}` shape so + the developer cannot quietly re-add a server-id label that would + carry attacker-influenced header values onto the metrics surface. + +#### 6. `internal/relay/metrics_counters_test.go` (new, ~140 lines) + +Four tests, all in `package relay` so `fakeConn`/`fakePhone`/`fakeBinary` +are reused without duplication. One shared `assertCounter` helper mirrors +`assertGauge` from `metrics_connections_test.go`. + +- **`TestForwardMetrics_PhoneToBinary_OnlyOnSuccess`** — AC for the + phone→binary increment. Scenario: + - `r := NewRegistry()`, `mreg := NewMetricsRegistry()`, + `NewForwardMetrics(mreg, r)`, `h := NewMetricsHandler(mreg)`. + - `bin := &fakeBinary{id: "bin"}`; `r.ClaimServer("s1", bin)`. + - Run `StartPhoneForwarder` in a goroutine driving 3 frames through a + `fakePhone`. Verify all 3 received by `bin`. Cancel ctx. + - Assert `frames_forwarded_total{direction="phone_to_binary"}` is 3. + Assert `{direction="binary_to_phone"}` is 0. + - Then a negative case: with `bin.sendErr = errors.New("nope")`, drive + another frame; the forwarder returns. Counter unchanged at 3 (Send + failed → no increment). The AC clause "No increment on … `Send` + error" is checked. + - The "No binary" / "marshal error" negative paths are covered by + the existing forwarder tests' coverage of return values; this + metrics test verifies them by inspection: drive a frame with + `BinaryFor` returning nil (no `ClaimServer` first), forwarder + returns nil, counter unchanged. + +- **`TestForwardMetrics_BinaryToPhone_OnlyOnSuccess`** — AC for the + binary→phone increment. Mirrors the shape of the phone→binary test + but uses `StartBinaryForwarder`. Scenarios (drive successive + envelopes through one `fakeBinarySource`, each with a different + outcome): + - 2 well-formed envelopes addressing a registered phone → counter + `{direction="binary_to_phone"}` increments by 2. + - 1 malformed envelope (not parseable by `Unmarshal`) → forwarder + `continue`s; counter unchanged. + - 1 well-formed envelope addressing an unknown conn_id → forwarder + `continue`s; counter unchanged. + - 1 well-formed envelope to a phone whose `Send` returns an error + (set `fakePhone.sendErr`) → forwarder `continue`s; counter + unchanged. + - Final assertion: total = 2, no other direction incremented. + +- **`TestGraceMetrics_OnlyOnRealEviction`** — AC for the grace counter. + Scenario: + - `r := NewRegistry()`, wire `NewGraceMetrics(mreg, r)`, handler `h`. + - `r.ClaimServer("s1", &fakeBinary{id: "b1"})`. + - `r.ScheduleReleaseServer("s1", 5*time.Millisecond)`. + - `time.Sleep(30 * time.Millisecond)` to let the timer fire and the + eviction complete. + - Assert `grace_expiries_total` is 1. + - Then a stale-fire scenario: `r.ClaimServer("s2", b)`, + `r.ScheduleReleaseServer("s2", 5*ms)`, immediately + `r.ClaimServer("s2", &fakeBinary{id: "b3"})` (cancels and replaces + the timer). Sleep 30ms past the original window. Real eviction did + NOT happen for `s2`; counter must still be 1, not 2. + +- **`TestGraceMetrics_RaceFreedom`** — race coverage that the hook plays + nicely with the stale-fire path under concurrent cycles. Shape pinned + to `TestScheduleReleaseServer_RaceFreedomUnderRapidCycles` (registry_test.go:413): + - 16 goroutines, 200 ops each. Each op: `ClaimServer`, optionally + `ScheduleReleaseServer` with 1ms grace, sometimes immediately + `ClaimServer` again (cancel-replace), sometimes let the timer fire. + - After `wg.Wait()` + a 50ms drain sleep, scrape and assert the + counter is a non-negative integer; do NOT assert an exact value + (the cancel-replace race window means the eviction count is + schedule-dependent, between 0 and goroutines × ops). The + structural assertion is `go test -race`: no DATA RACE report + against the hook field. The pointer-identity guard's existing + coverage carries through; the new wrinkle is the hook field's + read from the timer goroutine concurrent with no writes (writes + happen only at boot). + - The counter value is bounded: invariant `0 ≤ value ≤ total schedule + calls that were NOT cancelled`. Tests assert only the + no-race-detector-finding condition. The single-eviction + + single-stale-fire correctness is the previous test's job. + +`assertCounter(t, h, "metric_name", labelStr, want)` substring-matches +`metric_name{labels} ` or `metric_name ` (no-label case) +in the scraped body. Same substring-match posture as +`metrics_connections_test.go`'s `assertGauge` (because the body also +contains `promhttp_metric_handler_*` self-instrumentation lines). +Pinned literals (`pyrycode_relay_frames_forwarded_total{direction="phone_to_binary"} 3`) +are the version-bump canaries for `client_golang` text-format changes — +same posture as the gauge tests in #61. + +### Concurrency model + +- **No new goroutines.** The hook bodies run on whatever goroutine the + caller is on: + - Phone hook → the phone forwarder goroutine (one per connected phone). + - Binary hook → the binary forwarder goroutine (one per connected + binary). + - Grace hook → the `time.AfterFunc` goroutine (one per timer firing). +- **No new locks.** The Counter / CounterVec increments are atomic per + `client_golang`'s contract. +- **Hook field reads under no lock.** Each forwarder reads `reg.onPhoneForwarded` + / `reg.onBinaryForwarded` as a function-pointer load on every iteration. + `handleGraceExpiry` reads `reg.onGraceExpiry` once after `Unlock`. + Per the doc comment, the fields are written exactly once at boot + before either listener starts, so the reads are race-free in the + happens-before sense (boot writes happen-before forwarder + goroutine starts via the standard "goroutine creation + happens-before goroutine body" rule from the Go memory model). +- **Race-detector coverage.** `TestGraceMetrics_RaceFreedom` exercises + the hook field's read under high concurrency. Existing + `TestRegistry_RaceFreedom` / `TestScheduleReleaseServer_RaceFreedomUnderRapidCycles` + / `TestConnectionsMetrics_RaceFreedom` (#61) carry through unchanged. + +### Error handling + +| Failure mode | Surface | Recovery | +| --- | --- | --- | +| `MustRegister` collision (same metric name registered twice on `metricsReg`) | panic at boot inside `NewForwardMetrics` / `NewGraceMetrics` | impossible by construction — both names are unique across the relay's metric space; if a future sibling ticket re-uses a name the panic surfaces at boot, which is the right time | +| `Counter.Inc` on a nil pointer | impossible — the hook is registered with a closure that captures a non-nil counter constructed in the same function | n/a | +| Hook field still nil when forwarder runs | hook call is `if h := …; h != nil { h() }` no-op | n/a; tests that don't wire metrics see no-op behaviour | +| Forwarder Send returns nil but with zero bytes written | not reachable — `Send` returns nil iff the frame was written; nil is the success contract per the existing forwarder tests | n/a | +| Pointer-identity guard returns false (stale fire) | hook is never reached — guard's `return` exits before the hook call | n/a; the structural defence the AC specifies | + +### Testing strategy + +- `make vet`, `make test` (which is `-race`-wide per the Makefile), and + `make build` are the AC gates. All three must be clean before merge. +- Four new tests, listed above. All live in `internal/relay/metrics_counters_test.go`. +- Existing tests (registry, forward, connections) are NOT modified — + the hook fields default to nil and the existing tests never wire a + metric, so the new code paths are invisible to them. +- `TestGraceMetrics_RaceFreedom`'s race coverage extends the existing + `TestRegistry_RaceFreedom` proof to the new field-read path. The Go + race detector's verdict under `make test` is the structural AC gate. +- No new e2e or integration tests. `/metrics` listener wiring (#60) + is unchanged by this ticket; e2e coverage that scrapes `/metrics` + in autocert mode is out of scope. + +### Open questions + +1. **`docs/knowledge/codebase/58.md` and `docs/knowledge/INDEX.md` + updates.** AC names both. Per the architect's *Never Update* rule + and the established precedent (#59/#60/#61 each deferred these to + the documentation phase), the architect does NOT write either file + in this ticket. The developer's commit ships the code; the + documentation phase's post-merge run lands the codebase summary and + the INDEX entry. The AC is satisfied at the ticket level by the + documentation phase, not by the architect or developer phases. + Recorded so the developer does not chase an AC item that is not + theirs. +2. **Pre-bind labelled counters in the closure?** The constructor's + `WithLabelValues("phone_to_binary")` could be called once and the + resulting `prometheus.Counter` captured by the closure instead of + re-resolving on every frame. CounterVec caches by label set so the + cost difference is one map lookup per frame at relay traffic rates; + not measurable. Either implementation satisfies the AC. Developer + picks; if pre-binding, document the choice in a one-line comment. +3. **`time.AfterFunc` goroutine vs hook reentrancy.** The grace hook + could in principle call back into the registry; it does not (the + only caller is `metrics_grace.go`'s `m.counter.Inc()`). Doc comment + on the field warns against acquiring `r.mu` in a hook (deadlock + risk) so a future contributor adding a registry-state-reading hook + sees the constraint. +4. **Should `SetForwarderHooks` / `SetGraceExpiryHook` be private?** + They are exported because main.go is in a different package and + must call them. Tests that exercise the metrics directly (in + `package relay`) could use either an exported or unexported setter; + exported avoids the duplicate-shape problem. The "Set once at boot" + constraint is doc-comment-enforced, not type-enforced — consistent + with `Registry`'s other thread-safety contracts (e.g. ClaimServer + is documented as safe under concurrent calls, but the doc is the + contract, not a type-level guarantee). + +### Out of scope + +- Histogram for send-duration (`pyrycode_relay_send_duration_seconds`): + explicitly out of scope per ticket body. Its own ticket if it lands. +- Upgrade/register counters (#57's territory). #57 will land its own + per-concern collector file (`metrics_upgrade.go` or similar) using the + same registry-hook pattern this ticket establishes for the grace + counter, since the upgrade-flood reject site is similarly inside + the relay package's existing handler code. +- Touching the `/metrics` listener config (#60's territory). +- Process / Go-runtime collectors (forbidden by ADR-0008 § Scope of + use). +- Documentation phase artefacts (`docs/knowledge/codebase/58.md`, + `docs/knowledge/INDEX.md` append) — handled by the documentation + phase post-merge. + +## Security review (security-sensitive ticket) + +### Mindset + +Re-reading the spec as an adversary against the relay process. Default +verdict FAIL until each applicable category below has either a concrete +finding or an explicit design decision that makes the category not +applicable. The label is the gate; "this is just two counters" is +exactly the bias the pass exists to bypass — metric labels and metric +values are exfiltration channels regardless of how the data plane is +shaped. + +### 1. Trust boundaries + +- **Inbound trust boundary:** unchanged. The `/metrics` listener (which + terminates scrape traffic) is #60's loopback-only surface; this ticket + adds counters consumed by that listener but does not change its + network surface. +- **Internal boundary:** the relay's private registry vs. + `prometheus.DefaultRegisterer`. Enforced by both constructors taking + `prometheus.Registerer` explicitly (no default-registry shortcut); + reinforced structurally by the existing + `TestMetricsRegistry_NoGlobalRegistrarLeak` (snapshot delta is zero + after this ticket's wiring — the new counters register on a fresh + private registry, never on the default). +- **New internal boundary** introduced: the hook fields on `*Registry`. + Hook writes happen only at boot (single-goroutine, single-call); hook + reads happen from forwarder + timer goroutines. The boundary defence + is the doc-comment-enforced "set once at boot" rule. No + network-supplied input ever writes to these fields. + +### 2. Tokens, secrets, credentials + +- **`frames_forwarded_total` labels:** the `direction` label has exactly + two values, both hard-coded string literals in `metrics_forward.go` + (`phone_to_binary`, `binary_to_phone`). Neither comes from any + request header, query parameter, frame payload, or registry state. + Cardinality is exactly 2. No attacker-influenced label values. + Structurally — the spec calls out the constants in the constructor's + doc comment so a future contributor cannot quietly rewrite them to + read the direction from request state. +- **`grace_expiries_total` labels:** none. Scalar counter. +- **Metric values:** both counters are monotonic non-negative integers + derived from the count of successful per-frame Send operations + (forward counter) or actual evictions (grace counter). Neither + value carries a token, IP, server-id, or frame content. The + forwarded byte-count is NOT exposed; only the integer-valued frame + count. +- **Token-handling code paths:** unchanged. The relay's only handled + secret is `X-Pyrycode-Token` at `/v1/client` (#5). This ticket adds + no code path that touches a token; forwarder code already in scope + has no token handling. + +### 3. File operations + +- None. No new file I/O. + +### 4. Subprocess / external command execution + +- None. + +### 5. Cryptographic primitives + +- None directly. `prometheus.Counter.Inc` uses no `crypto/*` API. +- The hooks-on-Registry mechanism uses ordinary function pointers; no + reflection, no unsafe. + +### 6. Network & I/O + +- **Handler exposure surface change.** Before this ticket: `/metrics` + body contains `promhttp_metric_handler_*` self-instrumentation lines + + the two `pyrycode_relay_connected_*` gauges. After this ticket: + + two counter-vec lines (`frames_forwarded_total{direction="phone_to_binary"}`, + `frames_forwarded_total{direction="binary_to_phone"}`) + one scalar + counter line (`grace_expiries_total`). Total response size delta: + ~250 bytes (well under any scrape buffer). +- **Cardinality bound.** `frames_forwarded_total` has cardinality 2 + (hard-coded). `grace_expiries_total` is scalar. Both are constant + across all traffic patterns and load profiles. A future ticket + adding a per-server-id or per-conn-id label would re-open this + category; this ticket closes it for itself. +- **No new `Accept` content negotiation.** The collectors emit via the + existing `NewMetricsHandler` (#59), which pins text format. +- **No new listener.** The `/metrics` surface is #60's territory; this + ticket adds collectors to the registry that listener already serves. + +### 7. Error messages, logs, telemetry + +- **No new log lines.** The hook bodies are pure `Counter.Inc()` calls; + they do not log. The forwarders already log at the existing levels + (Info on success-path end, Warn on per-frame errors); this ticket + does not change that. +- **Counter values are integers.** No format-string or scientific-notation + paths through the exposition layer that could leak adjacent memory. +- **Hook write/read race:** doc-comment-bound to boot-only writes. The + race detector in `TestGraceMetrics_RaceFreedom` exercises the read + path under high concurrency. + +### 8. Concurrency + +- **No new goroutines** — the hook calls run on the forwarder / + AfterFunc goroutines that already exist. +- **No new locks.** Counter `Inc` is internally atomic; hook field + reads are single-word loads of function pointers established at + boot (happens-before per the Go memory model: goroutine creation + happens-before the goroutine body). +- **`r.onGraceExpiry` read is unlocked.** Same justification as + above: the field is set once at boot, before any timer is armed. + The read happens AFTER the eviction body completes (after + `r.mu.Unlock()` and the phone-close loop), so even a hypothetical + reordering cannot leak the lock-protected map state through the + hook. +- **Hook MUST NOT acquire `r.mu`.** Documented on the field. Violation + would deadlock (the grace hook runs after `r.mu.Unlock` but the + forwarder hooks could in principle re-enter mutator paths if a + future contributor adds one — the doc comment closes that). The + concrete `metrics_forward.go` / `metrics_grace.go` hook bodies are + pure `prometheus.Counter.Inc()` and acquire no relay-side lock. + +### 9. Threat model alignment + +- `docs/threat-model.md` § *Log hygiene* (extends to metric labels): + closed by design — `direction` is a hard-coded constant set; + `grace_expiries_total` is label-less. Recorded in the constructors' + doc comments so the boundary is visible at the call site, not just in + the spec. +- `docs/threat-model.md` § *DoS*: response size delta ~250 bytes; + cardinality fixed at 3 series total. No DoS expansion. The `/metrics` + listener's loopback gate (#60) caps the scrape population to same-host + processes. +- `docs/threat-model.md` § *Supply chain*: no new dependency. The new + prod files import `prometheus/client_golang` symbols already pulled in + by #59. `go mod tidy` after this ticket is a no-op against `go.sum`. +- Protocol spec (`pyrycode/pyrycode/docs/protocol-mobile.md`): unaffected. + No wire-protocol surface change. The forwarder's per-frame contract is + unchanged — counter increments are purely observational. + +### Findings + +- **[Trust boundaries]** No findings. Hook field writes are boot-only; + reads are race-free per the Go memory model; no network input reaches + any hook field. +- **[Tokens]** PASS — `direction` label values are hard-coded constants; + `grace_expiries_total` is label-less; values are integers carrying no + identifier. Spec calls out the rejected `{server="..."}` shape so the + developer cannot regress. +- **[File operations]** No findings. +- **[Subprocess]** No findings. +- **[Cryptographic primitives]** No findings. +- **[Network & I/O]** PASS — response size delta ~250 bytes; cardinality + fixed at 3 series; no new endpoint, no new content negotiation. +- **[Errors / logs / telemetry]** No findings — hooks are silent; values + are integers. +- **[Concurrency]** PASS — no new goroutines, no new locks; the hook + field read/write discipline is doc-comment-enforced ("set once at + boot") and the race test covers the concurrent-read path under + `-race`. +- **[Threat model]** No findings — log hygiene, DoS, supply chain all + unchanged or closed by design. + +### Verdict + +**PASS.** No MUST FIX findings. The single load-bearing security design +choice (hard-coded `direction` label values, no `server` label) is +documented both in the spec's § Design and in the production file's +doc comments, so the developer cannot regress it without first deleting +the spec's instruction. The hook-on-Registry pattern's "set once at +boot" rule is doc-comment-enforced rather than type-enforced — consistent +with the registry's other thread-safety contracts (e.g. `ClaimServer`'s +documented concurrent-safety) and the existing connections-collector +test's race coverage carries the proof. + +**Reviewer:** architect (self-review per `architect/security-review.md`) +**Date:** 2026-05-13 From f343937fca69ab11f68b42336512a8000f60d1a2 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Wed, 13 May 2026 10:31:14 +0300 Subject: [PATCH 2/3] feat(relay): frame-forwarded and grace-expiry counters (#58) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds two Prometheus counters via nil-safe hooks on *Registry, keeping the prometheus dep out of registry.go and forward.go: - pyrycode_relay_frames_forwarded_total{direction} — incremented after a successful sink Send in StartPhoneForwarder / StartBinaryForwarder. direction is a hard-coded constant set {phone_to_binary, binary_to_phone}, cardinality 2. - pyrycode_relay_grace_expiries_total — incremented from handleGraceExpiry's success branch, after the pointer-identity guard so stale fires do not increment. Per-direction CounterVec children are pre-bound from WithLabelValues so the hot-path hook is a single atomic Inc. Co-Authored-By: Claude Opus 4.7 --- cmd/pyrycode-relay/main.go | 2 + internal/relay/forward.go | 8 + internal/relay/metrics_counters_test.go | 242 ++++++++++++++++++++++++ internal/relay/metrics_forward.go | 44 +++++ internal/relay/metrics_grace.go | 33 ++++ internal/relay/registry.go | 38 ++++ 6 files changed, 367 insertions(+) create mode 100644 internal/relay/metrics_counters_test.go create mode 100644 internal/relay/metrics_forward.go create mode 100644 internal/relay/metrics_grace.go diff --git a/cmd/pyrycode-relay/main.go b/cmd/pyrycode-relay/main.go index 923996d..f075d43 100644 --- a/cmd/pyrycode-relay/main.go +++ b/cmd/pyrycode-relay/main.go @@ -100,6 +100,8 @@ func main() { metricsReg := relay.NewMetricsRegistry() relay.NewConnectionsMetrics(metricsReg, reg) + relay.NewForwardMetrics(metricsReg, reg) + relay.NewGraceMetrics(metricsReg, reg) metricsMux := http.NewServeMux() metricsMux.Handle("/metrics", relay.NewMetricsHandler(metricsReg)) diff --git a/internal/relay/forward.go b/internal/relay/forward.go index efe0013..b23f720 100644 --- a/internal/relay/forward.go +++ b/internal/relay/forward.go @@ -70,6 +70,10 @@ func StartPhoneForwarder( "err", err) return err } + + if h := reg.onPhoneForwarded; h != nil { + h() + } } } @@ -154,5 +158,9 @@ func StartBinaryForwarder( "err", err) continue } + + if h := reg.onBinaryForwarded; h != nil { + h() + } } } diff --git a/internal/relay/metrics_counters_test.go b/internal/relay/metrics_counters_test.go new file mode 100644 index 0000000..15c29c3 --- /dev/null +++ b/internal/relay/metrics_counters_test.go @@ -0,0 +1,242 @@ +package relay + +import ( + "errors" + "fmt" + "net/http" + "net/http/httptest" + "strings" + "sync" + "testing" + "time" +) + +// assertCounter scrapes h and asserts that body contains the literal line +// ` ` in the Prometheus text format. labelStr may be +// empty for unlabelled counters. Substring-match posture mirrors +// assertGauge in metrics_connections_test.go — the body also carries +// promhttp_metric_handler_* self-instrumentation lines. +func assertCounter(t *testing.T, h http.Handler, metric, labelStr string, want int) { + t.Helper() + rec := httptest.NewRecorder() + h.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/metrics", nil)) + if rec.Code != http.StatusOK { + t.Fatalf("scrape status: got %d, want 200", rec.Code) + } + var wantLine string + if labelStr == "" { + wantLine = fmt.Sprintf("%s %d", metric, want) + } else { + wantLine = fmt.Sprintf("%s{%s} %d", metric, labelStr, want) + } + if !strings.Contains(rec.Body.String(), wantLine) { + t.Errorf("scrape body missing %q; got:\n%s", wantLine, rec.Body.String()) + } +} + +func TestForwardMetrics_PhoneToBinary_OnlyOnSuccess(t *testing.T) { + t.Parallel() + + reg := NewRegistry() + mreg := NewMetricsRegistry() + NewForwardMetrics(mreg, reg) + h := NewMetricsHandler(mreg) + + bin := &fakeBinary{id: "bin-s1"} + if err := reg.ClaimServer("s1", bin); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + phone := newFakePhone("client-s1-aa00bb11") + if err := reg.RegisterPhone("s1", ®istryConn{phone}); err != nil { + t.Fatalf("RegisterPhone: %v", err) + } + + done, cancel := runForwarder(reg, "s1", phone) + defer cancel() + + for i := 0; i < 3; i++ { + phone.frames <- []byte(fmt.Sprintf(`{"i":%d}`, i)) + } + waitForSent(t, bin, 3, 2*time.Second) + + assertCounter(t, h, "pyrycode_relay_frames_forwarded_total", `direction="phone_to_binary"`, 3) + // binary_to_phone direction registered (label cardinality 2) but never + // incremented — assert it stays at 0. + assertCounter(t, h, "pyrycode_relay_frames_forwarded_total", `direction="binary_to_phone"`, 0) + + // Close to drain forwarder cleanly before the next subscenario. + close(phone.frames) + <-done + + // Negative case: Send returns an error → no increment. New forwarder, + // new phone; reuse the same binary with a sticky sendErr. + bin.mu.Lock() + bin.sendErr = errors.New("nope") + bin.mu.Unlock() + phone2 := newFakePhone("client-s1-aa00bb12") + if err := reg.RegisterPhone("s1", ®istryConn{phone2}); err != nil { + t.Fatalf("RegisterPhone phone2: %v", err) + } + done2, cancel2 := runForwarder(reg, "s1", phone2) + defer cancel2() + phone2.frames <- []byte(`{"send":"will-fail"}`) + select { + case <-done2: + case <-time.After(time.Second): + t.Fatal("forwarder did not return after Send error") + } + + // Still 3; the failed Send did not increment. + assertCounter(t, h, "pyrycode_relay_frames_forwarded_total", `direction="phone_to_binary"`, 3) + + // No-binary case: no ClaimServer for s2; counter unaffected. + phone3 := newFakePhone("client-s2-cc00dd11") + done3, cancel3 := runForwarder(reg, "s2", phone3) + defer cancel3() + phone3.frames <- []byte(`{"no":"binary"}`) + select { + case err := <-done3: + if err != nil { + t.Fatalf("forwarder return = %v, want nil", err) + } + case <-time.After(time.Second): + t.Fatal("forwarder did not return on missing binary") + } + assertCounter(t, h, "pyrycode_relay_frames_forwarded_total", `direction="phone_to_binary"`, 3) +} + +func TestForwardMetrics_BinaryToPhone_OnlyOnSuccess(t *testing.T) { + t.Parallel() + + reg := NewRegistry() + mreg := NewMetricsRegistry() + NewForwardMetrics(mreg, reg) + h := NewMetricsHandler(mreg) + + p1 := newFakePhone("client-s1-aaaa1111") + p2 := newFakePhone("client-s1-bbbb2222") + pErr := newFakePhone("client-s1-cccc3333") + pErr.sendErr = errors.New("phone wedged") + claimAndRegister(t, reg, "s1", p1, p2, pErr) + + src := newFakeBinarySource("bin-s1") + done, cancel := runBinaryForwarder(reg, "s1", src) + defer cancel() + + // 2 well-formed envelopes to a registered phone → +2. + src.frames <- mustMarshal(t, p1.ConnID(), []byte(`{"i":1}`)) + src.frames <- mustMarshal(t, p2.ConnID(), []byte(`{"i":2}`)) + waitForPhoneSent(t, p1, 1, 2*time.Second) + waitForPhoneSent(t, p2, 1, 2*time.Second) + + // Malformed envelope → continue, no increment. + src.frames <- []byte("not-json") + // Unknown conn_id → continue, no increment. + src.frames <- mustMarshal(t, "client-s1-deadbeef", []byte(`{"dropped":true}`)) + // Phone send error → continue, no increment. + src.frames <- mustMarshal(t, pErr.ConnID(), []byte(`{"wedged":true}`)) + + // Send a final well-formed envelope so we can synchronise on its + // delivery — once it lands, the three preceding continue paths must + // have already been processed. + src.frames <- mustMarshal(t, p1.ConnID(), []byte(`{"i":3}`)) + waitForPhoneSent(t, p1, 2, 2*time.Second) + + close(src.frames) + <-done + + // Three real successes (i=1 to p1, i=2 to p2, i=3 to p1). + assertCounter(t, h, "pyrycode_relay_frames_forwarded_total", `direction="binary_to_phone"`, 3) + // phone_to_binary direction registered but unused. + assertCounter(t, h, "pyrycode_relay_frames_forwarded_total", `direction="phone_to_binary"`, 0) +} + +func TestGraceMetrics_OnlyOnRealEviction(t *testing.T) { + t.Parallel() + + r := NewRegistry() + mreg := NewMetricsRegistry() + NewGraceMetrics(mreg, r) + h := NewMetricsHandler(mreg) + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer s1: %v", err) + } + r.ScheduleReleaseServer("s1", 5*time.Millisecond) + + // Wait for the real eviction. Poll until BinaryFor("s1") clears. + deadline := time.Now().Add(time.Second) + for time.Now().Before(deadline) { + if _, ok := r.BinaryFor("s1"); !ok { + break + } + time.Sleep(2 * time.Millisecond) + } + if _, ok := r.BinaryFor("s1"); ok { + t.Fatal("grace timer did not fire within timeout") + } + + assertCounter(t, h, "pyrycode_relay_grace_expiries_total", "", 1) + + // Stale-fire scenario: arm + cancel-replace before timer fires. + if err := r.ClaimServer("s2", &fakeConn{id: "b-2"}); err != nil { + t.Fatalf("ClaimServer s2: %v", err) + } + r.ScheduleReleaseServer("s2", 5*time.Millisecond) + if err := r.ClaimServer("s2", &fakeConn{id: "b-2-prime"}); err != nil { + t.Fatalf("ClaimServer s2 reclaim: %v", err) + } + + // Sleep past the original grace window so the stale fire reaches the + // AfterFunc body and is then short-circuited by the pointer-identity + // guard. If the guard were broken the counter would move to 2. + time.Sleep(30 * time.Millisecond) + + assertCounter(t, h, "pyrycode_relay_grace_expiries_total", "", 1) +} + +func TestGraceMetrics_RaceFreedom(t *testing.T) { + t.Parallel() + + r := NewRegistry() + mreg := NewMetricsRegistry() + NewGraceMetrics(mreg, r) + h := NewMetricsHandler(mreg) + + const goroutines = 16 + const opsPer = 200 + + var wg sync.WaitGroup + wg.Add(goroutines) + for g := 0; g < goroutines; g++ { + g := g + go func() { + defer wg.Done() + sid := fmt.Sprintf("s-%d", g%4) + for i := 0; i < opsPer; i++ { + _ = r.ClaimServer(sid, &fakeConn{id: fmt.Sprintf("b-%d-%d", g, i)}) + r.ScheduleReleaseServer(sid, time.Millisecond) + if i%2 == 0 { + _ = r.ClaimServer(sid, &fakeConn{id: fmt.Sprintf("b2-%d-%d", g, i)}) + } + } + }() + } + wg.Wait() + + // Drain in-flight timers. + time.Sleep(50 * time.Millisecond) + + // Race detector is the real assertion. Sanity-check the body parses + // and the metric is present with a non-negative integer value (the + // exact value depends on the cancel-replace race window, between 0 + // and the total schedule count). + rec := httptest.NewRecorder() + h.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/metrics", nil)) + if rec.Code != http.StatusOK { + t.Fatalf("scrape status: got %d, want 200", rec.Code) + } + if !strings.Contains(rec.Body.String(), "pyrycode_relay_grace_expiries_total ") { + t.Errorf("scrape body missing grace counter; got:\n%s", rec.Body.String()) + } +} diff --git a/internal/relay/metrics_forward.go b/internal/relay/metrics_forward.go new file mode 100644 index 0000000..fc8f300 --- /dev/null +++ b/internal/relay/metrics_forward.go @@ -0,0 +1,44 @@ +package relay + +import "github.com/prometheus/client_golang/prometheus" + +// forwardMetrics emits pyrycode_relay_frames_forwarded_total as a +// Prometheus CounterVec labelled by direction +// (phone_to_binary | binary_to_phone). Push-shaped: the forwarder loop is +// the source of truth, not a snapshot of registry state (contrast +// metrics_connections.go's pull-based gauges). The hook indirection +// through *Registry keeps the prometheus dep out of forward.go. +// +// The direction label values are HARD-CODED CONSTANTS — see +// docs/specs/architecture/58-frame-and-grace-counters.md § Security review +// § Tokens: cardinality is exactly 2 and neither value is +// attacker-influenced. Do not "fix" the hard-coding by reading the +// direction from request state. +type forwardMetrics struct { + phoneToBinary prometheus.Counter + binaryToPhone prometheus.Counter +} + +// NewForwardMetrics registers the frames-forwarded counter vector against +// reg and wires the per-direction increment hooks on src. The two counters +// are pre-bound from WithLabelValues so the hook body is a single atomic +// Inc — avoids re-resolving the label set on every forwarded frame. +// +// Wiring site: cmd/pyrycode-relay/main.go, alongside NewConnectionsMetrics +// and NewGraceMetrics. Boot-time only; SetForwarderHooks is not safe to +// call after either listener starts serving. +func NewForwardMetrics(reg prometheus.Registerer, src *Registry) { + vec := prometheus.NewCounterVec( + prometheus.CounterOpts{ + Name: "pyrycode_relay_frames_forwarded_total", + Help: "Number of frames forwarded by the relay, labelled by direction. Increments once per successful sink Send; per-frame error and skip paths do not increment.", + }, + []string{"direction"}, + ) + reg.MustRegister(vec) + m := &forwardMetrics{ + phoneToBinary: vec.WithLabelValues("phone_to_binary"), + binaryToPhone: vec.WithLabelValues("binary_to_phone"), + } + src.SetForwarderHooks(m.phoneToBinary.Inc, m.binaryToPhone.Inc) +} diff --git a/internal/relay/metrics_grace.go b/internal/relay/metrics_grace.go new file mode 100644 index 0000000..716ddb9 --- /dev/null +++ b/internal/relay/metrics_grace.go @@ -0,0 +1,33 @@ +package relay + +import "github.com/prometheus/client_golang/prometheus" + +// graceMetrics emits pyrycode_relay_grace_expiries_total as a scalar +// Prometheus Counter. Push-shaped: the hook fires from +// Registry.handleGraceExpiry's success branch, structurally ensuring +// stale-fire no-ops never increment. The pointer-identity guard in +// handleGraceExpiry is the load-bearing defence for the no-double-count +// invariant; the hook only runs after the guard passes. +// +// No labels. A {server=""} shape would carry the attacker-influenced +// x-pyrycode-server header value onto the metrics surface, which the +// threat model § Log hygiene forbids — see +// docs/specs/architecture/58-frame-and-grace-counters.md § Security review. +type graceMetrics struct { + counter prometheus.Counter +} + +// NewGraceMetrics registers the grace-expiry counter against reg and wires +// the eviction-increment hook on src. Boot-time only; +// SetGraceExpiryHook is not safe to call after ScheduleReleaseServer has +// been called. +func NewGraceMetrics(reg prometheus.Registerer, src *Registry) { + m := &graceMetrics{ + counter: prometheus.NewCounter(prometheus.CounterOpts{ + Name: "pyrycode_relay_grace_expiries_total", + Help: "Number of grace-window timers that actually evicted a server-id slot. Stale fires (cancel-and-replace races) do not increment.", + }), + } + reg.MustRegister(m.counter) + src.SetGraceExpiryHook(m.counter.Inc) +} diff --git a/internal/relay/registry.go b/internal/relay/registry.go index fba7abb..3b72f68 100644 --- a/internal/relay/registry.go +++ b/internal/relay/registry.go @@ -64,6 +64,24 @@ type Registry struct { binaries map[string]Conn phones map[string][]Conn timers map[string]*graceEntry + + // onPhoneForwarded is invoked by StartPhoneForwarder after each + // successful binary.Send. Nil = no-op. Set once at boot via + // SetForwarderHooks before either listener starts serving; concurrent + // mutation during serving is undefined. Called outside any registry + // lock; the hook body MUST NOT acquire r.mu (deadlock risk). The + // production hook body (metrics_forward.go) is a pure + // prometheus.CounterVec.WithLabelValues(...).Inc call — atomic per + // client_golang's contract, no relay-side lock involved. + onPhoneForwarded func() + // onBinaryForwarded mirrors onPhoneForwarded for StartBinaryForwarder's + // per-successful-phone.Send path. Same lifecycle and locking rules. + onBinaryForwarded func() + // onGraceExpiry is invoked by handleGraceExpiry after the + // pointer-identity guard's success branch and the phone-close fan-out. + // Stale fires do not reach this site. Set once at boot via + // SetGraceExpiryHook; same locking rules as the forwarder hooks. + onGraceExpiry func() } // graceEntry wraps a pending grace-period timer. Its pointer identity @@ -84,6 +102,22 @@ func NewRegistry() *Registry { } } +// SetForwarderHooks installs the two per-frame increment hooks. Either (or +// both) may be nil for no-op. Call once at boot before any forwarder +// goroutine runs; concurrent calls during serving are undefined. Hook +// bodies MUST NOT acquire r.mu — deadlock risk. +func (r *Registry) SetForwarderHooks(phone, binary func()) { + r.onPhoneForwarded = phone + r.onBinaryForwarded = binary +} + +// SetGraceExpiryHook installs the eviction-increment hook. May be nil for +// no-op. Call once at boot before any ScheduleReleaseServer call; concurrent +// calls during serving are undefined. The hook body MUST NOT acquire r.mu. +func (r *Registry) SetGraceExpiryHook(grace func()) { + r.onGraceExpiry = grace +} + // ClaimServer registers conn as the binary for serverID. First-claim-wins: // a second concurrent caller for the same serverID receives // ErrServerIDConflict and the conflicting caller's conn is left untouched @@ -187,6 +221,10 @@ func (r *Registry) handleGraceExpiry(serverID string, self *graceEntry) { for _, p := range snapshot { p.Close() } + + if h := r.onGraceExpiry; h != nil { + h() + } } // RegisterPhone appends conn to the phones slice for serverID. Returns From 6260e71d83b77a9b3ed366b6fcaedb67f4cf6af8 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Wed, 13 May 2026 10:37:08 +0300 Subject: [PATCH 3/3] docs: frame-forwarded and grace-expiry counters (#58) Adds the per-ticket codebase note and an evergreen feature doc for the two new Prometheus counters (frames_forwarded_total{direction}, grace_expiries_total) wired via nil-safe func() hooks on *Registry to keep prometheus out of registry.go and forward.go. INDEX.md gains a single entry for the new feature doc. Co-Authored-By: Claude Opus 4.7 --- docs/knowledge/INDEX.md | 1 + docs/knowledge/codebase/58.md | 77 +++++++++++++++ .../features/frame-and-grace-counters.md | 93 +++++++++++++++++++ 3 files changed, 171 insertions(+) create mode 100644 docs/knowledge/codebase/58.md create mode 100644 docs/knowledge/features/frame-and-grace-counters.md diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index 512eafa..a60ee3e 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -10,6 +10,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o - [Linux capability allowlist (boot-time refusal)](features/capability-allowlist.md) — relay parses `/proc/self/status`'s `CapEff:` hex mask at boot and refuses to start (exit 2) if any bit is set outside `AllowedCapabilities` (currently `{CAP_NET_BIND_SERVICE}` only, motivated by autocert binding `:80`/`:443` from uid 65532 in the distroless image). Exported sentinel `ErrUnexpectedCapability` is branchable via `errors.Is`; the wrapped error names every offending bit symbolically (`CAP_SYS_ADMIN (bit 21)` or `bit 63` for unknown), lists the allowlist contents, and embeds the operator fix string. `CapEff` only — `CapPrm/CapBnd/CapInh` would broaden false-positives (legitimate K8s default policy grants wide CapBnd) without adding load-bearing protection (relay never `capset(2)`s). Linux/non-Linux split at compile time via the new `_.go` / `_other.go` build-tag convention (see ADR-0009); non-Linux GOOS logs one skip line and returns nil. Unconditional — no production-mode gating, no env-var bypass, because stray capabilities are never legitimate. Reader-boundary test seam (`func() (string, error)`) exercises the parse + mask check end-to-end without touching real `/proc`. Joins the boot-time-refusal sentinel family (#9, #77, #79; future #78) (#79). - [Production-mode contract & startup refusals](features/production-mode.md) — `PYRYCODE_RELAY_PRODUCTION=1` env-var contract (exact-string match, lazy read via injected getter, mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE` shape from #64/#65) plus the boot-time checks that consume it. **#77** introduced `relay.CheckInsecureListenInProduction` + exported `ErrInsecureListenInProduction` sentinel (branchable via `errors.Is`) firing when production mode is on AND `--insecure-listen` is set. **#78** added the second consumer: `relay.CheckRunningAsRoot(geteuid, getenv)` + exported `ErrRunningAsRoot` sentinel firing when production mode is on AND `syscall.Geteuid() == 0`, closing the deploy-time gap (`docker run --user 0`, `securityContext.runAsUser: 0`, hand-edited Dockerfile dropping `USER`) that escapes the CI non-root-build contract (#32 Dockerfile, #68 Trivy). Both wired in `cmd/pyrycode-relay/main.go` after flag-parse with `os.Exit(2)` (config-rejected-at-boot, distinct from runtime-failure exit 1) and structured log fields: `env_var` carries the name only (never the value, even though `effective_uid` carries the kernel-supplied int — log-injection structurally impossible), one-line `fix` listing valid resolutions. `IsProductionMode` exported so siblings compose on the same predicate without re-reading the env var. Test seams: `func(string) string` for env, `func() int` for euid — both the smallest possible (no interface, no struct, no package-level var) and the only way to exercise the uid-0 branch in a unit test without re-execing the test binary as root. Two instances of the shape (#77, #78) now codify the "sibling boot-check" pattern; `Config.Validate()` consolidation deferred until ~5 checks exist (#77, #78). - [Fly.io deploy](features/fly-deploy.md) — production host wiring: `fly.toml` declares TCP-passthrough on `:80`/`:443` (no Fly HTTP proxy, no Fly-managed certs) so TLS keeps terminating in the relay via autocert (#9), persistent Fly volume `relay_autocert` mounted at `/var/lib/relay/autocert`, and a single-machine hard cap encoded via `min_machines_running=1` + `auto_start_machines=false` + `auto_stop_machines="off"` + `[deploy] strategy="immediate"` (Fly Apps v2 has no `max_machines` key; the in-binary `PYRYCODE_RELAY_SINGLE_INSTANCE` self-check from #65 is the backstop). CI `deploy` job in `.github/workflows/ci.yml` runs `flyctl deploy --remote-only` on push to `main`, gated by branch-condition + `needs: [test, security, image-scan]` + `permissions: contents: read` so `FLY_API_TOKEN` is structurally unreachable from PR code; `superfly/flyctl-actions/setup-flyctl` pinned by commit SHA with `# Tracks:` comment (same convention as #68 / #41). Dedicated IPv4 is required (not optional) for autocert's HTTP-01 challenge; TCP passthrough preserves the real socket peer IP that #34's rate limiter reads. `__REGION__` / `__DOMAIN__` ship as placeholders that fail loud on first deploy (#38). +- [Frame-forwarded and grace-expiry counters](features/frame-and-grace-counters.md) — three Prometheus counters wired through nil-safe `func()` hooks on `*Registry` so neither `registry.go` nor `forward.go` imports `prometheus` (ADR-0008 *Scope of use*): `pyrycode_relay_frames_forwarded_total{direction}` with `direction` hard-coded to `{phone_to_binary, binary_to_phone}` (cardinality 2; neither value is attacker-influenced) Inc'd by `StartPhoneForwarder` / `StartBinaryForwarder` exactly once per successful sink `Send` — structurally below every `continue` / error-return branch so the four miss-paths (no binary, marshal err, unknown conn_id, sink Send err) cannot over-count; and the scalar `pyrycode_relay_grace_expiries_total` Inc'd by `handleGraceExpiry` as the LAST statement of the success branch (after `r.mu.Unlock()` and the phone-close loop) — the existing pointer-identity guard's stale-fire `return` exits before the hook is reached, so stale fires never increment by construction (same structural defence #61's gauges ride for free). Three private fields on the `Registry` struct (`onPhoneForwarded`, `onBinaryForwarded`, `onGraceExpiry`) plus two exported setters (`SetForwarderHooks`, `SetGraceExpiryHook`) wired at boot from `cmd/pyrycode-relay/main.go` immediately after `NewConnectionsMetrics`; hooks are nil-safe so existing tests (which never wire metrics) see no behaviour change. Per-concern collector files `metrics_forward.go` (~45 lines) and `metrics_grace.go` (~33 lines) pre-bind `CounterVec` children via `WithLabelValues` once in the constructor and pass `Counter.Inc` method values directly into the setters — hot path is a single atomic Inc, no map lookup per frame, no anonymous-closure wrapping. Rejected alternatives: parameter threading (≥10 call-site cascade across `ClientHandler` / `ServerHandler` / `Start*Forwarder` signatures, right at the edit-fan-out red line) and package-level vars (breaks per-test isolation under `-race`, already rejected by #59's scaffolding spec). Hook field writes are boot-only, doc-comment-enforced; reads are race-free per the Go memory model's "goroutine creation happens-before goroutine body" rule; race-tested under `-race` via 16 goroutines × 200 ops in `metrics_counters_test.go`. Hooks MUST NOT acquire `r.mu` (deadlock risk if a future contributor adds a registry-mutating hook); the concrete bodies are pure `Counter.Inc` calls so the constraint holds today. Send-duration histogram (`pyrycode_relay_send_duration_seconds`) explicitly deferred — hot-path `time.Now()` × 2 + bucket bookkeeping warrants its own architect pass (#58). - [Connection-count gauges](features/connection-count-gauges.md) — `pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones` exposed via a pull-based `prometheus.Collector` reading `Registry.Counts()` on each scrape; zero edits to `registry.go`; scalar (no labels) by design — `{server="..."}` would carry the attacker-influenced `x-pyrycode-server` header onto the metrics surface, which threat-model § Log hygiene forbids; stale grace-expiry fires can't move the gauge because the pointer-identity guard (ADR-0006) keeps the maps unchanged and the gauge IS the map size; race-tested against 16 mutator goroutines + a tight-loop scraper under `-race`. First collector wired into the #59 seam (#61). - [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars) — first instantiated by #61's `connectionsCollector`. Listener landed in #60 (see [Metrics listener](features/metrics-listener.md)). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59). - [Docker image](features/docker-image.md) — portable OCI artifact: multi-stage `Dockerfile` builds a fully-static binary (`CGO_ENABLED=0`, `-trimpath -s -w`) into `distroless/static-debian12:nonroot`; both base images digest-pinned with `# Tracks:` comments; exposes `:80`/`:443` and declares `/var/lib/relay/autocert` volume; host-specific wiring (TLS policy, ports, volumes, healthcheck) is #38's problem (#32). PR-time Trivy CVE scan against the just-built image lives in CI as the `image-scan` job, fails on **fixable** CRITICAL/HIGH only (`ignore-unfixed: true`), action pinned by commit SHA with `# Tracks: ` comment mirroring the Dockerfile pin convention; intentional overlap with `govulncheck` (source-reachability vs. shipped-artifact) (#68). Both scanners are also re-run daily against `main` via `.github/workflows/security-scan.yml` (cron + `workflow_dispatch`) so disclosed CVEs against unchanged deps surface within ≤24h rather than staying invisible until the next bump (#72); a red cron run also opens a `security-sensitive`-labelled GitHub issue via the workflow's `file-issue` job (artifact-handoff privilege split keeps `issues: write` off the scanners and out of workflow scope; deterministic-title dedup via `gh issue list --search 'in:title …'`) so regressions land as tracked work-items rather than passive Actions rows (#73). diff --git a/docs/knowledge/codebase/58.md b/docs/knowledge/codebase/58.md new file mode 100644 index 0000000..33084de --- /dev/null +++ b/docs/knowledge/codebase/58.md @@ -0,0 +1,77 @@ +# Ticket #58 — `frames_forwarded_total` + `grace_expiries_total` counters + +Third slice of the #56 metrics rollout (after #59 scaffolding, #61 gauges, #60 listener). Adds the two event counters — frame throughput by direction and real-eviction count — via nil-safe `func()` hooks on `*Registry`, keeping `prometheus` out of `registry.go` and `forward.go`. + +## Implementation + +- **`internal/relay/registry.go` (modified)** — three private hook fields + two setter methods, plus one new line in `handleGraceExpiry`: + - `onPhoneForwarded`, `onBinaryForwarded`, `onGraceExpiry` — private `func()` fields on the `Registry` struct, all nil-safe, all documented as boot-time-write-only (any future mid-run mutation would need to revisit the unlocked-read decision). + - `SetForwarderHooks(phone, binary func())` and `SetGraceExpiryHook(grace func())` — exported because `main.go` (different package) calls them. Single-line bodies; no lock because the "set once at boot" contract is doc-comment-enforced (consistent with `Registry`'s other thread-safety contracts). + - `handleGraceExpiry` invokes `onGraceExpiry` as the LAST statement of the function, after `r.mu.Unlock()` AND after the phone-close `for` loop, so the stale-fire `return` path (line 213) exits before the hook is reached. Stale fires never increment by construction. +- **`internal/relay/forward.go` (modified)** — one nil-safe hook call per forwarder, placed AFTER the existing Send-success path: + - `StartPhoneForwarder` — `if h := reg.onPhoneForwarded; h != nil { h() }` immediately after `binary.Send(wrapped)` returns nil. The single-sink forwarder returns on Send error so the increment is naturally below all error paths. + - `StartBinaryForwarder` — `if h := reg.onBinaryForwarded; h != nil { h() }` immediately after `phone.Send(env.Frame)` returns nil. The three `continue` branches (unmarshal err, unknown `conn_id`, phone Send err) exit before the increment, structurally guaranteeing no over-count. +- **`internal/relay/metrics_forward.go` (new, ~45 lines)** — per-concern collector mirroring `metrics_connections.go`'s shape: + - `forwardMetrics` struct holds two pre-bound `prometheus.Counter` values (NOT a `CounterVec` reference + `WithLabelValues` per Inc). The constructor calls `vec.WithLabelValues("phone_to_binary")` and `vec.WithLabelValues("binary_to_phone")` once and captures the resulting children, so the hot-path hook is a single atomic Inc — avoids the per-frame map lookup the spec listed as an optional optimisation. + - `NewForwardMetrics(reg prometheus.Registerer, src *Registry)` constructs the `CounterVec` (`Name: "pyrycode_relay_frames_forwarded_total"`, `Help`, labels `["direction"]`), `MustRegister`s it on `reg`, pre-binds the two children, and calls `src.SetForwarderHooks(m.phoneToBinary.Inc, m.binaryToPhone.Inc)` — passing the `Counter.Inc` method values directly, no anonymous-closure wrapping. + - The doc comment on the struct names the hard-coded `direction` constants as load-bearing for the security review's *Tokens* category. +- **`internal/relay/metrics_grace.go` (new, ~33 lines)** — same shape, scalar: + - `graceMetrics` struct holds one `prometheus.Counter`. `NewGraceMetrics` constructs the counter (`Name: "pyrycode_relay_grace_expiries_total"`), `MustRegister`s it, and calls `src.SetGraceExpiryHook(m.counter.Inc)`. + - Doc comment names the rejected `{server=""}` label shape and the pointer-identity guard as the structural defence for the no-double-count invariant. +- **`cmd/pyrycode-relay/main.go` (+2 lines)** — two new constructor calls in the metrics-wiring block, immediately after `NewConnectionsMetrics(metricsReg, reg)`: + + ```go + relay.NewForwardMetrics(metricsReg, reg) + relay.NewGraceMetrics(metricsReg, reg) + ``` + + Both register against the private `metricsReg` and wire the hooks on `reg`. Boot-time ordering is correct: both calls run before any listener starts serving and before any forwarder goroutine launches (those start on the first connection). +- **`internal/relay/metrics_counters_test.go` (new, 243 lines)** — four tests + an inline `assertCounter(t, h, metric, labelStr, want)` helper that mirrors `assertGauge`'s substring-match posture from #61: + - `TestForwardMetrics_PhoneToBinary_OnlyOnSuccess` — drives 3 successful frames through `fakePhone` → `fakeBinary`, asserts `direction="phone_to_binary"` is 3 and `direction="binary_to_phone"` is 0. Then a negative subscenario: sticky `bin.sendErr` on a new phone → forwarder returns, counter unchanged at 3. Then a no-binary subscenario: phone on a server-id with no `ClaimServer` → forwarder returns nil, counter unchanged. + - `TestForwardMetrics_BinaryToPhone_OnlyOnSuccess` — drives 2 well-formed envelopes (one to each of two phones), then a malformed envelope, an unknown-conn_id envelope, and a phone-send-error envelope (`pErr.sendErr` set). Each `continue` path is exercised. Synchronises on a final well-formed envelope's delivery: once it lands, the three preceding continues have already been processed. Final counter: `binary_to_phone` = 3 (i=1 to p1, i=2 to p2, i=3 to p1), `phone_to_binary` = 0. + - `TestGraceMetrics_OnlyOnRealEviction` — real-eviction path: `ClaimServer` → `ScheduleReleaseServer(5ms)` → poll `BinaryFor` until the timer fires → assert counter is 1. Then the stale-fire path: `ClaimServer` → `ScheduleReleaseServer(5ms)` → immediate `ClaimServer` reclaim → sleep 30ms past the original window → counter MUST still be 1 (not 2). If the pointer-identity guard were broken, this would catch it. + - `TestGraceMetrics_RaceFreedom` — 16 goroutines × 200 ops of claim / schedule-release(1ms) / occasional immediate cancel-replace. Asserts the counter is present and parseable; the structural assertion is `go test -race` finding no DATA RACE between hook-field reads (timer goroutines) and the boot-time write. Pinned to `TestScheduleReleaseServer_RaceFreedomUnderRapidCycles`'s shape from `registry_test.go`. +- **Reused** — `fakeBinary`, `fakePhone`, `fakeBinarySource`, `fakeConn`, `runForwarder`, `runBinaryForwarder`, `claimAndRegister`, `mustMarshal`, `newFakePhone`, `waitForSent`, `waitForPhoneSent` from `forward_test.go` and `registry_test.go` (same package, no duplication). + +## Acceptance criteria — verification map + +- **AC-1** (`frames_forwarded_total` CounterVec defined with label `direction`, two hard-coded values): `metrics_forward.go:30-42` registers the vec; `phone_to_binary` / `binary_to_phone` are string literals at lines 40-41. +- **AC-2** (`grace_expiries_total` scalar counter, no labels): `metrics_grace.go:24-32`. +- **AC-3** (`StartPhoneForwarder` increments `phone_to_binary` exactly once per successful `binary.Send`, never on `BinaryFor` miss / marshal err / Send err): hook call at `forward.go:74-76` sits after the `Send err`-returns branch and after the `BinaryFor` no-ok-returns branch. `TestForwardMetrics_PhoneToBinary_OnlyOnSuccess` exercises all four paths (3 successes + 1 Send err + 1 missing-binary). +- **AC-4** (`StartBinaryForwarder` increments `binary_to_phone` exactly once per successful `phone.Send`, never on unknown `conn_id` / malformed envelope / per-sink Send err): hook call at `forward.go:162-164` sits below all three `continue` branches. `TestForwardMetrics_BinaryToPhone_OnlyOnSuccess` exercises each. +- **AC-5** (`handleGraceExpiry` increments `grace_expiries_total` exactly once per actual eviction, inside the pointer-identity guard's success branch — stale fires never increment): hook call at `registry.go:225-227` sits below `Unlock()` and after the phone-close loop; the stale-fire `return` exits at line 213 before the hook is reached. `TestGraceMetrics_OnlyOnRealEviction`'s stale-fire subscenario asserts the counter is unchanged after a cancel-replace. +- **AC-6 / AC-7** (tests assert success-count, not iteration-count; race coverage for the grace counter): `TestForwardMetrics_*` assertions are pinned to known-good integer values; `TestGraceMetrics_RaceFreedom` runs under the package-wide `-race` build. +- **AC-8** (`make vet`, `make test -race`, `make build` clean): verified in the developer run. +- **AC-9** (codebase summary + INDEX update): this file + the INDEX entry land in the documentation phase. + +## Patterns established + +- **Nil-safe `func()` hooks on `*Registry` as the seam for per-Registry-event counters.** When a counter's increment site lives in code that mustn't import `prometheus`, add a private nil-safe hook field, an exported boot-time setter, and a per-concern collector file that constructs the counter and passes its `Inc` method value into the setter. Any future per-Registry-event counter (e.g. upgrade rejects in #57, `phones_registered_total`) lands the same way. Avoids both (a) parameter threading through ≥10 call sites and (b) the package-level-vars singleton trap rejected by #59's spec. +- **Pre-bind `CounterVec` children in the collector constructor.** Call `WithLabelValues("…")` once per label set in the constructor, capture the resulting `prometheus.Counter`, and pass its `Inc` method value into the registry hook. Hot path becomes a single atomic Inc; no map lookup per frame. The spec lists this as optional; the developer's choice was to pre-bind because the cost is two extra struct fields and the win is a measurable hot-path simplification. +- **Pass method values directly into setters.** `src.SetForwarderHooks(m.phoneToBinary.Inc, m.binaryToPhone.Inc)` — no anonymous-closure wrapping. Go's method values bind the receiver at evaluation time; the resulting `func()` has the same calling-convention cost as a hand-written closure but reads as the bare intent. +- **Hook field reads stay unlocked; the "set once at boot" rule is doc-comment-enforced.** Per the Go memory model's "goroutine creation happens-before goroutine body" rule, fields written at boot are race-free for any reads performed by goroutines launched after the writes. The doc comment on each hook field states the constraint so a future mid-run mutator must first delete the doc. +- **Same-package fakes, no duplication.** Both the gauge tests (#61) and the counter tests (#58) live in `package relay` so they reuse `fakeConn` / `fakePhone` / `fakeBinary` / the existing forwarder runners. Resist the urge to factor a "metrics test fixtures" file — every fake the metrics tests need already exists in the sibling forwarder/registry tests. + +## Lessons learned + +- **Hook bodies via method values are the bare-minimum-cost seam.** `m.counter.Inc` (a method value) compiles to a closure carrying the receiver pointer; the hook-call cost at the increment site is the function-pointer load + indirect call already present in the nil-safe `if h := …; h != nil { h() }` shape. There is no further "pre-bind the closure" optimisation that pays for itself; reaching for explicit `func() { m.counter.Inc() }` wrappers buys nothing but a longer line. +- **The pointer-identity guard does the structural work; the hook is observational.** `handleGraceExpiry`'s pointer-identity guard (ADR-0006) is what guarantees the stale-fire AC — the hook just happens to live below the guard's success branch. The same shape paid off for #61's pull-based gauges, where the guard's structural defence meant zero stale-fire test wiring. Carry forward: any new event counter that fires from a path with an existing pointer-identity / sequence-number guard inherits the no-double-count proof for free; place the hook BELOW the guard and the AC writes itself. +- **Synchronise on a sentinel envelope's delivery, not on a `time.Sleep`.** `TestForwardMetrics_BinaryToPhone_OnlyOnSuccess` injects three `continue`-path envelopes (malformed, unknown, phone-send-err), then sends a final well-formed envelope and waits for its delivery via `waitForPhoneSent`. Because the forwarder consumes the channel sequentially, when the final envelope lands the three preceding continues have already been processed. No sleep, no flake. Reach for the sentinel-after-the-batch pattern any time a test needs to assert "the three skip paths produced no side effects" without coupling to wall-clock timing. +- **Substring-matching the metric line is robust to `client_golang` text-format minutiae.** `assertCounter` matches `fmt.Sprintf("%s{%s} %d", metric, labels, want)` as a substring of the scrape body. Full-body equality would fight (a) the `promhttp_metric_handler_*` self-instrumentation lines and (b) any future `client_golang` exposition tweaks (whitespace, label ordering inside `{}`). Same posture as `assertGauge` in #61; carry to every future metric assertion. + +## Cross-links + +- [Frame-forwarded and grace-expiry counters (feature)](../features/frame-and-grace-counters.md) — evergreen contract + design doc. +- [Connection-count gauges (feature)](../features/connection-count-gauges.md) — sibling collector wired against the same `metricsReg` (#61); pull-shaped where this ticket is push-shaped. +- [Metrics registry (feature)](../features/metrics-registry.md) — the seam (#59). +- [Metrics listener (feature)](../features/metrics-listener.md) — the `/metrics` surface that serves the new counters (#60). +- [Phone-side frame forwarder](../features/phone-forwarder.md) — `StartPhoneForwarder` call site (#25). +- [Binary-side frame forwarder](../features/binary-forwarder.md) — `StartBinaryForwarder` call site (#26). +- [Connection registry](../features/connection-registry.md) — `handleGraceExpiry` call site and pointer-identity guard. +- [ADR-0008](../decisions/0008-prometheus-client-adoption.md) — Prometheus scope-of-use rules that motivated the hook indirection. +- [ADR-0006](../decisions/0006-grace-period-as-reclaim-path.md) — pointer-identity guard the grace counter rides for free. +- [Codebase note #59](59.md) — registry + handler scaffolding. +- [Codebase note #60](60.md) — listener + wiring conventions. +- [Codebase note #61](61.md) — first collector instance; pre-set the substring-match test posture. +- [Architect spec](../../specs/architecture/58-frame-and-grace-counters.md) — full design rationale + security review. +- [#56 — parent ticket](https://github.com/pyrycode/pyrycode-relay/issues/56) — split into #59 / #60 / #61 / #57 / #58. diff --git a/docs/knowledge/features/frame-and-grace-counters.md b/docs/knowledge/features/frame-and-grace-counters.md new file mode 100644 index 0000000..e915e8a --- /dev/null +++ b/docs/knowledge/features/frame-and-grace-counters.md @@ -0,0 +1,93 @@ +# Frame-forwarded and grace-expiry counters + +Three Prometheus counter time series that let an operator see relay throughput and distinguish "binary disconnected and reconnected within the grace window" from "binary disconnected and the slot was actually reclaimed": + +- `pyrycode_relay_frames_forwarded_total{direction="phone_to_binary"}` — frames the relay successfully forwarded from a phone to the binary holding the server-id. +- `pyrycode_relay_frames_forwarded_total{direction="binary_to_phone"}` — frames the relay successfully forwarded from a binary to a registered phone. +- `pyrycode_relay_grace_expiries_total` — number of grace timers that actually evicted a server-id slot (real evictions only; stale `time.AfterFunc` fires never increment). + +All three are monotonic non-negative integer counters. The `direction` label has exactly two hard-coded values; no other labels exist on either metric. + +## API + +Package `internal/relay`: + +```go +// metrics_forward.go +func NewForwardMetrics(reg prometheus.Registerer, src *Registry) + +// metrics_grace.go +func NewGraceMetrics(reg prometheus.Registerer, src *Registry) +``` + +Each constructor (a) registers its counter against `reg` and (b) wires the per-event increment hook on `src` via `Registry.SetForwarderHooks` / `Registry.SetGraceExpiryHook`. Boot-time only — set the hooks before either listener starts serving. + +## Design — push-shaped counters via nil-safe hooks on `*Registry` + +Counters of events stay push-shaped (one Inc per event), the natural shape for `Counter`s. The wrinkle: the increment sites live in code (`registry.go`, `forward.go`) that — per ADR-0008's *Scope of use* boundary — must NOT import `prometheus`. Three private `func()` fields on `*Registry` provide the seam: + +- `onPhoneForwarded` — invoked by `StartPhoneForwarder` after a successful `binary.Send`. +- `onBinaryForwarded` — invoked by `StartBinaryForwarder` after a successful `phone.Send`. +- `onGraceExpiry` — invoked by `handleGraceExpiry` after the pointer-identity guard's success branch. + +All three fields are nil-safe (call sites are `if h := …; h != nil { h() }`); tests that don't wire metrics see no-op behaviour. The fields are exported only through `SetForwarderHooks` / `SetGraceExpiryHook` setters; hook bodies in `metrics_forward.go` / `metrics_grace.go` are pure `prometheus.Counter.Inc` calls that acquire no relay-side lock. + +The two natural alternatives were rejected: + +1. **Parameter threading** (passing `*forwardMetrics` through `ClientHandler` / `ServerHandler` / `Start*Forwarder` signatures) — measured cost was ≥10 call-site updates across ≥7 files, right at the architect's edit-fan-out red line. Rejected. +2. **Package-level vars** — forces one registry per process, breaks per-test isolation under `-race`. Already rejected by the #59 scaffolding spec. + +The hooks-on-Registry shape pays a one-shot cost (Registry's surface gains three fields it does not "own") in exchange for zero signature cascade and zero test churn. Any future per-Registry-event counter (e.g. `phones_registered_total`) lands the same way: one private hook + one setter + one nil-safe call site. + +## Increment placement — load-bearing for correctness + +**`StartPhoneForwarder`** increments `direction="phone_to_binary"` exactly once per successful `binary.Send` (after the existing `if err := binary.Send(wrapped); err != nil` check passes). The single-sink forwarder returns on Send error, so the increment is naturally below all error paths. + +**`StartBinaryForwarder`** increments `direction="binary_to_phone"` exactly once per successful `phone.Send`. The three `continue` paths — malformed envelope (`Unmarshal` err), unknown `conn_id` (no matching phone), per-phone `Send` err — branch out before the increment, structurally guaranteeing no over-count. + +**`handleGraceExpiry`** increments `grace_expiries_total` as the LAST step of the success branch, after `r.mu.Unlock()` and the phone-close loop. The pointer-identity guard's stale-fire `return` exits before the hook call, so stale fires never increment by construction — the same structural defence the connection-count gauges (#61) ride for free. + +## No attacker-influenced labels + +The `direction` label's two values (`phone_to_binary`, `binary_to_phone`) are string literals in `metrics_forward.go`; neither comes from a request header, query parameter, frame payload, or registry state. Cardinality is exactly 2 regardless of traffic. `grace_expiries_total` has no labels. + +The production files' doc comments restate the constraint so a future contributor cannot quietly re-add a `{server=""}` label that would carry the attacker-influenced `x-pyrycode-server` header value onto the metrics surface (forbidden by threat model § Log hygiene, which extends to metric labels). + +## Concurrency + +- **No new goroutines.** Hook bodies run on whichever goroutine the caller is on: the forwarder goroutine (per connected phone/binary) for the frame hooks, the `time.AfterFunc` goroutine for the grace hook. +- **No new locks.** `Counter.Inc` is internally atomic per `client_golang`'s contract. +- **Hook field reads are unlocked.** The fields are written exactly once at boot (via the setters) before any forwarder or timer goroutine launches; per the Go memory model's "goroutine creation happens-before goroutine body" rule, the reads are race-free. +- **Hooks MUST NOT acquire `r.mu`.** The doc comment on each field states the constraint — `handleGraceExpiry` already holds the lock at parts of its body; a hook re-entering registry mutator paths would deadlock. The concrete hook bodies are pure `Counter.Inc` calls and acquire no relay-side lock. +- **CounterVec children are pre-bound.** `NewForwardMetrics` calls `WithLabelValues("phone_to_binary")` and `WithLabelValues("binary_to_phone")` once in the constructor and captures the resulting `prometheus.Counter` values, so each frame's hot path is a single atomic Inc on a pre-resolved child — no map lookup per frame. + +`TestGraceMetrics_RaceFreedom` runs 16 goroutines × 200 ops of claim / schedule-release / cancel-replace cycles under `-race`; the race detector's verdict is the structural assertion that the hook field's concurrent reads against boot-time writes are clean. + +## Wiring + +Both constructors are called from `cmd/pyrycode-relay/main.go`'s metrics block, alongside `NewConnectionsMetrics`: + +```go +relay.NewConnectionsMetrics(metricsReg, reg) +relay.NewForwardMetrics(metricsReg, reg) +relay.NewGraceMetrics(metricsReg, reg) +``` + +All three register against the private `metricsReg` (no `DefaultRegisterer`) and the resulting `/metrics` response is served by the localhost-only listener wired in #60. + +## Out of scope + +- **`pyrycode_relay_send_duration_seconds` histogram** for per-frame send latency was deferred. The hot-path cost (`time.Now()` × 2 per frame + bucket bookkeeping) warrants its own architect pass; the value/cost trade-off is unsettled. Its own ticket if it lands. +- **Upgrade/register counters** for the WS-upgrade reject paths (#57) — sibling ticket, same registry-hook pattern. + +## Related + +- [Metrics registry (scaffolding)](metrics-registry.md) — the seam these counters plug into (#59). +- [Metrics listener (localhost-only)](metrics-listener.md) — the `/metrics` surface that serves them (#60). +- [Connection-count gauges](connection-count-gauges.md) — pull-based sibling collector; the seam admits both shapes (#61). +- [Phone-side frame forwarder](phone-forwarder.md) — the `StartPhoneForwarder` call site for the `phone_to_binary` increment. +- [Binary-side frame forwarder](binary-forwarder.md) — the `StartBinaryForwarder` call site for the `binary_to_phone` increment. +- [Connection registry](connection-registry.md) — the `handleGraceExpiry` call site and pointer-identity guard (ADR-0006). +- [ADR-0008: Prometheus client adoption](../decisions/0008-prometheus-client-adoption.md) — scope-of-use rules that motivated the hook indirection. +- [ADR-0006: Grace window IS the reclaim path](../decisions/0006-grace-period-as-reclaim-path.md) — pointer-identity guard the grace counter rides for free. +- [Threat model](../../threat-model.md) — log-hygiene rule extended to metric labels.