From 91c80b546d9febdd7515280f20a14fa44b41151b Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Tue, 12 May 2026 19:40:53 +0300 Subject: [PATCH 1/3] spec: pyrycode_relay_connected_{binaries,phones} gauges (#61) --- .../specs/architecture/61-connected-gauges.md | 708 ++++++++++++++++++ 1 file changed, 708 insertions(+) create mode 100644 docs/specs/architecture/61-connected-gauges.md diff --git a/docs/specs/architecture/61-connected-gauges.md b/docs/specs/architecture/61-connected-gauges.md new file mode 100644 index 0000000..63f3893 --- /dev/null +++ b/docs/specs/architecture/61-connected-gauges.md @@ -0,0 +1,708 @@ +# Spec — `pyrycode_relay_connected_{binaries,phones}` gauges (#61) + +## Files to read first + +- `internal/relay/registry.go:56-79` — `Registry` struct, lock discipline, + constructor. The collector holds a `*Registry` (same package, so no + interface needed). `mu sync.RWMutex` is the lock `Counts()` already takes. +- `internal/relay/registry.go:148-184` — `ScheduleReleaseServer` + + `handleGraceExpiry`. Read the pointer-identity guard at line 171 + (`if r.timers[serverID] != self { … return }`). This is the stale-fire + no-op the AC names; the pull-based design rides it for free (see § Design + *Why pull-based*). +- `internal/relay/registry.go:258-271` — `Counts()`. The collector's + `Collect` calls this; the gauges ARE its return values. `Counts()` already + holds `RLock` and walks the maps in a snapshot-consistent way; one call + is internally consistent (per its doc comment). +- `internal/relay/registry_test.go:19-45` — `fakeConn`. The new race test + reuses this type (same package, no duplication needed). +- `internal/relay/registry_test.go:413-451` — `TestScheduleReleaseServer_RaceFreedomUnderRapidCycles`. + Shape reference for the new race test: 16 goroutines × 200 ops, hammering + the cancel/replace + stale-fire paths. The new test races the same + registry mutations against periodic `Collect()` calls. +- `internal/relay/metrics.go:1-52` — `NewMetricsRegistry` and + `NewMetricsHandler` from #59. `MustRegister` happens against + `prometheus.Registerer`; the constructor in this ticket takes a + `prometheus.Registerer` argument, not the concrete `*prometheus.Registry` + — same convention as `promhttp.HandlerOpts.Registry`. +- `internal/relay/metrics_test.go:64-86` — `TestMetricsHandler_RegisterAndScrape_RoundTrip`. + The basic scrape-and-assert pattern the new tests mirror: create registry, + wire collector, scrape via the existing `NewMetricsHandler`, substring-match + the expected text output. +- `docs/specs/architecture/59-metrics-scaffolding.md` § *The architect's + "package-vars-vs-struct" pick* — the per-concern collector struct pattern + this ticket follows. The seam rejects the package-level-var and + mega-struct alternatives; the gauge here is a *third* shape (pull-based + Collector wrapping external state) but still hosted by a private struct + in its own file, so it stays inside the established seam. +- `docs/knowledge/decisions/0008-prometheus-client-adoption.md` § *Scope of + use* — the "no `DefaultRegisterer`" boundary. The new constructor + enforces it by taking the registerer as an explicit argument; the test + for this ticket re-runs the snapshot-delta check inherited from #59 by + way of the existing `TestMetricsRegistry_NoGlobalRegistrarLeak` (which + still passes after this ticket: the gauges register on a fresh private + registry, never on the default). +- `docs/PROJECT-MEMORY.md` line 41 — "Interface methods called under the + lock are documented as non-blocking getters." Relevant *negatively*: + pull-based design means the collector never holds the registry lock + itself; `Counts()` already takes its own RLock. No new lock-acquisition + paths. + +## Context + +`/healthz` reports a one-shot connection-count snapshot (#10). Time-series +visibility wants the same numbers exposed as Prometheus gauges so the +operator can graph utilisation and alert on drops. #59 landed the registry ++ handler scaffolding (`NewMetricsRegistry`, `NewMetricsHandler`); this +ticket adds the first pair of metrics on top of that seam. + +The single load-bearing design choice the AC delegates to the architect: + +> Gauges are incremented and decremented at the existing register, +> unregister, and grace-expiry-eviction sites in `internal/relay/registry.go`. +> The architect picks the exact shape (counters maintained inside the +> registry's lock vs. a pull-based `prometheus.Collector` over the existing +> `Counts()`). + +This spec picks **pull-based**. The justification is the bulk of § Design. + +#60 (listener) and #61 (this ticket) are independent siblings of #59 — once +this ticket lands the collector, the gauges are live in the registry's +view of the world regardless of whether #60 has wired the listener yet. +A test scrape via `NewMetricsHandler` exercises the seam without the +listener. + +## Design + +### Why pull-based, not push-based + +The AC admits two shapes. Pull-based wins on five concrete axes: + +1. **Zero diff in `registry.go`.** The registry is internet-exposed, + security-sensitive, and already raced against under `-race`. Adding + Inc/Dec calls at five mutation sites (ClaimServer success path, + ClaimServer-during-grace replace path, ReleaseServer success path, + handleGraceExpiry post-identity-check path, RegisterPhone success + path, UnregisterPhone match path) widens the audit surface of a + sensitive file. Pull-based leaves the file untouched. +2. **The stale-fire AC is satisfied structurally, not by review.** The + AC requires that `handleGraceExpiry`'s pointer-identity guard + no-op (when `r.timers[serverID] != self`) must NOT move the gauge. + With push-based, the developer has to put the Dec calls *inside* the + `if r.timers[serverID] != self` branch, AFTER the check, and the + tests have to assert that pointer-identity-miss paths leave the + gauge unchanged. With pull-based, the gauge IS `Counts()`; the guard + already keeps the map unchanged on a stale fire, so the gauge is + structurally unaffected. One fewer thing to get wrong. +3. **`Counts()` is the authoritative live count.** The AC requires + "gauge values match the registry's live count at all observation + points." Pull-based makes this a tautology: the gauge is the live + count, by construction. Push-based introduces a second source of + truth (the AtomicInt64 inside each gauge) that must agree with the + first (the map sizes); divergence between the two is a class of bug + pull-based cannot have. +4. **No new lock-acquisition path.** `Counts()` already holds RLock + and walks the maps. The collector calls `Counts()`; that's the only + read. Push-based adds Inc/Dec calls inside the registry's Lock + regions — `prometheus.Gauge.Inc/Dec` is documented atomic and + non-blocking, so this is *safe*, but it's a new pattern the lock + discipline must continue to honour as the registry evolves. +5. **The seam pattern admits it.** #59's per-concern collector struct + pattern was illustrated with a push-shaped CounterVec. This ticket + instantiates the same pattern with a pull-shaped Collector — a + private struct in its own file, constructed via a function taking + `prometheus.Registerer`, registered against the relay's private + registry. The pattern accommodates both shapes; this ticket + documents the precedent that gauges-of-a-state are naturally + pull-shaped, while sibling tickets #57/#58 (counters-of-events) + stay push-shaped. Different semantics, different shapes; both + live in the same seam. + +Push-based has one nominal advantage worth naming: the gauge value +updates atomically with the state change, while pull-based is only as +fresh as the last scrape. For Prometheus's typical 15-60s scrape +interval and the relay's use case (operator alerts on connection +drops), that freshness gap is acceptable — the operator's alert +window dwarfs the scrape interval. The advantage does not outweigh +points 1-5. + +### Artefacts + +**Two new files, no edits to existing files.** + +#### 1. `internal/relay/metrics_connections.go` (new, ~40 lines) + +```go +package relay + +import "github.com/prometheus/client_golang/prometheus" + +// connectionsCollector emits pyrycode_relay_connected_binaries and +// pyrycode_relay_connected_phones as Prometheus gauges. The values come +// from Registry.Counts() on every scrape (pull-based); the collector +// never mutates the registry and never holds the registry lock itself. +// +// Pull-based design — rationale recorded in +// docs/specs/architecture/61-connected-gauges.md § Why pull-based: +// - registry.go is untouched (audit surface unchanged), +// - the grace-expiry stale-fire guard (registry.go handleGraceExpiry) +// keeps the maps unchanged on a stale fire, so the gauge is +// structurally unaffected — no second source of truth to keep in sync, +// - Counts() already holds the right lock and returns a +// snapshot-consistent (binaries, phones) pair. +type connectionsCollector struct { + binaries *prometheus.Desc + phones *prometheus.Desc + src *Registry +} + +// NewConnectionsMetrics registers the pair of connection-count gauges +// against reg, reading their values from src on each scrape. Returns +// nothing — the collector is held alive by the registerer. +// +// Wiring site: cmd/pyrycode-relay/main.go (the call lands with #60's +// listener wiring; until then the constructor is exercised only by +// metrics_connections_test.go). +func NewConnectionsMetrics(reg prometheus.Registerer, src *Registry) { + reg.MustRegister(&connectionsCollector{ + binaries: prometheus.NewDesc( + "pyrycode_relay_connected_binaries", + "Number of pyrycode binary connections currently held by the relay registry.", + nil, nil, + ), + phones: prometheus.NewDesc( + "pyrycode_relay_connected_phones", + "Number of mobile client connections currently held by the relay registry, summed across all server-ids.", + nil, nil, + ), + src: src, + }) +} + +func (c *connectionsCollector) Describe(ch chan<- *prometheus.Desc) { + ch <- c.binaries + ch <- c.phones +} + +func (c *connectionsCollector) Collect(ch chan<- prometheus.Metric) { + b, p := c.src.Counts() + ch <- prometheus.MustNewConstMetric(c.binaries, prometheus.GaugeValue, float64(b)) + ch <- prometheus.MustNewConstMetric(c.phones, prometheus.GaugeValue, float64(p)) +} +``` + +Notes the developer must preserve verbatim: + +- **No labels.** Both metrics are scalar gauges, not GaugeVecs. + `prometheus.NewDesc(..., nil, nil)` — empty `variableLabels` and empty + `constLabels`. The AC names two metrics with no `{server="..."}` + cardinality. This is also load-bearing for security review § Tokens: + a `server` label would carry server-ids (attacker-influenced via the + `x-pyrycode-server` header on `/v1/server`) into the metrics surface, + which the threat model § Log hygiene forbids. Keep labels empty. +- **`MustNewConstMetric`, not `NewConstMetric`.** The Desc is built + once at construction; the value comes from `Counts()`, which can + never produce values that violate the Desc contract (no labels to + mismatch, count is a non-negative int). A nil return from + `NewConstMetric` is impossible under these inputs; the `Must` form + matches the contract and saves the per-scrape `if err != nil` branch. +- **`prometheus.Registerer` (interface), not `*prometheus.Registry` + (concrete).** `Registerer` is the `MustRegister`-only sub-interface; + passing it (rather than the concrete type) keeps the constructor + composable with anything implementing the interface (e.g. + `prometheus.WrapRegistererWith` for label-decoration), even though + no caller currently uses that. Same convention as + `prometheus/client_golang`'s own constructors. +- **No package-level `var`s.** Pattern reference: #59's spec § + *package-vars-vs-struct pick* rejects them as singleton-forcing. This + ticket keeps the precedent. + +#### 2. `internal/relay/metrics_connections_test.go` (new, ~120 lines) + +Three tests, in the same `package relay` (so the `fakeConn` from +`registry_test.go` is reusable without copy-paste — same convention as +the rest of the test suite, per PROJECT-MEMORY line 28). + +```go +package relay + +import ( + "fmt" + "net/http" + "net/http/httptest" + "strings" + "sync" + "sync/atomic" + "testing" + "time" +) +``` + +**Test 1: `TestConnectionsMetrics_ReflectLiveCounts`** — AC gate. +Constructs a registry, wires the collector, mutates the registry +through the public API (ClaimServer / RegisterPhone), scrapes via the +existing `NewMetricsHandler`, asserts the body contains the expected +gauge lines. + +```go +func TestConnectionsMetrics_ReflectLiveCounts(t *testing.T) { + t.Parallel() + + r := NewRegistry() + mreg := NewMetricsRegistry() + NewConnectionsMetrics(mreg, r) + h := NewMetricsHandler(mreg) + + // Empty registry: both gauges should be 0. + assertGauge(t, h, "pyrycode_relay_connected_binaries", 0) + assertGauge(t, h, "pyrycode_relay_connected_phones", 0) + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + if err := r.RegisterPhone("s1", &fakeConn{id: "p-1"}); err != nil { + t.Fatalf("RegisterPhone: %v", err) + } + if err := r.RegisterPhone("s1", &fakeConn{id: "p-2"}); err != nil { + t.Fatalf("RegisterPhone: %v", err) + } + + assertGauge(t, h, "pyrycode_relay_connected_binaries", 1) + assertGauge(t, h, "pyrycode_relay_connected_phones", 2) + + // Unregister one phone and release the server; gauges follow. + r.UnregisterPhone("s1", "p-1") + if !r.ReleaseServer("s1") { + t.Fatal("ReleaseServer: got false, want true") + } + assertGauge(t, h, "pyrycode_relay_connected_binaries", 0) + // Phone p-2 outlives the release per ReleaseServer's contract (orphan + // phones survive until explicit unregister or grace expiry). + assertGauge(t, h, "pyrycode_relay_connected_phones", 1) +} +``` + +```go +// assertGauge scrapes h and asserts that body contains the literal line +// ` ` in the Prometheus text format. Substring-match (not +// equality on the full body) because the body also contains the +// promhttp_metric_handler_* self-instrumentation counters from +// HandlerOpts.Registry=reg. +func assertGauge(t *testing.T, h http.Handler, name string, want int) { + t.Helper() + rec := httptest.NewRecorder() + h.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/metrics", nil)) + if rec.Code != http.StatusOK { + t.Fatalf("scrape status: got %d, want 200", rec.Code) + } + wantLine := fmt.Sprintf("%s %d", name, want) + if !strings.Contains(rec.Body.String(), wantLine) { + t.Errorf("scrape body missing %q; got:\n%s", wantLine, rec.Body.String()) + } +} +``` + +The text-format literal `pyrycode_relay_connected_binaries 0` is the +canary the spec pins: a future bump of `client_golang` that changes +the formatting (extra whitespace, scientific notation for zero, etc.) +would surface as a test failure rather than a tautological pass — same +posture as `metrics_test.go`'s pinned `Content-Type` literal. + +**Test 2: `TestConnectionsMetrics_GraceStaleFireDoesNotMoveGauge`** — +the stale-fire AC. Reproduces the cancel/replace pointer-identity +race window: + +1. ClaimServer("s1", b1), RegisterPhone("s1", p1). +2. ScheduleReleaseServer("s1", short d) → arms entry A. +3. ClaimServer("s1", b2) → ClaimServer's grace path cancels entry A's + timer, deletes the map entry, replaces the binary. +4. Sleep past d. If timer A had managed to start its body before + Stop() observed it, the pointer-identity guard in + `handleGraceExpiry` no-ops because `r.timers["s1"]` is now nil + (entry A was deleted). +5. Scrape — gauges must reflect the live state (1 binary b2, + 1 phone p1), NOT the post-eviction state (0,0). + +```go +func TestConnectionsMetrics_GraceStaleFireDoesNotMoveGauge(t *testing.T) { + t.Parallel() + + r := NewRegistry() + mreg := NewMetricsRegistry() + NewConnectionsMetrics(mreg, r) + h := NewMetricsHandler(mreg) + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer b-1: %v", err) + } + if err := r.RegisterPhone("s1", &fakeConn{id: "p-1"}); err != nil { + t.Fatalf("RegisterPhone p-1: %v", err) + } + + // Arm grace; immediately cancel-and-replace via reclaim. + r.ScheduleReleaseServer("s1", 5*time.Millisecond) + if err := r.ClaimServer("s1", &fakeConn{id: "b-2"}); err != nil { + t.Fatalf("ClaimServer b-2 (reclaim): %v", err) + } + + // Wait past the original grace window. If the stale fire were to + // move the gauge, this is when it would. + time.Sleep(30 * time.Millisecond) + + assertGauge(t, h, "pyrycode_relay_connected_binaries", 1) + assertGauge(t, h, "pyrycode_relay_connected_phones", 1) +} +``` + +This test is the structural defence the AC demands for the +stale-fire-no-op rule. With pull-based design, the test reads as +"the registry's Counts() reflects b-2 + p-1, therefore the gauge +does too." With a hypothetical push-based design, the test would +have to additionally inspect a Dec call site for branch coverage. +Pull-based earns the simpler test. + +**Test 3: `TestConnectionsMetrics_RaceFreedom`** — the explicit AC +"race register/unregister/grace-expiry against periodic gauge reads". + +```go +func TestConnectionsMetrics_RaceFreedom(t *testing.T) { + t.Parallel() + + r := NewRegistry() + mreg := NewMetricsRegistry() + NewConnectionsMetrics(mreg, r) + h := NewMetricsHandler(mreg) + + const goroutines = 16 + const opsPer = 200 + + stopScrape := make(chan struct{}) + var scrapeCount atomic.Int64 + + // Scraper: periodic gauge reads via the public handler — the same + // path Prometheus would hit. + go func() { + for { + select { + case <-stopScrape: + return + default: + rec := httptest.NewRecorder() + h.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/metrics", nil)) + if rec.Code != http.StatusOK { + t.Errorf("scrape status: got %d, want 200", rec.Code) + return + } + scrapeCount.Add(1) + // Tight loop — no sleep — to maximise interleaving under + // -race. The race detector reports on memory accesses, not + // on iteration count. + } + } + }() + + var wg sync.WaitGroup + wg.Add(goroutines) + for g := 0; g < goroutines; g++ { + g := g + go func() { + defer wg.Done() + sid := fmt.Sprintf("s-%d", g%4) + for i := 0; i < opsPer; i++ { + _ = r.ClaimServer(sid, &fakeConn{id: fmt.Sprintf("b-%d-%d", g, i)}) + _ = r.RegisterPhone(sid, &fakeConn{id: fmt.Sprintf("p-%d-%d", g, i)}) + r.ScheduleReleaseServer(sid, time.Millisecond) + _ = r.ClaimServer(sid, &fakeConn{id: fmt.Sprintf("b2-%d-%d", g, i)}) + r.UnregisterPhone(sid, fmt.Sprintf("p-%d-%d", g, i)) + } + }() + } + wg.Wait() + close(stopScrape) + + // Sleep past max grace duration so any in-flight timers fire and exit. + time.Sleep(50 * time.Millisecond) + + // Final state need not be empty (some claims won the cancellation + // race); we only assert the scraper ran AND a final scrape returns 200 + // with parseable gauge lines. The race detector's verdict is the real + // assertion — this test exists to feed -race interleavings, not to + // check exact values. + if scrapeCount.Load() == 0 { + t.Error("scraper never ran") + } + rec := httptest.NewRecorder() + h.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/metrics", nil)) + if rec.Code != http.StatusOK { + t.Fatalf("final scrape status: got %d, want 200", rec.Code) + } + body := rec.Body.String() + if !strings.Contains(body, "pyrycode_relay_connected_binaries ") { + t.Errorf("final scrape body missing binaries gauge; got:\n%s", body) + } + if !strings.Contains(body, "pyrycode_relay_connected_phones ") { + t.Errorf("final scrape body missing phones gauge; got:\n%s", body) + } +} +``` + +Shape pinned to `TestScheduleReleaseServer_RaceFreedomUnderRapidCycles` +(registry_test.go:419) — same goroutine count, same op shape, same +"final state need not be empty" pattern. The new wrinkle is the +concurrent scraper goroutine: it reads `Counts()` indirectly through +the collector, racing against every mutation site. + +The scraper uses a tight loop without `time.Sleep` between iterations. +The race detector reports on memory access pairs, not on iteration +count; sleeping only reduces the interleaving density. The test should +finish in well under a second (16 × 200 ops with one-millisecond grace +timers). + +**No fourth test.** A test asserting "`Counts()` and the gauge agree" +would be tautological under pull-based design (the gauge IS the +return value of `Counts()`). Skip it. + +## Concurrency model + +- The collector spawns no goroutines. +- `Collect` is called by `prometheus.Registry.Gather()`, which the + handler invokes per scrape. Concurrent scrapes are serialised inside + `prometheus.Registry.Gather()` (the library's lock); the collector + itself is stateless beyond the two Desc pointers and the `*Registry` + reference. +- `Counts()` takes the registry's RLock. The collector calls `Counts()` + exactly once per `Collect`. No new lock-acquisition path; the + registry's existing lock discipline carries through. +- The race test (Test 3) is the structural check: 16 mutator goroutines + + 1 scraper goroutine, no DATA RACE reports under `-race`. The + registry's existing race coverage + (`TestRegistry_RaceFreedom`, `TestScheduleReleaseServer_RaceFreedomUnderRapidCycles`) + already proves `Counts()` is race-free against the mutation API; this + test extends the proof to "`Counts()` racing against scrape calls via + the collector seam." + +## Error handling + +Three concrete failure modes the spec accounts for: + +1. **Metric-name collision.** `MustRegister` panics if a collector with + either gauge name is already registered against `reg`. The panic + surfaces at process startup (in `main.go` once #60 wires the call), + not at scrape time. Desirable: the relay refuses to serve a broken + metrics surface. The metric names this ticket introduces + (`pyrycode_relay_connected_binaries`, `pyrycode_relay_connected_phones`) + are unique within the relay's metric space; sibling tickets #57/#58 + register under different names per their bodies. No conflict at the + time this ticket lands. +2. **`Collect` invoked before the registry is fully constructed.** Not + reachable: `NewConnectionsMetrics` is called once at startup, after + `NewRegistry()` and `NewMetricsRegistry()`; the registry's zero-value + maps are populated by `NewRegistry()` before the collector is + registered. No nil-deref window. +3. **`Counts()` returns negative.** Not reachable: `Counts()` returns + `len(map)` and a sum of `len(slice)` values; both are + non-negative. `float64(int)` conversion is exact for the range Go + maps occupy on the relay (tens of thousands at most). No clamp + needed; if a future ticket changes `Counts()` to return a signed + delta, that ticket revisits this design. + +## Testing strategy + +- `make vet`, `make test` (which is `-race`-wide on this repo per the + Makefile), and `make build` are the AC gates. All three must be + clean before merge. +- The three tests above. No fixtures, no new test helpers beyond + `assertGauge` defined inline. `fakeConn` is reused from + `registry_test.go` (same package; no import or duplication). +- No new e2e or integration coverage. #60's listener wiring will + exercise the gauges via a live `/metrics` scrape over HTTP — that + ticket's tests, not this one's. +- The `-race` test is the structural defence for the AC's "values + match the registry's live count at all observation points": pull-based + makes the equivalence trivially true; the race test proves no + goroutine schedule can break it. + +## Open questions + +1. **`docs/knowledge/codebase/61.md` and `docs/knowledge/INDEX.md`.** + The AC lists both. Per the architect's *Never Update* rule and #59's + precedent (its spec § Open questions #3), the documentation phase + writes both files post-merge. The architect and the developer leave + them alone. Recorded so the developer does not chase an AC item + that is not theirs. +2. **Wiring call site.** `NewConnectionsMetrics(mreg, registry)` must + be invoked from somewhere; the natural site is + `cmd/pyrycode-relay/main.go` next to `relay.NewRegistry()` and + `relay.NewMetricsRegistry()`. #60's spec owns that wiring step. + This ticket lands the constructor; #60 lands the call. Until #60 + merges, the gauges are exercised only by this ticket's tests — + that's fine: tests are the AC gate, and the production wiring is + a one-line follow-up that #60 handles inside its own spec. +3. **Should the same collector emit additional `connected_*` metrics + in the future** (e.g. `pyrycode_relay_grace_period_active` gauge + counting active timers)? Out of scope. If wanted, the natural + extension is to add a third Desc field + a third `Counts()`-style + getter on the registry. Recorded so the next sibling architect + sees the seam admits it. + +## Out of scope + +- The `/metrics` listener (`http.Server`, bind address, TLS) — #60. +- Counters and histograms for the upgrade-attempt and frame-forwarded + paths — #57, #58. +- Edits to `registry.go`. Pull-based explicitly avoids them. +- `docs/knowledge/codebase/61.md` and `docs/knowledge/INDEX.md` — + documentation phase territory. +- Process / Go-runtime collectors (`collectors.NewGoCollector` etc.) — + out of bounds per ADR-0008 § Scope of use. + +## Security review (security-sensitive ticket) + +### Mindset + +Re-reading the spec as an adversary. Default verdict FAIL until each +applicable category below has either a concrete finding or an explicit +design decision that makes the category not applicable. + +The label is the gate — not the architect's judgement of "this is +just a counter." Metrics-as-exfiltration is exactly the category +this ticket sits in. + +### 1. Trust boundaries + +- **Inbound trust boundary:** none introduced. The `/metrics` listener + (which terminates untrusted scrape traffic) is #60. This ticket + ships a Collector that reads `*Registry`'s public method `Counts()` + on each scrape — no `net.Conn`, no `http.Request`, no + user-controlled input reaches this code. +- **Internal boundary:** the relay's private registry vs. + `prometheus.DefaultRegisterer`. Enforced by the constructor taking + `prometheus.Registerer` explicitly (no default-registry shortcut) + and reinforced structurally by the existing + `TestMetricsRegistry_NoGlobalRegistrarLeak` (the snapshot-delta is + zero before and after constructing the relay's wiring; this ticket + adds collectors to a fresh private registry, not to the default). + +### 2. Tokens, secrets, credentials + +- The two metrics carry **no labels** (`prometheus.NewDesc(name, + help, nil, nil)`). This is load-bearing for the threat model § Log + hygiene rule that extends to metric labels (recorded in + `docs/knowledge/features/metrics-registry.md` § *Scope of use*). + A naive shape — `connected_phones{server=""}` — would + emit the `x-pyrycode-server` header value (attacker-influenced) + onto the metrics surface; the spec rejects that shape explicitly. +- No metric value carries a token, IP, or any other secret. Values + are non-negative integers from `len(map)` / sum of `len(slice)`. +- The collector holds a `*Registry` reference. The registry does not + store tokens (per PROJECT-MEMORY: tokens are presence-checked and + discarded at `/v1/client`). No transitive secret reach. + +### 3. File operations + +None. The collector performs no file I/O. + +### 4. Subprocess / external command execution + +None. + +### 5. Cryptographic primitives + +None directly. `prometheus.MustNewConstMetric` uses no `crypto/*` APIs. +The transitive `xxhash/v2` dependency (non-cryptographic label-set +keying) is irrelevant here — labels are empty. + +### 6. Network & I/O + +- **Handler exposure surface change.** Before this ticket: the + `/metrics` body contains only `promhttp_metric_handler_*` + self-instrumentation counters (~7 lines, ~200 bytes). After this + ticket: + two scalar gauge lines (~80 bytes). Total response size + stays well under any reasonable scrape buffer. No DoS expansion. +- **Cardinality bound is zero.** Both gauges are label-less; the metric + count this ticket adds is exactly two, regardless of how many + server-ids or phones the relay holds. A future ticket adding a + per-server-id GaugeVec would re-open this category; this ticket + closes it. +- **No new `Accept` content negotiation.** The collector emits via + the existing `NewMetricsHandler`, which pins text format + (OpenMetrics disabled). No content-type-based parser differential + exposure. + +### 7. Error messages, logs, telemetry + +- **Metric labels are empty** — see § 2. The label-exfiltration + channel the threat model § Log hygiene names is closed here by + design. +- **No log lines added.** The collector does not log; collection + errors from `Counts()` are impossible (the method returns two + ints, no error). +- **Metric values are integers.** No format-string or scientific-notation + shenanigans that could leak adjacent process memory through value + serialisation. + +### 8. Concurrency + +- The collector holds no shared mutable state of its own. The two + `*prometheus.Desc` fields and the `*Registry` reference are set + at construction and never mutated. +- `Collect` reads via `Counts()`, which takes the registry's + RLock. No new lock-acquisition path; no new ordering hazard. +- The race test (Test 3) is the structural check. 16 mutator + goroutines + 1 scraper goroutine. The Go race detector's verdict + under `-race` is the AC gate. If the test passes under `make + test`, the seam is race-free; if it ever flakes, the failure mode + is structural (Counts()'s RLock vs. mutator Lock interaction) and + shared with the existing `TestRegistry_RaceFreedom`. + +### 9. Threat model alignment + +- `docs/threat-model.md` § *Log hygiene* — extends to metric labels. + Closed by design (no labels). § 2 above. +- `docs/threat-model.md` § *DoS* — the response size delta is ~80 + bytes; the `/metrics` listener's bind-address policy and `http.Server` + timeouts are #60's territory. This ticket does not introduce a new + network-reachable endpoint and does not unbound the existing one's + response size. +- `docs/threat-model.md` § *Supply chain* — no new dependency. The + collector uses `prometheus/client_golang` symbols already pulled in + by #59 (`prometheus.Registerer`, `prometheus.Collector`, + `prometheus.NewDesc`, `prometheus.MustNewConstMetric`, + `prometheus.GaugeValue`). `go mod tidy` after this ticket should be + a no-op against `go.sum`; if it is not, the developer flags that as + a regression — likely a stale `go.sum` from #59's merge, not new + surface from this ticket. +- Protocol spec (`pyrycode/pyrycode/docs/protocol-mobile.md`) + unaffected — no wire-protocol surface change. + +### Findings + +- **[Trust boundaries]** No findings — the collector reads a + package-internal method via a `*Registry` reference; no user input + reaches this code. +- **[Tokens]** PASS — gauges are label-less by design; + spec calls out the rejected `{server="..."}` shape explicitly so + the developer cannot quietly re-add it. +- **[File operations]** No findings. +- **[Subprocess]** No findings. +- **[Cryptographic primitives]** No findings. +- **[Network & I/O]** PASS — response size delta is ~80 bytes; no new + endpoint; cardinality is constant. +- **[Errors / logs / telemetry]** No findings — no labels, no logs, + integer values only. +- **[Concurrency]** PASS — no new lock-acquisition path; race test + exercises the seam against the existing mutation API under `-race`. +- **[Threat model]** No findings — log-hygiene extension to metric + labels is closed structurally; DoS surface unchanged; no + supply-chain expansion. + +### Verdict + +**PASS.** No MUST FIX findings. The single load-bearing security +design choice (label-less gauges, no `server` label) is documented +both in the production file's comments and in this spec's § Design +"No labels" note, so the developer cannot regress it without first +deleting the spec's instruction. + +**Reviewer:** architect (self-review per `architect/security-review.md`) +**Date:** 2026-05-12 From f2e767241afe696ba2403db652c63d314a45a3b4 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Tue, 12 May 2026 19:42:42 +0300 Subject: [PATCH 2/3] feat(relay): pyrycode_relay_connected_{binaries,phones} gauges (#61) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pull-based prometheus.Collector that reads Registry.Counts() on every scrape. No edits to registry.go — the grace-expiry pointer-identity guard keeps the maps unchanged on a stale fire, so the gauges are structurally unaffected (no second source of truth to keep in sync). Label-less by design: a {server="..."} label would carry the attacker-influenced x-pyrycode-server header value onto the metrics surface, which threat-model § Log hygiene forbids. Tests: live-count reflection, grace-stale-fire no-move, and a -race-driven scraper-vs-mutator interleaving check (16 × 200 ops). --- internal/relay/metrics_connections.go | 62 +++++++ internal/relay/metrics_connections_test.go | 178 +++++++++++++++++++++ 2 files changed, 240 insertions(+) create mode 100644 internal/relay/metrics_connections.go create mode 100644 internal/relay/metrics_connections_test.go diff --git a/internal/relay/metrics_connections.go b/internal/relay/metrics_connections.go new file mode 100644 index 0000000..15d2f8f --- /dev/null +++ b/internal/relay/metrics_connections.go @@ -0,0 +1,62 @@ +package relay + +import "github.com/prometheus/client_golang/prometheus" + +// connectionsCollector emits pyrycode_relay_connected_binaries and +// pyrycode_relay_connected_phones as Prometheus gauges. The values come +// from Registry.Counts() on every scrape (pull-based); the collector +// never mutates the registry and never holds the registry lock itself. +// +// Pull-based design — rationale recorded in +// docs/specs/architecture/61-connected-gauges.md § Why pull-based: +// - registry.go is untouched (audit surface unchanged), +// - the grace-expiry stale-fire guard (registry.go handleGraceExpiry) +// keeps the maps unchanged on a stale fire, so the gauge is +// structurally unaffected — no second source of truth to keep in sync, +// - Counts() already holds the right lock and returns a +// snapshot-consistent (binaries, phones) pair. +type connectionsCollector struct { + binaries *prometheus.Desc + phones *prometheus.Desc + src *Registry +} + +// NewConnectionsMetrics registers the pair of connection-count gauges +// against reg, reading their values from src on each scrape. Returns +// nothing — the collector is held alive by the registerer. +// +// Wiring site: cmd/pyrycode-relay/main.go (the call lands with #60's +// listener wiring; until then the constructor is exercised only by +// metrics_connections_test.go). +// +// The gauges are label-less by design — see +// docs/specs/architecture/61-connected-gauges.md § Security review § Tokens: +// a {server="..."} label would carry the attacker-influenced +// x-pyrycode-server header value onto the metrics surface, which the +// threat model § Log hygiene forbids. Do not add labels. +func NewConnectionsMetrics(reg prometheus.Registerer, src *Registry) { + reg.MustRegister(&connectionsCollector{ + binaries: prometheus.NewDesc( + "pyrycode_relay_connected_binaries", + "Number of pyrycode binary connections currently held by the relay registry.", + nil, nil, + ), + phones: prometheus.NewDesc( + "pyrycode_relay_connected_phones", + "Number of mobile client connections currently held by the relay registry, summed across all server-ids.", + nil, nil, + ), + src: src, + }) +} + +func (c *connectionsCollector) Describe(ch chan<- *prometheus.Desc) { + ch <- c.binaries + ch <- c.phones +} + +func (c *connectionsCollector) Collect(ch chan<- prometheus.Metric) { + b, p := c.src.Counts() + ch <- prometheus.MustNewConstMetric(c.binaries, prometheus.GaugeValue, float64(b)) + ch <- prometheus.MustNewConstMetric(c.phones, prometheus.GaugeValue, float64(p)) +} diff --git a/internal/relay/metrics_connections_test.go b/internal/relay/metrics_connections_test.go new file mode 100644 index 0000000..8820c32 --- /dev/null +++ b/internal/relay/metrics_connections_test.go @@ -0,0 +1,178 @@ +package relay + +import ( + "fmt" + "net/http" + "net/http/httptest" + "strings" + "sync" + "sync/atomic" + "testing" + "time" +) + +// assertGauge scrapes h and asserts that body contains the literal line +// ` ` in the Prometheus text format. Substring-match (not +// equality on the full body) because the body also contains the +// promhttp_metric_handler_* self-instrumentation counters from +// HandlerOpts.Registry=reg. +func assertGauge(t *testing.T, h http.Handler, name string, want int) { + t.Helper() + rec := httptest.NewRecorder() + h.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/metrics", nil)) + if rec.Code != http.StatusOK { + t.Fatalf("scrape status: got %d, want 200", rec.Code) + } + wantLine := fmt.Sprintf("%s %d", name, want) + if !strings.Contains(rec.Body.String(), wantLine) { + t.Errorf("scrape body missing %q; got:\n%s", wantLine, rec.Body.String()) + } +} + +func TestConnectionsMetrics_ReflectLiveCounts(t *testing.T) { + t.Parallel() + + r := NewRegistry() + mreg := NewMetricsRegistry() + NewConnectionsMetrics(mreg, r) + h := NewMetricsHandler(mreg) + + // Empty registry: both gauges should be 0. + assertGauge(t, h, "pyrycode_relay_connected_binaries", 0) + assertGauge(t, h, "pyrycode_relay_connected_phones", 0) + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer: %v", err) + } + if err := r.RegisterPhone("s1", &fakeConn{id: "p-1"}); err != nil { + t.Fatalf("RegisterPhone: %v", err) + } + if err := r.RegisterPhone("s1", &fakeConn{id: "p-2"}); err != nil { + t.Fatalf("RegisterPhone: %v", err) + } + + assertGauge(t, h, "pyrycode_relay_connected_binaries", 1) + assertGauge(t, h, "pyrycode_relay_connected_phones", 2) + + // Unregister one phone and release the server; gauges follow. + r.UnregisterPhone("s1", "p-1") + if !r.ReleaseServer("s1") { + t.Fatal("ReleaseServer: got false, want true") + } + assertGauge(t, h, "pyrycode_relay_connected_binaries", 0) + // Phone p-2 outlives the release per ReleaseServer's contract (orphan + // phones survive until explicit unregister or grace expiry). + assertGauge(t, h, "pyrycode_relay_connected_phones", 1) +} + +func TestConnectionsMetrics_GraceStaleFireDoesNotMoveGauge(t *testing.T) { + t.Parallel() + + r := NewRegistry() + mreg := NewMetricsRegistry() + NewConnectionsMetrics(mreg, r) + h := NewMetricsHandler(mreg) + + if err := r.ClaimServer("s1", &fakeConn{id: "b-1"}); err != nil { + t.Fatalf("ClaimServer b-1: %v", err) + } + if err := r.RegisterPhone("s1", &fakeConn{id: "p-1"}); err != nil { + t.Fatalf("RegisterPhone p-1: %v", err) + } + + // Arm grace; immediately cancel-and-replace via reclaim. + r.ScheduleReleaseServer("s1", 5*time.Millisecond) + if err := r.ClaimServer("s1", &fakeConn{id: "b-2"}); err != nil { + t.Fatalf("ClaimServer b-2 (reclaim): %v", err) + } + + // Wait past the original grace window. If the stale fire were to + // move the gauge, this is when it would. + time.Sleep(30 * time.Millisecond) + + assertGauge(t, h, "pyrycode_relay_connected_binaries", 1) + assertGauge(t, h, "pyrycode_relay_connected_phones", 1) +} + +func TestConnectionsMetrics_RaceFreedom(t *testing.T) { + t.Parallel() + + r := NewRegistry() + mreg := NewMetricsRegistry() + NewConnectionsMetrics(mreg, r) + h := NewMetricsHandler(mreg) + + const goroutines = 16 + const opsPer = 200 + + stopScrape := make(chan struct{}) + var scrapeCount atomic.Int64 + scrapeDone := make(chan struct{}) + + // Scraper: periodic gauge reads via the public handler — the same + // path Prometheus would hit. + go func() { + defer close(scrapeDone) + for { + select { + case <-stopScrape: + return + default: + rec := httptest.NewRecorder() + h.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/metrics", nil)) + if rec.Code != http.StatusOK { + t.Errorf("scrape status: got %d, want 200", rec.Code) + return + } + scrapeCount.Add(1) + // Tight loop — no sleep — to maximise interleaving under + // -race. The race detector reports on memory accesses, not + // on iteration count. + } + } + }() + + var wg sync.WaitGroup + wg.Add(goroutines) + for g := 0; g < goroutines; g++ { + g := g + go func() { + defer wg.Done() + sid := fmt.Sprintf("s-%d", g%4) + for i := 0; i < opsPer; i++ { + _ = r.ClaimServer(sid, &fakeConn{id: fmt.Sprintf("b-%d-%d", g, i)}) + _ = r.RegisterPhone(sid, &fakeConn{id: fmt.Sprintf("p-%d-%d", g, i)}) + r.ScheduleReleaseServer(sid, time.Millisecond) + _ = r.ClaimServer(sid, &fakeConn{id: fmt.Sprintf("b2-%d-%d", g, i)}) + r.UnregisterPhone(sid, fmt.Sprintf("p-%d-%d", g, i)) + } + }() + } + wg.Wait() + close(stopScrape) + <-scrapeDone + + // Sleep past max grace duration so any in-flight timers fire and exit. + time.Sleep(50 * time.Millisecond) + + // Final state need not be empty (some claims won the cancellation + // race); we only assert the scraper ran AND a final scrape returns 200 + // with parseable gauge lines. The race detector's verdict is the real + // assertion — this test exists to feed -race interleavings, not to + // check exact values. + if scrapeCount.Load() == 0 { + t.Error("scraper never ran") + } + rec := httptest.NewRecorder() + h.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/metrics", nil)) + if rec.Code != http.StatusOK { + t.Fatalf("final scrape status: got %d, want 200", rec.Code) + } + body := rec.Body.String() + if !strings.Contains(body, "pyrycode_relay_connected_binaries ") { + t.Errorf("final scrape body missing binaries gauge; got:\n%s", body) + } + if !strings.Contains(body, "pyrycode_relay_connected_phones ") { + t.Errorf("final scrape body missing phones gauge; got:\n%s", body) + } +} From 0147786044a80fb99ba0bd658ba8efe7690f3fb7 Mon Sep 17 00:00:00 2001 From: Juhana Ilmoniemi Date: Tue, 12 May 2026 19:47:29 +0300 Subject: [PATCH 3/3] docs: connection-count gauges feature + codebase note for #61 Adds docs/knowledge/features/connection-count-gauges.md (evergreen design + pull-based-collector rationale), docs/knowledge/codebase/61.md (per-ticket implementation summary + lessons), updates INDEX.md, and refreshes the metrics-registry feature doc to note the first collector has landed. Co-Authored-By: Claude Opus 4.7 --- docs/knowledge/INDEX.md | 3 +- docs/knowledge/codebase/61.md | 47 +++++++++++++++ .../features/connection-count-gauges.md | 59 +++++++++++++++++++ docs/knowledge/features/metrics-registry.md | 6 +- 4 files changed, 112 insertions(+), 3 deletions(-) create mode 100644 docs/knowledge/codebase/61.md create mode 100644 docs/knowledge/features/connection-count-gauges.md diff --git a/docs/knowledge/INDEX.md b/docs/knowledge/INDEX.md index 2e28e6a..804c48e 100644 --- a/docs/knowledge/INDEX.md +++ b/docs/knowledge/INDEX.md @@ -4,7 +4,8 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o ## Features -- [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). No counters/gauges, no listener — siblings (#57, #58) and #60 fill those in. Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59). +- [Connection-count gauges](features/connection-count-gauges.md) — `pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones` exposed via a pull-based `prometheus.Collector` reading `Registry.Counts()` on each scrape; zero edits to `registry.go`; scalar (no labels) by design — `{server="..."}` would carry the attacker-influenced `x-pyrycode-server` header onto the metrics surface, which threat-model § Log hygiene forbids; stale grace-expiry fires can't move the gauge because the pointer-identity guard (ADR-0006) keeps the maps unchanged and the gauge IS the map size; race-tested against 16 mutator goroutines + a tight-loop scraper under `-race`. First collector wired into the #59 seam (#61). +- [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars) — first instantiated by #61's `connectionsCollector`. Listener still pending (#60). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59). - [Docker image](features/docker-image.md) — portable OCI artifact: multi-stage `Dockerfile` builds a fully-static binary (`CGO_ENABLED=0`, `-trimpath -s -w`) into `distroless/static-debian12:nonroot`; both base images digest-pinned with `# Tracks:` comments; exposes `:80`/`:443` and declares `/var/lib/relay/autocert` volume; host-specific wiring (TLS policy, ports, volumes, healthcheck) is #38's problem (#32). - [Binary-side frame forwarder](features/binary-forwarder.md) — per-binary read pump: unwraps each inbound routing envelope, linear-scans `PhonesFor(serverID)` for `env.ConnID`, writes `env.Frame` verbatim to that phone; opaque inner bytes; synchronous (handler discards the return); diverges from #25 in error policy — unknown `conn_id`, malformed envelope, phone `Send` error all log+continue (a single bad frame never tears down the binary); replaced `/v1/server`'s `CloseRead` placeholder (#26). - [WebSocket heartbeat](features/heartbeat.md) — per-conn goroutine on both endpoints sends RFC 6455 ping every 30s; closes with `1011 "heartbeat timeout"` if no pong within 30s. Detects half-open TCP within 60s; ctx-cancel exit path leaves close to the handler defer (#7). diff --git a/docs/knowledge/codebase/61.md b/docs/knowledge/codebase/61.md new file mode 100644 index 0000000..f3c4839 --- /dev/null +++ b/docs/knowledge/codebase/61.md @@ -0,0 +1,47 @@ +# Ticket #61 — `pyrycode_relay_connected_{binaries,phones}` gauges + +First metrics collector wired into the #59 scaffolding seam. A pull-based `prometheus.Collector` exposes the relay's live connection counts (`Registry.Counts()`) as two scalar gauges, with zero edits to `registry.go`. The gauges are the time-series companion to `/healthz`'s one-shot snapshot. + +## Implementation + +- **`internal/relay/metrics_connections.go`** (new, ~60 lines) — one export, two unexported methods: + - `NewConnectionsMetrics(reg prometheus.Registerer, src *Registry)` constructs a private `connectionsCollector{binaries, phones *prometheus.Desc, src *Registry}` and `MustRegister`s it against `reg`. Takes `Registerer` (interface), not `*prometheus.Registry` (concrete), matching `client_golang`'s own constructor convention. + - `Describe` emits both `*prometheus.Desc` values; `Collect` calls `src.Counts()` once and emits two `prometheus.MustNewConstMetric(..., prometheus.GaugeValue, float64(n))` metrics. + - `prometheus.NewDesc(name, help, nil, nil)` — empty `variableLabels` and `constLabels`. **No labels by design** — the production file's docstring names the constraint and points at the spec's threat-model justification so a future contributor cannot quietly re-add `{server="..."}`. + - `MustNewConstMetric` (not `NewConstMetric`): with no labels and non-negative integer values, the error branch is unreachable; the `Must` form drops the per-scrape `if err != nil` noise. +- **`internal/relay/metrics_connections_test.go`** (new, ~180 lines), three tests + an inline `assertGauge` helper: + - `TestConnectionsMetrics_ReflectLiveCounts` — AC gate. Wires registry + collector + handler, mutates state through the public API (`ClaimServer`, `RegisterPhone`, `UnregisterPhone`, `ReleaseServer`), scrapes via `NewMetricsHandler`, substring-asserts the Prometheus text-format literal ` ` lines. + - `TestConnectionsMetrics_GraceStaleFireDoesNotMoveGauge` — the stale-fire AC. Arms a 5ms grace, then immediately reclaims; sleeps past the original window so any stale `time.AfterFunc` body runs through `handleGraceExpiry`'s pointer-identity guard; asserts gauges reflect the reclaim state (b-2 + p-1), not the post-eviction state (0, 0). With pull-based design this is structurally guaranteed: the guard keeps the maps unchanged, and the gauge is the map size. + - `TestConnectionsMetrics_RaceFreedom` — 16 mutator goroutines × 200 ops (claim / register / schedule-release / reclaim / unregister) race against a tight-loop scraper goroutine; `-race` detector's verdict is the structural assertion. Shape pinned to `TestScheduleReleaseServer_RaceFreedomUnderRapidCycles` in `registry_test.go`. Adds a `scrapeDone` channel to deterministically join the scraper before the final assertion scrape (small deviation from the spec; same intent, no flake window). + - `assertGauge(t, h, name, want)` — scrapes via `h.ServeHTTP`, substring-matches `fmt.Sprintf("%s %d", name, want)`. Substring (not full-body equality) because the response also carries `promhttp_metric_handler_*` self-instrumentation lines from `HandlerOpts.Registry: reg` (#59). +- **Reused** — `fakeConn` from `registry_test.go` (same package, no duplication needed); `NewMetricsRegistry` + `NewMetricsHandler` from #59. +- **Unchanged** — `registry.go`, `metrics.go`, `cmd/pyrycode-relay/main.go`. The wiring call (`NewConnectionsMetrics(mreg, registry)`) lands with #60's listener wiring; until then the collector is exercised only by this ticket's tests. + +## Patterns established + +- **Pull-based collector for gauges-of-a-state.** Where the gauge value is a function of registry state already exposed through a getter (`Counts()`), prefer a `prometheus.Collector` that reads the getter on each scrape over Inc/Dec calls at every mutation site. Sibling architects: counters-of-events (#57 upgrade rejects, #58 frame counts) stay push-shaped; the seam admits both. +- **`Registerer` over `*Registry` in collector constructors.** New collector constructors take `prometheus.Registerer` so they compose with `WrapRegistererWith` if a future ticket needs label decoration. Same as `client_golang`'s own constructors. +- **Label-less for state mirrors of internet-exposed identifiers.** Any future gauge that mirrors registry state keyed by `server-id` (or any other attacker-influenced identifier) stays scalar. Per-server cardinality is a deliberate decision a future ADR would need to argue past the threat-model § Log hygiene rule's extension to metric labels. +- **`MustNewConstMetric` when value-space cannot violate the Desc.** No labels + non-negative integers means the error branch is unreachable; the `Must` form keeps the `Collect` body to one line per metric. + +## Out of scope (delegated) + +- `/metrics` listener (`http.Server`, bind address, TLS, flag plumbing) — #60. +- Counters/histograms for upgrade-attempt and frame-forward paths — #57, #58. +- A `pyrycode_relay_grace_period_active` gauge counting in-flight grace timers — would extend the collector with a third Desc + a third `Counts()`-style getter on the registry. Recorded for future siblings; no current need. + +## Lessons learned + +- **When the registry has a snapshot getter, write a pull-based collector.** The architect's spec lists five separate wins (zero diff in the sensitive file, structural stale-fire defence, single source of truth, no new lock-acquisition path, seam compatibility). The biggest in practice is #2: the grace-expiry pointer-identity guard (ADR-0006) gives you the stale-fire AC for free under pull-based, but would require careful Dec-placement-and-branch-coverage under push-based. If a future state-mirror gauge sits adjacent to a stale-fire guard, repeat the pattern. +- **`MustRegister` panics are a deploy-time safety net.** Name collisions surface at process startup (in `main.go`), not at scrape time. Sibling tickets registering against the same `mreg` must own unique metric names within the `pyrycode_relay_*` namespace; this ticket reserves `pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones`. +- **Tight-loop scrape goroutines beat sleeping ones under `-race`.** The race detector reports on memory access pairs, not on iteration count; sleeping only reduces interleaving density. Pin scrapers in race tests to a select-default loop (no `time.Sleep` between iterations) — same posture as the existing `TestScheduleReleaseServer_RaceFreedomUnderRapidCycles`. Test still finishes in well under a second on a 16×200 op schedule. +- **Substring-match the metric line, not the full body.** `HandlerOpts.Registry: reg` (load-bearing for #59's no-default-registry boundary) means the scrape body also contains `promhttp_metric_handler_*` lines. Full-body equality would fight that; `strings.Contains(body, " ")` keeps the text-format literal as the canary without coupling tests to the self-instrumentation surface. + +## Cross-links + +- [Connection-count gauges feature](../features/connection-count-gauges.md) — evergreen seam + design doc. +- [Metrics registry feature](../features/metrics-registry.md) — the #59 scaffolding this collector plugs into. +- [ADR-0008: Prometheus client adoption](../decisions/0008-prometheus-client-adoption.md) — scope-of-use rules. +- [ADR-0006: Grace window IS the reclaim path](../decisions/0006-grace-period-as-reclaim-path.md) — pointer-identity guard the pull-based design rides for free. +- [`/healthz` JSON endpoint](../features/healthz.md) — one-shot snapshot of the same two counts. +- [Architect spec](../../specs/architecture/61-connected-gauges.md) — full design rationale + security review. diff --git a/docs/knowledge/features/connection-count-gauges.md b/docs/knowledge/features/connection-count-gauges.md new file mode 100644 index 0000000..94ea54e --- /dev/null +++ b/docs/knowledge/features/connection-count-gauges.md @@ -0,0 +1,59 @@ +# Connection-count gauges + +Two Prometheus gauges expose the relay's live connection counts, the same numbers `/healthz` returns as a one-shot snapshot, but as a time series an operator can graph and alert on: + +- `pyrycode_relay_connected_binaries` — number of pyrycode binary connections currently held by the registry. +- `pyrycode_relay_connected_phones` — number of mobile-client connections currently held, summed across all server-ids. + +Both are scalar gauges with no labels. Values are non-negative integers. + +## API + +Package `internal/relay` (`metrics_connections.go`): + +```go +func NewConnectionsMetrics(reg prometheus.Registerer, src *Registry) +``` + +Registers a single private `connectionsCollector` against `reg` that reads `src.Counts()` on every scrape. Returns nothing — the registerer holds the collector alive. + +The constructor takes `prometheus.Registerer` (the `MustRegister`-only sub-interface), not the concrete `*prometheus.Registry`, so it composes with anything implementing the interface (e.g. `prometheus.WrapRegistererWith`). Same convention as `prometheus/client_golang`'s own constructors. + +## Design — pull-based collector + +The collector emits the gauges via `prometheus.Collector` / `MustNewConstMetric` on each `Collect`, reading the registry's live state through the existing `Registry.Counts()` getter. It never mutates the registry, never holds the registry lock itself, and adds no Inc/Dec call sites to `registry.go`. + +Pull-based was picked over a push-based shape (Inc/Dec at every registry mutation site) on five concrete axes — recorded in the spec, summarised here because the rationale is load-bearing for sibling tickets: + +1. **Zero diff in `registry.go`.** The registry is internet-exposed and already raced against under `-race`; widening its audit surface is expensive. +2. **The grace-expiry stale-fire AC is satisfied structurally.** `handleGraceExpiry`'s pointer-identity guard already keeps the maps unchanged on a stale fire (ADR-0006). Because the gauge IS the map size via `Counts()`, the no-op path can't move the gauge by construction — no `if r.timers[serverID] == self` branch to wire Dec into, no test required to assert the branch was taken. +3. **`Counts()` is the single source of truth.** Push-based would maintain a second source (the gauge's internal `AtomicInt64`) that must agree with the first; divergence is a class of bug pull-based cannot have. +4. **No new lock-acquisition path.** `Counts()` already holds RLock and walks the maps. The collector calls `Counts()` once per scrape — no new locking surface in `registry.go`. +5. **The metrics seam admits it.** The per-concern collector pattern from #59 was illustrated with a push-shaped `CounterVec`. This ticket instantiates the same pattern with a pull-shaped `Collector` — both shapes coexist; gauges-of-a-state are naturally pull, counters-of-events are naturally push. + +Pull-based's nominal cost is freshness: the gauge is only as current as the last scrape. For Prometheus's typical 15-60s scrape interval and the relay's alerting use case, the freshness gap is well below the operator's alert window. Acceptable. + +## No labels — load-bearing for log hygiene + +Both gauges are scalar; `prometheus.NewDesc(name, help, nil, nil)` passes empty `variableLabels` and `constLabels`. A `{server=""}` label would carry the attacker-influenced `x-pyrycode-server` header value onto the metrics surface, which the threat model § Log hygiene rule (which extends to metric labels) forbids. The production file's comments and the spec both name this constraint so a future contributor cannot quietly re-add a server label without first deleting the spec's instruction. + +The constant-cardinality property also keeps the `/metrics` response size delta small (~80 bytes) regardless of how many server-ids the relay holds. + +## Wiring + +The collector lives in the registry's state for the process lifetime once `NewConnectionsMetrics(mreg, registry)` is called at startup. The wiring call site is `cmd/pyrycode-relay/main.go` next to `relay.NewRegistry()` and `relay.NewMetricsRegistry()`, alongside the listener wiring tracked under #60. Until #60 merges, the collector is exercised only by `metrics_connections_test.go`. + +## Concurrency + +- The collector spawns no goroutines and holds no mutable state — two `*prometheus.Desc` fields and one `*Registry` pointer set at construction. +- `Collect` calls `Counts()` once per scrape. `Counts()` takes the registry's RLock; the collector itself is lock-free. Concurrent scrapes are serialised inside `prometheus.Registry.Gather()`. +- `TestConnectionsMetrics_RaceFreedom` runs 16 mutator goroutines (claim / register / schedule-release / reclaim / unregister) against a tight-loop scraper under `-race`; the race detector's verdict is the structural assertion. + +## Related + +- [Metrics registry (scaffolding)](metrics-registry.md) — the seam this collector plugs into (#59). +- [ADR-0008: Prometheus client adoption](../decisions/0008-prometheus-client-adoption.md) — scope-of-use rules (no `DefaultRegisterer`, no process/runtime collectors). +- [ADR-0006: Grace window IS the reclaim path](../decisions/0006-grace-period-as-reclaim-path.md) — pointer-identity guard the pull-based design rides for free. +- [Connection registry](connection-registry.md) — `Counts()` is the live count source the gauges read. +- [`/healthz` JSON endpoint](healthz.md) — one-shot snapshot of the same two numbers (the gauges are the time-series companion). +- [Threat model](../../threat-model.md) — log-hygiene rule that extends to metric labels. diff --git a/docs/knowledge/features/metrics-registry.md b/docs/knowledge/features/metrics-registry.md index 671d0ca..fb63043 100644 --- a/docs/knowledge/features/metrics-registry.md +++ b/docs/knowledge/features/metrics-registry.md @@ -2,7 +2,9 @@ The relay holds a private `*prometheus.Registry` and serves it over a Prometheus-format `/metrics` handler. The registry is constructed per process; sibling tickets register counters / gauges / histograms against it. `prometheus.DefaultRegisterer` is never touched. -This page documents the scaffolding only — what the seam is, why it has the shape it does, and where the rules are recorded. The first counter / gauge lands in sibling tickets; the `/metrics` listener wiring lives in #60. +This page documents the scaffolding only — what the seam is, why it has the shape it does, and where the rules are recorded. The `/metrics` listener wiring lives in #60. + +The first collector to plug into the seam landed in #61: see [Connection-count gauges](connection-count-gauges.md) for the pull-based `connectionsCollector` exposing `pyrycode_relay_connected_{binaries,phones}` over `Registry.Counts()`. Sibling counter/histogram tickets (#57, #58) follow. ## API @@ -77,7 +79,7 @@ Structurally out of bounds: ## What this deliberately does NOT do -- No counters, no gauges, no histograms — siblings (#57, #58, future) own those. +- No counters, no histograms here — siblings (#57, #58, future) own those. The first gauge collector (`connectionsCollector`) landed in #61 as a pull-based reader of `Registry.Counts()`; see [Connection-count gauges](connection-count-gauges.md). - No listener — `/metrics` is not bound to a `mux` in this ticket. #60 owns the listener (flag, bind-address validation, `http.Server` timeouts). - No process / Go-runtime collectors. - No `ErrorLog` wiring from `promhttp` into `slog`.