Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). No counters/gauges, no listener — siblings (#57, #58) and #60 fill those in. Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59).
- [Connection-count gauges](features/connection-count-gauges.md) — `pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones` exposed via a pull-based `prometheus.Collector` reading `Registry.Counts()` on each scrape; zero edits to `registry.go`; scalar (no labels) by design — `{server="..."}` would carry the attacker-influenced `x-pyrycode-server` header onto the metrics surface, which threat-model § Log hygiene forbids; stale grace-expiry fires can't move the gauge because the pointer-identity guard (ADR-0006) keeps the maps unchanged and the gauge IS the map size; race-tested against 16 mutator goroutines + a tight-loop scraper under `-race`. First collector wired into the #59 seam (#61).
- [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars) — first instantiated by #61's `connectionsCollector`. Listener still pending (#60). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59).
- [Docker image](features/docker-image.md) — portable OCI artifact: multi-stage `Dockerfile` builds a fully-static binary (`CGO_ENABLED=0`, `-trimpath -s -w`) into `distroless/static-debian12:nonroot`; both base images digest-pinned with `# Tracks:` comments; exposes `:80`/`:443` and declares `/var/lib/relay/autocert` volume; host-specific wiring (TLS policy, ports, volumes, healthcheck) is #38's problem (#32).
- [Binary-side frame forwarder](features/binary-forwarder.md) — per-binary read pump: unwraps each inbound routing envelope, linear-scans `PhonesFor(serverID)` for `env.ConnID`, writes `env.Frame` verbatim to that phone; opaque inner bytes; synchronous (handler discards the return); diverges from #25 in error policy — unknown `conn_id`, malformed envelope, phone `Send` error all log+continue (a single bad frame never tears down the binary); replaced `/v1/server`'s `CloseRead` placeholder (#26).
- [WebSocket heartbeat](features/heartbeat.md) — per-conn goroutine on both endpoints sends RFC 6455 ping every 30s; closes with `1011 "heartbeat timeout"` if no pong within 30s. Detects half-open TCP within 60s; ctx-cancel exit path leaves close to the handler defer (#7).
Expand Down
47 changes: 47 additions & 0 deletions docs/knowledge/codebase/61.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Ticket #61 — `pyrycode_relay_connected_{binaries,phones}` gauges

First metrics collector wired into the #59 scaffolding seam. A pull-based `prometheus.Collector` exposes the relay's live connection counts (`Registry.Counts()`) as two scalar gauges, with zero edits to `registry.go`. The gauges are the time-series companion to `/healthz`'s one-shot snapshot.

## Implementation

- **`internal/relay/metrics_connections.go`** (new, ~60 lines) — one export, two unexported methods:
- `NewConnectionsMetrics(reg prometheus.Registerer, src *Registry)` constructs a private `connectionsCollector{binaries, phones *prometheus.Desc, src *Registry}` and `MustRegister`s it against `reg`. Takes `Registerer` (interface), not `*prometheus.Registry` (concrete), matching `client_golang`'s own constructor convention.
- `Describe` emits both `*prometheus.Desc` values; `Collect` calls `src.Counts()` once and emits two `prometheus.MustNewConstMetric(..., prometheus.GaugeValue, float64(n))` metrics.
- `prometheus.NewDesc(name, help, nil, nil)` — empty `variableLabels` and `constLabels`. **No labels by design** — the production file's docstring names the constraint and points at the spec's threat-model justification so a future contributor cannot quietly re-add `{server="..."}`.
- `MustNewConstMetric` (not `NewConstMetric`): with no labels and non-negative integer values, the error branch is unreachable; the `Must` form drops the per-scrape `if err != nil` noise.
- **`internal/relay/metrics_connections_test.go`** (new, ~180 lines), three tests + an inline `assertGauge` helper:
- `TestConnectionsMetrics_ReflectLiveCounts` — AC gate. Wires registry + collector + handler, mutates state through the public API (`ClaimServer`, `RegisterPhone`, `UnregisterPhone`, `ReleaseServer`), scrapes via `NewMetricsHandler`, substring-asserts the Prometheus text-format literal `<name> <int>` lines.
- `TestConnectionsMetrics_GraceStaleFireDoesNotMoveGauge` — the stale-fire AC. Arms a 5ms grace, then immediately reclaims; sleeps past the original window so any stale `time.AfterFunc` body runs through `handleGraceExpiry`'s pointer-identity guard; asserts gauges reflect the reclaim state (b-2 + p-1), not the post-eviction state (0, 0). With pull-based design this is structurally guaranteed: the guard keeps the maps unchanged, and the gauge is the map size.
- `TestConnectionsMetrics_RaceFreedom` — 16 mutator goroutines × 200 ops (claim / register / schedule-release / reclaim / unregister) race against a tight-loop scraper goroutine; `-race` detector's verdict is the structural assertion. Shape pinned to `TestScheduleReleaseServer_RaceFreedomUnderRapidCycles` in `registry_test.go`. Adds a `scrapeDone` channel to deterministically join the scraper before the final assertion scrape (small deviation from the spec; same intent, no flake window).
- `assertGauge(t, h, name, want)` — scrapes via `h.ServeHTTP`, substring-matches `fmt.Sprintf("%s %d", name, want)`. Substring (not full-body equality) because the response also carries `promhttp_metric_handler_*` self-instrumentation lines from `HandlerOpts.Registry: reg` (#59).
- **Reused** — `fakeConn` from `registry_test.go` (same package, no duplication needed); `NewMetricsRegistry` + `NewMetricsHandler` from #59.
- **Unchanged** — `registry.go`, `metrics.go`, `cmd/pyrycode-relay/main.go`. The wiring call (`NewConnectionsMetrics(mreg, registry)`) lands with #60's listener wiring; until then the collector is exercised only by this ticket's tests.

## Patterns established

- **Pull-based collector for gauges-of-a-state.** Where the gauge value is a function of registry state already exposed through a getter (`Counts()`), prefer a `prometheus.Collector` that reads the getter on each scrape over Inc/Dec calls at every mutation site. Sibling architects: counters-of-events (#57 upgrade rejects, #58 frame counts) stay push-shaped; the seam admits both.
- **`Registerer` over `*Registry` in collector constructors.** New collector constructors take `prometheus.Registerer` so they compose with `WrapRegistererWith` if a future ticket needs label decoration. Same as `client_golang`'s own constructors.
- **Label-less for state mirrors of internet-exposed identifiers.** Any future gauge that mirrors registry state keyed by `server-id` (or any other attacker-influenced identifier) stays scalar. Per-server cardinality is a deliberate decision a future ADR would need to argue past the threat-model § Log hygiene rule's extension to metric labels.
- **`MustNewConstMetric` when value-space cannot violate the Desc.** No labels + non-negative integers means the error branch is unreachable; the `Must` form keeps the `Collect` body to one line per metric.

## Out of scope (delegated)

- `/metrics` listener (`http.Server`, bind address, TLS, flag plumbing) — #60.
- Counters/histograms for upgrade-attempt and frame-forward paths — #57, #58.
- A `pyrycode_relay_grace_period_active` gauge counting in-flight grace timers — would extend the collector with a third Desc + a third `Counts()`-style getter on the registry. Recorded for future siblings; no current need.

## Lessons learned

- **When the registry has a snapshot getter, write a pull-based collector.** The architect's spec lists five separate wins (zero diff in the sensitive file, structural stale-fire defence, single source of truth, no new lock-acquisition path, seam compatibility). The biggest in practice is #2: the grace-expiry pointer-identity guard (ADR-0006) gives you the stale-fire AC for free under pull-based, but would require careful Dec-placement-and-branch-coverage under push-based. If a future state-mirror gauge sits adjacent to a stale-fire guard, repeat the pattern.
- **`MustRegister` panics are a deploy-time safety net.** Name collisions surface at process startup (in `main.go`), not at scrape time. Sibling tickets registering against the same `mreg` must own unique metric names within the `pyrycode_relay_*` namespace; this ticket reserves `pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones`.
- **Tight-loop scrape goroutines beat sleeping ones under `-race`.** The race detector reports on memory access pairs, not on iteration count; sleeping only reduces interleaving density. Pin scrapers in race tests to a select-default loop (no `time.Sleep` between iterations) — same posture as the existing `TestScheduleReleaseServer_RaceFreedomUnderRapidCycles`. Test still finishes in well under a second on a 16×200 op schedule.
- **Substring-match the metric line, not the full body.** `HandlerOpts.Registry: reg` (load-bearing for #59's no-default-registry boundary) means the scrape body also contains `promhttp_metric_handler_*` lines. Full-body equality would fight that; `strings.Contains(body, "<name> <int>")` keeps the text-format literal as the canary without coupling tests to the self-instrumentation surface.

## Cross-links

- [Connection-count gauges feature](../features/connection-count-gauges.md) — evergreen seam + design doc.
- [Metrics registry feature](../features/metrics-registry.md) — the #59 scaffolding this collector plugs into.
- [ADR-0008: Prometheus client adoption](../decisions/0008-prometheus-client-adoption.md) — scope-of-use rules.
- [ADR-0006: Grace window IS the reclaim path](../decisions/0006-grace-period-as-reclaim-path.md) — pointer-identity guard the pull-based design rides for free.
- [`/healthz` JSON endpoint](../features/healthz.md) — one-shot snapshot of the same two counts.
- [Architect spec](../../specs/architecture/61-connected-gauges.md) — full design rationale + security review.
59 changes: 59 additions & 0 deletions docs/knowledge/features/connection-count-gauges.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Connection-count gauges

Two Prometheus gauges expose the relay's live connection counts, the same numbers `/healthz` returns as a one-shot snapshot, but as a time series an operator can graph and alert on:

- `pyrycode_relay_connected_binaries` — number of pyrycode binary connections currently held by the registry.
- `pyrycode_relay_connected_phones` — number of mobile-client connections currently held, summed across all server-ids.

Both are scalar gauges with no labels. Values are non-negative integers.

## API

Package `internal/relay` (`metrics_connections.go`):

```go
func NewConnectionsMetrics(reg prometheus.Registerer, src *Registry)
```

Registers a single private `connectionsCollector` against `reg` that reads `src.Counts()` on every scrape. Returns nothing — the registerer holds the collector alive.

The constructor takes `prometheus.Registerer` (the `MustRegister`-only sub-interface), not the concrete `*prometheus.Registry`, so it composes with anything implementing the interface (e.g. `prometheus.WrapRegistererWith`). Same convention as `prometheus/client_golang`'s own constructors.

## Design — pull-based collector

The collector emits the gauges via `prometheus.Collector` / `MustNewConstMetric` on each `Collect`, reading the registry's live state through the existing `Registry.Counts()` getter. It never mutates the registry, never holds the registry lock itself, and adds no Inc/Dec call sites to `registry.go`.

Pull-based was picked over a push-based shape (Inc/Dec at every registry mutation site) on five concrete axes — recorded in the spec, summarised here because the rationale is load-bearing for sibling tickets:

1. **Zero diff in `registry.go`.** The registry is internet-exposed and already raced against under `-race`; widening its audit surface is expensive.
2. **The grace-expiry stale-fire AC is satisfied structurally.** `handleGraceExpiry`'s pointer-identity guard already keeps the maps unchanged on a stale fire (ADR-0006). Because the gauge IS the map size via `Counts()`, the no-op path can't move the gauge by construction — no `if r.timers[serverID] == self` branch to wire Dec into, no test required to assert the branch was taken.
3. **`Counts()` is the single source of truth.** Push-based would maintain a second source (the gauge's internal `AtomicInt64`) that must agree with the first; divergence is a class of bug pull-based cannot have.
4. **No new lock-acquisition path.** `Counts()` already holds RLock and walks the maps. The collector calls `Counts()` once per scrape — no new locking surface in `registry.go`.
5. **The metrics seam admits it.** The per-concern collector pattern from #59 was illustrated with a push-shaped `CounterVec`. This ticket instantiates the same pattern with a pull-shaped `Collector` — both shapes coexist; gauges-of-a-state are naturally pull, counters-of-events are naturally push.

Pull-based's nominal cost is freshness: the gauge is only as current as the last scrape. For Prometheus's typical 15-60s scrape interval and the relay's alerting use case, the freshness gap is well below the operator's alert window. Acceptable.

## No labels — load-bearing for log hygiene

Both gauges are scalar; `prometheus.NewDesc(name, help, nil, nil)` passes empty `variableLabels` and `constLabels`. A `{server="<server-id>"}` label would carry the attacker-influenced `x-pyrycode-server` header value onto the metrics surface, which the threat model § Log hygiene rule (which extends to metric labels) forbids. The production file's comments and the spec both name this constraint so a future contributor cannot quietly re-add a server label without first deleting the spec's instruction.

The constant-cardinality property also keeps the `/metrics` response size delta small (~80 bytes) regardless of how many server-ids the relay holds.

## Wiring

The collector lives in the registry's state for the process lifetime once `NewConnectionsMetrics(mreg, registry)` is called at startup. The wiring call site is `cmd/pyrycode-relay/main.go` next to `relay.NewRegistry()` and `relay.NewMetricsRegistry()`, alongside the listener wiring tracked under #60. Until #60 merges, the collector is exercised only by `metrics_connections_test.go`.

## Concurrency

- The collector spawns no goroutines and holds no mutable state — two `*prometheus.Desc` fields and one `*Registry` pointer set at construction.
- `Collect` calls `Counts()` once per scrape. `Counts()` takes the registry's RLock; the collector itself is lock-free. Concurrent scrapes are serialised inside `prometheus.Registry.Gather()`.
- `TestConnectionsMetrics_RaceFreedom` runs 16 mutator goroutines (claim / register / schedule-release / reclaim / unregister) against a tight-loop scraper under `-race`; the race detector's verdict is the structural assertion.

## Related

- [Metrics registry (scaffolding)](metrics-registry.md) — the seam this collector plugs into (#59).
- [ADR-0008: Prometheus client adoption](../decisions/0008-prometheus-client-adoption.md) — scope-of-use rules (no `DefaultRegisterer`, no process/runtime collectors).
- [ADR-0006: Grace window IS the reclaim path](../decisions/0006-grace-period-as-reclaim-path.md) — pointer-identity guard the pull-based design rides for free.
- [Connection registry](connection-registry.md) — `Counts()` is the live count source the gauges read.
- [`/healthz` JSON endpoint](healthz.md) — one-shot snapshot of the same two numbers (the gauges are the time-series companion).
- [Threat model](../../threat-model.md) — log-hygiene rule that extends to metric labels.
6 changes: 4 additions & 2 deletions docs/knowledge/features/metrics-registry.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

The relay holds a private `*prometheus.Registry` and serves it over a Prometheus-format `/metrics` handler. The registry is constructed per process; sibling tickets register counters / gauges / histograms against it. `prometheus.DefaultRegisterer` is never touched.

This page documents the scaffolding only — what the seam is, why it has the shape it does, and where the rules are recorded. The first counter / gauge lands in sibling tickets; the `/metrics` listener wiring lives in #60.
This page documents the scaffolding only — what the seam is, why it has the shape it does, and where the rules are recorded. The `/metrics` listener wiring lives in #60.

The first collector to plug into the seam landed in #61: see [Connection-count gauges](connection-count-gauges.md) for the pull-based `connectionsCollector` exposing `pyrycode_relay_connected_{binaries,phones}` over `Registry.Counts()`. Sibling counter/histogram tickets (#57, #58) follow.

## API

Expand Down Expand Up @@ -77,7 +79,7 @@ Structurally out of bounds:

## What this deliberately does NOT do

- No counters, no gauges, no histograms — siblings (#57, #58, future) own those.
- No counters, no histograms here — siblings (#57, #58, future) own those. The first gauge collector (`connectionsCollector`) landed in #61 as a pull-based reader of `Registry.Counts()`; see [Connection-count gauges](connection-count-gauges.md).
- No listener — `/metrics` is not bound to a `mux` in this ticket. #60 owns the listener (flag, bind-address validation, `http.Server` timeouts).
- No process / Go-runtime collectors.
- No `ErrorLog` wiring from `promhttp` into `slog`.
Expand Down
Loading