Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). No counters/gauges, no listener — siblings (#57, #58) and #60 fill those in. Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59).
- [Docker image](features/docker-image.md) — portable OCI artifact: multi-stage `Dockerfile` builds a fully-static binary (`CGO_ENABLED=0`, `-trimpath -s -w`) into `distroless/static-debian12:nonroot`; both base images digest-pinned with `# Tracks:` comments; exposes `:80`/`:443` and declares `/var/lib/relay/autocert` volume; host-specific wiring (TLS policy, ports, volumes, healthcheck) is #38's problem (#32).
- [Binary-side frame forwarder](features/binary-forwarder.md) — per-binary read pump: unwraps each inbound routing envelope, linear-scans `PhonesFor(serverID)` for `env.ConnID`, writes `env.Frame` verbatim to that phone; opaque inner bytes; synchronous (handler discards the return); diverges from #25 in error policy — unknown `conn_id`, malformed envelope, phone `Send` error all log+continue (a single bad frame never tears down the binary); replaced `/v1/server`'s `CloseRead` placeholder (#26).
- [WebSocket heartbeat](features/heartbeat.md) — per-conn goroutine on both endpoints sends RFC 6455 ping every 30s; closes with `1011 "heartbeat timeout"` if no pong within 30s. Detects half-open TCP within 60s; ctx-cancel exit path leaves close to the handler defer (#7).
Expand All @@ -18,6 +19,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Decisions

- [ADR-0008: Adopt `github.com/prometheus/client_golang` for relay metrics](decisions/0008-prometheus-client-adoption.md) — first non-stdlib direct dep since #15 (`nhooyr.io/websocket`); alternatives (hand-rolled text format, `VictoriaMetrics/metrics`, `expvar`+sidecar) rejected; transitive set enumerated and pinned via `go.sum`; scope of use bans `DefaultRegisterer`, process/runtime collectors, `slog` adapter, Prometheus router/config helpers; pattern established: ADR-before-import for any next direct dep.
- [ADR-0007: `WSConn.CloseWithCode` for active-conn application close codes](decisions/0007-wsconn-closewithcode-for-active-conn.md) — extends ADR-0005 for the post-claim window: heartbeat (#7) needs `1011 "heartbeat timeout"` on a live WSConn; `Close()` delegates to `CloseWithCode(StatusNormalClosure, "")`, both share `closeOnce`.
- [ADR-0006: Grace window IS the reclaim path](decisions/0006-grace-period-as-reclaim-path.md) — `ClaimServer` during a pending grace timer succeeds (not `4409`); pointer-identity wrapper defends against stale `time.AfterFunc` fires after `Stop()`.
- [ADR-0005: Application WS close codes go through the underlying `*websocket.Conn`](decisions/0005-application-close-codes-via-underlying-conn.md) — `WSConn.Close()` stays minimal (`StatusNormalClosure`); handlers emit `4409` / `4401` / `4404` directly on the `*websocket.Conn` in stillborn-WSConn paths.
Expand Down
56 changes: 56 additions & 0 deletions docs/knowledge/codebase/59.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Ticket #59 — Prometheus metrics registry scaffolding

Stands up a private `*prometheus.Registry` and an `http.Handler` factory wrapping `promhttp.HandlerFor`. First non-stdlib direct dep since #15 (`nhooyr.io/websocket`); ADR-0008 is the justification. Scaffolding only — no counters, no gauges, no listener. The seam exists so sibling tickets (#57, #58, …) attach metrics and #60 wires the listener without re-touching `metrics.go`.

## Implementation

- **`internal/relay/metrics.go`** — two exports:
- `NewMetricsRegistry() *prometheus.Registry` — thin wrapper over `prometheus.NewRegistry()`. Fresh instance per call, no singleton.
- `NewMetricsHandler(reg *prometheus.Registry) http.Handler` — `promhttp.HandlerFor(reg, HandlerOpts{Registry: reg})`. `HandlerOpts.Registry: reg` is load-bearing: it routes `promhttp`'s self-instrumentation (`promhttp_metric_handler_*`) onto the private registry instead of `DefaultRegisterer`, so the ADR-0008 § Scope of use rule holds even for the library's internals. Pattern mirrors `NewHealthzHandler` (#10).
- **`internal/relay/metrics_test.go`** — three tests:
- `TestMetricsHandler_EmptyRegistry_ResponseShape` — AC gate. Status 200, `Content-Type` prefix `text/plain; version=0.0.4; charset=utf-8`, body decodes through `expfmt.NewDecoder` to `io.EOF`. Substring (not equality) match on content-type because newer `prometheus/common` appends `; escaping=<scheme>` — additive, backward-compatible.
- `TestMetricsHandler_RegisterAndScrape_RoundTrip` — registers a trivial counter, increments to 2, asserts `relay_test_counter_total 2` in the body. Catches the cheapest possible regression where a future `client_golang` bump breaks the `Gatherer` contract for registered metrics.
- `TestMetricsRegistry_NoGlobalRegistrarLeak` — snapshots `prometheus.DefaultGatherer.Gather()` `len(mfs)` before and after `NewMetricsHandler(NewMetricsRegistry())`. Asserts delta = 0. Stable across version bumps (asserts delta, not absolute count). Structural defence for ADR-0008 against a contributor reaching for `promauto`'s default-registry shortcuts.
- **`docs/knowledge/decisions/0008-prometheus-client-adoption.md`** — new ADR. Alternatives considered (hand-rolled text format; `VictoriaMetrics/metrics`; `expvar` + exporter sidecar), supply-chain notes (transitives: `beorn7/perks`, `cespare/xxhash/v2`, `munnerz/goautoneg`, `prometheus/client_model`, `prometheus/common`, `prometheus/procfs`, `google.golang.org/protobuf`), scope-of-use boundary (no `DefaultRegisterer`, no process/runtime collectors, no `slog` adapter, no Prometheus router/config helpers).
- **`go.mod` / `go.sum`** — first new direct dep since `nhooyr.io/websocket` in #15. `make vet`, `make test -race`, `make build` clean.
- **No changes to `cmd/pyrycode-relay/main.go`** — the listener (and any mux registration) lives in #60. This ticket is scaffolding by design; both exports are dead code at merge time, brought to life by siblings.

## Patterns established

- **Per-concern collector struct, constructed by a sibling-ticket helper taking `prometheus.Registerer`.** Each sibling adds a private `*xxxMetrics` type in its own file (e.g. `metrics_upgrade.go`); the wiring site (`main.go`, #60) holds N collectors as locals. Adding a metric type is a greenfield file — no edits to `metrics.go` or to other siblings' files. Eliminates the merge-conflict surface that a growing mega-struct or package-level `var`s would create between concurrently-architected siblings (#57 and #58 were in flight at the same time).
- **Handler-factory rule extends.** `NewMetricsHandler` returns `http.Handler` (not `http.HandlerFunc`), holds no per-request state, exposes no fields — same contract as `NewHealthzHandler` (#10).
- **Per-call registry construction.** `NewMetricsRegistry()` returns a fresh registry on each call; the process holds exactly one in production but tests construct as many as they need. Matches `Registry` (#3, ADR-0003) — tests aren't forced to share global state.
- **ADR-before-import for new direct deps.** The dispatcher's documentation agent cross-links the ADR from `INDEX.md` under *Decisions*. The pattern is the new bar for any next direct dep — ADR file lands before the `go.mod` edit, justification or not, never a silent first-import.

## Seam-shape pick rationale (recorded in the spec, not in `metrics.go`)

The spec (`docs/specs/architecture/59-metrics-scaffolding.md`) is the place sibling architects read first for the registration shape. Three options were considered:

1. **Package-level `var`s.** Rejected. Package-init can only register on a package-level singleton, which forces `NewMetricsRegistry` into a singleton and breaks per-test isolation.
2. **One mega-struct (`type Metrics struct { … }`) growing fields per sibling.** Rejected. Concurrent sibling PRs all touch the same struct — guaranteed merge conflict.
3. **Per-concern collector struct (adopted).** Each sibling owns a private type in its own file; `metrics.go` stays untouched after #59.

The seam exposed (`NewMetricsRegistry`, `NewMetricsHandler`) admits the pattern without naming it. The spec is the contract.

## Out of scope (delegated)

- Counters / gauges / histograms — sibling tickets (#57 upgrade-flood rejects, #58 register failures, future throughput / grace-expiry).
- `/metrics` listener, bind-address validation, flag plumbing, TLS for the metrics port — #60.
- Process / Go-runtime collectors (`collectors.NewGoCollector`, `NewProcessCollector`).
- `ErrorLog` wiring from `promhttp` into `slog`.
- OpenMetrics content negotiation (left disabled by default).
- Metric-label exfiltration rules (sibling specs enforce; metric labels carry the same MUST-NOT-emit posture as `slog` values per threat-model § Log hygiene).

## Lessons learned

- **The `HandlerOpts.Registry` field is load-bearing, not optional.** Leaving it `nil` lands `promhttp`'s self-instrumentation counters on `DefaultRegisterer` — silently breaking the no-default-registry boundary even with otherwise correct user code. The structural test (`TestMetricsRegistry_NoGlobalRegistrarLeak`) catches the omission; the docstring on `NewMetricsHandler` explains why.
- **Substring-match the Prometheus content-type literal, don't equality-match.** `prometheus/common` appended `; escaping=<scheme>` to the text-format media type in a recent release (additive, backward-compatible per the format spec). An equality assertion against `text/plain; version=0.0.4; charset=utf-8` would flake on the next `go.mod` bump. The spec's intent — "AC literal is the canary" — is preserved by checking `strings.HasPrefix(got, textFormatPrefix)`: a release that renamed the version or dropped the charset would still fail loudly.
- **Snapshot-delta is the right shape for global-state tests.** `TestMetricsRegistry_NoGlobalRegistrarLeak` asserts `before == after` on `DefaultGatherer.Gather()` length, not `after == 0`. Stable across `client_golang` bumps that pre-register their own internals on the default gatherer. Generalises: when defending an "X should not change global state Y" invariant, snapshot Y, run X, compare — don't pin the absolute value.

## Cross-links

- [Metrics registry feature](../features/metrics-registry.md) — evergreen seam + usage doc.
- [ADR-0008: Prometheus client adoption](../decisions/0008-prometheus-client-adoption.md) — dep justification, alternatives, scope-of-use rules.
- [`/healthz` handler](../features/healthz.md) — sibling factory-shaped handler; `NewMetricsHandler` mirrors its contract.
- [Threat model § Log hygiene](../../threat-model.md) — MUST-NOT-emit list that propagates to metric label values via sibling specs.
- [Architect spec](../../specs/architecture/59-metrics-scaffolding.md) — full design + security review.
163 changes: 163 additions & 0 deletions docs/knowledge/decisions/0008-prometheus-client-adoption.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# ADR-0008: Adopt `github.com/prometheus/client_golang` for relay metrics

**Status:** Accepted (#59)
**Date:** 2026-05-11

## Context

`/healthz` (#10) is a point-in-time snapshot: it answers "is the relay alive
and how many things are connected right now." Production diagnosis of the
adversarial / capacity regimes the relay actually faces needs cumulative
counters — upgrade-attempt floods, header-validation rejects, register-failure
rates, frame throughput, grace-expiry rates. Those are not derivable from
snapshots without a scrape-and-diff harness; they want a real time-series
surface.

The downstream consumer is a Prometheus scraper. The chosen exposition format
must be one a stock Prometheus server accepts on a `/metrics` endpoint, with
no shim component between relay and scraper.

This is the relay's first non-stdlib direct dep since #15
(`nhooyr.io/websocket`). `docs/PROJECT-MEMORY.md` line 37 fixes the rule:
"Stdlib + `golang.org/x/crypto` + `nhooyr.io/websocket`. Direct deps arrived
in #9 and #15. Keep the dependency surface deliberate; new deps need a
justification (the ADR is the justification)." This file is that
justification.

## Alternatives considered

### 1. Hand-rolled Prometheus text exposition

The text format is documented and small. Counters are easy; gauges are easy.
Histograms and summaries (`_bucket{le=…}`, `_count`, `_sum` triplet with
ordered buckets and `+Inf` terminator) are not — escapability, label sorting,
quantile encoding, and OpenMetrics interop are each their own footgun. The
relay does not need histograms today, but the metrics surface will grow into
them (frame size distributions, grace-window latencies). Hand-rolling now
means hand-rolling the histogram code later, in a security-sensitive package,
to save one dep.

Rejected. The maintenance cost lands on the relay forever; the dep cost
is paid once and is bounded.

### 2. `github.com/VictoriaMetrics/metrics`

Drop-in alternative to `client_golang`. Smaller and fewer transitive deps —
genuinely attractive on the supply-chain axis. Rejected because:

- The ecosystem around `client_golang` (`testutil`, `promhttp`,
`expfmt`, `collectors`) is the lingua franca; rejecting it means
every future contributor reaches for the wrong package by reflex.
- VictoriaMetrics' library does not target full Prometheus parity on
exposition format edge cases (it ships a deliberate subset). The
relay does not gain anything from the subset.

### 3. `expvar` (stdlib) + exporter sidecar

`expvar` ships JSON, not text format. To feed a Prometheus scraper you bridge
through an `expvar` exporter sidecar. That is one more process to deploy and
one more failure mode to monitor (the sidecar's health is now also a
production concern). The relay is supposed to be a single binary in a
distroless container; adding an exposition sidecar undermines the deployment
posture (#32).

Rejected.

## Decision

Adopt `github.com/prometheus/client_golang` v1.x (the current stable line) as
a direct dep. Use:

- `prometheus.NewRegistry()` for the per-process registry.
- `prometheus.NewCounter` / `NewCounterVec` / `NewGauge` / `NewGaugeVec` /
`NewHistogramVec` for metric types as sibling tickets need them.
- `promhttp.HandlerFor(reg, opts)` for the exposition handler.
- `prometheus/testutil` if and when a sibling ticket needs to assert
counter increments under test.

## Scope of use

In bounds (`internal/relay/` and the listener's wiring in #60):

- Construction of the relay's private registry (`internal/relay/metrics.go`).
- Construction of counters / gauges / histograms by sibling tickets.
- Handler factory wrapping `promhttp.HandlerFor`.
- Test-time use of `prometheus/testutil` for metric assertions.

Out of bounds, structurally:

- **`prometheus.DefaultRegisterer` is never touched.** The relay's registry
is constructed with `prometheus.NewRegistry()` and held privately. A
developer who reaches for `promauto`'s default-registry shortcuts is
reaching outside the design; code review and the spec's docstring on
`NewMetricsRegistry` are the gates.
- **Process / Go-runtime collectors** (`collectors.NewGoCollector`,
`collectors.NewProcessCollector`) are **not** registered in this ticket.
They are useful but they expand the exposed surface (memory layout,
GC pacing, file descriptor counts, command line via `process_*`)
and the listener in #60 will be reachable from a more sensitive
scrape position than `/healthz`. If a future ticket adds them, it
registers them on the same private registry — never on the global default.
- **Structured logging.** The relay continues to use `slog` for logs. No
log adapter from `client_golang` is wired in. The "Log hygiene"
threat-model rule (#36's allowlist) does not extend to metric labels;
metric labels are an independent low-cardinality channel and are
governed by the threat-model § *Log hygiene* MUST-NOT-emit list
applied to label values when sibling tickets land.
- **HTTP routing.** stdlib `net/http`, not any Prometheus router.
- **Config parsing.** stdlib `flag`, not any Prometheus config helper.

## Supply-chain notes

Direct dep: `github.com/prometheus/client_golang` v1.x.

Transitive deps the module graph will pick up (verified against
`go.sum` of comparable projects on 2026-05-11; the developer's `go mod tidy`
output is the authoritative version of this list at commit time):

- `github.com/beorn7/perks` — histogram quantile estimation.
- `github.com/cespare/xxhash/v2` — fast non-cryptographic hash for
label-set keying.
- `github.com/munnerz/goautoneg` — content-type negotiation for
`Accept` headers on `/metrics`.
- `github.com/prometheus/client_model` — generated protobuf for the
Prometheus exposition model.
- `github.com/prometheus/common` — shared types across the Prometheus
Go ecosystem; pulls `expfmt` (text/OpenMetrics encoders).
- `github.com/prometheus/procfs` — only used by process / runtime
collectors. We do not register those (see *Scope of use*), but the
module is transitively present.
- `google.golang.org/protobuf` — required by `client_model`.

Maintenance status: actively maintained, hosted under the Prometheus
organisation, deployed at production scale across the CNCF graduated
ecosystem. The release cadence (multiple minor releases per year) is on
the slow-and-steady side compared to web frameworks.

Vetting posture: `go.sum` digests pin each transitive at the resolved
version. Renovate (or whatever bumper lands) is responsible for keeping
the line current. `govulncheck` (already in `make lint`) covers the
expanded surface.

## Consequences

- **Dependency surface grows.** The `go.mod` `require` block gains one
direct entry; transitives (~7 modules) are recorded in `go.sum`. This is
a known, deliberate expansion, signed off in this ADR.
- **`govulncheck` and `gosec` now scan a wider tree.** Failures in
`prometheus/*` modules land as `make lint` failures in CI; the
remediation is the same as for any dep — bump within the line, or
open an upstream issue if the bump is unavailable.
- **`make build` artifact size grows.** Indicative figure (from comparable
Go services adopting `client_golang` for the first time): the static
binary gains roughly 1–2 MB. Acceptable for the relay's distroless
deployment posture (#32); the artifact is still well under the
registry's practical ceiling.
- **The "stdlib + `x/crypto` + `nhooyr.io/websocket`" line in
`docs/PROJECT-MEMORY.md` is now stale.** Updating that line is a
documentation-phase concern, not the architect's; the documentation
agent will edit it when the codebase summary for #59 lands.
- **The pattern this establishes for any next dep:** an ADR file, before
the `go.mod` edit. The dispatcher's documentation agent then
cross-links it from `docs/knowledge/INDEX.md` under *Decisions*. No
silent first-import.
Loading