Skip to content

feat(relay): upgrade-attempt and register-failure counters (#57)#90

Merged
ilmoniemi merged 3 commits into
mainfrom
feature/57
May 13, 2026
Merged

feat(relay): upgrade-attempt and register-failure counters (#57)#90
ilmoniemi merged 3 commits into
mainfrom
feature/57

Conversation

@ilmoniemi
Copy link
Copy Markdown
Contributor

What

Adds two Prometheus CounterVecs at the WebSocket-upgrade call sites:

  • pyrycode_relay_ws_upgrade_attempts_total{endpoint, outcome} — 2 endpoints × 6 outcomes, 12 cells pre-bound (9 reachable, 3 structurally unreachable exposed at 0 for a stable scrape shape).
  • pyrycode_relay_register_failures_total{kind} — 4 cells, co-incremented in lockstep with their paired outcome=reject_* cells via composite event-named methods.

Total: 16 series, label cardinality fixed and budgeted.

Issue

Closes #57. Slice of the #56 metrics rollout (after #59 scaffolding, #61 gauges, #60 listener, #58 frame/grace counters).

Testing

  • internal/relay/metrics_upgrade_test.go — five tests:
    • TestUpgradeMetrics_ServerEndpoint_TerminalPaths table-drives the four server terminals (accept, reject_headers, reject_409, reject_rate_limit) and asserts the exercised cell at 1 plus every other endpoint=server cell at 0 (no-double-increment proof).
    • TestUpgradeMetrics_ClientEndpoint_TerminalPaths analogous over the five client terminals (accept, reject_headers, reject_404, reject_429, reject_rate_limit).
    • TestUpgradeMetrics_RegisterFailures_CoIncrement drives each composite method once and pins the "sum on kind equals sum on the matching outcome" invariant.
    • TestUpgradeMetrics_AllSixteenSeries_Exposed advances every reachable method, then asserts all 16 series exist in the scrape (9 reachable at 1, 3 unreachable at 0, 4 failure kinds at their composite totals).
    • TestUpgradeMetrics_NoGlobalRegistrarLeak pins ADR-0008 § Scope of use locally.
  • Seven mechanical , nil appends in existing handler tests, no behaviour change.
  • go test -race ./..., go vet ./..., go build ./... all clean.

Architecture compliance

  • Follows the arch spec at docs/specs/architecture/57-upgrade-and-register-failure-counters.md verbatim:
    • Status-code observer wrapper OUTSIDE the rate-limit middleware (chosen over a callback in NewRateLimitMiddleware because the latter would cross the 11-call-site edit-fan-out red line).
    • Nil-safe *UpgradeMetrics parameter on ServerHandler / ClientHandler (7 mechanical test-side appends, under the 10-call-site red line).
    • Composite event methods make the register-failure ↔ upgrade-outcome lockstep a property of the method, not of the caller.
    • All label values are hard-coded constants; never derived from attacker-influenced input (security-review § Tokens).
    • statusObserverResponseWriter implements http.Hijacker by delegation so websocket.Accept downstream can still upgrade on the success branch.
  • Codebase summary in docs/knowledge/codebase/57.md. docs/knowledge/INDEX.md left to the documentation phase per the developer-agent contract.

🤖 Generated with Claude Code

ilmoniemi and others added 2 commits May 13, 2026 11:00
Adds two Prometheus CounterVecs to the WebSocket-upgrade call sites so
operators can distinguish flood patterns (malformed headers, 4409
conflicts, 4404/4429 phone failures, rate-limit denies) without
log-grepping.

- internal/relay/metrics_upgrade.go: new UpgradeMetrics type, nine event-
  named nil-safe methods, two WrapXxxRateLimitDeny status-code observers.
  All 16 label cells pre-bound; 3 unreachable cells exposed at 0 for a
  stable scrape shape.
- internal/relay/server_endpoint.go: ServerHandler gains a *UpgradeMetrics
  parameter; three increment sites (header-reject, 4409 conflict, accept).
- internal/relay/client_endpoint.go: ClientHandler gains a *UpgradeMetrics
  parameter; four increment sites (header-reject, 4404, 4429, accept).
- cmd/pyrycode-relay/main.go: constructs UpgradeMetrics, stacks
  WrapServerRateLimitDeny / WrapClientRateLimitDeny OUTSIDE the rate-
  limit middleware so the deny observer sees the 429.
- internal/relay/metrics_upgrade_test.go: five tests covering every
  terminal path, the co-increment invariant, the full 16-series scrape,
  and the no-DefaultRegisterer-leak structural defence.
- Existing handler tests: seven mechanical `, nil` appends, no behaviour
  change.
- docs/knowledge/codebase/57.md: codebase summary.

make vet, make test -race, make build clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ilmoniemi
Copy link
Copy Markdown
Contributor Author

Code Review: #57

Decision: PASS

Findings

None blocking. Two notes:

  • [NIT] internal/relay/server_endpoint.go:90 and internal/relay/client_endpoint.go:91metrics.ServerAccept() / metrics.ClientAccept() fire before the logger.Info("server_claimed"/"phone_registered", …) line; the spec § Handler increment placement § server / client phrased the success-path placement as "after … logger.Info fires" (the conflict-path text contrasts with this by specifying "between the close-code write and the logger.Info line"). Functionally identical — both sites are post-claim / post-register, ahead of the defer block — so this isn't a finding worth churning a respin over; flagging only because the prose contrast is the kind of thing a future reader may stumble on.

Summary

Clean implementation of the architect's design. Two CounterVecs with the spec's 16-cell label surface, all pre-bound (the three unreachable cells materialised via _ = upgradeVec.WithLabelValues(...) to keep the scrape shape stable — verified by TestUpgradeMetrics_AllSixteenSeries_Exposed). Increments are structurally tight to each terminal branch; composite event-named methods bake the upgrade-outcome ↔ register-failure co-increment into the method contract so a caller cannot accidentally advance one without the other.

statusObserverResponseWriter correctly forwards http.Hijacker (load-bearing for the success path — websocket.Accept type-asserts the writer); records only the first WriteHeader code; no body buffering. The "rate-limit middleware is the only HTTP-429 source" claim holds — handlers write 400 / 101 / WS application close codes inside the upgrade, no 429.

Wire-up: cmd/pyrycode-relay/main.go constructs UpgradeMetrics against the existing metricsReg (ADR-0008 § Scope of use respected) and stacks WrapXxxRateLimitDeny OUTSIDE the rate-limit middleware so the observer sees the middleware's 429. Boot-time ordering correct.

Blast-radius check (codegraph_callers + grep): 9 prod+test call sites of ServerHandler / ClientHandler all updated (1 prod + 8 tests via mechanical , nil or , m). No missed sites.

Tests: five tests in metrics_upgrade_test.go. The closed-label inventory assertion helpers (assertServerOutcomes / assertClientOutcomes / assertFailureKinds) walk every cell and assert 0 on unstated keys — this is the no-double-increment proof, not just "the cell I exercised is 1." TestUpgradeMetrics_NoGlobalRegistrarLeak locally pins ADR-0008 § Scope of use alongside the package-level invariant. go test -race ./... -count=1 clean locally.

Security goggles (security-sensitive label):

  • Spec § Security review present, verdict PASS, all nine sections walked.
  • All three label values (endpoint, outcome, kind) are compile-time string constants set per event-named method — none derived from headers, paths, or any request state. Same load-bearing constraint as metrics_forward.go's direction.
  • No tokens logged. X-Pyrycode-Token remains unread/unlogged/uncompared on the client path.
  • No file / subprocess / crypto code. Network safe — observer doesn't buffer the body, only records one int.
  • /metrics continues to bind to the loopback metricsListen listener, not the internet-exposed listener. New counters are not reachable from outside.

go vet, go test -race, go build all clean.

…EX entry (#57)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ilmoniemi ilmoniemi merged commit b4e6315 into main May 13, 2026
2 of 3 checks passed
@ilmoniemi ilmoniemi deleted the feature/57 branch May 13, 2026 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

relay: /metrics — upgrade-attempt and register-failure counters

1 participant