Skip to content

relay: /metrics — frame-forwarded and grace-expiry counters #58

@ilmoniemi

Description

@ilmoniemi

User Story

As a relay operator, I want time-series counters for forwarded frames (by direction) and grace-window expiries, so I can observe relay throughput and tell "binary disconnected and came back inside grace" from "binary disconnected and the slot was reclaimed" at scale.

Context

Third slice of the metrics rollout (split from #37). The metrics registry scaffolding (#59) ships the registry; #57 adds upgrade/register counters; this slice wires the forward-loop counter and the grace-expiry counter.

The frame counter increments on the hot path inside StartPhoneForwarder / StartBinaryForwarder — one increment per forwarded frame. Prometheus counter increments are atomic and cheap, but the architect should confirm the increment lives after a successful Send (or the chosen sink-error decision per direction, matching the forwarder error-policy pattern in PROJECT-MEMORY — phone forwarder returns on sink error so the increment goes before-and-on-success; binary forwarder continues on per-sink errors so the increment goes per successful Send, never on the loop iteration).

The grace-expiry counter fires inside the time.AfterFunc callback in Registry.ScheduleReleaseServer — increment after the stale-fire pointer-identity guard returns true (i.e. only on a real eviction, never on stale fires).

Acceptance Criteria

  • pyrycode_relay_frames_forwarded_total counter vector defined with label direction (phone_to_binary | binary_to_phone), registered against the metrics registry from relay: adopt prometheus/client_golang and introduce metrics registry scaffolding #59.
  • pyrycode_relay_grace_expiries_total counter (no labels), registered against the same registry.
  • StartPhoneForwarder (internal/relay/forward.go) increments frames_forwarded_total{direction="phone_to_binary"} exactly once per successful binary.Send. No increment on BinaryFor miss, marshal error, or Send error.
  • StartBinaryForwarder (same file) increments frames_forwarded_total{direction="binary_to_phone"} exactly once per successful phone.Send. No increment on unknown conn_id, malformed envelope, or per-sink Send error (matches the N-sink continue-on-per-sink-error policy from PROJECT-MEMORY).
  • Registry.ScheduleReleaseServer increments grace_expiries_total exactly once per actual eviction, inside the pointer-identity guard's success branch — stale-fire no-ops do not increment.
  • Tests exercise each direction with a fake source and sink and assert the counter increment count matches the success count, not the read-loop iteration count.
  • A grace-expiry race test (matching the existing ScheduleReleaseServer race coverage) confirms the counter increments only on real evictions, never on stale fires.
  • make vet, make test -race, and make build clean.
  • docs/knowledge/codebase/<n>.md summary entry created; docs/knowledge/INDEX.md updated.

Technical Notes

  • Histogram for send-duration (pyrycode_relay_send_duration_seconds, listed as "optional, expensive" in the original relay: /metrics Prometheus endpoint — operational time-series counters #37) is explicitly out of scope here. If it lands, it's its own ticket — the cost/value tradeoff (per-frame time.Now() × 2 plus bucket bookkeeping in the hottest loop) warrants its own architect pass.
  • The architect should confirm that the increment site for the binary-side forwarder is per-phone.Send, not per-loop-iteration — the loop fans out to multiple phones per envelope, and we want the metric to reflect actual forwarded-to-phone frames, not envelopes consumed.

Size Estimate

XS

Split from #37. Depends on #59.

Metadata

Metadata

Assignees

No one assigned

    Labels

    security-sensitiveTouches auth, crypto, or internet-exposed input pathssize:xsTiny ticket: <30 lines production code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions