Skip to content

Commit c46e90d

Browse files
authored
Merge pull request #88 from pyrycode/feature/58
feat(relay): frame-forwarded and grace-expiry counters (#58)
2 parents e0e21b2 + 6260e71 commit c46e90d

10 files changed

Lines changed: 1216 additions & 0 deletions

File tree

cmd/pyrycode-relay/main.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,8 @@ func main() {
100100

101101
metricsReg := relay.NewMetricsRegistry()
102102
relay.NewConnectionsMetrics(metricsReg, reg)
103+
relay.NewForwardMetrics(metricsReg, reg)
104+
relay.NewGraceMetrics(metricsReg, reg)
103105

104106
metricsMux := http.NewServeMux()
105107
metricsMux.Handle("/metrics", relay.NewMetricsHandler(metricsReg))

docs/knowledge/INDEX.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o
1010
- [Linux capability allowlist (boot-time refusal)](features/capability-allowlist.md) — relay parses `/proc/self/status`'s `CapEff:` hex mask at boot and refuses to start (exit 2) if any bit is set outside `AllowedCapabilities` (currently `{CAP_NET_BIND_SERVICE}` only, motivated by autocert binding `:80`/`:443` from uid 65532 in the distroless image). Exported sentinel `ErrUnexpectedCapability` is branchable via `errors.Is`; the wrapped error names every offending bit symbolically (`CAP_SYS_ADMIN (bit 21)` or `bit 63` for unknown), lists the allowlist contents, and embeds the operator fix string. `CapEff` only — `CapPrm/CapBnd/CapInh` would broaden false-positives (legitimate K8s default policy grants wide CapBnd) without adding load-bearing protection (relay never `capset(2)`s). Linux/non-Linux split at compile time via the new `_<goos>.go` / `_other.go` build-tag convention (see ADR-0009); non-Linux GOOS logs one skip line and returns nil. Unconditional — no production-mode gating, no env-var bypass, because stray capabilities are never legitimate. Reader-boundary test seam (`func() (string, error)`) exercises the parse + mask check end-to-end without touching real `/proc`. Joins the boot-time-refusal sentinel family (#9, #77, #79; future #78) (#79).
1111
- [Production-mode contract & startup refusals](features/production-mode.md) — `PYRYCODE_RELAY_PRODUCTION=1` env-var contract (exact-string match, lazy read via injected getter, mirrors `PYRYCODE_RELAY_SINGLE_INSTANCE` shape from #64/#65) plus the boot-time checks that consume it. **#77** introduced `relay.CheckInsecureListenInProduction` + exported `ErrInsecureListenInProduction` sentinel (branchable via `errors.Is`) firing when production mode is on AND `--insecure-listen` is set. **#78** added the second consumer: `relay.CheckRunningAsRoot(geteuid, getenv)` + exported `ErrRunningAsRoot` sentinel firing when production mode is on AND `syscall.Geteuid() == 0`, closing the deploy-time gap (`docker run --user 0`, `securityContext.runAsUser: 0`, hand-edited Dockerfile dropping `USER`) that escapes the CI non-root-build contract (#32 Dockerfile, #68 Trivy). Both wired in `cmd/pyrycode-relay/main.go` after flag-parse with `os.Exit(2)` (config-rejected-at-boot, distinct from runtime-failure exit 1) and structured log fields: `env_var` carries the name only (never the value, even though `effective_uid` carries the kernel-supplied int — log-injection structurally impossible), one-line `fix` listing valid resolutions. `IsProductionMode` exported so siblings compose on the same predicate without re-reading the env var. Test seams: `func(string) string` for env, `func() int` for euid — both the smallest possible (no interface, no struct, no package-level var) and the only way to exercise the uid-0 branch in a unit test without re-execing the test binary as root. Two instances of the shape (#77, #78) now codify the "sibling boot-check" pattern; `Config.Validate()` consolidation deferred until ~5 checks exist (#77, #78).
1212
- [Fly.io deploy](features/fly-deploy.md) — production host wiring: `fly.toml` declares TCP-passthrough on `:80`/`:443` (no Fly HTTP proxy, no Fly-managed certs) so TLS keeps terminating in the relay via autocert (#9), persistent Fly volume `relay_autocert` mounted at `/var/lib/relay/autocert`, and a single-machine hard cap encoded via `min_machines_running=1` + `auto_start_machines=false` + `auto_stop_machines="off"` + `[deploy] strategy="immediate"` (Fly Apps v2 has no `max_machines` key; the in-binary `PYRYCODE_RELAY_SINGLE_INSTANCE` self-check from #65 is the backstop). CI `deploy` job in `.github/workflows/ci.yml` runs `flyctl deploy --remote-only` on push to `main`, gated by branch-condition + `needs: [test, security, image-scan]` + `permissions: contents: read` so `FLY_API_TOKEN` is structurally unreachable from PR code; `superfly/flyctl-actions/setup-flyctl` pinned by commit SHA with `# Tracks:` comment (same convention as #68 / #41). Dedicated IPv4 is required (not optional) for autocert's HTTP-01 challenge; TCP passthrough preserves the real socket peer IP that #34's rate limiter reads. `__REGION__` / `__DOMAIN__` ship as placeholders that fail loud on first deploy (#38).
13+
- [Frame-forwarded and grace-expiry counters](features/frame-and-grace-counters.md) — three Prometheus counters wired through nil-safe `func()` hooks on `*Registry` so neither `registry.go` nor `forward.go` imports `prometheus` (ADR-0008 *Scope of use*): `pyrycode_relay_frames_forwarded_total{direction}` with `direction` hard-coded to `{phone_to_binary, binary_to_phone}` (cardinality 2; neither value is attacker-influenced) Inc'd by `StartPhoneForwarder` / `StartBinaryForwarder` exactly once per successful sink `Send` — structurally below every `continue` / error-return branch so the four miss-paths (no binary, marshal err, unknown conn_id, sink Send err) cannot over-count; and the scalar `pyrycode_relay_grace_expiries_total` Inc'd by `handleGraceExpiry` as the LAST statement of the success branch (after `r.mu.Unlock()` and the phone-close loop) — the existing pointer-identity guard's stale-fire `return` exits before the hook is reached, so stale fires never increment by construction (same structural defence #61's gauges ride for free). Three private fields on the `Registry` struct (`onPhoneForwarded`, `onBinaryForwarded`, `onGraceExpiry`) plus two exported setters (`SetForwarderHooks`, `SetGraceExpiryHook`) wired at boot from `cmd/pyrycode-relay/main.go` immediately after `NewConnectionsMetrics`; hooks are nil-safe so existing tests (which never wire metrics) see no behaviour change. Per-concern collector files `metrics_forward.go` (~45 lines) and `metrics_grace.go` (~33 lines) pre-bind `CounterVec` children via `WithLabelValues` once in the constructor and pass `Counter.Inc` method values directly into the setters — hot path is a single atomic Inc, no map lookup per frame, no anonymous-closure wrapping. Rejected alternatives: parameter threading (≥10 call-site cascade across `ClientHandler` / `ServerHandler` / `Start*Forwarder` signatures, right at the edit-fan-out red line) and package-level vars (breaks per-test isolation under `-race`, already rejected by #59's scaffolding spec). Hook field writes are boot-only, doc-comment-enforced; reads are race-free per the Go memory model's "goroutine creation happens-before goroutine body" rule; race-tested under `-race` via 16 goroutines × 200 ops in `metrics_counters_test.go`. Hooks MUST NOT acquire `r.mu` (deadlock risk if a future contributor adds a registry-mutating hook); the concrete bodies are pure `Counter.Inc` calls so the constraint holds today. Send-duration histogram (`pyrycode_relay_send_duration_seconds`) explicitly deferred — hot-path `time.Now()` × 2 + bucket bookkeeping warrants its own architect pass (#58).
1314
- [Connection-count gauges](features/connection-count-gauges.md)`pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones` exposed via a pull-based `prometheus.Collector` reading `Registry.Counts()` on each scrape; zero edits to `registry.go`; scalar (no labels) by design — `{server="..."}` would carry the attacker-influenced `x-pyrycode-server` header onto the metrics surface, which threat-model § Log hygiene forbids; stale grace-expiry fires can't move the gauge because the pointer-identity guard (ADR-0006) keeps the maps unchanged and the gauge IS the map size; race-tested against 16 mutator goroutines + a tight-loop scraper under `-race`. First collector wired into the #59 seam (#61).
1415
- [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars) — first instantiated by #61's `connectionsCollector`. Listener landed in #60 (see [Metrics listener](features/metrics-listener.md)). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59).
1516
- [Docker image](features/docker-image.md) — portable OCI artifact: multi-stage `Dockerfile` builds a fully-static binary (`CGO_ENABLED=0`, `-trimpath -s -w`) into `distroless/static-debian12:nonroot`; both base images digest-pinned with `# Tracks:` comments; exposes `:80`/`:443` and declares `/var/lib/relay/autocert` volume; host-specific wiring (TLS policy, ports, volumes, healthcheck) is #38's problem (#32). PR-time Trivy CVE scan against the just-built image lives in CI as the `image-scan` job, fails on **fixable** CRITICAL/HIGH only (`ignore-unfixed: true`), action pinned by commit SHA with `# Tracks: <tag>` comment mirroring the Dockerfile pin convention; intentional overlap with `govulncheck` (source-reachability vs. shipped-artifact) (#68). Both scanners are also re-run daily against `main` via `.github/workflows/security-scan.yml` (cron + `workflow_dispatch`) so disclosed CVEs against unchanged deps surface within ≤24h rather than staying invisible until the next bump (#72); a red cron run also opens a `security-sensitive`-labelled GitHub issue via the workflow's `file-issue` job (artifact-handoff privilege split keeps `issues: write` off the scanners and out of workflow scope; deterministic-title dedup via `gh issue list --search 'in:title …'`) so regressions land as tracked work-items rather than passive Actions rows (#73).

0 commit comments

Comments
 (0)