feat(relay): graceful shutdown on SIGTERM — drain WS connections (#31) by ilmoniemi · Pull Request #91 · pyrycode/pyrycode-relay

ilmoniemi · 2026-05-13T08:26:21Z

What

On SIGTERM/SIGINT the relay now drains in-flight WebSocket connections to a clean close before exiting:

Listener is stopped (new /v1/server and /v1/client upgrades fail the handshake).
Every binary + phone WS conn registered at drain start gets CloseWithCode(StatusGoingAway, "shutting down").
Drain blocks up to 10s (default drainDeadline) for peers to acknowledge; conns still mid-handshake at the deadline are force-closed via http.Server.Close.
Both --insecure-listen and --domain (autocert) modes handle the signal identically.

Implements all five acceptance criteria from #31.

Issue

Closes #31.

Design notes

Registry.Snapshot() mirrors the existing PhonesFor pattern: snapshot under RLock, all close fan-out happens outside the lock. Returns one flat []Conn because the only consumer treats binaries and phones uniformly.
relay.Shutdown(ctx, logger, reg, servers...) in internal/relay/shutdown.go. Concurrent per-server http.Shutdown + concurrent per-conn CloseWithCode keep wall-clock cost bounded by the single-worst-case 5s close-handshake (per lessons.md), not linear in conn count. Returns nil on clean drain, ctx.Err() on deadline expiry.
gracefulCloser is a narrow interface internal to shutdown.go — *WSConn satisfies it without widening the registry's Conn interface or breaking existing fakeConn test mocks.
cmd/pyrycode-relay/main is split into main → run(args, sigCtx) so the e2e test can drive the shutdown path with a synthetic context cancellation instead of forking a real subprocess (architect-suggested shape). Signal-triggered drains exit 0; listener-error-triggered drains exit 1 (preserves today's os.Exit(1) semantics for process-supervisor restart).
drainDeadline = 10 * time.Second lives at the wiring site per the established #21 / #60 policy-values-in-main convention.

Testing

registry_test.go: TestSnapshot_EmptyRegistry, TestSnapshot_IncludesBinariesAndPhones, TestSnapshot_FreshSliceIsolation, TestSnapshot_RaceFreedom (16 goroutines × 200 ops under -race).
shutdown_test.go: clean drain, empty registry, deadline expiry (with a Close-blocking fake conn — asserts prompt return after srv.Close()), and idempotent close on a real *WSConn.
main_e2e_test.go: boots the relay on 127.0.0.1:0 via run(), opens a /v1/server WS, triggers shutdown, asserts the next Read surfaces websocket.CloseStatus == StatusGoingAway and run() returns 0 within the drain deadline. Companion test asserts the listener is no longer reachable after run returns.

go test -race ./...
go vet ./...
go build ./cmd/pyrycode-relay

all green.

Architecture compliance

Follows the spec at docs/specs/architecture/31-graceful-shutdown-on-sigterm.md:

✅ Registry.Snapshot() accessor with the documented signature, RLock + freshly-allocated slice, returns nil on empty.
✅ Shutdown(ctx, logger, reg, servers...) with the documented behavior (concurrent fan-out, ctx-first / done-first select, force-close on deadline, idempotency).
✅ main wiring: signal.NotifyContext(SIGTERM, SIGINT), per-server listener goroutines feeding a buffered listenerErr chan, signal-vs-error select, drainCtx with timeout, exit codes 0/1 per the architect's "Open questions" resolution.
✅ defer limiter.Close() now runs on the clean return path; updated comment points to relay: graceful shutdown on SIGTERM — drain WS connections before exit #31 instead of the stale relay: wire per-IP rate-limit middleware on /v1/server + /v1/client #47.

🤖 Generated with Claude Code

On SIGTERM or SIGINT the relay now drains in-flight WebSocket connections to a clean close before exiting: stop accepting new connections, emit StatusGoingAway (1001) on every live binary and phone conn registered at drain start, wait up to 10s for peers to acknowledge, then force-close. Both --insecure-listen and --domain (autocert) modes handle the signal identically. - Registry.Snapshot returns a freshly-allocated slice of every live Conn so the shutdown helper can fan close frames out without exposing internal maps (same shape as PhonesFor: copy under RLock, slow work outside the lock). - internal/relay/shutdown.go: concurrent per-server http.Shutdown fan-out + per-conn CloseWithCode fan-out (5s close-handshake gate caps wall-clock regardless of conn count). Force-closes via http.Server.Close on deadline expiry. Idempotency inherited from WSConn.closeOnce. - cmd/pyrycode-relay/main: factored into run(args, sigCtx) so the e2e test can drive the shutdown path without forking a real subprocess. Listener goroutines feed a buffered errors chan; signal-triggered drains exit 0, listener-error-triggered drains exit 1 (preserves today's os.Exit(1) semantics for the process-supervisor restart case).

ilmoniemi · 2026-05-13T08:31:19Z

Code Review: #31

Decision: PASS

Findings

[NIT] internal/relay/shutdown.go:45-56 — listener-close runs concurrently with reg.Snapshot(). The spec described step 1 (per-server srv.Shutdown goroutines) and step 2 (reg.Snapshot()) as if they were sequential ("listener is closed before the snapshot fires"), but in practice the goroutines may not have scheduled by the time Snapshot returns. A handshake that completes in that microsecond-wide window would register after the snapshot and miss the 1001 fan-out, degrading to a TCP-reset on process exit. The practical impact is negligible (goroutine scheduling is fast, the conn still terminates), but the strict reading of AC "listener stopped before drain begins" is not guaranteed. Not blocking — the architect signed off on this shape — but worth a follow-up if shutdown ever needs to be airtight.
[NIT] cmd/pyrycode-relay/main.go:42-46 — signal.NotifyContext's cancel function is discarded. defer cancel() would be tidier; in main it makes no practical difference since the process exits, but the convention is to release.
[NIT] internal/relay/shutdown_test.go:188-192 — comment on newTestServer says "blocks until the listener has actually accepted bind". net.Listen does synchronously bind the port, so the addr is reachable on return; the comment slightly oversells what it does (it doesn't wait for Serve to be running).
[NIT] cmd/pyrycode-relay/main_e2e_test.go:124-136 — freePort close/re-bind race is acknowledged in the comment. Fine for now; flag for revisit if these e2e tests start flaking.

CI status

The test job on the latest CI run failed on TestUpgradeMetrics_ServerEndpoint_TerminalPaths/reject_409 with reject_409=0 in the scraped metric. I cannot reproduce locally: go test -race ./... and go test -race -count=1 ./internal/relay/ -run TestUpgradeMetrics pass 5/5 times. The failing test is in metrics_upgrade_test.go (pre-existing), and the diff touches neither the upgrade pipeline nor the metrics package. This looks like a pre-existing timing race in the test itself: assertServerOutcomes scrapes after c2.Read returns the close error, but the metric increment in the server handler can land slightly after the close frame is on the wire. Recommend re-running CI before merging. Not a blocker on the review; flagging because the protocol requires CI green.

Summary

The implementation cleanly matches the architect's spec: Registry.Snapshot follows the established passive-registry / PhonesFor pattern; the Shutdown helper fans listener-close and per-conn CloseWithCode(StatusGoingAway, …) out concurrently with a unified WaitGroup, force-closes on deadline expiry, and the gracefulCloser type assertion is a nice touch for keeping fakeConn in the registry tests untouched. Wiring in main is uniform across both modes via runServers with the exit-code-1-on-listener-error / exit-code-0-on-signal split from the spec's open questions. Tests cover empty registry, full drain, deadline expiry with a stuck close, and an e2e through run() driving the actual signal-context path.

No MUST FIX or SHOULD FIX findings. Re-run CI; merge when green.

Add feature doc, codebase note, and INDEX entry for the SIGTERM/SIGINT drain path. Refresh metrics-listener / rate-limit-middleware / autocert-tls feature docs whose "no graceful shutdown" / "deferred to #31" lines are no longer accurate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ilmoniemi added 2 commits May 13, 2026 11:18

spec: graceful shutdown on SIGTERM — drain WS connections (#31)

9269aa6

ilmoniemi mentioned this pull request May 13, 2026

relay: startup self-check refuses to run as multi-instance deploy #65

Closed

5 tasks

ilmoniemi merged commit 891f7af into main May 13, 2026
2 of 3 checks passed

ilmoniemi deleted the feature/31 branch May 13, 2026 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(relay): graceful shutdown on SIGTERM — drain WS connections (#31)#91

feat(relay): graceful shutdown on SIGTERM — drain WS connections (#31)#91
ilmoniemi merged 3 commits into
mainfrom
feature/31

ilmoniemi commented May 13, 2026

Uh oh!

ilmoniemi commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ilmoniemi commented May 13, 2026

What

Issue

Design notes

Testing

Architecture compliance

Uh oh!

ilmoniemi commented May 13, 2026

Code Review: #31

Findings

CI status

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant