feat(relay): graceful shutdown on SIGTERM — drain WS connections (#31)#91
Conversation
On SIGTERM or SIGINT the relay now drains in-flight WebSocket connections to a clean close before exiting: stop accepting new connections, emit StatusGoingAway (1001) on every live binary and phone conn registered at drain start, wait up to 10s for peers to acknowledge, then force-close. Both --insecure-listen and --domain (autocert) modes handle the signal identically. - Registry.Snapshot returns a freshly-allocated slice of every live Conn so the shutdown helper can fan close frames out without exposing internal maps (same shape as PhonesFor: copy under RLock, slow work outside the lock). - internal/relay/shutdown.go: concurrent per-server http.Shutdown fan-out + per-conn CloseWithCode fan-out (5s close-handshake gate caps wall-clock regardless of conn count). Force-closes via http.Server.Close on deadline expiry. Idempotency inherited from WSConn.closeOnce. - cmd/pyrycode-relay/main: factored into run(args, sigCtx) so the e2e test can drive the shutdown path without forking a real subprocess. Listener goroutines feed a buffered errors chan; signal-triggered drains exit 0, listener-error-triggered drains exit 1 (preserves today's os.Exit(1) semantics for the process-supervisor restart case).
Code Review: #31Decision: PASS Findings
CI statusThe SummaryThe implementation cleanly matches the architect's spec: No MUST FIX or SHOULD FIX findings. Re-run CI; merge when green. |
Add feature doc, codebase note, and INDEX entry for the SIGTERM/SIGINT drain path. Refresh metrics-listener / rate-limit-middleware / autocert-tls feature docs whose "no graceful shutdown" / "deferred to #31" lines are no longer accurate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
What
On SIGTERM/SIGINT the relay now drains in-flight WebSocket connections to a clean close before exiting:
/v1/serverand/v1/clientupgrades fail the handshake).CloseWithCode(StatusGoingAway, "shutting down").drainDeadline) for peers to acknowledge; conns still mid-handshake at the deadline are force-closed viahttp.Server.Close.--insecure-listenand--domain(autocert) modes handle the signal identically.Implements all five acceptance criteria from #31.
Issue
Closes #31.
Design notes
Registry.Snapshot()mirrors the existingPhonesForpattern: snapshot underRLock, all close fan-out happens outside the lock. Returns one flat[]Connbecause the only consumer treats binaries and phones uniformly.relay.Shutdown(ctx, logger, reg, servers...)ininternal/relay/shutdown.go. Concurrent per-serverhttp.Shutdown+ concurrent per-connCloseWithCodekeep wall-clock cost bounded by the single-worst-case 5s close-handshake (perlessons.md), not linear in conn count. Returnsnilon clean drain,ctx.Err()on deadline expiry.gracefulCloseris a narrow interface internal toshutdown.go—*WSConnsatisfies it without widening the registry'sConninterface or breaking existingfakeConntest mocks.cmd/pyrycode-relay/mainis split intomain → run(args, sigCtx)so the e2e test can drive the shutdown path with a synthetic context cancellation instead of forking a real subprocess (architect-suggested shape). Signal-triggered drains exit 0; listener-error-triggered drains exit 1 (preserves today'sos.Exit(1)semantics for process-supervisor restart).drainDeadline = 10 * time.Secondlives at the wiring site per the established#21/#60policy-values-in-main convention.Testing
registry_test.go:TestSnapshot_EmptyRegistry,TestSnapshot_IncludesBinariesAndPhones,TestSnapshot_FreshSliceIsolation,TestSnapshot_RaceFreedom(16 goroutines × 200 ops under-race).shutdown_test.go: clean drain, empty registry, deadline expiry (with aClose-blocking fake conn — asserts prompt return aftersrv.Close()), and idempotent close on a real*WSConn.main_e2e_test.go: boots the relay on127.0.0.1:0viarun(), opens a/v1/serverWS, triggers shutdown, asserts the nextReadsurfaceswebsocket.CloseStatus == StatusGoingAwayandrun()returns 0 within the drain deadline. Companion test asserts the listener is no longer reachable afterrunreturns.all green.
Architecture compliance
Follows the spec at
docs/specs/architecture/31-graceful-shutdown-on-sigterm.md:Registry.Snapshot()accessor with the documented signature,RLock+ freshly-allocated slice, returnsnilon empty.Shutdown(ctx, logger, reg, servers...)with the documented behavior (concurrent fan-out, ctx-first / done-first select, force-close on deadline, idempotency).mainwiring:signal.NotifyContext(SIGTERM, SIGINT), per-server listener goroutines feeding a bufferedlistenerErrchan, signal-vs-error select,drainCtxwith timeout, exit codes 0/1 per the architect's "Open questions" resolution.defer limiter.Close()now runs on the clean return path; updated comment points to relay: graceful shutdown on SIGTERM — drain WS connections before exit #31 instead of the stale relay: wire per-IP rate-limit middleware on /v1/server + /v1/client #47.🤖 Generated with Claude Code