Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions cmd/pyrycode-relay/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -182,10 +182,20 @@ func run(args []string, sigCtx context.Context) int {

if *insecureListen != "" {
logger.Info("starting", "version", Version, "mode", "insecure", "listen", *insecureListen)
// ReadHeaderTimeout (5s) bounds the pre-upgrade WebSocket
// handshake window — the gap between TCP accept and full HTTP
// request-header receipt. After header parse,
// nhooyr.io/websocket.Accept hijacks the connection and
// ReadTimeout/WriteTimeout no longer apply; post-upgrade slow
// peers are covered by per-frame deadlines (#15) and heartbeat
// ping/pong (#7). Caps slow-loris exposure on the internet-
// exposed surface (docs/threat-model.md § DoS resistance).
// Lives at each wiring site per the policy-values-in-main
// convention (docs/PROJECT-MEMORY.md "Project-level conventions").
srv := &http.Server{
Addr: *insecureListen,
Handler: mux,
ReadHeaderTimeout: 10 * time.Second,
ReadHeaderTimeout: 5 * time.Second,
ReadTimeout: 60 * time.Second,
WriteTimeout: 60 * time.Second,
IdleTimeout: 120 * time.Second,
Expand Down Expand Up @@ -236,7 +246,7 @@ func run(args []string, sigCtx context.Context) int {
Addr: ":443",
Handler: relay.EnforceHost(*domain, mux),
TLSConfig: relay.TLSConfig(mgr),
ReadHeaderTimeout: 10 * time.Second,
ReadHeaderTimeout: 5 * time.Second,
ReadTimeout: 60 * time.Second,
WriteTimeout: 60 * time.Second,
IdleTimeout: 120 * time.Second,
Expand All @@ -248,7 +258,7 @@ func run(args []string, sigCtx context.Context) int {
// would 302 GET/HEAD to HTTPS; the AC requires explicit 404 for
// non-challenge traffic.
Handler: mgr.HTTPHandler(http.NotFoundHandler()),
ReadHeaderTimeout: 10 * time.Second,
ReadHeaderTimeout: 5 * time.Second,
ReadTimeout: 60 * time.Second,
WriteTimeout: 60 * time.Second,
IdleTimeout: 120 * time.Second,
Expand Down
59 changes: 59 additions & 0 deletions docs/knowledge/codebase/35.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Ticket #35 — WS upgrade handshake timeout: cap pre-upgrade window at 5s

`http.Server.ReadHeaderTimeout` was 10s at every relay listener. That is the only bound on a peer that opens TCP and dribbles WS-upgrade request headers byte-by-byte; `ReadTimeout` / `WriteTimeout` stop applying once `websocket.Accept` hijacks the connection, so the 10s window was the full slow-loris exposure on the internet-exposed surface. This ticket tightens the value to 5s at every wiring site — three `http.Server` literals in `cmd/pyrycode-relay/main.go` and one in `internal/relay/metrics_listen.go`'s `NewMetricsServer` — plus a behavioural test pinning the `net/http` contract the relay depends on.

## Implementation

- **Four wiring sites move together.** `ReadHeaderTimeout: 10 * time.Second` → `5 * time.Second` at:
- `cmd/pyrycode-relay/main.go:198` — insecure-listen `srv`.
- `cmd/pyrycode-relay/main.go:249` — autocert TLS `httpsSrv`.
- `cmd/pyrycode-relay/main.go:261` — autocert HTTP-01 `httpSrv`.
- `internal/relay/metrics_listen.go:90` — `NewMetricsServer` return literal.
- **Metrics listener moves with the public listeners on purpose.** `metrics_listen.go`'s function comment ("Timeouts match the public listener (cmd/pyrycode-relay/main.go) so the metrics surface has the same DoS-resistance shape … today they match because that is the safest default") makes the four-site invariant load-bearing. Loopback-only metrics traffic has no slow-loris threat, but leaving the value at 10s would silently break the invariant the comment encodes and force a future contributor to re-discover whether the divergence was intentional.
- **Documenting the bound at the wiring site.** A new comment paragraph above the first occurrence (insecure-listen `srv` in `main.go`) states the value, what it bounds (pre-upgrade window — the gap between TCP accept and full HTTP header receipt), the references to the threat model and the post-upgrade complements (#15 per-frame deadlines, #7 heartbeat), and the wiring-site convention from `docs/PROJECT-MEMORY.md`. The autocert-mode `httpsSrv` / `httpSrv` carry the literal without their own comment — they sit visibly next to the commented insecure-mode `srv` in the same file. `NewMetricsServer`'s existing function-level comment block already covers its site.
- **No factory, no shared `const`.** Each of the four sites keeps its own literal. The inline pattern is load-bearing for the wiring-site convention: an operator reading the constructor sees the policy value next to the listener that enforces it. Promoting to a shared symbol would gain DRY but lose locality — the rule the AC names ("policy values live at the wiring site") explicitly prefers locality.
- **Why 5s.** Inside the AC's "~5 seconds" band, 5s is the choice because (a) it is identical to nhooyr's per-conn close-handshake timeout already documented in `docs/lessons.md` and used as the basis for `drainDeadline` headroom — keeping the relay's slow-peer bounds at one consistent number reduces operator surprise; (b) it comfortably exceeds round-trip + header-write time on the realistic worst case (mobile peer over LTE, ~500ms-1s RTT, a few hundred bytes of headers), so legitimate clients are not at risk of intermittent boot failures; (c) 3s would be tighter but starts to feel like a tuning value, while 5s leaves observable margin without leaving the resource window meaningfully open. Not a tuning parameter for this ticket; should not be revisited unless an operational signal demands it.
- **No new code path, no new goroutines, no new logs, no new metrics.** `ReadHeaderTimeout` is enforced by `net/http`'s connection-state machine before the registered handler is invoked. When it fires, the underlying `net.Conn` is closed and no handler runs. No log line, no metric — matches today's behaviour at 10s. A peer that never sends a full request is indistinguishable from a peer that disconnected mid-write, and we already chose not to log either case. Visibility into pre-upgrade timeouts as a slow-loris indicator is deferred to a future metrics-pipeline ticket against `internal/relay/metrics_upgrade.go`.
- **Test (`internal/relay/slow_upgrade_test.go`, new, ~93 lines):** `TestHTTPServerReadHeaderTimeoutClosesSlowPeer` pins the `net/http` contract this ticket depends on. Constructs a bare `*http.Server` (not `httptest.NewServer`, which cannot set `ReadHeaderTimeout`) on a `net.Listen("tcp", "127.0.0.1:0")` listener, with `ReadHeaderTimeout: 150ms` to keep the test fast and other timeouts (`ReadTimeout` / `WriteTimeout` / `IdleTimeout`) deliberately generous so the test cannot accidentally pass by tripping the wrong timeout. A slow client `Dial`s, writes a partial request-header block (`"GET /v1/server HTTP/1.1\r\nHost: x\r\n"` — no terminating `\r\n\r\n`), then `Read`s with a deadline of `ReadHeaderTimeout + 500ms` slack. The server must close the connection: `Read` returns either `io.EOF` or a connection-reset-style `net.Error` whose `Timeout()` is false (the client-side deadline error is explicitly distinguished and fails the test loudly — "server did not close the connection"). An `atomic.Int64` guards the handler: if it ever increments, the test fails because that signals `ReadHeaderTimeout` did not fire before the handler ran.
- **What the test does and does not prove.** Does prove: the standard library's `ReadHeaderTimeout` contract — the thing the relay depends on — actually closes the connection at the configured deadline. If a future Go version regressed this, or if a future refactor accidentally zeroed the field, the test catches it. Does not prove: that `main.go` sets the value to 5s. That is verified by reading the diff (or `grep ReadHeaderTimeout cmd/pyrycode-relay/main.go internal/relay/metrics_listen.go`) at code review time, same as every other wiring-site literal (`drainDeadline`, `maxFrameBytes`, the rate-limit constants in `main.go`, the timeouts in `metrics_listen.go`).
- **`internal/relay/metrics_listen_test.go`** — one literal updated in the constructor matrix's pinned-timeout assertion (10s → 5s). The test pins the timeout values intentionally rather than reading from a shared symbol, so an accidental drift surfaces as a unit-test failure.

## Acceptance criteria — verification map

- **AC-1** (pre-upgrade stall closed within ~5s): `ReadHeaderTimeout: 5 * time.Second` at every wiring site. Behavioural test `TestHTTPServerReadHeaderTimeoutClosesSlowPeer` proves the contract at a short value; the production value is verified by code review at each literal.
- **AC-2** (applies on `/v1/server` and `/v1/client`): the bound lives on `http.Server`, before the per-route handler dispatch — it covers every path on the listener (`/healthz`, `/v1/server`, `/v1/client`) by construction.
- **AC-3** (applies in both `--insecure-listen` and `--domain` modes): three `http.Server` literals in `main.go` move together (insecure `srv`, autocert `httpsSrv`, autocert `httpSrv`); the metrics-listener literal moves with them.
- **AC-4** (focused slow-handshake test in `internal/relay`): `internal/relay/slow_upgrade_test.go` — bare `http.Server` on loopback, partial-header dial, `Read` returns within deadline + slack.
- **AC-5** (5s literal at the wiring site, not buried in handler internals): each of the four sites carries the literal directly; no helper, no shared `const`. Matches the policy-values-in-main convention documented in `docs/PROJECT-MEMORY.md`.
- **AC-6** (`make vet`, `make test -race`, `make build` clean): no new dependencies, no new exported symbols, no concurrency changes.

## Patterns reinforced (not new)

- **Policy values live at the wiring site.** Each of the three `http.Server` literals in `main.go` and the one in `metrics_listen.go` continues to carry its own value. The local-comment-+-literal shape (mirroring `drainDeadline`'s comment block at `main.go:25-31`) is the established way the codebase documents a wiring-site policy.
- **Behavioural test of a stdlib contract + code-review of the literal.** The combination is how this codebase already validates `drainDeadline` (#31), `maxFrameBytes` (#15), the rate-limit constants (#47), and the existing `metrics_listen.go` timeouts (#60). The slow-handshake test fits the same shape — it pins the `net/http` contract the production literal depends on, while the literal itself is verified at review.
- **Multi-site invariants need every site to move.** The metrics-listener literal moves with the public-listener literals because the function's own comment makes the symmetry load-bearing. Carry forward: when a comment says "this value matches site X for reason Y", changing site X mechanically obligates this site too — the alternative is silently invalidating the comment.

## Lessons learned

- **`ReadTimeout` does not cover the upgrade window the way intuition suggests.** It is tempting to read `ReadTimeout: 60s` as "the whole request must finish in 60s" — that would already bound slow-loris. But `nhooyr.io/websocket.Accept` calls `http.ResponseWriter`'s `Hijack`, and once a connection is hijacked, `net/http` stops applying `ReadTimeout` / `WriteTimeout`. So between TCP accept and the moment the upgrade completes, `ReadHeaderTimeout` is the only stdlib-level bound. The post-upgrade window is then handled by per-frame deadlines (#15) and heartbeat (#7), which are application-level concerns living above the hijacked socket. Three layers, three distinct surfaces — none of them subsumes the others.
- **A behavioural test that runs the production handler would be misleading here.** A natural first instinct is "spin up a real `relay.ServerHandler`, dial slowly, assert close" — but that path requires `websocket.Accept` against a real upgrade, and the bound this ticket tests fires *before* the upgrade even starts. Testing the production handler would exercise more code without exercising the deciding line. The bare-`http.Server` shape is the minimal coverage of the thing actually being defended.
- **Handler-never-runs is the right negative signal.** The `atomic.Int64` handler-call counter is small but load-bearing: if the test setup ever lets the handler run, the assertion ("server closed the connection") could still pass spuriously via a different code path (e.g. a handler-side `r.Body` read with its own deadline). Counting the handler invocations makes "the timeout fired before the handler ran" an explicit, testable claim rather than an implicit assumption.
- **A short test timeout exercises the same contract as the production one.** Tests do not need the production value (5s) to verify the production contract. `net/http`'s `ReadHeaderTimeout` enforcement is the same code path at any positive duration; using 150ms keeps the test under the parallel-tests budget without weakening the assertion. Same shape as `TestShutdown_DeadlineExpiryReturnsCtxErr` (#31), which uses a 50ms drain deadline against the helper's 10s production constant.
- **Non-EOF, non-timeout close errors are valid.** Different kernels surface a server-side close as `io.EOF`, `ECONNRESET`, `EPIPE`, or `read: connection reset by peer` depending on TCP race timing. The test treats EOF as the canonical signal and logs (but does not fail on) other non-timeout errors, which keeps it robust across darwin/linux/CI runners without weakening the assertion (`netErr.Timeout()` being false is the load-bearing predicate — anything else means the server, not the client, ended the read).

## Cross-links

- [Threat model § DoS resistance](../../threat-model.md) — updated to cite the new 5s value and to list all four `http.Server` instances (insecure, autocert TLS, autocert HTTP-01, metrics listener).
- [Feature: Autocert TLS](../features/autocert-tls.md) — server-timeouts table updated to 5s.
- [Feature: Metrics listener](../features/metrics-listener.md) — constructor timeout policy updated to 5s; function comment's "match the public listener" invariant preserved.
- [Codebase note #31](31.md) — `drainDeadline` is the established pattern for a policy literal documented at the wiring site; this ticket reuses the same shape.
- [Codebase note #15](15.md) — per-frame deadlines that cover the post-upgrade slow-peer window.
- [Codebase note #7](7.md) — heartbeat ping/pong that also covers the post-upgrade slow-peer window.
- [Codebase note #50](50.md) — per-IP attempt-rate cap; the layer above this one in the DoS-resistance stack.

## Out of scope (named in spec, deferred)

- Per-IP connection rate limiting — covered by #50.
- Post-upgrade slow peers — covered by #15 (per-frame deadlines) and #7 (heartbeat).
- A metric counting `ReadHeaderTimeout` firings — no operational signal currently demands it; future work belongs against `internal/relay/metrics_upgrade.go`.
- Tightening `ReadTimeout`, `WriteTimeout`, or `IdleTimeout` — those govern post-header request lifecycle and are not the slow-loris surface this ticket addresses.
4 changes: 2 additions & 2 deletions docs/knowledge/features/autocert-tls.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,12 +75,12 @@ Both `httpsSrv` and `httpSrv` mirror the insecure path's timeouts:

| Setting | Value |
|---|---|
| `ReadHeaderTimeout` | 10s |
| `ReadHeaderTimeout` | 5s |
| `ReadTimeout` | 60s |
| `WriteTimeout` | 60s |
| `IdleTimeout` | 120s |

`ReadHeaderTimeout` bounds slow-loris on both ports and keeps `gosec` G114 quiet.
`ReadHeaderTimeout` bounds slow-loris on the pre-upgrade window on both ports and keeps `gosec` G114 quiet. Tightened from 10s to 5s in #35; post-upgrade slow peers are covered by per-frame deadlines (#15) and heartbeat (#7).

## TLS version

Expand Down
2 changes: 1 addition & 1 deletion docs/knowledge/features/metrics-listener.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ func NewMetricsServer(addr string, h http.Handler) (*http.Server, error)
- `NewMetricsServer` is the opt-out-aware constructor:
- `addr == ""` → `(nil, nil)`. The caller checks `srv != nil` before launching the goroutine or adding the port to the listener allowlist.
- `CheckLoopbackBind(addr)` fails → `(nil, err)`.
- otherwise → `*http.Server` with the public listener's timeout policy (`ReadHeaderTimeout: 10s`, `ReadTimeout: 60s`, `WriteTimeout: 60s`, `IdleTimeout: 120s`) — duplicated rather than shared so either listener can drift if a future ticket has reason; today they match because that is the safest default.
- otherwise → `*http.Server` with the public listener's timeout policy (`ReadHeaderTimeout: 5s`, `ReadTimeout: 60s`, `WriteTimeout: 60s`, `IdleTimeout: 120s`) — duplicated rather than shared so either listener can drift if a future ticket has reason; today they match because that is the safest default. (`ReadHeaderTimeout` was 10s at first wiring; #35 tightened all four wiring sites to 5s together to keep the invariant the function's comment encodes.)

The handler argument is `http.Handler`, not `*http.ServeMux`: the caller chooses whether to wire a bare `NewMetricsHandler(reg)` or wrap it in a mux. `cmd/pyrycode-relay/main.go` wraps in a `ServeMux` so `/metrics` is distinguishable from a future sibling path on the same listener.

Expand Down
Loading
Loading