Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,16 +30,26 @@ make lint # gosec + govulncheck (requires both installed locally)

## Run

Production (autocert):

```bash
sudo ./bin/pyrycode-relay --domain relay.example.com
```

The relay binds `:443` (WSS) and `:80` (ACME http-01 challenge). Both ports must be reachable from the public internet — Let's Encrypt issues the cert by hitting `:80` on first request to the domain. The first WSS request after a fresh start may take ~10–20s while the cert is issued and cached to `--cert-cache`. Subsequent restarts reuse the cached cert.

Behind a reverse proxy (TLS terminated upstream):

```bash
./bin/pyrycode-relay --domain relay.example.com
./bin/pyrycode-relay --insecure-listen :8080
```

Flags:

| Flag | Default | Notes |
|---|---|---|
| `--domain` | (required for autocert) | Public domain for Let's Encrypt cert issuance. Required when `--insecure-listen` is unset. |
| `--cert-cache` | `~/.pyrycode-relay/certs` | Directory for autocert's TLS certificate cache. |
| `--cert-cache` | `~/.pyrycode-relay/certs` | Directory for autocert's TLS certificate cache. Created with `0700` if missing; refuses to start if an existing dir is world- or group-readable. |
| `--insecure-listen` | (unset) | Listen address for plain HTTP (e.g. `:8080`). Disables autocert. Use only when fronted by a reverse proxy. |
| `--version` | | Print version and exit. |

Expand Down
47 changes: 43 additions & 4 deletions cmd/pyrycode-relay/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ import (
"net/http"
"os"
"time"

"github.com/pyrycode/pyrycode-relay/internal/relay"
)

// Version is overridden at build time via -ldflags.
Expand Down Expand Up @@ -63,11 +65,48 @@ func main() {
return
}

// TLS path (autocert) is intentionally not implemented yet — first
// real ticket wires it up. Refuse to start so misconfiguration is loud.
logger.Error("autocert TLS path not yet implemented; use --insecure-listen for now",
mgr, err := relay.NewAutocertManager(*domain, *certCache)
if err != nil {
logger.Error("autocert setup failed", "err", err)
os.Exit(1)
}

httpsSrv := &http.Server{
Addr: ":443",
Handler: relay.EnforceHost(*domain, mux),
TLSConfig: relay.TLSConfig(mgr),
ReadHeaderTimeout: 10 * time.Second,
ReadTimeout: 60 * time.Second,
WriteTimeout: 60 * time.Second,
IdleTimeout: 120 * time.Second,
}

httpSrv := &http.Server{
Addr: ":80",
// NotFoundHandler — NOT nil. autocert.Manager.HTTPHandler(nil)
// would 302 GET/HEAD to HTTPS; the AC requires explicit 404 for
// non-challenge traffic.
Handler: mgr.HTTPHandler(http.NotFoundHandler()),
ReadHeaderTimeout: 10 * time.Second,
ReadTimeout: 60 * time.Second,
WriteTimeout: 60 * time.Second,
IdleTimeout: 120 * time.Second,
}

logger.Info("starting", "version", Version, "mode", "autocert",
"domain", *domain, "cert_cache", *certCache)
os.Exit(2)

go func() {
if err := httpSrv.ListenAndServe(); err != nil {
logger.Error("http-01 listener failed", "err", err)
os.Exit(1)
}
}()

if err := httpsSrv.ListenAndServeTLS("", ""); err != nil {
logger.Error("https listener failed", "err", err)
os.Exit(1)
}
}

func defaultCertCache() string {
Expand Down
5 changes: 3 additions & 2 deletions docs/PROJECT-MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,20 +9,21 @@ Stateless WebSocket router between mobile clients and pyry binaries. Internet-ex
| Go skeleton | Done | `cmd/pyrycode-relay/main.go`, `internal/relay/doc.go` |
| `http.Server` with explicit timeouts (gosec G114) | Done | `cmd/pyrycode-relay/main.go` |
| Routing envelope wrapper type (`Envelope`, `Marshal`, `Unmarshal`, sentinel errors) | Done (#1) | `internal/relay/envelope.go` |
| TLS via autocert in `--domain` mode (`NewAutocertManager`, `EnforceHost`, `TLSConfig`, `ErrCacheDirInsecure`) | Done (#9) | `internal/relay/tls.go`, `cmd/pyrycode-relay/main.go` |
| WS upgrade on `/v1/server` and `/v1/client` | Not started | — |
| Header validation (`x-pyrycode-server`, `x-pyrycode-token`) | Not started | — |
| Connection registry (server-id ↔ binary, server-id ↔ phones) | Not started | — |
| Frame forwarding using the routing envelope | Not started | — |
| `conn_id` generation scheme | Not started | — |
| TLS / autocert | Not started | — |
| Threat model doc | Not started | `docs/threat-model.md` planned |

## Patterns established

- **Sentinel errors, branched via `errors.Is`** for validation failures at protocol boundaries. New routing-layer code follows the `Err...` naming and wraps with `fmt.Errorf("…: %w", err, sentinel)` when adding context.
- **Opacity by type.** Inner-frame payloads are carried as `json.RawMessage`. The relay never deserialises payloads; the type makes that hard to violate accidentally.
- **Validate at the envelope boundary, not deeper.** Structural checks (presence, non-empty, JSON well-formedness) belong here; semantic checks (token validity, message kind) belong to the binary.
- **Stdlib only.** No external Go dependencies so far. `encoding/json` is sufficient for routing.
- **Stdlib + `golang.org/x/crypto`.** First non-stdlib dep arrived in #9 (`acme/autocert`). Keep the dependency surface deliberate; new deps need a justification.
- **Loud failure over silent correction.** `--domain` and `--insecure-listen` are mutually exclusive and explicit. `ErrCacheDirInsecure` refuses to start on a world/group-readable cert cache rather than re-chmoding it. `:80` returns `404` for non-challenge traffic instead of redirecting to HTTPS (ADR-0002).
- **Tests live in the same package** (`package relay`, not `relay_test`) so they can `errors.Is` against unexported sentinels if needed and reach private helpers without re-exporting.

## Conventions
Expand Down
2 changes: 2 additions & 0 deletions docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,12 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

## Features

- [Autocert TLS](features/autocert-tls.md) — `--domain` mode wiring: `:443` WSS via Let's Encrypt + `:80` ACME http-01, host gates, cache-dir permission policy, TLS 1.2 floor.
- [Routing envelope](features/routing-envelope.md) — typed Go wrapper (`Envelope`, `Marshal`, `Unmarshal`) for the `{conn_id, frame}` wire shape exchanged between relay and binary.

## Decisions

- [ADR-0002: Explicit-failure 404 on `:80` instead of HTTPS redirect](decisions/0002-autocert-explicit-failure-on-port-80.md) — pass `http.NotFoundHandler()` to `autocert.Manager.HTTPHandler`, not `nil`; loud failure over silent scheme-switching.
- [ADR-0001: Routing envelope shape and opacity](decisions/0001-routing-envelope-shape-and-opacity.md) — `json.RawMessage` for the inner frame; sentinel errors; validate at boundary, never parse payloads.

## Architecture
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# ADR-0002: Explicit-failure 404 on `:80` instead of HTTPS redirect

**Status:** Accepted (#9)
**Date:** 2026-05-08

## Context

In `--domain` mode the relay binds `:80` for the ACME http-01 challenge. Anything *other* than an ACME challenge request hitting that port is, by definition, a misconfigured client — phones and binaries are required to use WSS on `:443`. The question is how to respond to those non-challenge requests.

`golang.org/x/crypto/acme/autocert.Manager.HTTPHandler(fallback)` defines the behaviour for non-challenge traffic:

- `fallback == nil` → autocert redirects GET/HEAD to HTTPS with `302` and returns `400` for other methods.
- `fallback != nil` → autocert calls `fallback.ServeHTTP` for all non-challenge traffic.

The ticket's literal AC text said "the `nil` argument means non-challenge HTTP requests get a 404," which is incorrect about autocert's behaviour. The intent stated alongside the AC was the load-bearing part: *"we do NOT redirect to HTTPS — we want explicit-failure semantics for misconfigured clients."*

## Decision

Pass `http.NotFoundHandler()` explicitly to `manager.HTTPHandler`, not `nil`. Non-challenge port-80 traffic returns `404` with no body. ACME challenges still flow through autocert's own routing and reach the cert-issuance handler unchanged.

## Rationale

A `302` redirect is convenient for browsers but actively unhelpful for the relay's clients:

- Mobile clients (phones, binaries) speak WSS on `:443`. They have no business reaching the relay over plain HTTP. A redirect would mask the misconfiguration; a `404` surfaces it.
- Browsers shouldn't reach the relay at all — there is no HTTP UI. A `404` is the honest answer.
- The relay's posture elsewhere (`--insecure-listen` vs `--domain` are mutually exclusive and explicit; `ErrCacheDirInsecure` refuses to start instead of silently weakening) is consistent loud-failure. Redirecting plain-HTTP traffic to HTTPS would be the lone silent-correction in this binary.

Returning a body in the 404 was considered and rejected — the relay leaks no internal state in error responses anywhere else, and nothing useful would go in the body.

## Consequences

- The wiring in `cmd/pyrycode-relay/main.go` is `mgr.HTTPHandler(http.NotFoundHandler())`, with a comment explaining why `nil` is wrong here. Future contributors who "clean up" to `nil` reintroduce the redirect.
- If a future operator deployment ever wants the redirect (e.g. fronting a browser-facing healthcheck on `:80`), this is a one-line change — but it should require a fresh ticket and decision.
- The choice is independent of the `:443` `EnforceHost` middleware (which returns `421` for Host-header mismatches). Both follow the same explicit-failure principle but address different threats: `EnforceHost` pins the application-layer host on TLS connections; the `:80` 404 prevents silent scheme-switching.
124 changes: 124 additions & 0 deletions docs/knowledge/features/autocert-tls.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Autocert TLS (`--domain` mode)

The relay terminates TLS itself in production via Let's Encrypt and `golang.org/x/crypto/acme/autocert`. One command brings the public listener up:

```
sudo pyrycode-relay --domain relay.example.com
```

Two listeners come up: `:443` for WSS (cert via autocert) and `:80` for the ACME http-01 challenge. No reverse proxy in the production path.

Authoritative wire spec: [`pyrycode/pyrycode/docs/protocol-mobile.md` § TLS](https://github.com/pyrycode/pyrycode/blob/main/docs/protocol-mobile.md#tls).

## Modes

| Flag | When to use |
|---|---|
| `--domain <fqdn>` | Production. Relay holds the cert. Requires `:80` and `:443` reachable from the public internet. |
| `--insecure-listen :8080` | Behind a reverse proxy that terminates TLS upstream, or local dev. Disables autocert. |

The two are mutually exclusive; one of them must be set or the binary refuses to start.

## API

Package `internal/relay` (`tls.go`):

```go
func NewAutocertManager(domain, cacheDir string) (*autocert.Manager, error)
func EnforceHost(domain string, next http.Handler) http.Handler
func TLSConfig(m *autocert.Manager) *tls.Config

var ErrCacheDirInsecure = errors.New("relay: cert cache dir has insecure permissions (must be 0700)")
```

`cmd/pyrycode-relay/main.go` does the wiring (server construction, timeouts, listener startup); `internal/relay` provides only the testable pieces.

## Configuration

| Flag | Default | Notes |
|---|---|---|
| `--domain` | (required for autocert) | Single domain. Bound via `autocert.HostWhitelist(domain)`; ACME issuance for any other host is rejected. |
| `--cert-cache` | `~/.pyrycode-relay/certs` | Created with `0700` if missing. Refuses to start if an existing dir is world- or group-readable (`ErrCacheDirInsecure`). |
| `--insecure-listen` | (unset) | Disables autocert. |

The cache dir is the only on-disk state the relay keeps. TLS private keys live there (autocert writes them with `0o600`).

## Two host gates

Single-domain enforcement happens in two places, because the TLS handshake's SNI does not bind the HTTP `Host` header — they can legally disagree on the same connection.

1. **`autocert.HostWhitelist(domain)`** — gates ACME *issuance*. Without it, the relay would attempt to fetch certs for any name a client requests via SNI, which Let's Encrypt would either rate-limit us out of or, worse, succeed at for an attacker-controlled domain pointed at our IP.
2. **`EnforceHost(domain, mux)`** — gates *application traffic* on `:443`. Any request whose `Host` header doesn't match the configured domain (case-insensitive, port-tolerant) gets `421 Misdirected Request` with no body, before reaching the routing handlers.

`:80` does not use `EnforceHost`. The autocert HTTPHandler owns the host check there via the manager's `HostPolicy`.

## Explicit-failure semantics on `:80`

`:80` is wired as `mgr.HTTPHandler(http.NotFoundHandler())` — *not* `mgr.HTTPHandler(nil)`. Passing `nil` would redirect non-challenge traffic (GET/HEAD) to HTTPS with `302`. The relay deliberately returns `404` instead so that misconfigured clients fail loudly rather than silently switching schemes. See ADR-0002.

## Cert cache permissions

`NewAutocertManager` is stricter than the ticket's AC ("existing dir → no-op"). The actual policy:

- Missing → `os.MkdirAll(dir, 0o700)`.
- Exists, is dir, mode `& 0o077 == 0` → no-op (secure).
- Exists, is dir, mode `& 0o077 != 0` → `ErrCacheDirInsecure` (refuses to start).
- Exists, not a dir → error.

Reasoning: TLS private keys live there. Silently accepting an attacker-readable cache fits neither the relay's "internet-exposed; adversarial input is the default" stance nor the `--insecure-listen` precedent of loud failure.

Per-file mode (`0o600`) is autocert's contract, not enforced here.

## Server timeouts

Both `httpsSrv` and `httpSrv` mirror the insecure path's timeouts:

| Setting | Value |
|---|---|
| `ReadHeaderTimeout` | 10s |
| `ReadTimeout` | 60s |
| `WriteTimeout` | 60s |
| `IdleTimeout` | 120s |

`ReadHeaderTimeout` bounds slow-loris on both ports and keeps `gosec` G114 quiet.

## TLS version

`TLSConfig(m)` returns `m.TLSConfig()` with `MinVersion = tls.VersionTLS12`. Centralised because:

- autocert's default config leaves `MinVersion` zero, which trips `gosec` G402 on `make lint`.
- A future "switch to 1.3-min" is a one-line change with one test to update.

Cipher selection is delegated to Go's secure-default suites for TLS 1.2; TLS 1.3 doesn't expose suite selection.

## Concurrency

Two goroutines, no shared mutable state:

1. Main — `httpsSrv.ListenAndServeTLS("", "")`. The empty cert/key paths defer to `TLSConfig.GetCertificate`, which `manager.TLSConfig()` populates. This is the documented autocert pattern.
2. Background — `httpSrv.ListenAndServe()` for ACME http-01.

Either listener failing → log + `os.Exit(1)`. No graceful shutdown (mirrors the insecure path).

The `*autocert.Manager` is constructed once and used read-only; its internal locking is autocert's contract.

## Operational notes

- `:80` and `:443` are privileged. Run with `sudo`, or grant `CAP_NET_BIND_SERVICE` via `setcap` / systemd `AmbientCapabilities`.
- First request to the domain after a fresh start may take ~10–20s while autocert issues and caches the cert. Subsequent restarts reuse the cache.
- Cert renewal is silent — autocert handles it. No metrics in v1.

## Out of scope (deferred)

- Multi-domain certs (`autocert.HostWhitelist` is variadic but the UX/ops story isn't worth solving until a second domain is needed).
- DNS-01 / wildcard certs.
- `--acme-email` for Let's Encrypt expiry mail.
- Renewal observability.
- Self-signed dev fallback (`--insecure-listen` exists for that).
- Per-IP / per-connection rate limits (separate ticket).
- Graceful shutdown / signal handling.

## Related

- [ADR-0002: Explicit-failure 404 on `:80` instead of HTTPS redirect](../decisions/0002-autocert-explicit-failure-on-port-80.md)
- [Architecture overview](../../architecture.md) — relay is the cert holder.
16 changes: 16 additions & 0 deletions docs/lessons.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,19 @@ When unmarshalling `{"frame": null}` into a struct with `Frame json.RawMessage`,
## `json.RawMessage` round-trips are byte-stable *modulo whitespace*

`encoding/json` re-encodes `RawMessage` through the outer marshaller and may strip insignificant whitespace. Tests asserting opacity should canonicalise both sides with `json.Compact` before `bytes.Equal`, otherwise they flake on formatting. Source: routing-envelope tests (#1).

## `autocert.Manager.HTTPHandler(nil)` redirects to HTTPS — it does not 404

The autocert docs and the `nil` argument are easy to misread as "no fallback → 404." Actual behaviour: GET/HEAD get a `302` to HTTPS; other methods get `400`. To get an explicit 404 (or any other handler), pass it explicitly: `mgr.HTTPHandler(http.NotFoundHandler())`. See ADR-0002. Source: autocert TLS wiring (#9).

## TLS handshake SNI does not bind the HTTP `Host` header

A client can complete the TLS handshake using SNI for `relay.example.com` and then send `Host: somethingelse.com` in the HTTP request on the same connection. The wire spec says they must match for the relay; enforce it in application code (`EnforceHost` returns `421 Misdirected Request`). Don't assume autocert's `HostWhitelist` covers this — that gates ACME *issuance*, not request-time host validation. Source: autocert TLS (#9).

## `r.Host` may carry a `:port` suffix

RFC 7230 §5.4 allows it. When comparing to a configured hostname, strip the port via `net.SplitHostPort` first (it errors out cleanly if there's no port — use the original string in that case). Compare with `strings.EqualFold` because hostnames are case-insensitive. Source: `EnforceHost` (#9).

## `autocert.Manager.TLSConfig()` doesn't set `MinVersion`

`gosec` G402 fires on `make lint` if you use it raw. Wrap it in a helper that pins `MinVersion = tls.VersionTLS12` (or 1.3) before handing it to `http.Server`. Centralising the override means a future bump is a one-line change. Source: `relay.TLSConfig` (#9).
Loading