Skip to content

feat(relay): refuse to boot on multi-instance deploy (#65)#93

Merged
ilmoniemi merged 4 commits into
mainfrom
feature/65
May 13, 2026
Merged

feat(relay): refuse to boot on multi-instance deploy (#65)#93
ilmoniemi merged 4 commits into
mainfrom
feature/65

Conversation

@ilmoniemi
Copy link
Copy Markdown
Contributor

What

Adds CheckSingleInstance: a boot-time refusal that fires when the relay detects it is running on a multi-instance-capable platform (today: Fly.io, signalled by a non-empty FLY_APP_NAME) AND the operator has not asserted single-instance intent via PYRYCODE_RELAY_SINGLE_INSTANCE=1. On positive detection the relay logs an ERROR containing the substring multi-instance deploy detected and the literal bypass env-var name, then exits 2.

This is the deterministic backstop half of the belt-and-suspenders pair noted in docs/PROJECT-MEMORY.md. The doc half (#64) shipped earlier — prose in docs/architecture.md § Single-instance constraint that pins the literal env-var name and value used here.

Implementation follows the spec at docs/specs/architecture/65-startup-multi-instance-check.md exactly:

  • internal/relay/single_instance.go — new helper, signature mirrors CheckInsecureListenInProduction. Exports ErrMultiInstanceDeployDetected (branchable via errors.Is).
  • internal/relay/env_config.go — one new row in envContracts for the bypass env var, so a typo like PYRYCODE_RELAY_SINGLE_INSTANCE=true fails CheckEnvConfig with the structured per-key error before CheckSingleInstance reads the value.
  • cmd/pyrycode-relay/main.go — wired between CheckEnvConfig and CheckInsecureListenInProduction. Rationale per spec § Wiring order: deploy-shape misconfiguration is more fundamental than production-mode flag misconfiguration; typo'd bypass surfaces as a structured config error first.

Issue

Closes #65.

Testing

internal/relay/single_instance_test.go — table-driven matrix covering AC #5(a/b/c) plus guard-rails for the bypass-set-no-signal no-op case and the non-exact-"1" bypass values ("true", "0", " 1"). Separate assertions lock the error message contents to the AC #2 substrings (multi-instance deploy detected and PYRYCODE_RELAY_SINGLE_INSTANCE) and verify errors.Is branchability.

internal/relay/env_config_test.go — adds TestCheckEnvConfig_SingleInstanceBypassMalformedValue to lock the new envContracts row.

go test -race ./...    # pass
go vet ./...           # clean
go build ./cmd/pyrycode-relay  # builds

Architecture compliance

  • New helper signature mirrors CheckRunningAsRoot / CheckInsecureListenInProduction (injected-getenv seam).
  • Bypass env-var name lives in internal/relay (envSingleInstanceBypass), not duplicated at the call site, matching the envProductionMode convention.
  • Decision order: bypass-wins-unconditionally → platform-signal-present → nil. Matches the spec § Design steps 1–3.
  • Error-log shape mirrors the existing CheckInsecureListenInProduction block (logger.Error + "refusing to start: …" + structured fields + return 2).
  • Pure function, no goroutines, no I/O beyond env-var reads — no shutdown path needed.

ilmoniemi and others added 2 commits May 13, 2026 11:57
CheckSingleInstance is the deterministic backstop to the docs/architecture.md
§ Single-instance constraint prose half (#64): when FLY_APP_NAME is non-empty
(Fly platform signal) and PYRYCODE_RELAY_SINGLE_INSTANCE is not exactly "1",
the relay refuses to start with a structured ERROR log and exit code 2. The
in-memory connection registry is per-process; silent fly scale count > 1
would split server-id routing across replicas. Bypass env var registered in
envContracts so a typo like =true fails CheckEnvConfig first.

Wiring: between CheckEnvConfig and CheckInsecureListenInProduction in
cmd/pyrycode-relay/main.go.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ilmoniemi
Copy link
Copy Markdown
Contributor Author

Code Review: #65

Decision: FAIL

Findings

  • [MUST FIX] CI security job (gosec) is failing on internal/relay/single_instance.go:9 — gosec G101 (Potential hardcoded credentials, CWE-798) flags the constant envSingleInstanceBypass because the identifier matches gosec's default credential-name heuristic (Bypass is in the default pattern list). The literal value "PYRYCODE_RELAY_SINGLE_INSTANCE" is not a credential, so this is a false positive — but a red CI security check blocks merge regardless. Two ways to clear it:

    1. Annotate, e.g. const envSingleInstanceBypass = "PYRYCODE_RELAY_SINGLE_INSTANCE" //#nosec G101 -- env-var name, not a credential. Acceptable because the PR description / commit message can justify it (per the project's security-sensitive contract: // #nosec annotations require justification). Lowest-friction option.
    2. Rename the identifier so it no longer matches gosec's pattern (e.g. envSingleInstanceAssert, envSingleInstanceOverride). This contradicts the architect's spec (docs/specs/architecture/65-startup-multi-instance-check.md § Design pins the literal envSingleInstanceBypass) — if you go this route, also update the spec.

    Recommend option (1): minimal diff, zero behaviour change, the annotation makes the false-positive nature explicit at the line in question. Either fix is acceptable.

Summary

Implementation matches the spec faithfully: decision order, sentinel error contents, env-config registry row, wiring placement between CheckEnvConfig and CheckInsecureListenInProduction, error-log shape, exit code 2, and the test matrix covering AC #5(a–c) plus guard-rail cases (bypass-without-signal no-op, non-exact-"1" values "true"/"0"/" 1" still refuse). TestCheckEnvConfig_SingleInstanceBypassMalformedValue correctly locks the new envContracts row. Go vet and go test -race pass locally. Re-review after CI is green.

`envSingleInstanceBypass` holds the literal env-var name string
`PYRYCODE_RELAY_SINGLE_INSTANCE`, not a credential. gosec G101 flags
it at LOW confidence; suppress with an inline `#nosec` annotation
naming the rule and the reason so the suppression scope is narrow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ilmoniemi
Copy link
Copy Markdown
Contributor Author

Code Review: #65

Decision: PASS

Findings

None.

Summary

Clean implementation that matches the architect's spec exactly:

  • internal/relay/single_instance.go mirrors the CheckInsecureListenInProduction shape (injected getenv seam, exported sentinel, pure function).
  • Decision order in CheckSingleInstance follows § Design steps 1–3: bypass wins unconditionally, then platform signal, then nil.
  • Env-var name lives in single_instance.go (envSingleInstanceBypass) and is referenced from env_config.go — single-source-of-truth invariant preserved.
  • Wiring at cmd/pyrycode-relay/main.go:95-111 slots correctly between CheckEnvConfig (so a typo'd =true surfaces as the structured per-key error first) and CheckInsecureListenInProduction (deploy-shape misconfig before flag misconfig).
  • Error log shape (ERROR level, "refusing to start: multi-instance deploy detected" literal message, structured err/bypass_env_var/fix fields, return 2) matches the established Check* refusal pattern. AC feat(relay): routing-envelope wrapper type (#1) #2's substring requirement (multi-instance deploy detected and PYRYCODE_RELAY_SINGLE_INSTANCE) is met in both the message and the sentinel's Error() string.
  • Test matrix in single_instance_test.go covers AC relay: WS upgrade for /v1/client — accept phone connection, look up server-id, register #5(a/b/c) plus the bypass-no-signal no-op and three non-exact-"1" rows ("true", "0", " 1"). errors.Is branchability and AC feat(relay): routing-envelope wrapper type (#1) #2 substring asserts are separately locked.
  • env_config_test.go adds the per-key envContracts lock for the new bypass row.

Security review (security-sensitive label)

Architect's ## Security review section is present in the spec with a PASS verdict (trust boundaries, bypass abuse, log injection, resource exhaustion, DoS via misconfig, wire-protocol surface). Implementation matches the spec's findings:

  • No env-var values interpolated into log lines — only the literal bypass env-var name and the sentinel error. No log-injection vector.
  • No new network or wire-protocol surface.
  • // #nosec G101 -- env var name, not a credential annotation on envSingleInstanceBypass is justified inline (the const value is a literal env-var name, not a secret). Commit d91e7e8 was the targeted fix; CI security job is green.
  • Pure function, two env-var reads, package-init allocation only for the sentinel. No per-request cost.

CI

  • go vet ./... — clean
  • go test -race ./internal/relay/... — pass (2.831s)
  • GitHub Actions test / security / image-scan — pass

- new feature doc: single-instance startup self-check (contract,
  wiring sequence, behaviour matrix, scope/limits, threat model)
- new per-ticket note codebase/65.md (implementation, patterns,
  lessons learned including the gosec G101 false-positive)
- env-config-validator.md updated: second row in the registry,
  ordering rationale generalised to all downstream Check* helpers,
  behaviour matrix now covers both env vars
- production-mode.md wiring sequence: CheckSingleInstance inserted
  between CheckEnvConfig and CheckInsecureListenInProduction; the
  Config-struct deferral note bumped to five startup checks
- INDEX.md: one-line summary at the top of Features
@ilmoniemi ilmoniemi merged commit 1054284 into main May 13, 2026
3 checks passed
@ilmoniemi ilmoniemi deleted the feature/65 branch May 13, 2026 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

relay: startup self-check refuses to run as multi-instance deploy

1 participant