Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,40 @@ jobs:
run: gosec ./...
- name: govulncheck
run: govulncheck ./...

image-scan:
runs-on: ubuntu-latest
# Belt-and-suspenders: the workflow-level permissions block above
# already grants only `contents: read`, but a future top-level
# escalation would silently widen this job's scope without the
# explicit declaration. Pin the minimum here so a regression in the
# workflow header doesn't grant the scanner write tokens.
permissions:
contents: read
steps:
- uses: actions/checkout@v6
- name: build image
# The scan operates on the same artifact a PR would publish.
run: docker build -t pyrycode-relay:${{ github.sha }} .
# Tracks: aquasecurity/trivy-action@v0.36.0
# Pinned by commit SHA so a tag-swap upstream cannot change what
# runs against our image between Renovate bumps. Refresh the
# comment in lockstep with the SHA so a reviewer can sanity-check
# which release the digest refers to. Same convention as the
# Dockerfile base-image digest pins (#32).
- name: trivy image scan
uses: aquasecurity/trivy-action@ed142fd0673e97e23eac54620cfb913e5ce36c25
with:
image-ref: pyrycode-relay:${{ github.sha }}
format: table
severity: CRITICAL,HIGH
# Only fixable CVEs fail the build. CVEs with no upstream
# patch produce no actionable work and would block every PR
# until someone unrelated to the change fixes them.
ignore-unfixed: true
# Cover both OS packages and language manifests / Go binary
# content Trivy can re-derive. Some overlap with govulncheck
# (#41) is intentional: govulncheck audits source + reachability;
# Trivy audits what physically shipped in the image.
vuln-type: os,library
exit-code: '1'
2 changes: 1 addition & 1 deletion docs/knowledge/INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ One-line pointers into the evergreen knowledge base. Newest entries at the top o

- [Connection-count gauges](features/connection-count-gauges.md) — `pyrycode_relay_connected_binaries` and `pyrycode_relay_connected_phones` exposed via a pull-based `prometheus.Collector` reading `Registry.Counts()` on each scrape; zero edits to `registry.go`; scalar (no labels) by design — `{server="..."}` would carry the attacker-influenced `x-pyrycode-server` header onto the metrics surface, which threat-model § Log hygiene forbids; stale grace-expiry fires can't move the gauge because the pointer-identity guard (ADR-0006) keeps the maps unchanged and the gauge IS the map size; race-tested against 16 mutator goroutines + a tight-loop scraper under `-race`. First collector wired into the #59 seam (#61).
- [Metrics registry (scaffolding)](features/metrics-registry.md) — private `*prometheus.Registry` + `NewMetricsHandler` factory wrapping `promhttp.HandlerFor` (text format only; OpenMetrics off; `HandlerOpts.Registry: reg` keeps `promhttp_metric_handler_*` off `DefaultRegisterer`). Seam shape for siblings: per-concern collector struct in its own file, constructed by a helper taking `prometheus.Registerer` (no mega-struct, no package-level vars) — first instantiated by #61's `connectionsCollector`. Listener still pending (#60). Structural defence against default-registry leaks via `TestMetricsRegistry_NoGlobalRegistrarLeak` (#59).
- [Docker image](features/docker-image.md) — portable OCI artifact: multi-stage `Dockerfile` builds a fully-static binary (`CGO_ENABLED=0`, `-trimpath -s -w`) into `distroless/static-debian12:nonroot`; both base images digest-pinned with `# Tracks:` comments; exposes `:80`/`:443` and declares `/var/lib/relay/autocert` volume; host-specific wiring (TLS policy, ports, volumes, healthcheck) is #38's problem (#32).
- [Docker image](features/docker-image.md) — portable OCI artifact: multi-stage `Dockerfile` builds a fully-static binary (`CGO_ENABLED=0`, `-trimpath -s -w`) into `distroless/static-debian12:nonroot`; both base images digest-pinned with `# Tracks:` comments; exposes `:80`/`:443` and declares `/var/lib/relay/autocert` volume; host-specific wiring (TLS policy, ports, volumes, healthcheck) is #38's problem (#32). PR-time Trivy CVE scan against the just-built image lives in CI as the `image-scan` job, fails on **fixable** CRITICAL/HIGH only (`ignore-unfixed: true`), action pinned by commit SHA with `# Tracks: <tag>` comment mirroring the Dockerfile pin convention; intentional overlap with `govulncheck` (source-reachability vs. shipped-artifact) (#68).
- [Binary-side frame forwarder](features/binary-forwarder.md) — per-binary read pump: unwraps each inbound routing envelope, linear-scans `PhonesFor(serverID)` for `env.ConnID`, writes `env.Frame` verbatim to that phone; opaque inner bytes; synchronous (handler discards the return); diverges from #25 in error policy — unknown `conn_id`, malformed envelope, phone `Send` error all log+continue (a single bad frame never tears down the binary); replaced `/v1/server`'s `CloseRead` placeholder (#26).
- [WebSocket heartbeat](features/heartbeat.md) — per-conn goroutine on both endpoints sends RFC 6455 ping every 30s; closes with `1011 "heartbeat timeout"` if no pong within 30s. Detects half-open TCP within 60s; ctx-cancel exit path leaves close to the handler defer (#7).
- [Phone-side frame forwarder](features/phone-forwarder.md) — per-phone read pump: wraps each inbound phone frame in the routing envelope keyed by the phone's `conn_id` and `Send`s it to the binary holding `serverID`; opaque inner bytes; synchronous (handler discards the return); replaced `/v1/client`'s `CloseRead` placeholder; added `WSConn.Read` (single-caller) (#25).
Expand Down
52 changes: 52 additions & 0 deletions docs/knowledge/codebase/68.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Ticket #68 — PR-time Trivy image CVE scan in `ci.yml`

Adds the runtime-image half of CVE scanning to CI. A new `image-scan` job in `.github/workflows/ci.yml` builds the relay image and runs `aquasecurity/trivy-action` against it; the job fails when any **fixable** CRITICAL/HIGH CVE is detected. Complements `govulncheck` (#41, Go source + reachability) by auditing the OS packages and Go-binary content that physically shipped in the image — the gap #32's Dockerfile-hardening ticket explicitly deferred.

## Implementation

- **`.github/workflows/ci.yml`** — appended a third top-level job, `image-scan`, alongside the existing `test` and `security` jobs. Job declares its own `permissions: contents: read` block (belt-and-suspenders against a future workflow-header widening). Two steps: `docker build -t pyrycode-relay:${{ github.sha }} .` then `aquasecurity/trivy-action@ed142fd0673e97e23eac54620cfb913e5ce36c25` (tracks `v0.36.0`).
- **Trivy inputs locked to the AC's failure policy:**
- `image-ref: pyrycode-relay:${{ github.sha }}` — reads the just-built image from the local Docker daemon (no push).
- `severity: CRITICAL,HIGH` — narrows the noise surface to what the AC fails on.
- `ignore-unfixed: true` — only CVEs with a known upstream fix fail the build. AC #2 verbatim; the architect's spec lists three reasons this is the right policy and not just a literal AC read (actionability, Renovate is the right tool for unfixed CVEs, keeps `.trivyignore` pristine for genuine false-positives).
- `vuln-type: os,library` — covers distroless's small OS-package set AND language-manifest / Go-binary content Trivy re-derives. Overlap with `govulncheck` is intentional: source-reachability vs. what-actually-linked-in are two views of the same surface.
- `exit-code: '1'` — fails the job on findings.
- `format: table` — human-readable in the step log; no SARIF upload (would require `security-events: write` and is deferred per the spec's open questions).
- **No `.trivyignore`** — per AC, the allow-list is added only if a real false-positive surfaces. Initial wiring produced none, so nothing committed.
- **Action pinned by commit SHA, not tag** — parity with the Dockerfile base-image digest pins (#32). `# Tracks: aquasecurity/trivy-action@v0.36.0` comment alongside names the upstream tag the SHA was taken from so a Renovate bump's new SHA is reviewable against the new tag.
- **No Go code changes; no Makefile changes.** Local-side `make scan-image` parity is out of scope (would add a Trivy install dependency for every `make lint` contributor).

## Patterns established

- **Pinned-action + `# Tracks: <tag>` comment for any GitHub Action in CI.** Same convention as Dockerfile base digests (#32). Renovate keeps the SHA fresh; the comment lets a reviewer sanity-check that a SHA bump corresponds to the named upstream tag. New action additions should follow this two-line shape.
- **Dedicated CI job per scanner surface.** Source-side scanners (`gosec`, `govulncheck`) stay in `security`; image-side scanners get their own `image-scan` job. One red ✗ per audit surface gives an unambiguous failure signal and lets future image-side work (Cosign verification, SBOM upload, periodic re-scan) bolt on without disturbing the others. The three jobs run in parallel; total wall-clock unchanged.
- **`ignore-unfixed: true` as the default policy for CI vuln scanners.** Fail-on-unactionable is dev friction with no security benefit; the only "fix" available is suppression, which erodes the allow-list. Renovate-driven base bumps will reclassify unfixed CVEs as fixable when upstream ships a patch, then the build fails at the right moment.
- **Job-level `permissions:` block even when the workflow header already declares the minimum.** Belt-and-suspenders: a future top-level escalation (e.g. adding `id-token: write` for a Cosign step in some other job) would silently widen this job's scope without the explicit declaration.

## Verification

AC-defined verification paths (not committed as fixtures; the negative case requires a deliberately-broken Dockerfile that must not land on `main`):

1. **Known-good (positive).** Push this implementation as a normal PR — `image-scan` passes green; step log shows the Trivy table with zero entries in the CRITICAL/HIGH fixable columns. Current Dockerfile digests are recent enough that this is the expected outcome.
2. **Vulnerable (negative).** On a throwaway branch, swap the build-stage `FROM` to a known-old tag (e.g. `golang:1.16`), push, observe the `image-scan` job fail with the CVE list visible in the step log. Revert before merging.

## Out of scope (delegated)

- **Periodic re-scan** — catches CVEs disclosed *after* a PR merges against bases that haven't changed. Split out to a follow-up ticket and depends on this one landing first.
- **`govulncheck` reachable-Go-vulns scanning** — #41, already covered in the `security` job.
- **SARIF upload to GitHub's Security tab** — would widen permissions to `security-events: write`; deferred to a separate ticket with its own AC.
- **Scanning the build stage** — discarded at the multi-stage boundary; auditing it is noise.
- **`make scan-image` local target** — local convenience; can land in a follow-up if contributors ask. Would mirror this CI job's flags exactly.

## Lessons learned

- **Pin GitHub Actions by commit SHA, not version tag.** A maliciously- or accidentally-republished tag can re-point a `@v0.36.0` reference between PRs; the SHA cannot. The `# Tracks:` comment is the readable companion that keeps Renovate bumps reviewable. Same shape as `Dockerfile` base digests — repeat anywhere a third-party action joins CI.
- **`ignore-unfixed` is load-bearing, not optional polish.** Without it, the first unactionable CRITICAL upstream-with-no-patch blocks every PR until someone unrelated to the change cuts a `.trivyignore` entry. The suppression-discipline erosion that follows is the real failure mode. Default-noisy at the scanner level, default-pristine at the allow-list level.
- **Two scanners can intentionally overlap.** `govulncheck` and Trivy will sometimes flag the same Go-module vuln and sometimes only one will catch it — different vuln-DB sources, different reachability heuristics, different inclusion rules. For an internet-exposed relay, the redundancy is worth a few seconds of CI time and should not be "deduplicated."

## Cross-links

- [Architect spec](../../specs/architecture/68-trivy-image-scan.md) — full design rationale, alternatives considered, security review.
- [Feature: Docker image](../features/docker-image.md) — the artifact this job scans; the "No image scanning wiring" deferral noted in #32 closes with this ticket.
- [Ticket #32 codebase notes](32.md) — Dockerfile base hardening that established the digest-pin + `# Tracks:` convention this ticket mirrors for the action pin.
- [Threat model](../../threat-model.md) — § *Supply chain — Go dependencies*; this scan is the runtime-image complement to `govulncheck`, closing the OS-package gap the residual-risk paragraph names.
6 changes: 5 additions & 1 deletion docs/knowledge/features/docker-image.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,12 +70,16 @@ The image layer contributes hardening to three operational surfaces (see `docs/t
- **No host-specific config.** No `fly.toml`, `compose.yaml`, k8s manifest, or systemd unit — those live in #38.
- **No single-instance enforcement.** The relay's binary-slot single-instance constraint is #39's problem; the image can be run N times, but only one will hold the slot.
- **No startup security-posture self-check.** #42 covers runtime self-validation.
- **No image scanning wiring.** Trivy / Grype in CI is a separate ticket, not yet filed.
- **No `--cert-cache` baked in.** The default (`/home/nonroot/.pyrycode-relay/certs`) is only relevant for `--version` smoke tests; real deployments pass `--cert-cache /var/lib/relay/autocert` via the host manifest.

## CI image scanning

PR-time scanning is wired in `.github/workflows/ci.yml` as the `image-scan` job (#68). Each PR builds the image locally as `pyrycode-relay:${{ github.sha }}` and runs `aquasecurity/trivy-action` (commit-SHA pinned, `# Tracks: <upstream-tag>` comment alongside — same convention as the Dockerfile base-image digest pins) against it. The job fails on **fixable** CRITICAL/HIGH CVEs only (`ignore-unfixed: true`); unfixed CVEs print but don't block. Covers `os,library` vuln types — distroless's OS package set plus Go-binary content Trivy re-derives, intentionally overlapping with `govulncheck`'s source-reachability view. Periodic re-scan (catches CVEs disclosed after merge against unchanged bases) is a follow-up ticket.

## Cross-links

- [Ticket #32 codebase notes](../codebase/32.md) — what landed in this ticket.
- [Ticket #68 codebase notes](../codebase/68.md) — PR-time Trivy image CVE scan in CI.
- [Spec: 32-dockerfile-base-hardening](../../specs/architecture/32-dockerfile-base-hardening.md) — architect's design and security review.
- [Threat model](../../threat-model.md) — § *Deploy security*, § *Supply chain*, § *Cert & key handling* are the surfaces this image layer hardens.
- [Autocert TLS](autocert-tls.md) — what the `:80` / `:443` exposure and `/var/lib/relay/autocert` mount feed.
Expand Down
Loading