ci: speed up PR feedback (remove -p 1, shard e2e, cache, drop multi-arch on PR)

## Context

PR wall-clock to merge readiness is ~13 min, dominated by the `Go test` workflow (12m15s test step) and `E2E tests` (~10m45s). Audit of the workflows + a local empirical pass shows the bulk is achievable without test-code refactor and without sharding. This issue lists the prioritized findings as discrete deliverables.

Related: PR #4250 (CircleCI → GHA migration), commit 05a70569.

## Findings

### 1. Drop `-p 1` from `go-test.yaml` (highest impact, lowest effort)

`.github/workflows/go-test.yaml:28` runs `go test -p 1 ./... -race -coverprofile=c_raw.out`. The `-p 1` was inherited verbatim from `.circleci/config.yml` where it sat *inside* `circleci tests split` (8-way shard). The GHA migration dropped the shard but kept the flag — its original purpose (per-shard serialization) no longer applies.

Audit for inter-package parallelism hazards on the current codebase:

| Hazard                              | Found              |
|-------------------------------------|--------------------|
| `net.Listen` to fixed port          | 0 (2 files call `net.Listen`: `storage/session_memcached_test.go:139` uses `:0`, `network/transport/grpc/connection_manager_test.go:63` uses `bufconn`) |
| `os.Chdir` in tests                 | 0                  |
| `TestMain` setting up shared infra  | 0                  |
| Shared filesystem paths             | 0 (the `./test/*` reads in `auth/services/selfsigned/` are package-local; Go sets cwd per package) |
| `os.MkdirTemp` to fixed path        | 0                  |

`go test -p N` runs N test *binaries* in parallel, each in its own process — so the 49 `os.Setenv`/`t.Setenv` call sites are isolated by process boundary, not a hazard either.

Empirical local run on 12-core machine:

| Config                       | Wall-clock | Outcome |
|------------------------------|-----------|---------|
| `go test -p 1 -race ./...`   | 7m 04s    | PASS    |
| `go test -p 2 -race ./...`   | 3m 31s    | PASS    |
| `go test -race ./...` (default) | 2m 06s | PASS    |

GHA `ubuntu-latest` is 2 vCPUs, so dropping `-p 1` means the default `GOMAXPROCS=2` parallelism kicks in. Expected CI saving: **~6 min/PR** (12m15s → ~6m).

**Change**: drop `-p 1`. One line.

### 2. Add Docker layer caching for the e2e build

`.github/workflows/e2e-tests.yaml:55, 67` — both `Build and push` steps lack `cache-from`/`cache-to`. Every PR rebuilds the Go binary from scratch (~1m 46s).

```yaml
- name: Build and push
  uses: docker/build-push-action@v7
  with:
    cache-from: type=gha,scope=e2e
    cache-to: type=gha,scope=e2e,mode=max
    # ... existing fields
```

Dockerfile layer order at `Dockerfile:16-21` already puts `go.mod`/`go.sum` + `go mod download` before `COPY . .`, so the cache is effective on PRs that don't touch `go.sum`. ~1–2 min/PR after warm-up.

### 3. Drop multi-arch on PR runs in `build-images.yaml`

`.github/workflows/build-images.yaml:80` builds `linux/amd64,linux/arm64` on every PR. `arm64` under QEMU is ~5–8× slower than native — it's the bulk of the 26 min master run. On PRs `push: false` (line 81), so we only verify the build doesn't break; `arm64` is redundant.

```yaml
platforms: ${{ github.event_name == 'pull_request' && 'linux/amd64' || 'linux/amd64,linux/arm64' }}
```

Workflow isn't a required check, but it consumes ~20 runner-min/PR. On `master`/tags, separately, consider switching `arm64` to native `ubuntu-24.04-arm` runners (free for public repos) — cuts the 26 min master build to ~5 min as a parallel two-job matrix.

### 4. Shard `e2e-tests` by suite (bigger refactor)

`e2e-tests/run-tests.sh` runs 12+ docker-compose suites sequentially (~8m 25s of the 10m 45s job). Suites are independent (separate docker-compose stacks, no shared state).

**Change**:
- Refactor `run-tests.sh` to accept a suite name (`./run-tests.sh oauth-flow`).
- Split `.github/workflows/e2e-tests.yaml` into a `build` job (produces `nuts-node-ci:$SHA`, pushes to GHCR once) → `test-<suite>` matrix jobs (each pulls the image and runs one suite).

4-shard split → ~5 min/PR saving (longest suite on critical path).

## Suggested actions

| # | Description                                  | Effort | Independent? |
|---|----------------------------------------------|--------|-------------|
| 1 | Remove `-p 1` from `go-test.yaml`            | S      | yes         |
| 2 | Add gha cache to e2e Docker build            | S      | yes         |
| 3 | Conditional `platforms` on `build-images.yaml` | S    | yes         |
| 4 | Shard `e2e-tests` by suite                   | M      | yes         |

(1), (2), (3) can land as three small PRs first; (4) is the bigger structural change.

## Out of scope

- **Fix hardcoded port strings in tests** (used in config-validation, not actual binds — see audit table above for why they're not a parallelism hazard). Refactor would let us go beyond 2-core parallelism without sharding, but isn't needed to reach the ~4–5 min/PR target.
- **Larger runners** (`ubuntu-latest-4-cores`). Linear speedup at 2× per-minute cost — net-neutral on spend, faster wall-clock. Not necessary if (1)+(4) bring PR wall-clock to ~4–5 min.
- **Skip CodeQL on dependency-only PRs**. Not in required checks; out of scope here.

## Related

- Triggering PR: #4250 (CircleCI → GHA migration that introduced the regression)
- Commit that introduced `-p 1` in the new workflow: 05a70569

Hazard	Found
`net.Listen` to fixed port	0 (2 files call `net.Listen`: `storage/session_memcached_test.go:139` uses `:0`, `network/transport/grpc/connection_manager_test.go:63` uses `bufconn`)
`os.Chdir` in tests	0
`TestMain` setting up shared infra	0
Shared filesystem paths	0 (the `./test/*` reads in `auth/services/selfsigned/` are package-local; Go sets cwd per package)
`os.MkdirTemp` to fixed path	0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: speed up PR feedback (remove -p 1, shard e2e, cache, drop multi-arch on PR) #4258

Context

Findings

1. Drop `-p 1` from `go-test.yaml` (highest impact, lowest effort)

2. Add Docker layer caching for the e2e build

3. Drop multi-arch on PR runs in `build-images.yaml`

4. Shard `e2e-tests` by suite (bigger refactor)

Suggested actions

Out of scope

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Config	Wall-clock	Outcome
`go test -p 1 -race ./...`	7m 04s	PASS
`go test -p 2 -race ./...`	3m 31s	PASS
`go test -race ./...` (default)	2m 06s	PASS

#	Description	Effort	Independent?
1	Remove `-p 1` from `go-test.yaml`	S	yes
2	Add gha cache to e2e Docker build	S	yes
3	Conditional `platforms` on `build-images.yaml`	S	yes
4	Shard `e2e-tests` by suite	M	yes

ci: speed up PR feedback (remove -p 1, shard e2e, cache, drop multi-arch on PR) #4258

Description

Context

Findings

1. Drop -p 1 from go-test.yaml (highest impact, lowest effort)

2. Add Docker layer caching for the e2e build

3. Drop multi-arch on PR runs in build-images.yaml

4. Shard e2e-tests by suite (bigger refactor)

Suggested actions

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Drop `-p 1` from `go-test.yaml` (highest impact, lowest effort)

3. Drop multi-arch on PR runs in `build-images.yaml`

4. Shard `e2e-tests` by suite (bigger refactor)