Context
PR wall-clock to merge readiness is ~13 min, dominated by the Go test workflow (12m15s test step) and E2E tests (~10m45s). Audit of the workflows + a local empirical pass shows the bulk is achievable without test-code refactor and without sharding. This issue lists the prioritized findings as discrete deliverables.
Related: PR #4250 (CircleCI → GHA migration), commit 05a7056.
Findings
1. Drop -p 1 from go-test.yaml (highest impact, lowest effort)
.github/workflows/go-test.yaml:28 runs go test -p 1 ./... -race -coverprofile=c_raw.out. The -p 1 was inherited verbatim from .circleci/config.yml where it sat inside circleci tests split (8-way shard). The GHA migration dropped the shard but kept the flag — its original purpose (per-shard serialization) no longer applies.
Audit for inter-package parallelism hazards on the current codebase:
| Hazard |
Found |
net.Listen to fixed port |
0 (2 files call net.Listen: storage/session_memcached_test.go:139 uses :0, network/transport/grpc/connection_manager_test.go:63 uses bufconn) |
os.Chdir in tests |
0 |
TestMain setting up shared infra |
0 |
| Shared filesystem paths |
0 (the ./test/* reads in auth/services/selfsigned/ are package-local; Go sets cwd per package) |
os.MkdirTemp to fixed path |
0 |
go test -p N runs N test binaries in parallel, each in its own process — so the 49 os.Setenv/t.Setenv call sites are isolated by process boundary, not a hazard either.
Empirical local run on 12-core machine:
| Config |
Wall-clock |
Outcome |
go test -p 1 -race ./... |
7m 04s |
PASS |
go test -p 2 -race ./... |
3m 31s |
PASS |
go test -race ./... (default) |
2m 06s |
PASS |
GHA ubuntu-latest is 2 vCPUs, so dropping -p 1 means the default GOMAXPROCS=2 parallelism kicks in. Expected CI saving: ~6 min/PR (12m15s → ~6m).
Change: drop -p 1. One line.
2. Add Docker layer caching for the e2e build
.github/workflows/e2e-tests.yaml:55, 67 — both Build and push steps lack cache-from/cache-to. Every PR rebuilds the Go binary from scratch (~1m 46s).
- name: Build and push
uses: docker/build-push-action@v7
with:
cache-from: type=gha,scope=e2e
cache-to: type=gha,scope=e2e,mode=max
# ... existing fields
Dockerfile layer order at Dockerfile:16-21 already puts go.mod/go.sum + go mod download before COPY . ., so the cache is effective on PRs that don't touch go.sum. ~1–2 min/PR after warm-up.
3. Drop multi-arch on PR runs in build-images.yaml
.github/workflows/build-images.yaml:80 builds linux/amd64,linux/arm64 on every PR. arm64 under QEMU is ~5–8× slower than native — it's the bulk of the 26 min master run. On PRs push: false (line 81), so we only verify the build doesn't break; arm64 is redundant.
platforms: ${{ github.event_name == 'pull_request' && 'linux/amd64' || 'linux/amd64,linux/arm64' }}
Workflow isn't a required check, but it consumes ~20 runner-min/PR. On master/tags, separately, consider switching arm64 to native ubuntu-24.04-arm runners (free for public repos) — cuts the 26 min master build to ~5 min as a parallel two-job matrix.
4. Shard e2e-tests by suite (bigger refactor)
e2e-tests/run-tests.sh runs 12+ docker-compose suites sequentially (~8m 25s of the 10m 45s job). Suites are independent (separate docker-compose stacks, no shared state).
Change:
- Refactor
run-tests.sh to accept a suite name (./run-tests.sh oauth-flow).
- Split
.github/workflows/e2e-tests.yaml into a build job (produces nuts-node-ci:$SHA, pushes to GHCR once) → test-<suite> matrix jobs (each pulls the image and runs one suite).
4-shard split → ~5 min/PR saving (longest suite on critical path).
Suggested actions
| # |
Description |
Effort |
Independent? |
| 1 |
Remove -p 1 from go-test.yaml |
S |
yes |
| 2 |
Add gha cache to e2e Docker build |
S |
yes |
| 3 |
Conditional platforms on build-images.yaml |
S |
yes |
| 4 |
Shard e2e-tests by suite |
M |
yes |
(1), (2), (3) can land as three small PRs first; (4) is the bigger structural change.
Out of scope
- Fix hardcoded port strings in tests (used in config-validation, not actual binds — see audit table above for why they're not a parallelism hazard). Refactor would let us go beyond 2-core parallelism without sharding, but isn't needed to reach the ~4–5 min/PR target.
- Larger runners (
ubuntu-latest-4-cores). Linear speedup at 2× per-minute cost — net-neutral on spend, faster wall-clock. Not necessary if (1)+(4) bring PR wall-clock to ~4–5 min.
- Skip CodeQL on dependency-only PRs. Not in required checks; out of scope here.
Related
Context
PR wall-clock to merge readiness is ~13 min, dominated by the
Go testworkflow (12m15s test step) andE2E tests(~10m45s). Audit of the workflows + a local empirical pass shows the bulk is achievable without test-code refactor and without sharding. This issue lists the prioritized findings as discrete deliverables.Related: PR #4250 (CircleCI → GHA migration), commit 05a7056.
Findings
1. Drop
-p 1fromgo-test.yaml(highest impact, lowest effort).github/workflows/go-test.yaml:28runsgo test -p 1 ./... -race -coverprofile=c_raw.out. The-p 1was inherited verbatim from.circleci/config.ymlwhere it sat insidecircleci tests split(8-way shard). The GHA migration dropped the shard but kept the flag — its original purpose (per-shard serialization) no longer applies.Audit for inter-package parallelism hazards on the current codebase:
net.Listento fixed portnet.Listen:storage/session_memcached_test.go:139uses:0,network/transport/grpc/connection_manager_test.go:63usesbufconn)os.Chdirin testsTestMainsetting up shared infra./test/*reads inauth/services/selfsigned/are package-local; Go sets cwd per package)os.MkdirTempto fixed pathgo test -p Nruns N test binaries in parallel, each in its own process — so the 49os.Setenv/t.Setenvcall sites are isolated by process boundary, not a hazard either.Empirical local run on 12-core machine:
go test -p 1 -race ./...go test -p 2 -race ./...go test -race ./...(default)GHA
ubuntu-latestis 2 vCPUs, so dropping-p 1means the defaultGOMAXPROCS=2parallelism kicks in. Expected CI saving: ~6 min/PR (12m15s → ~6m).Change: drop
-p 1. One line.2. Add Docker layer caching for the e2e build
.github/workflows/e2e-tests.yaml:55, 67— bothBuild and pushsteps lackcache-from/cache-to. Every PR rebuilds the Go binary from scratch (~1m 46s).Dockerfile layer order at
Dockerfile:16-21already putsgo.mod/go.sum+go mod downloadbeforeCOPY . ., so the cache is effective on PRs that don't touchgo.sum. ~1–2 min/PR after warm-up.3. Drop multi-arch on PR runs in
build-images.yaml.github/workflows/build-images.yaml:80buildslinux/amd64,linux/arm64on every PR.arm64under QEMU is ~5–8× slower than native — it's the bulk of the 26 min master run. On PRspush: false(line 81), so we only verify the build doesn't break;arm64is redundant.Workflow isn't a required check, but it consumes ~20 runner-min/PR. On
master/tags, separately, consider switchingarm64to nativeubuntu-24.04-armrunners (free for public repos) — cuts the 26 min master build to ~5 min as a parallel two-job matrix.4. Shard
e2e-testsby suite (bigger refactor)e2e-tests/run-tests.shruns 12+ docker-compose suites sequentially (~8m 25s of the 10m 45s job). Suites are independent (separate docker-compose stacks, no shared state).Change:
run-tests.shto accept a suite name (./run-tests.sh oauth-flow)..github/workflows/e2e-tests.yamlinto abuildjob (producesnuts-node-ci:$SHA, pushes to GHCR once) →test-<suite>matrix jobs (each pulls the image and runs one suite).4-shard split → ~5 min/PR saving (longest suite on critical path).
Suggested actions
-p 1fromgo-test.yamlplatformsonbuild-images.yamle2e-testsby suite(1), (2), (3) can land as three small PRs first; (4) is the bigger structural change.
Out of scope
ubuntu-latest-4-cores). Linear speedup at 2× per-minute cost — net-neutral on spend, faster wall-clock. Not necessary if (1)+(4) bring PR wall-clock to ~4–5 min.Related
-p 1in the new workflow: 05a7056