Skip to content

ci: speed up PR feedback (remove -p 1, shard e2e, cache, drop multi-arch on PR) #4258

@reinkrul

Description

@reinkrul

Context

PR wall-clock to merge readiness is ~13 min, dominated by the Go test workflow (12m15s test step) and E2E tests (~10m45s). Audit of the workflows + a local empirical pass shows the bulk is achievable without test-code refactor and without sharding. This issue lists the prioritized findings as discrete deliverables.

Related: PR #4250 (CircleCI → GHA migration), commit 05a7056.

Findings

1. Drop -p 1 from go-test.yaml (highest impact, lowest effort)

.github/workflows/go-test.yaml:28 runs go test -p 1 ./... -race -coverprofile=c_raw.out. The -p 1 was inherited verbatim from .circleci/config.yml where it sat inside circleci tests split (8-way shard). The GHA migration dropped the shard but kept the flag — its original purpose (per-shard serialization) no longer applies.

Audit for inter-package parallelism hazards on the current codebase:

Hazard Found
net.Listen to fixed port 0 (2 files call net.Listen: storage/session_memcached_test.go:139 uses :0, network/transport/grpc/connection_manager_test.go:63 uses bufconn)
os.Chdir in tests 0
TestMain setting up shared infra 0
Shared filesystem paths 0 (the ./test/* reads in auth/services/selfsigned/ are package-local; Go sets cwd per package)
os.MkdirTemp to fixed path 0

go test -p N runs N test binaries in parallel, each in its own process — so the 49 os.Setenv/t.Setenv call sites are isolated by process boundary, not a hazard either.

Empirical local run on 12-core machine:

Config Wall-clock Outcome
go test -p 1 -race ./... 7m 04s PASS
go test -p 2 -race ./... 3m 31s PASS
go test -race ./... (default) 2m 06s PASS

GHA ubuntu-latest is 2 vCPUs, so dropping -p 1 means the default GOMAXPROCS=2 parallelism kicks in. Expected CI saving: ~6 min/PR (12m15s → ~6m).

Change: drop -p 1. One line.

2. Add Docker layer caching for the e2e build

.github/workflows/e2e-tests.yaml:55, 67 — both Build and push steps lack cache-from/cache-to. Every PR rebuilds the Go binary from scratch (~1m 46s).

- name: Build and push
  uses: docker/build-push-action@v7
  with:
    cache-from: type=gha,scope=e2e
    cache-to: type=gha,scope=e2e,mode=max
    # ... existing fields

Dockerfile layer order at Dockerfile:16-21 already puts go.mod/go.sum + go mod download before COPY . ., so the cache is effective on PRs that don't touch go.sum. ~1–2 min/PR after warm-up.

3. Drop multi-arch on PR runs in build-images.yaml

.github/workflows/build-images.yaml:80 builds linux/amd64,linux/arm64 on every PR. arm64 under QEMU is ~5–8× slower than native — it's the bulk of the 26 min master run. On PRs push: false (line 81), so we only verify the build doesn't break; arm64 is redundant.

platforms: ${{ github.event_name == 'pull_request' && 'linux/amd64' || 'linux/amd64,linux/arm64' }}

Workflow isn't a required check, but it consumes ~20 runner-min/PR. On master/tags, separately, consider switching arm64 to native ubuntu-24.04-arm runners (free for public repos) — cuts the 26 min master build to ~5 min as a parallel two-job matrix.

4. Shard e2e-tests by suite (bigger refactor)

e2e-tests/run-tests.sh runs 12+ docker-compose suites sequentially (~8m 25s of the 10m 45s job). Suites are independent (separate docker-compose stacks, no shared state).

Change:

  • Refactor run-tests.sh to accept a suite name (./run-tests.sh oauth-flow).
  • Split .github/workflows/e2e-tests.yaml into a build job (produces nuts-node-ci:$SHA, pushes to GHCR once) → test-<suite> matrix jobs (each pulls the image and runs one suite).

4-shard split → ~5 min/PR saving (longest suite on critical path).

Suggested actions

# Description Effort Independent?
1 Remove -p 1 from go-test.yaml S yes
2 Add gha cache to e2e Docker build S yes
3 Conditional platforms on build-images.yaml S yes
4 Shard e2e-tests by suite M yes

(1), (2), (3) can land as three small PRs first; (4) is the bigger structural change.

Out of scope

  • Fix hardcoded port strings in tests (used in config-validation, not actual binds — see audit table above for why they're not a parallelism hazard). Refactor would let us go beyond 2-core parallelism without sharding, but isn't needed to reach the ~4–5 min/PR target.
  • Larger runners (ubuntu-latest-4-cores). Linear speedup at 2× per-minute cost — net-neutral on spend, faster wall-clock. Not necessary if (1)+(4) bring PR wall-clock to ~4–5 min.
  • Skip CodeQL on dependency-only PRs. Not in required checks; out of scope here.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions