Skip to content

Commit 2781815

Browse files
authored
making deploy quicker - less intervals, observability now optional (#123)
* making deploy quicker - less intervals, observability now optional * Changes Made stack-tests.yml — Build, Test, Push - Removed dev from branch triggers (only main now) - Added tags: ['v*'] and cert-generator/** path triggers - Build job now pushes to GHCR with immutable sha-{sha} tag (push events only) - Added missing pre-builds: event-replay, dlq-processor, zookeeper-certgen (these were being rebuilt during compose startup before) - Added frontend-prod build (from Dockerfile.prod, pushed as frontend:sha-xxx for Trivy scanning) - E2E jobs pull from GHCR on push events (parallel docker pull & + retag to compose names), fall back to artifact for PRs - All push/pull commands are spelled out explicitly (no for loops) - Added packages: write permission to build job docker.yml — Scan & Promote (rewritten) - Trigger: workflow_run on "Stack Tests" completion (+ workflow_dispatch with optional SHA input) - Only runs when Stack Tests succeed on main - Scan jobs: Trivy scans backend and frontend-prod from GHCR using SHA tag - Promote job: crane copy sha-xxx → latest for all 12 images — registry-level manifest copy, no rebuild - latest is NEVER set during build — only after all tests + scans pass Flow Push to main: stack-tests.yml: unit → build (push sha-xxx to GHCR) → E2E (pull from GHCR) docker.yml: (on success) → scan → promote sha-xxx → latest PR: stack-tests.yml: unit → build (save artifact) → E2E (load artifact) docker.yml: (skipped — only triggers on main) * Replaced two separate scan jobs (scan-backend, scan-frontend) with a single matrix job (scan) that scans all 12 images in parallel: - fail-fast: false — one image's vulnerability findings don't cancel the other scans - Each matrix entry runs as its own parallel job on a separate runner - SARIF results uploaded per-image with unique categories (trivy-base, trivy-backend, etc.) - trivyignores: 'backend/.trivyignore' applied to all images (CVE exemptions are image-agnostic) - checkout@v6 included so the .trivyignore file is available Updated promote.needs from [scan-backend, scan-frontend] to [scan] — waits for all 12 matrix entries to pass before promoting anything to latest. Updated the summary security section to reflect that all 12 images are scanned. * What changed docker-compose.yaml (+15 lines): Every buildable service now has an image: field pointing to ghcr.io/hardmax71/integr8scode/{service}:${IMAGE_TAG:-latest}. kafka-init and user-seed share the backend image. Compose now knows where to pull pre-built images from. deploy.sh (+10 lines): Added --no-build flag to cmd_dev(). Passes --no-build to compose, preventing any build fallback. stack-tests.yml (-149 lines): - Build job: push condition changed from event_name != 'pull_request' to !github.event.pull_request.head.repo.fork (same-repo PRs can push to GHCR). Artifact save/upload removed entirely. - Both E2E jobs: Deleted all GHCR login, parallel pull, retag, artifact download, and load steps. Replaced with a single IMAGE_TAG env var on the "Start stack" step. Compose pulls SHA-tagged images from GHCR automatically using the image: fields. - Both E2E jobs have if: !fork guard — fork PRs skip E2E (unit tests still run). How it works | Scenario | What happens | |---------------------------------------------------------|--------------------------------------------------| | ./deploy.sh dev (local, first time) | Compose pulls latest from GHCR — no build needed | | ./deploy.sh dev --build (local, with changes) | Builds locally, tags with GHCR name | | CI: IMAGE_TAG=sha-xxx ./deploy.sh dev --no-build --wait | Compose pulls sha-tagged images from GHCR | | ./deploy.sh prod | Helm uses GHCR images (unchanged) | * 1. Playwright Sharding (frontend-e2e) - Added strategy.matrix with shardIndex: [1, 2] and shardTotal: [2] - fail-fast: false so one shard failing doesn't cancel the other - Test command: npx playwright test --shard=${{ matrix.shardIndex }}/${{ matrix.shardTotal }} - Artifact names include shard index to avoid collisions: playwright-report-1, playwright-report-2, frontend-e2e-logs-1, etc. - Job name shows shard: Frontend E2E (1/2), Frontend E2E (2/2) 2. GHCR Pre-pull (both E2E jobs) - Immediately after checkout, docker compose pull --quiet starts in the background via nohup - While GHCR images pull, the subsequent setup steps run in parallel: - backend-e2e: Docker cache load + k3s install (~85s of overlap) - frontend-e2e: Node setup + npm ci + Playwright install + Docker cache + k3s (~150s of overlap) - A "Wait for GHCR images" step before "Start stack" ensures pull is complete - "Start stack" then finds images already local — skips pulling entirely * Here's what this adds — infrastructure pre-warming: How it works Both E2E jobs now have this timeline: Step 2: Pre-pull GHCR images ──────────────────────────────── (background) Step 3-7: Node/Playwright/Docker cache setup ──────────────── (foreground, ~50s) Step 8: Docker-cache loads infra images ───────────────────── (~15s) Step 9: Pre-warm infrastructure ───────────────────────────── (background, starts immediately) ├── mongo + redis start (~5s to healthy) ├── shared-ca + cert-gen + zk-certgen start (~5s) ├── zookeeper starts after zk-certgen (~15s) ├── kafka starts after zookeeper healthy (~20s) └── schema-registry starts after kafka (~10s) Step 10: k3s install ──────────────────────────────────────── (~42s, OVERLAPS with infra chain) Step 12: Wait for background tasks ────────────────────────── (both should be done) Step 13: Start stack ──────────────────────────────────────── (infra already healthy, only app services) Expected impact on "Start stack" | Component | Before | After | |---------------------------------|---------------------------|------------------------------| | Infra initialization (zk chain) | ~50s (during Start stack) | 0s (already done during k3s) | | App image pull | ~60s | 0s (pre-pulled) | | App service startup | ~30s | ~30s | | Health check waits | ~20s | ~20s | | Total "Start stack" | ~2:20 | ~0:50 | * The root cause: cert-generator service in docker-compose.yaml mounts ~/.kube:/root/.kube. When Docker creates that bind mount source directory, it creates it as root:root. Then k3s-setup's sudo k3s kubectl config view --raw > /home/runner/.kube/config fails because the shell redirect (>) runs as the runner user who can't write to the root-owned directory. * Backend E2E step reorder (stack-tests.yml): | Before | After | |-----------------------------------------|----------------------------------------| | 1. checkout | 1. checkout | | 2. GHCR pre-pull (bg) | 2. GHCR pre-pull (bg) | | 3. docker-cache | 3. config copy (moved up) | | 4. infra pre-warm (bg) | 4. Install k3s (split from composite) | | 5. k3s-setup (composite, ~45s blocking) | 5. docker-cache (runs during k3s boot) | | 6. config copy | 6. infra pre-warm (bg) | | 7. wait for bg | 7. Finalize k3s (~25s+ after install) | | 8. start stack | 8. wait for bg | | | 9. start stack | Key gain: k3s boot (30s) now overlaps with docker-cache (10-18s) instead of blocking sequentially. The composite k3s-setup action is inlined as "Install k3s" + "Finalize k3s", same pattern as frontend-e2e. Complete optimization summary across both files: 1. docker-compose.yaml — Tightened health check intervals (5s→2-3s) and start periods (10s→3-5s) across all 7 services 2. frontend-e2e — Inlined k3s, overlaps boot with Node + npm ci + Playwright (~50s overlap) 3. backend-e2e — Inlined k3s, overlaps boot with docker-cache (~15s overlap) 4. Both YAML files validated * Before (19 steps, 6 sequential push steps = ~81s pushing): Build base → Push base (13s) → Build 8 workers → Push 8 workers (35s sequential) → Build cert-gen → Push cert-gen (7s) → Build zk-certgen → Push zk-certgen (8s) → Build frontend → Push frontend-dev (12s) → Build frontend-prod → Push frontend-prod (6s) After (14 steps, 1 parallel push step): Build base → Build 8 workers → Build cert-gen → Build zk-certgen → Build frontend → Build frontend-prod → Push all 13 in parallel (~15-20s) Expected savings: ~60s (81s sequential → ~20s parallel). Job should drop from 2m 48s → ~1m 50s. The builds are all done first (same total time), then all 13 pushes fire concurrently. Since they share base layers, Docker deduplicates — the first push uploads shared layers and the rest skip them. * Two changes made: 1. Parallel GHCR pushes (build-images job): - Merged 6 separate push steps into 1 step that pushes all 13 images in parallel via for ... do docker push & done; wait - Expected: ~81s sequential → ~15-20s parallel (saves ~60s) 2. Targeted health checks (both E2E jobs): - Replaced deploy.sh dev --no-build --wait (waits for ALL 15+ containers) with: - docker compose up -d --no-build (returns immediately, ~3s) - curl loop that only waits for backend (backend-e2e) or backend + frontend (frontend-e2e) - Workers start in background and become ready while tests run their initial setup - Expected: "Start stack" drops from ~2:01 to ~5s + "Wait for health" ~40-60s = ~45-65s total (saves ~60s) * Root cause analysis: docker compose up -d --no-build (even without --wait) takes 1:23 because depends_on: condition: service_healthy in docker-compose.yaml forces compose to wait for the entire dependency chain before creating dependent containers. Removing --wait only skipped the final "all healthy" check — the internal chain is the real bottleneck. Changes made (3 optimizations): 1. Removed docker-cache step (saves ~1:08 blocking time) The docker-cache composite action was loading 5 infra images from GHA cache in ~68s of blocking foreground time. But docker compose pull (pre-pull) already fetches ALL images in background. Removed the redundant step. 2. Merged pre-pull + pre-warm into single sequential background task Instead of: pre-pull (bg) → docker-cache (blocking 1:08) → pre-warm (bg) Now: docker compose pull && docker compose up -d ... infra all in one background process. Infra starts pulling + booting immediately after checkout, overlapping with all subsequent setup steps. 3. Pre-start cert-generator after k3s finalize cert-generator is on the critical path: cert-gen(complete) → backend(healthy) → frontend. Starting it right after kubeconfig exists gives it a ~15-20s head start while we wait for pre-pull to finish. * What changed: frontend.depends_on.backend: service_healthy → service_started Impact: Compose no longer waits for backend to pass its health check (~35s) before creating the frontend container. Backend and frontend now boot in parallel during docker compose up -d. For frontend-e2e: "Start stack" should drop from 1:20 to 45-50s (no backend health wait in compose), and "Wait for backend+frontend" picks up the slack but runs in parallel (45s). Net: 2:03 → ~1:30, saving ~33s → job drops to ~5:00. For backend-e2e: Smaller impact since backend tests don't need frontend. "Start stack" drops slightly (~10s) since compose returns earlier. Job should be ~5:30. At this point we're approaching the hard floor: - Backend E2E: 3:00 tests + 100s minimum setup = ~4:40 floor, currently ~5:30 (50s over) - Frontend E2E: 2:11 tests + 80s minimum setup = ~3:31 floor, currently ~5:00 (89s over, mostly from the depends_on chain which is inherent to docker-compose) * fixes * Created 2 composite actions, deleted 2 unused ones: | Action | Purpose | |------------------------|-----------------------------------------------------------------------------| | e2e-boot (new) | GHCR login + pull/prewarm infra (bg) + k3s install | | e2e-ready (new) | Finalize k3s + cert-gen + config + wait + start stack + health check + seed | | k3s-setup (deleted) | Was inlined previously, never referenced | | docker-cache (deleted) | Replaced by docker compose pull, never referenced | Step count reduction: - backend-e2e: 20 steps → 8 steps (checkout + 2 actions + test + coverage + logs) - frontend-e2e: 20 steps → 13 steps (checkout + e2e-boot + 5 Node/Playwright + e2e-ready + test + report + logs) * updated docs + branch = main for all CI workflows (removed dev) * clarified 12/13 images in docs
1 parent f6b4add commit 2781815

13 files changed

Lines changed: 780 additions & 670 deletions

File tree

.github/actions/docker-cache/action.yml

Lines changed: 0 additions & 64 deletions
This file was deleted.
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
name: 'E2E Boot'
2+
description: 'Kick off slow background tasks: GHCR auth, image pull + infra pre-warm, k3s install'
3+
4+
inputs:
5+
image-tag:
6+
description: 'GHCR image tag (e.g., sha-abc1234)'
7+
required: true
8+
github-token:
9+
description: 'GitHub token for GHCR authentication'
10+
required: true
11+
12+
runs:
13+
using: 'composite'
14+
steps:
15+
- name: Log in to GHCR
16+
uses: docker/login-action@v3
17+
with:
18+
registry: ${{ env.REGISTRY }}
19+
username: ${{ github.actor }}
20+
password: ${{ inputs.github-token }}
21+
22+
- name: Pull images and pre-warm infra (background)
23+
shell: bash
24+
env:
25+
IMAGE_TAG: ${{ inputs.image-tag }}
26+
run: |
27+
nohup bash -c '
28+
IMAGE_TAG='"$IMAGE_TAG"' docker compose pull --quiet 2>&1
29+
echo "--- pull done, starting infra ---"
30+
docker compose up -d --no-build \
31+
mongo redis shared-ca zookeeper-certgen zookeeper kafka schema-registry 2>&1
32+
echo $? > /tmp/infra-pull.exit
33+
' > /tmp/infra-pull.log 2>&1 &
34+
echo $! > /tmp/infra-pull.pid
35+
36+
- name: Install k3s
37+
shell: bash
38+
run: |
39+
K3S_TAG=$(echo "$K3S_VERSION" | sed 's/+/%2B/g')
40+
curl -sfL "https://raw.githubusercontent.com/k3s-io/k3s/${K3S_TAG}/install.sh" -o /tmp/k3s-install.sh
41+
echo "$K3S_INSTALL_SHA256 /tmp/k3s-install.sh" | sha256sum -c -
42+
chmod +x /tmp/k3s-install.sh
43+
INSTALL_K3S_VERSION="$K3S_VERSION" INSTALL_K3S_EXEC="--disable=traefik --bind-address 0.0.0.0 --tls-san host.docker.internal" /tmp/k3s-install.sh
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
name: 'E2E Ready'
2+
description: 'Finalize k3s, wait for infra, start compose stack, health-check, seed test users'
3+
4+
inputs:
5+
image-tag:
6+
description: 'GHCR image tag (e.g., sha-abc1234)'
7+
required: true
8+
wait-for-frontend:
9+
description: 'Also wait for frontend health check (default: false)'
10+
required: false
11+
default: 'false'
12+
13+
runs:
14+
using: 'composite'
15+
steps:
16+
- name: Finalize k3s
17+
shell: bash
18+
run: |
19+
mkdir -p /home/runner/.kube
20+
sudo k3s kubectl config view --raw > /home/runner/.kube/config
21+
sudo chmod 600 /home/runner/.kube/config
22+
export KUBECONFIG=/home/runner/.kube/config
23+
timeout 90 bash -c 'until kubectl cluster-info 2>/dev/null; do sleep 3; done'
24+
kubectl create namespace integr8scode --dry-run=client -o yaml | kubectl apply -f -
25+
sed -E 's#https://(127\.0\.0\.1|0\.0\.0\.0):6443#https://host.docker.internal:6443#g' \
26+
/home/runner/.kube/config > backend/kubeconfig.yaml
27+
chmod 644 backend/kubeconfig.yaml
28+
29+
- name: Start cert-generator (background)
30+
shell: bash
31+
env:
32+
IMAGE_TAG: ${{ inputs.image-tag }}
33+
run: |
34+
nohup docker compose up -d --no-build cert-generator \
35+
> /tmp/cert-gen.log 2>&1 &
36+
37+
- name: Use test environment config
38+
shell: bash
39+
run: |
40+
cp backend/config.test.toml backend/config.toml
41+
cp backend/secrets.example.toml backend/secrets.toml
42+
43+
- name: Wait for image pull and infra
44+
shell: bash
45+
run: |
46+
if [ -f /tmp/infra-pull.pid ]; then
47+
PID=$(cat /tmp/infra-pull.pid)
48+
if kill -0 "$PID" 2>/dev/null; then
49+
echo "Waiting for image pull + infra startup..."
50+
tail --pid="$PID" -f /dev/null 2>/dev/null || true
51+
fi
52+
fi
53+
cat /tmp/infra-pull.log 2>/dev/null || true
54+
cat /tmp/cert-gen.log 2>/dev/null || true
55+
if [ -f /tmp/infra-pull.exit ]; then
56+
EXIT_CODE=$(cat /tmp/infra-pull.exit)
57+
if [ "$EXIT_CODE" != "0" ]; then
58+
echo "::error::Background image pull / infra pre-warm failed (exit $EXIT_CODE)"
59+
exit 1
60+
fi
61+
fi
62+
63+
- name: Start stack
64+
shell: bash
65+
env:
66+
IMAGE_TAG: ${{ inputs.image-tag }}
67+
run: docker compose up -d --no-build
68+
69+
- name: Wait for services
70+
shell: bash
71+
env:
72+
WAIT_FOR_FRONTEND: ${{ inputs.wait-for-frontend }}
73+
run: |
74+
echo "Waiting for backend health..."
75+
timeout 120 bash -c 'until curl -ksf https://localhost/api/v1/health/live 2>/dev/null; do sleep 2; done'
76+
echo "Backend ready"
77+
if [ "$WAIT_FOR_FRONTEND" = "true" ]; then
78+
echo "Waiting for frontend health..."
79+
timeout 60 bash -c 'until curl -ksf https://localhost:5001 2>/dev/null; do sleep 2; done'
80+
echo "Frontend ready"
81+
fi
82+
83+
- name: Seed test users
84+
shell: bash
85+
run: docker compose exec -T backend uv run python scripts/seed_users.py

.github/actions/k3s-setup/action.yml

Lines changed: 0 additions & 57 deletions
This file was deleted.

0 commit comments

Comments
 (0)