Commit 2781815
authored
making deploy quicker - less intervals, observability now optional (#123)
* making deploy quicker - less intervals, observability now optional
* Changes Made
stack-tests.yml — Build, Test, Push
- Removed dev from branch triggers (only main now)
- Added tags: ['v*'] and cert-generator/** path triggers
- Build job now pushes to GHCR with immutable sha-{sha} tag (push events only)
- Added missing pre-builds: event-replay, dlq-processor, zookeeper-certgen (these were being rebuilt during compose startup before)
- Added frontend-prod build (from Dockerfile.prod, pushed as frontend:sha-xxx for Trivy scanning)
- E2E jobs pull from GHCR on push events (parallel docker pull & + retag to compose names), fall back to artifact for PRs
- All push/pull commands are spelled out explicitly (no for loops)
- Added packages: write permission to build job
docker.yml — Scan & Promote (rewritten)
- Trigger: workflow_run on "Stack Tests" completion (+ workflow_dispatch with optional SHA input)
- Only runs when Stack Tests succeed on main
- Scan jobs: Trivy scans backend and frontend-prod from GHCR using SHA tag
- Promote job: crane copy sha-xxx → latest for all 12 images — registry-level manifest copy, no rebuild
- latest is NEVER set during build — only after all tests + scans pass
Flow
Push to main:
stack-tests.yml: unit → build (push sha-xxx to GHCR) → E2E (pull from GHCR)
docker.yml: (on success) → scan → promote sha-xxx → latest
PR:
stack-tests.yml: unit → build (save artifact) → E2E (load artifact)
docker.yml: (skipped — only triggers on main)
* Replaced two separate scan jobs (scan-backend, scan-frontend) with a single matrix job (scan) that scans all 12 images in parallel:
- fail-fast: false — one image's vulnerability findings don't cancel the other scans
- Each matrix entry runs as its own parallel job on a separate runner
- SARIF results uploaded per-image with unique categories (trivy-base, trivy-backend, etc.)
- trivyignores: 'backend/.trivyignore' applied to all images (CVE exemptions are image-agnostic)
- checkout@v6 included so the .trivyignore file is available
Updated promote.needs from [scan-backend, scan-frontend] to [scan] — waits for all 12 matrix entries to pass before promoting anything to latest.
Updated the summary security section to reflect that all 12 images are scanned.
* What changed
docker-compose.yaml (+15 lines): Every buildable service now has an image: field pointing to ghcr.io/hardmax71/integr8scode/{service}:${IMAGE_TAG:-latest}. kafka-init and user-seed share the backend image. Compose now knows where to pull pre-built images from.
deploy.sh (+10 lines): Added --no-build flag to cmd_dev(). Passes --no-build to compose, preventing any build fallback.
stack-tests.yml (-149 lines):
- Build job: push condition changed from event_name != 'pull_request' to !github.event.pull_request.head.repo.fork (same-repo PRs can push to GHCR). Artifact save/upload removed entirely.
- Both E2E jobs: Deleted all GHCR login, parallel pull, retag, artifact download, and load steps. Replaced with a single IMAGE_TAG env var on the "Start stack" step. Compose pulls SHA-tagged images from GHCR automatically using the image: fields.
- Both E2E jobs have if: !fork guard — fork PRs skip E2E (unit tests still run).
How it works
| Scenario | What happens |
|---------------------------------------------------------|--------------------------------------------------|
| ./deploy.sh dev (local, first time) | Compose pulls latest from GHCR — no build needed |
| ./deploy.sh dev --build (local, with changes) | Builds locally, tags with GHCR name |
| CI: IMAGE_TAG=sha-xxx ./deploy.sh dev --no-build --wait | Compose pulls sha-tagged images from GHCR |
| ./deploy.sh prod | Helm uses GHCR images (unchanged) |
* 1. Playwright Sharding (frontend-e2e)
- Added strategy.matrix with shardIndex: [1, 2] and shardTotal: [2]
- fail-fast: false so one shard failing doesn't cancel the other
- Test command: npx playwright test --shard=${{ matrix.shardIndex }}/${{ matrix.shardTotal }}
- Artifact names include shard index to avoid collisions: playwright-report-1, playwright-report-2, frontend-e2e-logs-1, etc.
- Job name shows shard: Frontend E2E (1/2), Frontend E2E (2/2)
2. GHCR Pre-pull (both E2E jobs)
- Immediately after checkout, docker compose pull --quiet starts in the background via nohup
- While GHCR images pull, the subsequent setup steps run in parallel:
- backend-e2e: Docker cache load + k3s install (~85s of overlap)
- frontend-e2e: Node setup + npm ci + Playwright install + Docker cache + k3s (~150s of overlap)
- A "Wait for GHCR images" step before "Start stack" ensures pull is complete
- "Start stack" then finds images already local — skips pulling entirely
* Here's what this adds — infrastructure pre-warming:
How it works
Both E2E jobs now have this timeline:
Step 2: Pre-pull GHCR images ──────────────────────────────── (background)
Step 3-7: Node/Playwright/Docker cache setup ──────────────── (foreground, ~50s)
Step 8: Docker-cache loads infra images ───────────────────── (~15s)
Step 9: Pre-warm infrastructure ───────────────────────────── (background, starts immediately)
├── mongo + redis start (~5s to healthy)
├── shared-ca + cert-gen + zk-certgen start (~5s)
├── zookeeper starts after zk-certgen (~15s)
├── kafka starts after zookeeper healthy (~20s)
└── schema-registry starts after kafka (~10s)
Step 10: k3s install ──────────────────────────────────────── (~42s, OVERLAPS with infra chain)
Step 12: Wait for background tasks ────────────────────────── (both should be done)
Step 13: Start stack ──────────────────────────────────────── (infra already healthy, only app services)
Expected impact on "Start stack"
| Component | Before | After |
|---------------------------------|---------------------------|------------------------------|
| Infra initialization (zk chain) | ~50s (during Start stack) | 0s (already done during k3s) |
| App image pull | ~60s | 0s (pre-pulled) |
| App service startup | ~30s | ~30s |
| Health check waits | ~20s | ~20s |
| Total "Start stack" | ~2:20 | ~0:50 |
* The root cause: cert-generator service in docker-compose.yaml mounts ~/.kube:/root/.kube. When Docker creates that bind mount source directory, it creates it as root:root. Then k3s-setup's sudo k3s kubectl config
view --raw > /home/runner/.kube/config fails because the shell redirect (>) runs as the runner user who can't write to the root-owned directory.
* Backend E2E step reorder (stack-tests.yml):
| Before | After |
|-----------------------------------------|----------------------------------------|
| 1. checkout | 1. checkout |
| 2. GHCR pre-pull (bg) | 2. GHCR pre-pull (bg) |
| 3. docker-cache | 3. config copy (moved up) |
| 4. infra pre-warm (bg) | 4. Install k3s (split from composite) |
| 5. k3s-setup (composite, ~45s blocking) | 5. docker-cache (runs during k3s boot) |
| 6. config copy | 6. infra pre-warm (bg) |
| 7. wait for bg | 7. Finalize k3s (~25s+ after install) |
| 8. start stack | 8. wait for bg |
| | 9. start stack |
Key gain: k3s boot (30s) now overlaps with docker-cache (10-18s) instead of blocking sequentially. The composite k3s-setup action is inlined as "Install k3s" + "Finalize k3s", same pattern as frontend-e2e.
Complete optimization summary across both files:
1. docker-compose.yaml — Tightened health check intervals (5s→2-3s) and start periods (10s→3-5s) across all 7 services
2. frontend-e2e — Inlined k3s, overlaps boot with Node + npm ci + Playwright (~50s overlap)
3. backend-e2e — Inlined k3s, overlaps boot with docker-cache (~15s overlap)
4. Both YAML files validated
* Before (19 steps, 6 sequential push steps = ~81s pushing):
Build base → Push base (13s) → Build 8 workers → Push 8 workers (35s sequential)
→ Build cert-gen → Push cert-gen (7s) → Build zk-certgen → Push zk-certgen (8s)
→ Build frontend → Push frontend-dev (12s) → Build frontend-prod → Push frontend-prod (6s)
After (14 steps, 1 parallel push step):
Build base → Build 8 workers → Build cert-gen → Build zk-certgen
→ Build frontend → Build frontend-prod → Push all 13 in parallel (~15-20s)
Expected savings: ~60s (81s sequential → ~20s parallel). Job should drop from 2m 48s → ~1m 50s.
The builds are all done first (same total time), then all 13 pushes fire concurrently. Since they share base layers, Docker deduplicates — the first push uploads shared layers and the rest skip them.
* Two changes made:
1. Parallel GHCR pushes (build-images job):
- Merged 6 separate push steps into 1 step that pushes all 13 images in parallel via for ... do docker push & done; wait
- Expected: ~81s sequential → ~15-20s parallel (saves ~60s)
2. Targeted health checks (both E2E jobs):
- Replaced deploy.sh dev --no-build --wait (waits for ALL 15+ containers) with:
- docker compose up -d --no-build (returns immediately, ~3s)
- curl loop that only waits for backend (backend-e2e) or backend + frontend (frontend-e2e)
- Workers start in background and become ready while tests run their initial setup
- Expected: "Start stack" drops from ~2:01 to ~5s + "Wait for health" ~40-60s = ~45-65s total (saves ~60s)
* Root cause analysis: docker compose up -d --no-build (even without --wait) takes 1:23 because depends_on: condition: service_healthy in docker-compose.yaml forces compose to wait for the entire dependency chain before
creating dependent containers. Removing --wait only skipped the final "all healthy" check — the internal chain is the real bottleneck.
Changes made (3 optimizations):
1. Removed docker-cache step (saves ~1:08 blocking time)
The docker-cache composite action was loading 5 infra images from GHA cache in ~68s of blocking foreground time. But docker compose pull (pre-pull) already fetches ALL images in background. Removed the redundant step.
2. Merged pre-pull + pre-warm into single sequential background task
Instead of: pre-pull (bg) → docker-cache (blocking 1:08) → pre-warm (bg)
Now: docker compose pull && docker compose up -d ... infra all in one background process. Infra starts pulling + booting immediately after checkout, overlapping with all subsequent setup steps.
3. Pre-start cert-generator after k3s finalize
cert-generator is on the critical path: cert-gen(complete) → backend(healthy) → frontend. Starting it right after kubeconfig exists gives it a ~15-20s head start while we wait for pre-pull to finish.
* What changed: frontend.depends_on.backend: service_healthy → service_started
Impact: Compose no longer waits for backend to pass its health check (~35s) before creating the frontend container. Backend and frontend now boot in parallel during docker compose up -d.
For frontend-e2e: "Start stack" should drop from 1:20 to 45-50s (no backend health wait in compose), and "Wait for backend+frontend" picks up the slack but runs in parallel (45s). Net: 2:03 → ~1:30, saving ~33s → job drops
to ~5:00.
For backend-e2e: Smaller impact since backend tests don't need frontend. "Start stack" drops slightly (~10s) since compose returns earlier. Job should be ~5:30.
At this point we're approaching the hard floor:
- Backend E2E: 3:00 tests + 100s minimum setup = ~4:40 floor, currently ~5:30 (50s over)
- Frontend E2E: 2:11 tests + 80s minimum setup = ~3:31 floor, currently ~5:00 (89s over, mostly from the depends_on chain which is inherent to docker-compose)
* fixes
* Created 2 composite actions, deleted 2 unused ones:
| Action | Purpose |
|------------------------|-----------------------------------------------------------------------------|
| e2e-boot (new) | GHCR login + pull/prewarm infra (bg) + k3s install |
| e2e-ready (new) | Finalize k3s + cert-gen + config + wait + start stack + health check + seed |
| k3s-setup (deleted) | Was inlined previously, never referenced |
| docker-cache (deleted) | Replaced by docker compose pull, never referenced |
Step count reduction:
- backend-e2e: 20 steps → 8 steps (checkout + 2 actions + test + coverage + logs)
- frontend-e2e: 20 steps → 13 steps (checkout + e2e-boot + 5 Node/Playwright + e2e-ready + test + report + logs)
* updated docs + branch = main for all CI workflows (removed dev)
* clarified 12/13 images in docs1 parent f6b4add commit 2781815
13 files changed
Lines changed: 780 additions & 670 deletions
File tree
- .github
- actions
- docker-cache
- e2e-boot
- e2e-ready
- k3s-setup
- workflows
- docs/operations
This file was deleted.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
This file was deleted.
0 commit comments