Skip to content

Conversation

@HardMax71
Copy link
Owner

@HardMax71 HardMax71 commented Jan 31, 2026


Summary by cubic

Make local deploys faster by making observability optional, tightening healthcheck timing, and pulling pre-built images by default. CI builds SHA-tagged images, E2E uses them, scans all images, and promotes to latest after Stack Tests succeed (manual promote available).

  • New Features

    • Added flags: --observability, --debug, and --no-build; profiles apply only when passed; Kafdrop moved to the "debug" profile.
    • CI/CD: push SHA-tagged images to GHCR; E2E pulls via IMAGE_TAG (FE tests sharded); scan all images; auto-promote SHA → latest after Stack Tests pass; manual promote supported.
  • Migration

    • Observability no longer starts by default. Use: ./deploy.sh dev --observability or ./deploy.sh dev --debug.
    • If you depend on Kafdrop at http://localhost:9000, start with --debug.

Written for commit 595c223. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • Dev command: added --no-build, --observability, and --debug flags with help and conditional status output for observability/debug tools.
    • Compose: services now reference registry images and a debug profile to enable optional tooling.
  • Chores

    • Faster local startup via reduced healthcheck intervals and timing adjustments.
    • CI: consolidated scan-and-promote flow to promote SHA-tagged images to latest after scans; simplified build matrix; expanded triggers.
    • CI actions: added E2E boot/ready actions; removed legacy cache and k3s-setup actions.
  • Documentation

    • Major CI/CD guide refactor documenting new workflows, scans, promotion flow, and local run guidance.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 31, 2026

📝 Walkthrough

Walkthrough

Adds profile-aware flags to deploy.sh (--no-build, --observability, --debug); replaces many local compose images with GHCR image references and tightens healthchecks; restructures CI into scan+promote and expanded build/push in stack-tests; removes some composite actions and adds new e2e composite actions; updates docs and workflow triggers.

Changes

Cohort / File(s) Summary
Deploy script
deploy.sh
Adds CLI flags --no-build, --observability, --debug; introduces NO_BUILD_FLAG and PROFILE_FLAGS; extends parsing, info logs, and runs docker compose with profile/no-build options; adjusts profile-aware service status output.
Docker Compose
docker-compose.yaml
Replaces many local images with ghcr.io/...:${IMAGE_TAG:-latest} entries; adds profiles: ["debug"] for kafdrop; removes redundant sleeps; tightens/shortens healthcheck intervals, timeouts, retries, and start_periods; adds explicit images for many services (backend, frontend-dev, workers, processors, cert-gen, etc.).
CI: scan & promote workflow
.github/workflows/docker.yml
Converts per-component build jobs to a matrix-driven scan job producing SHA-tagged images and Trivy SARIF outputs; adds promote job to promote SHA-tagged images to latest via crane; updates triggers, summaries, and artifact naming.
CI: build/push & stack tests
.github/workflows/stack-tests.yml
Restructures build to Build & Push Images with SHA tagging and GHCR pushes for many images (coordinator, workers, frontend-prod/dev, zookeeper-certgen, etc.); updates E2E orchestration to pull GHCR images, adds Playwright sharding, phased infra startup, and enhanced log/artifact handling.
Removed composite actions
.github/actions/docker-cache/action.yml, .github/actions/k3s-setup/action.yml
Deletes prior composite actions that handled Docker cache load/save and the older k3s install/kubeconfig setup.
New composite actions
.github/actions/e2e-boot/action.yml, .github/actions/e2e-ready/action.yml
Adds "E2E Boot" to pre-pull/pre-warm images and install k3s, and "E2E Ready" to finalize k3s, start cert-generator in background, run docker-compose without rebuild, wait for services, and seed test users.
Workflow triggers & CI configs
.github/workflows/frontend-ci.yml, .github/workflows/mypy.yml, .github/workflows/ruff.yml, .github/workflows/security.yml
Restricts several workflows to main branch and adjusts triggers/path filters to align with the new pipeline.
Docs
docs/operations/cicd.md
Large refactor documenting the multi-workflow pipeline (Stack Tests, Docker Scan & Promote, SBOM & Supply Chain), composite actions, image lifecycle, frontend E2E sharding, and updated local run guidance and artifacts/summary changes.

Sequence Diagram(s)

sequenceDiagram
  actor Dev as "Developer (CLI)"
  participant Deploy as "deploy.sh"
  participant Compose as "docker compose"
  participant Registry as "GHCR"
  participant CI as "GitHub Actions"
  participant Services as "Containers"

  Dev->>Deploy: run with flags (--no-build / --debug / --observability)
  Deploy->>Compose: set PROFILE_FLAGS and NO_BUILD_FLAG\nrun docker compose up -d
  alt --no-build / profiles include promoted/latest
    Compose->>Registry: pull promoted/latest or IMAGE_TAG images
  else build required
    CI->>Registry: build & push SHA-tagged images
    CI->>CI: run Trivy scans (SARIF)
    CI->>Registry: promote SHA -> latest (crane)
    Compose->>Registry: pull promoted/latest images
  end
  Compose->>Services: start containers (profiles applied)
  Services-->>Compose: healthchecks (shorter intervals)
  Compose-->>Deploy: return profile-aware service statuses
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • dependencies update #58 — Related changes to CI stack startup and image handling (pre-built images, compose startup without rebuild).
  • Deploy via helm #8 — Overlaps on GHCR image switching, IMAGE_TAG/sha handling, and docker-compose / CI changes.
  • fix: #118 #120 — Similar modifications around image tagging, compose behavior, and CI promotion mechanics.

Poem

🐇 I nudged the flags and hopped the queue,
GHCR gleamed as images flew anew.
Profiles woke services, healthchecks sped,
Scans crowned shas and pushed them ahead.
🥕 A little rabbit cheers this CI thread.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: making deploy faster through shorter healthcheck intervals and making observability optional via CLI flags.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/e2e-tests-speedup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

  stack-tests.yml — Build, Test, Push

  - Removed dev from branch triggers (only main now)
  - Added tags: ['v*'] and cert-generator/** path triggers
  - Build job now pushes to GHCR with immutable sha-{sha} tag (push events only)
  - Added missing pre-builds: event-replay, dlq-processor, zookeeper-certgen (these were being rebuilt during compose startup before)
  - Added frontend-prod build (from Dockerfile.prod, pushed as frontend:sha-xxx for Trivy scanning)
  - E2E jobs pull from GHCR on push events (parallel docker pull & + retag to compose names), fall back to artifact for PRs
  - All push/pull commands are spelled out explicitly (no for loops)
  - Added packages: write permission to build job

  docker.yml — Scan & Promote (rewritten)

  - Trigger: workflow_run on "Stack Tests" completion (+ workflow_dispatch with optional SHA input)
  - Only runs when Stack Tests succeed on main
  - Scan jobs: Trivy scans backend and frontend-prod from GHCR using SHA tag
  - Promote job: crane copy sha-xxx → latest for all 12 images — registry-level manifest copy, no rebuild
  - latest is NEVER set during build — only after all tests + scans pass

  Flow

  Push to main:
    stack-tests.yml: unit → build (push sha-xxx to GHCR) → E2E (pull from GHCR)
    docker.yml:      (on success) → scan → promote sha-xxx → latest

  PR:
    stack-tests.yml: unit → build (save artifact) → E2E (load artifact)
    docker.yml:      (skipped — only triggers on main)
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/docker.yml (1)

46-56: ⚠️ Potential issue | 🟠 Major

Add exit-code: '1' to block promotion on CRITICAL/HIGH findings.

By default, aquasecurity/trivy-action sets exit-code: 0, so the step succeeds even when CRITICAL/HIGH vulnerabilities are found. This means the subsequent promote step would still run. Add exit-code: '1' to both Trivy scan steps to fail the job on severity-matched findings.

Note: When using format: 'sarif', the action outputs all vulnerabilities regardless of the severity filter. To ensure the exit code respects only your specified severities, also add limit-severities-for-sarif: true.

Suggested change
       severity: 'CRITICAL,HIGH'
+      exit-code: '1'

Applies to lines 46-56 and 89-98.

🤖 Fix all issues with AI agents
In @.github/workflows/docker.yml:
- Around line 170-182: The summary text assumes "All Stack Tests passed" which
is incorrect for manual workflow_dispatch runs; update the step that writes to
$GITHUB_STEP_SUMMARY to branch on the event type (use $GITHUB_EVENT_NAME or
github.event_name) and emit a different header/message when event ==
"workflow_dispatch" (e.g., "Images promoted manually — Stack Tests may not have
run") while keeping the existing text for normal CI runs (when event !=
"workflow_dispatch") and reuse the same $TAG/$REGISTRY/$PREFIX pull lines;
ensure the conditional wraps only the descriptive header line(s) not the image
table so the pull commands still print in both cases.
- Around line 129-153: The promote step currently copies all images using the
TAG variable with multiple crane copy commands (e.g., crane copy
"$REGISTRY/$PREFIX/base:$TAG" ...), but only backend and frontend are
Trivy-scanned; update the promote job so it only promotes images that have been
scanned or add Trivy scans for each promoted image before running the crane copy
commands. Concretely, either (A) restrict the list of crane copy targets to the
scanned images (reference the existing crane copy entries for backend and
frontend and remove/unpromote the other image names), or (B) insert a Trivy scan
loop that iterates the same image names (use PREFIX and TAG) and fails on
detected vulnerabilities before executing the crane copy lines for base,
coordinator, k8s-worker, pod-monitor, result-processor, saga-orchestrator,
event-replay, dlq-processor, cert-generator, and zookeeper-certgen.
🧹 Nitpick comments (2)
.github/workflows/stack-tests.yml (2)

322-357: Consider extracting the GHCR pull/retag logic into a composite action.

This entire block (login, parallel pulls, retagging) is duplicated verbatim in the frontend-e2e job (lines 474-509). Extracting it into a reusable composite action under .github/actions/ would reduce maintenance burden and ensure consistency.

♻️ Suggested approach

Create a composite action at .github/actions/pull-images/action.yml:

name: 'Pull Images from GHCR'
inputs:
  sha-tag:
    required: true
  image-prefix:
    required: true
  registry:
    required: true
  github-token:
    required: true
  github-actor:
    required: true
runs:
  using: 'composite'
  steps:
    - name: Pull and retag images from GHCR
      shell: bash
      env:
        TAG: ${{ inputs.sha-tag }}
        IMG: ${{ inputs.registry }}/${{ inputs.image-prefix }}
      run: |
        echo "${{ inputs.github-token }}" | docker login ghcr.io -u ${{ inputs.github-actor }} --password-stdin
        # ... parallel pulls and retag logic

Then use it in both E2E jobs:

- name: Pull images from GHCR
  if: github.event_name != 'pull_request'
  uses: ./.github/actions/pull-images
  with:
    sha-tag: ${{ needs.build-images.outputs.sha-tag }}
    image-prefix: ${{ needs.build-images.outputs.image-prefix }}
    registry: ${{ env.REGISTRY }}
    github-token: ${{ secrets.GITHUB_TOKEN }}
    github-actor: ${{ github.actor }}

331-343: Parallel pulls are efficient but failures may be harder to diagnose.

While wait correctly fails if any background job fails, the logs won't clearly indicate which specific image pull failed. For better debuggability, consider capturing PIDs and checking each exit status, or using xargs -P for parallel execution with clearer error reporting.

♻️ Alternative using xargs for clearer error output
-          docker pull "$IMG/base:$TAG" &
-          docker pull "$IMG/backend:$TAG" &
-          docker pull "$IMG/frontend-dev:$TAG" &
-          docker pull "$IMG/coordinator:$TAG" &
-          docker pull "$IMG/k8s-worker:$TAG" &
-          docker pull "$IMG/pod-monitor:$TAG" &
-          docker pull "$IMG/result-processor:$TAG" &
-          docker pull "$IMG/saga-orchestrator:$TAG" &
-          docker pull "$IMG/event-replay:$TAG" &
-          docker pull "$IMG/dlq-processor:$TAG" &
-          docker pull "$IMG/cert-generator:$TAG" &
-          docker pull "$IMG/zookeeper-certgen:$TAG" &
-          wait
+          echo "base backend frontend-dev coordinator k8s-worker pod-monitor result-processor saga-orchestrator event-replay dlq-processor cert-generator zookeeper-certgen" \
+            | tr ' ' '\n' \
+            | xargs -P 12 -I{} docker pull "$IMG/{}:$TAG"

… single matrix job (scan) that scans all 12 images in parallel:

  - fail-fast: false — one image's vulnerability findings don't cancel the other scans
  - Each matrix entry runs as its own parallel job on a separate runner
  - SARIF results uploaded per-image with unique categories (trivy-base, trivy-backend, etc.)
  - trivyignores: 'backend/.trivyignore' applied to all images (CVE exemptions are image-agnostic)
  - checkout@v6 included so the .trivyignore file is available

  Updated promote.needs from [scan-backend, scan-frontend] to [scan] — waits for all 12 matrix entries to pass before promoting anything to latest.

  Updated the summary security section to reflect that all 12 images are scanned.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/docker.yml (1)

50-72: ⚠️ Potential issue | 🟠 Major

Add GHCR login before Trivy scans private images.

Trivy cannot pull private GHCR images without registry authentication. While packages: read permission is already set, Trivy also needs credentials to authenticate. Add docker/login-action before the scan step with GITHUB_TOKEN, or alternatively pass TRIVY_USERNAME and TRIVY_PASSWORD env vars to the Trivy step.

🛠️ Suggested fix
       - uses: actions/checkout@v6

+      - name: Log in to GHCR
+        uses: docker/login-action@v3
+        with:
+          registry: ${{ env.REGISTRY }}
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+
       - name: Compute image ref
         id: ref
         run: |
           PREFIX="${GITHUB_REPOSITORY_OWNER,,}/integr8scode"

  docker-compose.yaml (+15 lines): Every buildable service now has an image: field pointing to ghcr.io/hardmax71/integr8scode/{service}:${IMAGE_TAG:-latest}. kafka-init and user-seed share the backend image. Compose now knows where to pull pre-built images from.

  deploy.sh (+10 lines): Added --no-build flag to cmd_dev(). Passes --no-build to compose, preventing any build fallback.

  stack-tests.yml (-149 lines):
  - Build job: push condition changed from event_name != 'pull_request' to !github.event.pull_request.head.repo.fork (same-repo PRs can push to GHCR). Artifact save/upload removed entirely.
  - Both E2E jobs: Deleted all GHCR login, parallel pull, retag, artifact download, and load steps. Replaced with a single IMAGE_TAG env var on the "Start stack" step. Compose pulls SHA-tagged images from GHCR automatically using the image: fields.
  - Both E2E jobs have if: !fork guard — fork PRs skip E2E (unit tests still run).

  How it works

  | Scenario                                                | What happens                                     |
  |---------------------------------------------------------|--------------------------------------------------|
  | ./deploy.sh dev (local, first time)                     | Compose pulls latest from GHCR — no build needed |
  | ./deploy.sh dev --build (local, with changes)           | Builds locally, tags with GHCR name              |
  | CI: IMAGE_TAG=sha-xxx ./deploy.sh dev --no-build --wait | Compose pulls sha-tagged images from GHCR        |
  | ./deploy.sh prod                                        | Helm uses GHCR images (unchanged)                |
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/stack-tests.yml:
- Around line 137-141: The echoes that append outputs use an unquoted
$GITHUB_OUTPUT which can cause globbing/word-splitting; update the two echo
lines to redirect into the quoted variable (use >> "$GITHUB_OUTPUT") so that the
values computed in PREFIX and SHA_TAG (and the redirection target) are safely
handled; locate the lines that set PREFIX, SHA_TAG and the two echo lines and
change only the redirection target to be quoted.
🧹 Nitpick comments (3)
.github/workflows/stack-tests.yml (1)

170-176: Consider consolidating repetitive tag-and-push blocks.

Each image follows the same tag → push pattern. While functional, this repetition increases maintenance burden. A matrix strategy or a reusable script/action could reduce duplication.

Example: Loop-based push script
      - name: Push all images to GHCR
        if: ${{ !github.event.pull_request.head.repo.fork }}
        env:
          TAG: ${{ steps.tags.outputs.sha-tag }}
          IMG: ${{ env.REGISTRY }}/${{ steps.tags.outputs.image-prefix }}
        run: |
          images=(
            "integr8scode-base:base"
            "integr8scode-backend:backend"
            "integr8scode-coordinator:coordinator"
            # ... add remaining images
          )
          for entry in "${images[@]}"; do
            local_name="${entry%%:*}"
            remote_name="${entry##*:}"
            docker tag "$local_name:latest" "$IMG/$remote_name:$TAG"
            docker push "$IMG/$remote_name:$TAG"
          done

Also applies to: 189-211, 224-229, 241-246, 259-264, 276-281

docker-compose.yaml (2)

78-81: Reduced start_period values may cause premature healthcheck failures.

The start_period reductions (e.g., kafka from 15s→5s, backend/frontend from 10s→5s) assume faster container startup. If services occasionally take longer to initialize (cold cache, slower CI runners, resource contention), healthchecks may fail before the service is ready, causing cascading dependency failures.

Consider whether retries × interval provides sufficient total wait time, or if a slightly longer start_period is safer for CI stability.

Service Old start_period New start_period retries × interval
kafka 15s 5s 12 × 5s = 60s
backend 10s 5s 50 × 3s = 150s
frontend 10s 5s 30 × 3s = 90s
redis 10s 5s 5 × 5s = 25s

The total windows (start_period + retries × interval) still seem reasonable, but monitor CI for flakiness.

Also applies to: 129-129, 159-159, 328-328


354-354: Kafdrop moved to debug profile — document access instructions.

Kafdrop is now only started with --debug flag. The PR description mentions it's available at http://localhost:9000 with --debug, which is helpful. Ensure the README or deploy script help text reflects this change so developers know how to access Kafka debugging tools.

  - Added strategy.matrix with shardIndex: [1, 2] and shardTotal: [2]
  - fail-fast: false so one shard failing doesn't cancel the other
  - Test command: npx playwright test --shard=${{ matrix.shardIndex }}/${{ matrix.shardTotal }}
  - Artifact names include shard index to avoid collisions: playwright-report-1, playwright-report-2, frontend-e2e-logs-1, etc.
  - Job name shows shard: Frontend E2E (1/2), Frontend E2E (2/2)

  2. GHCR Pre-pull (both E2E jobs)

  - Immediately after checkout, docker compose pull --quiet starts in the background via nohup
  - While GHCR images pull, the subsequent setup steps run in parallel:
    - backend-e2e: Docker cache load + k3s install (~85s of overlap)
    - frontend-e2e: Node setup + npm ci + Playwright install + Docker cache + k3s (~150s of overlap)
  - A "Wait for GHCR images" step before "Start stack" ensures pull is complete
  - "Start stack" then finds images already local — skips pulling entirely
  How it works

  Both E2E jobs now have this timeline:

  Step 2: Pre-pull GHCR images ──────────────────────────────── (background)
  Step 3-7: Node/Playwright/Docker cache setup ──────────────── (foreground, ~50s)
  Step 8: Docker-cache loads infra images ───────────────────── (~15s)
  Step 9: Pre-warm infrastructure ───────────────────────────── (background, starts immediately)
           ├── mongo + redis start (~5s to healthy)
           ├── shared-ca + cert-gen + zk-certgen start (~5s)
           ├── zookeeper starts after zk-certgen (~15s)
           ├── kafka starts after zookeeper healthy (~20s)
           └── schema-registry starts after kafka (~10s)
  Step 10: k3s install ──────────────────────────────────────── (~42s, OVERLAPS with infra chain)
  Step 12: Wait for background tasks ────────────────────────── (both should be done)
  Step 13: Start stack ──────────────────────────────────────── (infra already healthy, only app services)

  Expected impact on "Start stack"

  | Component                       | Before                    | After                        |
  |---------------------------------|---------------------------|------------------------------|
  | Infra initialization (zk chain) | ~50s (during Start stack) | 0s (already done during k3s) |
  | App image pull                  | ~60s                      | 0s (pre-pulled)              |
  | App service startup             | ~30s                      | ~30s                         |
  | Health check waits              | ~20s                      | ~20s                         |
  | Total "Start stack"             | ~2:20                     | ~0:50                        |
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.github/workflows/stack-tests.yml:
- Around line 322-335: The loop that reads PID from each pidfile (pidfile, PID)
uses tail --pid to wait but never captures the background process exit code;
update the block that handles /tmp/ghcr-pull.pid and /tmp/infra-warm.pid so that
after detecting the process (kill -0) you still wait for it with wait "$PID" (or
capture its exit status) and if the exit status is non‑zero then cat the
corresponding log (/tmp/ghcr-pull.log or /tmp/infra-warm.log) to stdout/stderr
and exit the script with that same non‑zero code; ensure this change is applied
to both wait blocks so failures surface immediately rather than being hidden by
tail --pid.
- Around line 287-299: The backend-e2e and frontend-e2e jobs need explicit GHCR
auth and workflow permissions like the build-images job: add a top-level
permissions block for packages: read (and contents: read if needed) to each job
and insert a GHCR login step before any docker-compose/docker pull (use echo
"${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u "${{ github.actor }}"
--password-stdin or the same pattern used in build-images). Update the Pre-pull
GHCR images step (and any other image-pull steps) in the backend-e2e and
frontend-e2e jobs to perform the docker login using the GITHUB_TOKEN environment
then run the existing docker compose pull command.

…~/.kube:/root/.kube. When Docker creates that bind mount source directory, it creates it as root:root. Then k3s-setup's sudo k3s kubectl config

  view --raw > /home/runner/.kube/config fails because the shell redirect (>) runs as the runner user who can't write to the root-owned directory.
  | Before                                  | After                                  |
  |-----------------------------------------|----------------------------------------|
  | 1. checkout                             | 1. checkout                            |
  | 2. GHCR pre-pull (bg)                   | 2. GHCR pre-pull (bg)                  |
  | 3. docker-cache                         | 3. config copy (moved up)              |
  | 4. infra pre-warm (bg)                  | 4. Install k3s (split from composite)  |
  | 5. k3s-setup (composite, ~45s blocking) | 5. docker-cache (runs during k3s boot) |
  | 6. config copy                          | 6. infra pre-warm (bg)                 |
  | 7. wait for bg                          | 7. Finalize k3s (~25s+ after install)  |
  | 8. start stack                          | 8. wait for bg                         |
  |                                         | 9. start stack                         |

  Key gain: k3s boot (30s) now overlaps with docker-cache (10-18s) instead of blocking sequentially. The composite k3s-setup action is inlined as "Install k3s" + "Finalize k3s", same pattern as frontend-e2e.

  Complete optimization summary across both files:
  1. docker-compose.yaml — Tightened health check intervals (5s→2-3s) and start periods (10s→3-5s) across all 7 services
  2. frontend-e2e — Inlined k3s, overlaps boot with Node + npm ci + Playwright (~50s overlap)
  3. backend-e2e — Inlined k3s, overlaps boot with docker-cache (~15s overlap)
  4. Both YAML files validated
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name=".github/workflows/stack-tests.yml">

<violation number="1" location=".github/workflows/stack-tests.yml:309">
P2: Avoid piping an unverified remote install script into `sh`. This introduces a supply‑chain risk for CI runners; prefer a pinned k3s version with checksum verification (or the existing k3s setup action) instead.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

  Build base → Push base (13s) → Build 8 workers → Push 8 workers (35s sequential)
  → Build cert-gen → Push cert-gen (7s) → Build zk-certgen → Push zk-certgen (8s)
  → Build frontend → Push frontend-dev (12s) → Build frontend-prod → Push frontend-prod (6s)

  After (14 steps, 1 parallel push step):
  Build base → Build 8 workers → Build cert-gen → Build zk-certgen
  → Build frontend → Build frontend-prod → Push all 13 in parallel (~15-20s)

  Expected savings: ~60s (81s sequential → ~20s parallel). Job should drop from 2m 48s → ~1m 50s.

  The builds are all done first (same total time), then all 13 pushes fire concurrently. Since they share base layers, Docker deduplicates — the first push uploads shared layers and the rest skip them.
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name=".github/workflows/stack-tests.yml">

<violation number="1" location=".github/workflows/stack-tests.yml:253">
P2: Backgrounding docker pushes without checking each exit status can let failed image pushes go unnoticed. Track PIDs and `wait` them individually so the step fails if any push fails.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/stack-tests.yml:
- Around line 246-253: The parallel docker push loop that runs docker push
"$IMG/$name:$TAG" & and then calls wait can mask failures because wait only
returns the status of the last job; fix it by capturing background PIDs (e.g.,
push PIDs array when launching jobs in the for loop), then iteratively wait on
each PID and check its exit status; if any wait returns non-zero set a failure
flag and exit non-zero at the end (or immediately exit on first failure) so any
individual docker push failure is propagated from the workflow.
🧹 Nitpick comments (3)
.github/workflows/stack-tests.yml (3)

319-319: Quote variable in basename to prevent word splitting.

The $pidfile variable should be quoted for shell safety.

Proposed fix
-                echo "Waiting for $(basename $pidfile .pid)..."
+                echo "Waiting for $(basename "$pidfile" .pid)..."

481-481: Quote variable in basename to prevent word splitting.

Same issue as in backend-e2e - the $pidfile variable should be quoted.

Proposed fix
-                echo "Waiting for $(basename $pidfile .pid)..."
+                echo "Waiting for $(basename "$pidfile" .pid)..."

299-310: Consider extracting k3s setup to a reusable composite action.

The k3s installation and finalization logic is duplicated between backend-e2e (lines 278-310) and frontend-e2e (lines 424-467). Extracting this to a composite action in .github/actions/ would reduce duplication and simplify maintenance.

  1. Parallel GHCR pushes (build-images job):
  - Merged 6 separate push steps into 1 step that pushes all 13 images in parallel via for ... do docker push & done; wait
  - Expected: ~81s sequential → ~15-20s parallel (saves ~60s)

  2. Targeted health checks (both E2E jobs):
  - Replaced deploy.sh dev --no-build --wait (waits for ALL 15+ containers) with:
    - docker compose up -d --no-build (returns immediately, ~3s)
    - curl loop that only waits for backend (backend-e2e) or backend + frontend (frontend-e2e)
  - Workers start in background and become ready while tests run their initial setup
  - Expected: "Start stack" drops from ~2:01 to ~5s + "Wait for health" ~40-60s = ~45-65s total (saves ~60s)
…-wait) takes 1:23 because depends_on: condition: service_healthy in docker-compose.yaml forces compose to wait for the entire dependency chain before

  creating dependent containers. Removing --wait only skipped the final "all healthy" check — the internal chain is the real bottleneck.

  Changes made (3 optimizations):

  1. Removed docker-cache step (saves ~1:08 blocking time)

  The docker-cache composite action was loading 5 infra images from GHA cache in ~68s of blocking foreground time. But docker compose pull (pre-pull) already fetches ALL images in background. Removed the redundant step.

  2. Merged pre-pull + pre-warm into single sequential background task

  Instead of: pre-pull (bg) → docker-cache (blocking 1:08) → pre-warm (bg)
  Now: docker compose pull && docker compose up -d ... infra all in one background process. Infra starts pulling + booting immediately after checkout, overlapping with all subsequent setup steps.

  3. Pre-start cert-generator after k3s finalize

  cert-generator is on the critical path: cert-gen(complete) → backend(healthy) → frontend. Starting it right after kubeconfig exists gives it a ~15-20s head start while we wait for pre-pull to finish.
…_started

  Impact: Compose no longer waits for backend to pass its health check (~35s) before creating the frontend container. Backend and frontend now boot in parallel during docker compose up -d.

  For frontend-e2e: "Start stack" should drop from 1:20 to 45-50s (no backend health wait in compose), and "Wait for backend+frontend" picks up the slack but runs in parallel (45s). Net: 2:03 → ~1:30, saving ~33s → job drops
   to ~5:00.

  For backend-e2e: Smaller impact since backend tests don't need frontend. "Start stack" drops slightly (~10s) since compose returns earlier. Job should be ~5:30.

  At this point we're approaching the hard floor:
  - Backend E2E: 3:00 tests + 100s minimum setup = ~4:40 floor, currently ~5:30 (50s over)
  - Frontend E2E: 2:11 tests + 80s minimum setup = ~3:31 floor, currently ~5:00 (89s over, mostly from the depends_on chain which is inherent to docker-compose)
  | Action                 | Purpose                                                                     |
  |------------------------|-----------------------------------------------------------------------------|
  | e2e-boot (new)         | GHCR login + pull/prewarm infra (bg) + k3s install                          |
  | e2e-ready (new)        | Finalize k3s + cert-gen + config + wait + start stack + health check + seed |
  | k3s-setup (deleted)    | Was inlined previously, never referenced                                    |
  | docker-cache (deleted) | Replaced by docker compose pull, never referenced                           |

  Step count reduction:
  - backend-e2e: 20 steps → 8 steps (checkout + 2 actions + test + coverage + logs)
  - frontend-e2e: 20 steps → 13 steps (checkout + e2e-boot + 5 Node/Playwright + e2e-ready + test + report + logs)

  Performance preserved: The split point between e2e-boot and e2e-ready is exactly where frontend-e2e interposes Node/Playwright setup, so k3s still boots in the background during that work.
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name=".github/actions/e2e-boot/action.yml">

<violation number="1" location=".github/actions/e2e-boot/action.yml:39">
P1: Downloading and executing the k3s install script without verifying its integrity exposes CI to supply‑chain tampering. Add checksum/signature verification (or pin to a trusted artifact) before execution.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.github/actions/e2e-boot/action.yml:
- Around line 22-33: The background nohup block that runs the docker compose
pull/up should capture and persist its exit status so failures are not lost;
modify the nohup script (the block starting with "nohup bash -c '" that writes
to /tmp/infra-pull.log and /tmp/infra-pull.pid) to write the final exit code
into a file such as /tmp/infra-pull.exit (e.g., run the commands, capture "$?"
and echo it to /tmp/infra-pull.exit before exiting), and update the e2e-ready
step to check /tmp/infra-pull.exit after waiting (fail if the file exists and
contains non-zero, printing /tmp/infra-pull.log for diagnostics).

In @.github/actions/e2e-ready/action.yml:
- Around line 43-54: The wait step currently only uses `tail --pid` so it blocks
until the background PID in /tmp/infra-pull.pid exits but doesn't check its exit
code; change the logic in the "Wait for image pull and infra" step to poll the
background PID (e.g., check /proc/$PID) or wait for a dedicated exit-status file
written by the background job (e.g., /tmp/infra-pull.exit), and after the PID
disappears read that exit-status file (or capture the background process status)
and fail the step (exit non-zero) if the background job reported an error;
update references to /tmp/infra-pull.pid, /tmp/infra-pull.log and the chosen
exit-status file so logs are still printed but the step fails when the infra
pull/warm fails.
🧹 Nitpick comments (2)
.github/actions/e2e-boot/action.yml (1)

35-41: K3S_VERSION is read from environment but not declared as input.

This step relies on K3S_VERSION being set in the calling workflow's env block. If the caller doesn't define it, the script will fail silently or produce an invalid URL. Consider adding it as an input with a default, or document the dependency.

Option: Add K3S_VERSION as an input
 inputs:
   image-tag:
     description: 'GHCR image tag (e.g., sha-abc1234)'
     required: true
   github-token:
     description: 'GitHub token for GHCR authentication'
     required: true
+  k3s-version:
+    description: 'k3s version to install (e.g., v1.32.11+k3s1)'
+    required: false
+    default: 'v1.32.11+k3s1'

Then use ${{ inputs.k3s-version }} in the step.

.github/actions/e2e-ready/action.yml (1)

29-35: Cert-generator background process PID not tracked.

Unlike the infra-pull process, the cert-generator's PID isn't captured, so there's no way to wait for it or check its exit status. If cert generation fails, the error surfaces only when dependent services fail to start.

Capture PID for consistency
     - name: Start cert-generator (background)
       shell: bash
       env:
         IMAGE_TAG: ${{ inputs.image-tag }}
       run: |
         nohup docker compose up -d --no-build cert-generator \
-          > /tmp/cert-gen.log 2>&1 &
+          > /tmp/cert-gen.log 2>&1 &
+        echo $! > /tmp/cert-gen.pid

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.github/workflows/security.yml:
- Around line 4-7: The security workflow currently restricts triggers by having
push and pull_request set to "branches: [ main ]", which skips scans on the dev
branch; update the workflow triggers (the push and pull_request entries) to
include the dev branch as well (e.g., change branches: [ main ] to branches: [
main, dev ]) so security scans run for dev PRs and pushes while retaining main
scans.

In `@docs/operations/cicd.md`:
- Line 84: Clarify the discrepancy between "13 container images" and "12 images
scanned" by adding a short explicit note in the image table or the Docker Scan &
Promote section stating which image is excluded from Trivy scanning and why
(e.g., explain that the `base` image is a build-only artifact or that
`frontend-dev` is not deployed to production and therefore not scanned);
reference the existing phrases "13 container images" and "12 deployed images" in
the note so readers can see which one is omitted and the reason for omission.
🧹 Nitpick comments (2)
docs/operations/cicd.md (2)

201-209: Add language identifier to the code fence.

The code fence at line 201 is missing a language identifier. While the content is a text-based diagram rather than code, adding text or plaintext as the language would satisfy linting rules and improve consistency.

📋 Suggested fix
-```
+```text
 e2e-boot (GHCR login + pull + k3s install)
     |
     ├── npm ci + playwright install (overlapped with k3s)

241-241: Fix grammatical structure.

The sentence lacks a subject. Adding "It" at the beginning would improve clarity.

✏️ Suggested fix
-`workflow_dispatch` with an optional SHA input to promote a specific commit.
+Runs automatically when `Stack Tests` completes successfully on `main`. It can also be triggered manually via
+`workflow_dispatch` with an optional SHA input to promote a specific commit.

@sonarqubecloud
Copy link

@HardMax71 HardMax71 merged commit 2781815 into main Feb 1, 2026
23 of 24 checks passed
@HardMax71 HardMax71 deleted the fix/e2e-tests-speedup branch February 1, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants