making deploy quicker - less intervals, observability now optional #123

HardMax71 · 2026-01-31T16:22:20Z

Summary by cubic

Make local deploys faster by making observability optional, tightening healthcheck timing, and pulling pre-built images by default. CI builds SHA-tagged images, E2E uses them, scans all images, and promotes to latest after Stack Tests succeed (manual promote available).

New Features
- Added flags: --observability, --debug, and --no-build; profiles apply only when passed; Kafdrop moved to the "debug" profile.
- CI/CD: push SHA-tagged images to GHCR; E2E pulls via IMAGE_TAG (FE tests sharded); scan all images; auto-promote SHA → latest after Stack Tests pass; manual promote supported.
Migration
- Observability no longer starts by default. Use: ./deploy.sh dev --observability or ./deploy.sh dev --debug.
- If you depend on Kafdrop at http://localhost:9000, start with --debug.

^{Written for commit 595c223. Summary will update on new commits.}

Summary by CodeRabbit

New Features
- Dev command: added --no-build, --observability, and --debug flags with help and conditional status output for observability/debug tools.
- Compose: services now reference registry images and a debug profile to enable optional tooling.
Chores
- Faster local startup via reduced healthcheck intervals and timing adjustments.
- CI: consolidated scan-and-promote flow to promote SHA-tagged images to latest after scans; simplified build matrix; expanded triggers.
- CI actions: added E2E boot/ready actions; removed legacy cache and k3s-setup actions.
Documentation
- Major CI/CD guide refactor documenting new workflows, scans, promotion flow, and local run guidance.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-31T16:22:36Z

📝 Walkthrough

Walkthrough

Adds profile-aware flags to deploy.sh (--no-build, --observability, --debug); replaces many local compose images with GHCR image references and tightens healthchecks; restructures CI into scan+promote and expanded build/push in stack-tests; removes some composite actions and adds new e2e composite actions; updates docs and workflow triggers.

Changes

Cohort / File(s)	Summary
Deploy script `deploy.sh`	Adds CLI flags `--no-build`, `--observability`, `--debug`; introduces `NO_BUILD_FLAG` and `PROFILE_FLAGS`; extends parsing, info logs, and runs `docker compose` with profile/no-build options; adjusts profile-aware service status output.
Docker Compose `docker-compose.yaml`	Replaces many local images with `ghcr.io/...:${IMAGE_TAG:-latest}` entries; adds `profiles: ["debug"]` for `kafdrop`; removes redundant sleeps; tightens/shortens healthcheck intervals, timeouts, retries, and start_periods; adds explicit images for many services (backend, frontend-dev, workers, processors, cert-gen, etc.).
CI: scan & promote workflow `.github/workflows/docker.yml`	Converts per-component build jobs to a matrix-driven scan job producing SHA-tagged images and Trivy SARIF outputs; adds `promote` job to promote SHA-tagged images to `latest` via `crane`; updates triggers, summaries, and artifact naming.
CI: build/push & stack tests `.github/workflows/stack-tests.yml`	Restructures build to Build & Push Images with SHA tagging and GHCR pushes for many images (coordinator, workers, frontend-prod/dev, zookeeper-certgen, etc.); updates E2E orchestration to pull GHCR images, adds Playwright sharding, phased infra startup, and enhanced log/artifact handling.
Removed composite actions `.github/actions/docker-cache/action.yml`, `.github/actions/k3s-setup/action.yml`	Deletes prior composite actions that handled Docker cache load/save and the older k3s install/kubeconfig setup.
New composite actions `.github/actions/e2e-boot/action.yml`, `.github/actions/e2e-ready/action.yml`	Adds "E2E Boot" to pre-pull/pre-warm images and install k3s, and "E2E Ready" to finalize k3s, start cert-generator in background, run docker-compose without rebuild, wait for services, and seed test users.
Workflow triggers & CI configs `.github/workflows/frontend-ci.yml`, `.github/workflows/mypy.yml`, `.github/workflows/ruff.yml`, `.github/workflows/security.yml`	Restricts several workflows to `main` branch and adjusts triggers/path filters to align with the new pipeline.
Docs `docs/operations/cicd.md`	Large refactor documenting the multi-workflow pipeline (Stack Tests, Docker Scan & Promote, SBOM & Supply Chain), composite actions, image lifecycle, frontend E2E sharding, and updated local run guidance and artifacts/summary changes.

Sequence Diagram(s)

sequenceDiagram
  actor Dev as "Developer (CLI)"
  participant Deploy as "deploy.sh"
  participant Compose as "docker compose"
  participant Registry as "GHCR"
  participant CI as "GitHub Actions"
  participant Services as "Containers"

  Dev->>Deploy: run with flags (--no-build / --debug / --observability)
  Deploy->>Compose: set PROFILE_FLAGS and NO_BUILD_FLAG\nrun docker compose up -d
  alt --no-build / profiles include promoted/latest
    Compose->>Registry: pull promoted/latest or IMAGE_TAG images
  else build required
    CI->>Registry: build & push SHA-tagged images
    CI->>CI: run Trivy scans (SARIF)
    CI->>Registry: promote SHA -> latest (crane)
    Compose->>Registry: pull promoted/latest images
  end
  Compose->>Services: start containers (profiles applied)
  Services-->>Compose: healthchecks (shorter intervals)
  Compose-->>Deploy: return profile-aware service statuses

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

dependencies update #58 — Related changes to CI stack startup and image handling (pre-built images, compose startup without rebuild).
Deploy via helm #8 — Overlaps on GHCR image switching, IMAGE_TAG/sha handling, and docker-compose / CI changes.
fix: #118 #120 — Similar modifications around image tagging, compose behavior, and CI promotion mechanics.

Poem

🐇 I nudged the flags and hopped the queue,
GHCR gleamed as images flew anew.
Profiles woke services, healthchecks sped,
Scans crowned shas and pushed them ahead.
🥕 A little rabbit cheers this CI thread.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: making deploy faster through shorter healthcheck intervals and making observability optional via CLI flags.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/e2e-tests-speedup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cubic-dev-ai

No issues found across 2 files

stack-tests.yml — Build, Test, Push - Removed dev from branch triggers (only main now) - Added tags: ['v*'] and cert-generator/** path triggers - Build job now pushes to GHCR with immutable sha-{sha} tag (push events only) - Added missing pre-builds: event-replay, dlq-processor, zookeeper-certgen (these were being rebuilt during compose startup before) - Added frontend-prod build (from Dockerfile.prod, pushed as frontend:sha-xxx for Trivy scanning) - E2E jobs pull from GHCR on push events (parallel docker pull & + retag to compose names), fall back to artifact for PRs - All push/pull commands are spelled out explicitly (no for loops) - Added packages: write permission to build job docker.yml — Scan & Promote (rewritten) - Trigger: workflow_run on "Stack Tests" completion (+ workflow_dispatch with optional SHA input) - Only runs when Stack Tests succeed on main - Scan jobs: Trivy scans backend and frontend-prod from GHCR using SHA tag - Promote job: crane copy sha-xxx → latest for all 12 images — registry-level manifest copy, no rebuild - latest is NEVER set during build — only after all tests + scans pass Flow Push to main: stack-tests.yml: unit → build (push sha-xxx to GHCR) → E2E (pull from GHCR) docker.yml: (on success) → scan → promote sha-xxx → latest PR: stack-tests.yml: unit → build (save artifact) → E2E (load artifact) docker.yml: (skipped — only triggers on main)

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/docker.yml (1)
46-56: ⚠️ Potential issue | 🟠 Major

Add exit-code: '1' to block promotion on CRITICAL/HIGH findings.

By default, aquasecurity/trivy-action sets exit-code: 0, so the step succeeds even when CRITICAL/HIGH vulnerabilities are found. This means the subsequent promote step would still run. Add exit-code: '1' to both Trivy scan steps to fail the job on severity-matched findings.

Note: When using format: 'sarif', the action outputs all vulnerabilities regardless of the severity filter. To ensure the exit code respects only your specified severities, also add limit-severities-for-sarif: true.
Suggested change
       severity: 'CRITICAL,HIGH'
+      exit-code: '1'
Applies to lines 46-56 and 89-98.

🤖 Fix all issues with AI agents

In @.github/workflows/docker.yml:
- Around line 170-182: The summary text assumes "All Stack Tests passed" which
is incorrect for manual workflow_dispatch runs; update the step that writes to
$GITHUB_STEP_SUMMARY to branch on the event type (use $GITHUB_EVENT_NAME or
github.event_name) and emit a different header/message when event ==
"workflow_dispatch" (e.g., "Images promoted manually — Stack Tests may not have
run") while keeping the existing text for normal CI runs (when event !=
"workflow_dispatch") and reuse the same $TAG/$REGISTRY/$PREFIX pull lines;
ensure the conditional wraps only the descriptive header line(s) not the image
table so the pull commands still print in both cases.
- Around line 129-153: The promote step currently copies all images using the
TAG variable with multiple crane copy commands (e.g., crane copy
"$REGISTRY/$PREFIX/base:$TAG" ...), but only backend and frontend are
Trivy-scanned; update the promote job so it only promotes images that have been
scanned or add Trivy scans for each promoted image before running the crane copy
commands. Concretely, either (A) restrict the list of crane copy targets to the
scanned images (reference the existing crane copy entries for backend and
frontend and remove/unpromote the other image names), or (B) insert a Trivy scan
loop that iterates the same image names (use PREFIX and TAG) and fails on
detected vulnerabilities before executing the crane copy lines for base,
coordinator, k8s-worker, pod-monitor, result-processor, saga-orchestrator,
event-replay, dlq-processor, cert-generator, and zookeeper-certgen.

🧹 Nitpick comments (2)

.github/workflows/stack-tests.yml (2)

322-357: Consider extracting the GHCR pull/retag logic into a composite action.

This entire block (login, parallel pulls, retagging) is duplicated verbatim in the frontend-e2e job (lines 474-509). Extracting it into a reusable composite action under .github/actions/ would reduce maintenance burden and ensure consistency.

♻️ Suggested approach

Create a composite action at .github/actions/pull-images/action.yml:

name: 'Pull Images from GHCR'
inputs:
  sha-tag:
    required: true
  image-prefix:
    required: true
  registry:
    required: true
  github-token:
    required: true
  github-actor:
    required: true
runs:
  using: 'composite'
  steps:
    - name: Pull and retag images from GHCR
      shell: bash
      env:
        TAG: ${{ inputs.sha-tag }}
        IMG: ${{ inputs.registry }}/${{ inputs.image-prefix }}
      run: |
        echo "${{ inputs.github-token }}" | docker login ghcr.io -u ${{ inputs.github-actor }} --password-stdin
        # ... parallel pulls and retag logic

Then use it in both E2E jobs:

- name: Pull images from GHCR
  if: github.event_name != 'pull_request'
  uses: ./.github/actions/pull-images
  with:
    sha-tag: ${{ needs.build-images.outputs.sha-tag }}
    image-prefix: ${{ needs.build-images.outputs.image-prefix }}
    registry: ${{ env.REGISTRY }}
    github-token: ${{ secrets.GITHUB_TOKEN }}
    github-actor: ${{ github.actor }}

331-343: Parallel pulls are efficient but failures may be harder to diagnose.

While wait correctly fails if any background job fails, the logs won't clearly indicate which specific image pull failed. For better debuggability, consider capturing PIDs and checking each exit status, or using xargs -P for parallel execution with clearer error reporting.

♻️ Alternative using xargs for clearer error output

-          docker pull "$IMG/base:$TAG" &
-          docker pull "$IMG/backend:$TAG" &
-          docker pull "$IMG/frontend-dev:$TAG" &
-          docker pull "$IMG/coordinator:$TAG" &
-          docker pull "$IMG/k8s-worker:$TAG" &
-          docker pull "$IMG/pod-monitor:$TAG" &
-          docker pull "$IMG/result-processor:$TAG" &
-          docker pull "$IMG/saga-orchestrator:$TAG" &
-          docker pull "$IMG/event-replay:$TAG" &
-          docker pull "$IMG/dlq-processor:$TAG" &
-          docker pull "$IMG/cert-generator:$TAG" &
-          docker pull "$IMG/zookeeper-certgen:$TAG" &
-          wait
+          echo "base backend frontend-dev coordinator k8s-worker pod-monitor result-processor saga-orchestrator event-replay dlq-processor cert-generator zookeeper-certgen" \
+            | tr ' ' '\n' \
+            | xargs -P 12 -I{} docker pull "$IMG/{}:$TAG"

.github/workflows/docker.yml

… single matrix job (scan) that scans all 12 images in parallel: - fail-fast: false — one image's vulnerability findings don't cancel the other scans - Each matrix entry runs as its own parallel job on a separate runner - SARIF results uploaded per-image with unique categories (trivy-base, trivy-backend, etc.) - trivyignores: 'backend/.trivyignore' applied to all images (CVE exemptions are image-agnostic) - checkout@v6 included so the .trivyignore file is available Updated promote.needs from [scan-backend, scan-frontend] to [scan] — waits for all 12 matrix entries to pass before promoting anything to latest. Updated the summary security section to reflect that all 12 images are scanned.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/docker.yml (1)
50-72: ⚠️ Potential issue | 🟠 Major

Add GHCR login before Trivy scans private images.

Trivy cannot pull private GHCR images without registry authentication. While packages: read permission is already set, Trivy also needs credentials to authenticate. Add docker/login-action before the scan step with GITHUB_TOKEN, or alternatively pass TRIVY_USERNAME and TRIVY_PASSWORD env vars to the Trivy step.
🛠️ Suggested fix
       - uses: actions/checkout@v6

+      - name: Log in to GHCR
+        uses: docker/login-action@v3
+        with:
+          registry: ${{ env.REGISTRY }}
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+
       - name: Compute image ref
         id: ref
         run: |
           PREFIX="${GITHUB_REPOSITORY_OWNER,,}/integr8scode"

docker-compose.yaml (+15 lines): Every buildable service now has an image: field pointing to ghcr.io/hardmax71/integr8scode/{service}:${IMAGE_TAG:-latest}. kafka-init and user-seed share the backend image. Compose now knows where to pull pre-built images from. deploy.sh (+10 lines): Added --no-build flag to cmd_dev(). Passes --no-build to compose, preventing any build fallback. stack-tests.yml (-149 lines): - Build job: push condition changed from event_name != 'pull_request' to !github.event.pull_request.head.repo.fork (same-repo PRs can push to GHCR). Artifact save/upload removed entirely. - Both E2E jobs: Deleted all GHCR login, parallel pull, retag, artifact download, and load steps. Replaced with a single IMAGE_TAG env var on the "Start stack" step. Compose pulls SHA-tagged images from GHCR automatically using the image: fields. - Both E2E jobs have if: !fork guard — fork PRs skip E2E (unit tests still run). How it works | Scenario | What happens | |---------------------------------------------------------|--------------------------------------------------| | ./deploy.sh dev (local, first time) | Compose pulls latest from GHCR — no build needed | | ./deploy.sh dev --build (local, with changes) | Builds locally, tags with GHCR name | | CI: IMAGE_TAG=sha-xxx ./deploy.sh dev --no-build --wait | Compose pulls sha-tagged images from GHCR | | ./deploy.sh prod | Helm uses GHCR images (unchanged) |

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @.github/workflows/stack-tests.yml:
- Around line 137-141: The echoes that append outputs use an unquoted
$GITHUB_OUTPUT which can cause globbing/word-splitting; update the two echo
lines to redirect into the quoted variable (use >> "$GITHUB_OUTPUT") so that the
values computed in PREFIX and SHA_TAG (and the redirection target) are safely
handled; locate the lines that set PREFIX, SHA_TAG and the two echo lines and
change only the redirection target to be quoted.

🧹 Nitpick comments (3)

.github/workflows/stack-tests.yml (1)
170-176: Consider consolidating repetitive tag-and-push blocks.

Each image follows the same tag → push pattern. While functional, this repetition increases maintenance burden. A matrix strategy or a reusable script/action could reduce duplication.
Example: Loop-based push script
      - name: Push all images to GHCR
        if: ${{ !github.event.pull_request.head.repo.fork }}
        env:
          TAG: ${{ steps.tags.outputs.sha-tag }}
          IMG: ${{ env.REGISTRY }}/${{ steps.tags.outputs.image-prefix }}
        run: |
          images=(
            "integr8scode-base:base"
            "integr8scode-backend:backend"
            "integr8scode-coordinator:coordinator"
            # ... add remaining images
          )
          for entry in "${images[@]}"; do
            local_name="${entry%%:*}"
            remote_name="${entry##*:}"
            docker tag "$local_name:latest" "$IMG/$remote_name:$TAG"
            docker push "$IMG/$remote_name:$TAG"
          done
Also applies to: 189-211, 224-229, 241-246, 259-264, 276-281
docker-compose.yaml (2)

78-81: Reduced start_period values may cause premature healthcheck failures.

The start_period reductions (e.g., kafka from 15s→5s, backend/frontend from 10s→5s) assume faster container startup. If services occasionally take longer to initialize (cold cache, slower CI runners, resource contention), healthchecks may fail before the service is ready, causing cascading dependency failures.

Consider whether retries × interval provides sufficient total wait time, or if a slightly longer start_period is safer for CI stability.

Service Old start_period New start_period retries × interval

kafka 15s 5s 12 × 5s = 60s

backend 10s 5s 50 × 3s = 150s

frontend 10s 5s 30 × 3s = 90s

redis 10s 5s 5 × 5s = 25s

The total windows (start_period + retries × interval) still seem reasonable, but monitor CI for flakiness.

Also applies to: 129-129, 159-159, 328-328

354-354: Kafdrop moved to debug profile — document access instructions.

Kafdrop is now only started with --debug flag. The PR description mentions it's available at http://localhost:9000 with --debug, which is helpful. Ensure the README or deploy script help text reflects this change so developers know how to access Kafka debugging tools.

.github/workflows/stack-tests.yml

- Added strategy.matrix with shardIndex: [1, 2] and shardTotal: [2] - fail-fast: false so one shard failing doesn't cancel the other - Test command: npx playwright test --shard=${{ matrix.shardIndex }}/${{ matrix.shardTotal }} - Artifact names include shard index to avoid collisions: playwright-report-1, playwright-report-2, frontend-e2e-logs-1, etc. - Job name shows shard: Frontend E2E (1/2), Frontend E2E (2/2) 2. GHCR Pre-pull (both E2E jobs) - Immediately after checkout, docker compose pull --quiet starts in the background via nohup - While GHCR images pull, the subsequent setup steps run in parallel: - backend-e2e: Docker cache load + k3s install (~85s of overlap) - frontend-e2e: Node setup + npm ci + Playwright install + Docker cache + k3s (~150s of overlap) - A "Wait for GHCR images" step before "Start stack" ensures pull is complete - "Start stack" then finds images already local — skips pulling entirely

How it works Both E2E jobs now have this timeline: Step 2: Pre-pull GHCR images ──────────────────────────────── (background) Step 3-7: Node/Playwright/Docker cache setup ──────────────── (foreground, ~50s) Step 8: Docker-cache loads infra images ───────────────────── (~15s) Step 9: Pre-warm infrastructure ───────────────────────────── (background, starts immediately) ├── mongo + redis start (~5s to healthy) ├── shared-ca + cert-gen + zk-certgen start (~5s) ├── zookeeper starts after zk-certgen (~15s) ├── kafka starts after zookeeper healthy (~20s) └── schema-registry starts after kafka (~10s) Step 10: k3s install ──────────────────────────────────────── (~42s, OVERLAPS with infra chain) Step 12: Wait for background tasks ────────────────────────── (both should be done) Step 13: Start stack ──────────────────────────────────────── (infra already healthy, only app services) Expected impact on "Start stack" | Component | Before | After | |---------------------------------|---------------------------|------------------------------| | Infra initialization (zk chain) | ~50s (during Start stack) | 0s (already done during k3s) | | App image pull | ~60s | 0s (pre-pulled) | | App service startup | ~30s | ~30s | | Health check waits | ~20s | ~20s | | Total "Start stack" | ~2:20 | ~0:50 |

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @.github/workflows/stack-tests.yml:
- Around line 322-335: The loop that reads PID from each pidfile (pidfile, PID)
uses tail --pid to wait but never captures the background process exit code;
update the block that handles /tmp/ghcr-pull.pid and /tmp/infra-warm.pid so that
after detecting the process (kill -0) you still wait for it with wait "$PID" (or
capture its exit status) and if the exit status is non‑zero then cat the
corresponding log (/tmp/ghcr-pull.log or /tmp/infra-warm.log) to stdout/stderr
and exit the script with that same non‑zero code; ensure this change is applied
to both wait blocks so failures surface immediately rather than being hidden by
tail --pid.
- Around line 287-299: The backend-e2e and frontend-e2e jobs need explicit GHCR
auth and workflow permissions like the build-images job: add a top-level
permissions block for packages: read (and contents: read if needed) to each job
and insert a GHCR login step before any docker-compose/docker pull (use echo
"${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u "${{ github.actor }}"
--password-stdin or the same pattern used in build-images). Update the Pre-pull
GHCR images step (and any other image-pull steps) in the backend-e2e and
frontend-e2e jobs to perform the docker login using the GITHUB_TOKEN environment
then run the existing docker compose pull command.

.github/workflows/stack-tests.yml

…~/.kube:/root/.kube. When Docker creates that bind mount source directory, it creates it as root:root. Then k3s-setup's sudo k3s kubectl config view --raw > /home/runner/.kube/config fails because the shell redirect (>) runs as the runner user who can't write to the root-owned directory.

| Before | After | |-----------------------------------------|----------------------------------------| | 1. checkout | 1. checkout | | 2. GHCR pre-pull (bg) | 2. GHCR pre-pull (bg) | | 3. docker-cache | 3. config copy (moved up) | | 4. infra pre-warm (bg) | 4. Install k3s (split from composite) | | 5. k3s-setup (composite, ~45s blocking) | 5. docker-cache (runs during k3s boot) | | 6. config copy | 6. infra pre-warm (bg) | | 7. wait for bg | 7. Finalize k3s (~25s+ after install) | | 8. start stack | 8. wait for bg | | | 9. start stack | Key gain: k3s boot (30s) now overlaps with docker-cache (10-18s) instead of blocking sequentially. The composite k3s-setup action is inlined as "Install k3s" + "Finalize k3s", same pattern as frontend-e2e. Complete optimization summary across both files: 1. docker-compose.yaml — Tightened health check intervals (5s→2-3s) and start periods (10s→3-5s) across all 7 services 2. frontend-e2e — Inlined k3s, overlaps boot with Node + npm ci + Playwright (~50s overlap) 3. backend-e2e — Inlined k3s, overlaps boot with docker-cache (~15s overlap) 4. Both YAML files validated

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name=".github/workflows/stack-tests.yml">

<violation number="1" location=".github/workflows/stack-tests.yml:309">
P2: Avoid piping an unverified remote install script into `sh`. This introduces a supply‑chain risk for CI runners; prefer a pinned k3s version with checksum verification (or the existing k3s setup action) instead.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

.github/workflows/stack-tests.yml

Build base → Push base (13s) → Build 8 workers → Push 8 workers (35s sequential) → Build cert-gen → Push cert-gen (7s) → Build zk-certgen → Push zk-certgen (8s) → Build frontend → Push frontend-dev (12s) → Build frontend-prod → Push frontend-prod (6s) After (14 steps, 1 parallel push step): Build base → Build 8 workers → Build cert-gen → Build zk-certgen → Build frontend → Build frontend-prod → Push all 13 in parallel (~15-20s) Expected savings: ~60s (81s sequential → ~20s parallel). Job should drop from 2m 48s → ~1m 50s. The builds are all done first (same total time), then all 13 pushes fire concurrently. Since they share base layers, Docker deduplicates — the first push uploads shared layers and the rest skip them.

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name=".github/workflows/stack-tests.yml">

<violation number="1" location=".github/workflows/stack-tests.yml:253">
P2: Backgrounding docker pushes without checking each exit status can let failed image pushes go unnoticed. Track PIDs and `wait` them individually so the step fails if any push fails.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

.github/workflows/stack-tests.yml

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @.github/workflows/stack-tests.yml:
- Around line 246-253: The parallel docker push loop that runs docker push
"$IMG/$name:$TAG" & and then calls wait can mask failures because wait only
returns the status of the last job; fix it by capturing background PIDs (e.g.,
push PIDs array when launching jobs in the for loop), then iteratively wait on
each PID and check its exit status; if any wait returns non-zero set a failure
flag and exit non-zero at the end (or immediately exit on first failure) so any
individual docker push failure is propagated from the workflow.

🧹 Nitpick comments (3)

.github/workflows/stack-tests.yml (3)
319-319: Quote variable in basename to prevent word splitting.

The $pidfile variable should be quoted for shell safety.
Proposed fix
-                echo "Waiting for $(basename $pidfile .pid)..."
+                echo "Waiting for $(basename "$pidfile" .pid)..."
481-481: Quote variable in basename to prevent word splitting.

Same issue as in backend-e2e - the $pidfile variable should be quoted.
Proposed fix
-                echo "Waiting for $(basename $pidfile .pid)..."
+                echo "Waiting for $(basename "$pidfile" .pid)..."
299-310: Consider extracting k3s setup to a reusable composite action.

The k3s installation and finalization logic is duplicated between backend-e2e (lines 278-310) and frontend-e2e (lines 424-467). Extracting this to a composite action in .github/actions/ would reduce duplication and simplify maintenance.

.github/workflows/stack-tests.yml

1. Parallel GHCR pushes (build-images job): - Merged 6 separate push steps into 1 step that pushes all 13 images in parallel via for ... do docker push & done; wait - Expected: ~81s sequential → ~15-20s parallel (saves ~60s) 2. Targeted health checks (both E2E jobs): - Replaced deploy.sh dev --no-build --wait (waits for ALL 15+ containers) with: - docker compose up -d --no-build (returns immediately, ~3s) - curl loop that only waits for backend (backend-e2e) or backend + frontend (frontend-e2e) - Workers start in background and become ready while tests run their initial setup - Expected: "Start stack" drops from ~2:01 to ~5s + "Wait for health" ~40-60s = ~45-65s total (saves ~60s)

…-wait) takes 1:23 because depends_on: condition: service_healthy in docker-compose.yaml forces compose to wait for the entire dependency chain before creating dependent containers. Removing --wait only skipped the final "all healthy" check — the internal chain is the real bottleneck. Changes made (3 optimizations): 1. Removed docker-cache step (saves ~1:08 blocking time) The docker-cache composite action was loading 5 infra images from GHA cache in ~68s of blocking foreground time. But docker compose pull (pre-pull) already fetches ALL images in background. Removed the redundant step. 2. Merged pre-pull + pre-warm into single sequential background task Instead of: pre-pull (bg) → docker-cache (blocking 1:08) → pre-warm (bg) Now: docker compose pull && docker compose up -d ... infra all in one background process. Infra starts pulling + booting immediately after checkout, overlapping with all subsequent setup steps. 3. Pre-start cert-generator after k3s finalize cert-generator is on the critical path: cert-gen(complete) → backend(healthy) → frontend. Starting it right after kubeconfig exists gives it a ~15-20s head start while we wait for pre-pull to finish.

…_started Impact: Compose no longer waits for backend to pass its health check (~35s) before creating the frontend container. Backend and frontend now boot in parallel during docker compose up -d. For frontend-e2e: "Start stack" should drop from 1:20 to 45-50s (no backend health wait in compose), and "Wait for backend+frontend" picks up the slack but runs in parallel (45s). Net: 2:03 → ~1:30, saving ~33s → job drops to ~5:00. For backend-e2e: Smaller impact since backend tests don't need frontend. "Start stack" drops slightly (~10s) since compose returns earlier. Job should be ~5:30. At this point we're approaching the hard floor: - Backend E2E: 3:00 tests + 100s minimum setup = ~4:40 floor, currently ~5:30 (50s over) - Frontend E2E: 2:11 tests + 80s minimum setup = ~3:31 floor, currently ~5:00 (89s over, mostly from the depends_on chain which is inherent to docker-compose)

| Action | Purpose | |------------------------|-----------------------------------------------------------------------------| | e2e-boot (new) | GHCR login + pull/prewarm infra (bg) + k3s install | | e2e-ready (new) | Finalize k3s + cert-gen + config + wait + start stack + health check + seed | | k3s-setup (deleted) | Was inlined previously, never referenced | | docker-cache (deleted) | Replaced by docker compose pull, never referenced | Step count reduction: - backend-e2e: 20 steps → 8 steps (checkout + 2 actions + test + coverage + logs) - frontend-e2e: 20 steps → 13 steps (checkout + e2e-boot + 5 Node/Playwright + e2e-ready + test + report + logs) Performance preserved: The split point between e2e-boot and e2e-ready is exactly where frontend-e2e interposes Node/Playwright setup, so k3s still boots in the background during that work.

cubic-dev-ai

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name=".github/actions/e2e-boot/action.yml">

<violation number="1" location=".github/actions/e2e-boot/action.yml:39">
P1: Downloading and executing the k3s install script without verifying its integrity exposes CI to supply‑chain tampering. Add checksum/signature verification (or pin to a trusted artifact) before execution.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

.github/actions/e2e-boot/action.yml

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @.github/actions/e2e-boot/action.yml:
- Around line 22-33: The background nohup block that runs the docker compose
pull/up should capture and persist its exit status so failures are not lost;
modify the nohup script (the block starting with "nohup bash -c '" that writes
to /tmp/infra-pull.log and /tmp/infra-pull.pid) to write the final exit code
into a file such as /tmp/infra-pull.exit (e.g., run the commands, capture "$?"
and echo it to /tmp/infra-pull.exit before exiting), and update the e2e-ready
step to check /tmp/infra-pull.exit after waiting (fail if the file exists and
contains non-zero, printing /tmp/infra-pull.log for diagnostics).

In @.github/actions/e2e-ready/action.yml:
- Around line 43-54: The wait step currently only uses `tail --pid` so it blocks
until the background PID in /tmp/infra-pull.pid exits but doesn't check its exit
code; change the logic in the "Wait for image pull and infra" step to poll the
background PID (e.g., check /proc/$PID) or wait for a dedicated exit-status file
written by the background job (e.g., /tmp/infra-pull.exit), and after the PID
disappears read that exit-status file (or capture the background process status)
and fail the step (exit non-zero) if the background job reported an error;
update references to /tmp/infra-pull.pid, /tmp/infra-pull.log and the chosen
exit-status file so logs are still printed but the step fails when the infra
pull/warm fails.

🧹 Nitpick comments (2)

.github/actions/e2e-boot/action.yml (1)
35-41: K3S_VERSION is read from environment but not declared as input.

This step relies on K3S_VERSION being set in the calling workflow's env block. If the caller doesn't define it, the script will fail silently or produce an invalid URL. Consider adding it as an input with a default, or document the dependency.
Option: Add K3S_VERSION as an input
 inputs:
   image-tag:
     description: 'GHCR image tag (e.g., sha-abc1234)'
     required: true
   github-token:
     description: 'GitHub token for GHCR authentication'
     required: true
+  k3s-version:
+    description: 'k3s version to install (e.g., v1.32.11+k3s1)'
+    required: false
+    default: 'v1.32.11+k3s1'
Then use ${{ inputs.k3s-version }} in the step.
.github/actions/e2e-ready/action.yml (1)
29-35: Cert-generator background process PID not tracked.

Unlike the infra-pull process, the cert-generator's PID isn't captured, so there's no way to wait for it or check its exit status. If cert generation fails, the error surfaces only when dependent services fail to start.
Capture PID for consistency
     - name: Start cert-generator (background)
       shell: bash
       env:
         IMAGE_TAG: ${{ inputs.image-tag }}
       run: |
         nohup docker compose up -d --no-build cert-generator \
-          > /tmp/cert-gen.log 2>&1 &
+          > /tmp/cert-gen.log 2>&1 &
+        echo $! > /tmp/cert-gen.pid

.github/actions/e2e-boot/action.yml

.github/actions/e2e-ready/action.yml

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In @.github/workflows/security.yml:
- Around line 4-7: The security workflow currently restricts triggers by having
push and pull_request set to "branches: [ main ]", which skips scans on the dev
branch; update the workflow triggers (the push and pull_request entries) to
include the dev branch as well (e.g., change branches: [ main ] to branches: [
main, dev ]) so security scans run for dev PRs and pushes while retaining main
scans.

In `@docs/operations/cicd.md`:
- Line 84: Clarify the discrepancy between "13 container images" and "12 images
scanned" by adding a short explicit note in the image table or the Docker Scan &
Promote section stating which image is excluded from Trivy scanning and why
(e.g., explain that the `base` image is a build-only artifact or that
`frontend-dev` is not deployed to production and therefore not scanned);
reference the existing phrases "13 container images" and "12 deployed images" in
the note so readers can see which one is omitted and the reason for omission.

🧹 Nitpick comments (2)

docs/operations/cicd.md (2)
201-209: Add language identifier to the code fence.

The code fence at line 201 is missing a language identifier. While the content is a text-based diagram rather than code, adding text or plaintext as the language would satisfy linting rules and improve consistency.
📋 Suggested fix
-```
+```text
 e2e-boot (GHCR login + pull + k3s install)
     |
     ├── npm ci + playwright install (overlapped with k3s)
241-241: Fix grammatical structure.

The sentence lacks a subject. Adding "It" at the beginning would improve clarity.
✏️ Suggested fix
-`workflow_dispatch` with an optional SHA input to promote a specific commit.
+Runs automatically when `Stack Tests` completes successfully on `main`. It can also be triggered manually via
+`workflow_dispatch` with an optional SHA input to promote a specific commit.

.github/workflows/security.yml

docs/operations/cicd.md

sonarqubecloud · 2026-01-31T23:55:40Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

making deploy quicker - less intervals, observability now optional

27c41f5

cubic-dev-ai bot reviewed Jan 31, 2026

View reviewed changes

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

.github/workflows/docker.yml Show resolved Hide resolved

.github/workflows/docker.yml Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

.github/workflows/stack-tests.yml Outdated Show resolved Hide resolved

HardMax71 added 2 commits January 31, 2026 20:21

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

.github/workflows/stack-tests.yml Outdated Show resolved Hide resolved

.github/workflows/stack-tests.yml Outdated Show resolved Hide resolved

HardMax71 added 2 commits January 31, 2026 20:55

cubic-dev-ai bot reviewed Jan 31, 2026

View reviewed changes

.github/workflows/stack-tests.yml Outdated Show resolved Hide resolved

cubic-dev-ai bot reviewed Jan 31, 2026

View reviewed changes

.github/workflows/stack-tests.yml Show resolved Hide resolved

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

.github/workflows/stack-tests.yml Outdated Show resolved Hide resolved

HardMax71 added 5 commits January 31, 2026 21:55

fixes

c9ac036

cubic-dev-ai bot reviewed Jan 31, 2026

View reviewed changes

.github/actions/e2e-boot/action.yml Show resolved Hide resolved

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

.github/actions/e2e-boot/action.yml Show resolved Hide resolved

.github/actions/e2e-ready/action.yml Show resolved Hide resolved

HardMax71 added 2 commits January 31, 2026 23:56

fixes

5ecb455

updated docs + branch = main for all CI workflows (removed dev)

b62ebbd

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

.github/workflows/security.yml Show resolved Hide resolved

docs/operations/cicd.md Show resolved Hide resolved

clarified 12/13 images in docs

595c223

HardMax71 merged commit 2781815 into main Feb 1, 2026
23 of 24 checks passed

HardMax71 deleted the fix/e2e-tests-speedup branch February 1, 2026 00:11

Service	Old start_period	New start_period	retries × interval
kafka	15s	5s	12 × 5s = 60s
backend	10s	5s	50 × 3s = 150s
frontend	10s	5s	30 × 3s = 90s
redis	10s	5s	5 × 5s = 25s

making deploy quicker - less intervals, observability now optional #123

making deploy quicker - less intervals, observability now optional #123

Uh oh!

Conversation

HardMax71 commented Jan 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Jan 31, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HardMax71 commented Jan 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 31, 2026 •

edited

Loading