-
Notifications
You must be signed in to change notification settings - Fork 0
making deploy quicker - less intervals, observability now optional #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds profile-aware flags to Changes
Sequence Diagram(s)sequenceDiagram
actor Dev as "Developer (CLI)"
participant Deploy as "deploy.sh"
participant Compose as "docker compose"
participant Registry as "GHCR"
participant CI as "GitHub Actions"
participant Services as "Containers"
Dev->>Deploy: run with flags (--no-build / --debug / --observability)
Deploy->>Compose: set PROFILE_FLAGS and NO_BUILD_FLAG\nrun docker compose up -d
alt --no-build / profiles include promoted/latest
Compose->>Registry: pull promoted/latest or IMAGE_TAG images
else build required
CI->>Registry: build & push SHA-tagged images
CI->>CI: run Trivy scans (SARIF)
CI->>Registry: promote SHA -> latest (crane)
Compose->>Registry: pull promoted/latest images
end
Compose->>Services: start containers (profiles applied)
Services-->>Compose: healthchecks (shorter intervals)
Compose-->>Deploy: return profile-aware service statuses
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found across 2 files
stack-tests.yml — Build, Test, Push
- Removed dev from branch triggers (only main now)
- Added tags: ['v*'] and cert-generator/** path triggers
- Build job now pushes to GHCR with immutable sha-{sha} tag (push events only)
- Added missing pre-builds: event-replay, dlq-processor, zookeeper-certgen (these were being rebuilt during compose startup before)
- Added frontend-prod build (from Dockerfile.prod, pushed as frontend:sha-xxx for Trivy scanning)
- E2E jobs pull from GHCR on push events (parallel docker pull & + retag to compose names), fall back to artifact for PRs
- All push/pull commands are spelled out explicitly (no for loops)
- Added packages: write permission to build job
docker.yml — Scan & Promote (rewritten)
- Trigger: workflow_run on "Stack Tests" completion (+ workflow_dispatch with optional SHA input)
- Only runs when Stack Tests succeed on main
- Scan jobs: Trivy scans backend and frontend-prod from GHCR using SHA tag
- Promote job: crane copy sha-xxx → latest for all 12 images — registry-level manifest copy, no rebuild
- latest is NEVER set during build — only after all tests + scans pass
Flow
Push to main:
stack-tests.yml: unit → build (push sha-xxx to GHCR) → E2E (pull from GHCR)
docker.yml: (on success) → scan → promote sha-xxx → latest
PR:
stack-tests.yml: unit → build (save artifact) → E2E (load artifact)
docker.yml: (skipped — only triggers on main)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/docker.yml (1)
46-56:⚠️ Potential issue | 🟠 MajorAdd
exit-code: '1'to block promotion on CRITICAL/HIGH findings.By default,
aquasecurity/trivy-actionsetsexit-code: 0, so the step succeeds even when CRITICAL/HIGH vulnerabilities are found. This means the subsequentpromotestep would still run. Addexit-code: '1'to both Trivy scan steps to fail the job on severity-matched findings.Note: When using
format: 'sarif', the action outputs all vulnerabilities regardless of theseverityfilter. To ensure the exit code respects only your specified severities, also addlimit-severities-for-sarif: true.Suggested change
severity: 'CRITICAL,HIGH' + exit-code: '1'Applies to lines 46-56 and 89-98.
🤖 Fix all issues with AI agents
In @.github/workflows/docker.yml:
- Around line 170-182: The summary text assumes "All Stack Tests passed" which
is incorrect for manual workflow_dispatch runs; update the step that writes to
$GITHUB_STEP_SUMMARY to branch on the event type (use $GITHUB_EVENT_NAME or
github.event_name) and emit a different header/message when event ==
"workflow_dispatch" (e.g., "Images promoted manually — Stack Tests may not have
run") while keeping the existing text for normal CI runs (when event !=
"workflow_dispatch") and reuse the same $TAG/$REGISTRY/$PREFIX pull lines;
ensure the conditional wraps only the descriptive header line(s) not the image
table so the pull commands still print in both cases.
- Around line 129-153: The promote step currently copies all images using the
TAG variable with multiple crane copy commands (e.g., crane copy
"$REGISTRY/$PREFIX/base:$TAG" ...), but only backend and frontend are
Trivy-scanned; update the promote job so it only promotes images that have been
scanned or add Trivy scans for each promoted image before running the crane copy
commands. Concretely, either (A) restrict the list of crane copy targets to the
scanned images (reference the existing crane copy entries for backend and
frontend and remove/unpromote the other image names), or (B) insert a Trivy scan
loop that iterates the same image names (use PREFIX and TAG) and fails on
detected vulnerabilities before executing the crane copy lines for base,
coordinator, k8s-worker, pod-monitor, result-processor, saga-orchestrator,
event-replay, dlq-processor, cert-generator, and zookeeper-certgen.
🧹 Nitpick comments (2)
.github/workflows/stack-tests.yml (2)
322-357: Consider extracting the GHCR pull/retag logic into a composite action.This entire block (login, parallel pulls, retagging) is duplicated verbatim in the
frontend-e2ejob (lines 474-509). Extracting it into a reusable composite action under.github/actions/would reduce maintenance burden and ensure consistency.♻️ Suggested approach
Create a composite action at
.github/actions/pull-images/action.yml:name: 'Pull Images from GHCR' inputs: sha-tag: required: true image-prefix: required: true registry: required: true github-token: required: true github-actor: required: true runs: using: 'composite' steps: - name: Pull and retag images from GHCR shell: bash env: TAG: ${{ inputs.sha-tag }} IMG: ${{ inputs.registry }}/${{ inputs.image-prefix }} run: | echo "${{ inputs.github-token }}" | docker login ghcr.io -u ${{ inputs.github-actor }} --password-stdin # ... parallel pulls and retag logicThen use it in both E2E jobs:
- name: Pull images from GHCR if: github.event_name != 'pull_request' uses: ./.github/actions/pull-images with: sha-tag: ${{ needs.build-images.outputs.sha-tag }} image-prefix: ${{ needs.build-images.outputs.image-prefix }} registry: ${{ env.REGISTRY }} github-token: ${{ secrets.GITHUB_TOKEN }} github-actor: ${{ github.actor }}
331-343: Parallel pulls are efficient but failures may be harder to diagnose.While
waitcorrectly fails if any background job fails, the logs won't clearly indicate which specific image pull failed. For better debuggability, consider capturing PIDs and checking each exit status, or usingxargs -Pfor parallel execution with clearer error reporting.♻️ Alternative using xargs for clearer error output
- docker pull "$IMG/base:$TAG" & - docker pull "$IMG/backend:$TAG" & - docker pull "$IMG/frontend-dev:$TAG" & - docker pull "$IMG/coordinator:$TAG" & - docker pull "$IMG/k8s-worker:$TAG" & - docker pull "$IMG/pod-monitor:$TAG" & - docker pull "$IMG/result-processor:$TAG" & - docker pull "$IMG/saga-orchestrator:$TAG" & - docker pull "$IMG/event-replay:$TAG" & - docker pull "$IMG/dlq-processor:$TAG" & - docker pull "$IMG/cert-generator:$TAG" & - docker pull "$IMG/zookeeper-certgen:$TAG" & - wait + echo "base backend frontend-dev coordinator k8s-worker pod-monitor result-processor saga-orchestrator event-replay dlq-processor cert-generator zookeeper-certgen" \ + | tr ' ' '\n' \ + | xargs -P 12 -I{} docker pull "$IMG/{}:$TAG"
… single matrix job (scan) that scans all 12 images in parallel: - fail-fast: false — one image's vulnerability findings don't cancel the other scans - Each matrix entry runs as its own parallel job on a separate runner - SARIF results uploaded per-image with unique categories (trivy-base, trivy-backend, etc.) - trivyignores: 'backend/.trivyignore' applied to all images (CVE exemptions are image-agnostic) - checkout@v6 included so the .trivyignore file is available Updated promote.needs from [scan-backend, scan-frontend] to [scan] — waits for all 12 matrix entries to pass before promoting anything to latest. Updated the summary security section to reflect that all 12 images are scanned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/docker.yml (1)
50-72:⚠️ Potential issue | 🟠 MajorAdd GHCR login before Trivy scans private images.
Trivy cannot pull private GHCR images without registry authentication. While
packages: readpermission is already set, Trivy also needs credentials to authenticate. Adddocker/login-actionbefore the scan step withGITHUB_TOKEN, or alternatively passTRIVY_USERNAMEandTRIVY_PASSWORDenv vars to the Trivy step.🛠️ Suggested fix
- uses: actions/checkout@v6 + - name: Log in to GHCR + uses: docker/login-action@v3 + with: + registry: ${{ env.REGISTRY }} + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - name: Compute image ref id: ref run: | PREFIX="${GITHUB_REPOSITORY_OWNER,,}/integr8scode"
docker-compose.yaml (+15 lines): Every buildable service now has an image: field pointing to ghcr.io/hardmax71/integr8scode/{service}:${IMAGE_TAG:-latest}. kafka-init and user-seed share the backend image. Compose now knows where to pull pre-built images from.
deploy.sh (+10 lines): Added --no-build flag to cmd_dev(). Passes --no-build to compose, preventing any build fallback.
stack-tests.yml (-149 lines):
- Build job: push condition changed from event_name != 'pull_request' to !github.event.pull_request.head.repo.fork (same-repo PRs can push to GHCR). Artifact save/upload removed entirely.
- Both E2E jobs: Deleted all GHCR login, parallel pull, retag, artifact download, and load steps. Replaced with a single IMAGE_TAG env var on the "Start stack" step. Compose pulls SHA-tagged images from GHCR automatically using the image: fields.
- Both E2E jobs have if: !fork guard — fork PRs skip E2E (unit tests still run).
How it works
| Scenario | What happens |
|---------------------------------------------------------|--------------------------------------------------|
| ./deploy.sh dev (local, first time) | Compose pulls latest from GHCR — no build needed |
| ./deploy.sh dev --build (local, with changes) | Builds locally, tags with GHCR name |
| CI: IMAGE_TAG=sha-xxx ./deploy.sh dev --no-build --wait | Compose pulls sha-tagged images from GHCR |
| ./deploy.sh prod | Helm uses GHCR images (unchanged) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/stack-tests.yml:
- Around line 137-141: The echoes that append outputs use an unquoted
$GITHUB_OUTPUT which can cause globbing/word-splitting; update the two echo
lines to redirect into the quoted variable (use >> "$GITHUB_OUTPUT") so that the
values computed in PREFIX and SHA_TAG (and the redirection target) are safely
handled; locate the lines that set PREFIX, SHA_TAG and the two echo lines and
change only the redirection target to be quoted.
🧹 Nitpick comments (3)
.github/workflows/stack-tests.yml (1)
170-176: Consider consolidating repetitive tag-and-push blocks.Each image follows the same tag → push pattern. While functional, this repetition increases maintenance burden. A matrix strategy or a reusable script/action could reduce duplication.
Example: Loop-based push script
- name: Push all images to GHCR if: ${{ !github.event.pull_request.head.repo.fork }} env: TAG: ${{ steps.tags.outputs.sha-tag }} IMG: ${{ env.REGISTRY }}/${{ steps.tags.outputs.image-prefix }} run: | images=( "integr8scode-base:base" "integr8scode-backend:backend" "integr8scode-coordinator:coordinator" # ... add remaining images ) for entry in "${images[@]}"; do local_name="${entry%%:*}" remote_name="${entry##*:}" docker tag "$local_name:latest" "$IMG/$remote_name:$TAG" docker push "$IMG/$remote_name:$TAG" doneAlso applies to: 189-211, 224-229, 241-246, 259-264, 276-281
docker-compose.yaml (2)
78-81: Reducedstart_periodvalues may cause premature healthcheck failures.The
start_periodreductions (e.g., kafka from 15s→5s, backend/frontend from 10s→5s) assume faster container startup. If services occasionally take longer to initialize (cold cache, slower CI runners, resource contention), healthchecks may fail before the service is ready, causing cascading dependency failures.Consider whether retries × interval provides sufficient total wait time, or if a slightly longer
start_periodis safer for CI stability.
Service Old start_period New start_period retries × interval kafka 15s 5s 12 × 5s = 60s backend 10s 5s 50 × 3s = 150s frontend 10s 5s 30 × 3s = 90s redis 10s 5s 5 × 5s = 25s The total windows (start_period + retries × interval) still seem reasonable, but monitor CI for flakiness.
Also applies to: 129-129, 159-159, 328-328
354-354: Kafdrop moved todebugprofile — document access instructions.Kafdrop is now only started with
--debugflag. The PR description mentions it's available athttp://localhost:9000with--debug, which is helpful. Ensure the README or deploy script help text reflects this change so developers know how to access Kafka debugging tools.
- Added strategy.matrix with shardIndex: [1, 2] and shardTotal: [2]
- fail-fast: false so one shard failing doesn't cancel the other
- Test command: npx playwright test --shard=${{ matrix.shardIndex }}/${{ matrix.shardTotal }}
- Artifact names include shard index to avoid collisions: playwright-report-1, playwright-report-2, frontend-e2e-logs-1, etc.
- Job name shows shard: Frontend E2E (1/2), Frontend E2E (2/2)
2. GHCR Pre-pull (both E2E jobs)
- Immediately after checkout, docker compose pull --quiet starts in the background via nohup
- While GHCR images pull, the subsequent setup steps run in parallel:
- backend-e2e: Docker cache load + k3s install (~85s of overlap)
- frontend-e2e: Node setup + npm ci + Playwright install + Docker cache + k3s (~150s of overlap)
- A "Wait for GHCR images" step before "Start stack" ensures pull is complete
- "Start stack" then finds images already local — skips pulling entirely
How it works
Both E2E jobs now have this timeline:
Step 2: Pre-pull GHCR images ──────────────────────────────── (background)
Step 3-7: Node/Playwright/Docker cache setup ──────────────── (foreground, ~50s)
Step 8: Docker-cache loads infra images ───────────────────── (~15s)
Step 9: Pre-warm infrastructure ───────────────────────────── (background, starts immediately)
├── mongo + redis start (~5s to healthy)
├── shared-ca + cert-gen + zk-certgen start (~5s)
├── zookeeper starts after zk-certgen (~15s)
├── kafka starts after zookeeper healthy (~20s)
└── schema-registry starts after kafka (~10s)
Step 10: k3s install ──────────────────────────────────────── (~42s, OVERLAPS with infra chain)
Step 12: Wait for background tasks ────────────────────────── (both should be done)
Step 13: Start stack ──────────────────────────────────────── (infra already healthy, only app services)
Expected impact on "Start stack"
| Component | Before | After |
|---------------------------------|---------------------------|------------------------------|
| Infra initialization (zk chain) | ~50s (during Start stack) | 0s (already done during k3s) |
| App image pull | ~60s | 0s (pre-pulled) |
| App service startup | ~30s | ~30s |
| Health check waits | ~20s | ~20s |
| Total "Start stack" | ~2:20 | ~0:50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.github/workflows/stack-tests.yml:
- Around line 322-335: The loop that reads PID from each pidfile (pidfile, PID)
uses tail --pid to wait but never captures the background process exit code;
update the block that handles /tmp/ghcr-pull.pid and /tmp/infra-warm.pid so that
after detecting the process (kill -0) you still wait for it with wait "$PID" (or
capture its exit status) and if the exit status is non‑zero then cat the
corresponding log (/tmp/ghcr-pull.log or /tmp/infra-warm.log) to stdout/stderr
and exit the script with that same non‑zero code; ensure this change is applied
to both wait blocks so failures surface immediately rather than being hidden by
tail --pid.
- Around line 287-299: The backend-e2e and frontend-e2e jobs need explicit GHCR
auth and workflow permissions like the build-images job: add a top-level
permissions block for packages: read (and contents: read if needed) to each job
and insert a GHCR login step before any docker-compose/docker pull (use echo
"${{ secrets.GITHUB_TOKEN }}" | docker login ghcr.io -u "${{ github.actor }}"
--password-stdin or the same pattern used in build-images). Update the Pre-pull
GHCR images step (and any other image-pull steps) in the backend-e2e and
frontend-e2e jobs to perform the docker login using the GITHUB_TOKEN environment
then run the existing docker compose pull command.
…~/.kube:/root/.kube. When Docker creates that bind mount source directory, it creates it as root:root. Then k3s-setup's sudo k3s kubectl config view --raw > /home/runner/.kube/config fails because the shell redirect (>) runs as the runner user who can't write to the root-owned directory.
| Before | After | |-----------------------------------------|----------------------------------------| | 1. checkout | 1. checkout | | 2. GHCR pre-pull (bg) | 2. GHCR pre-pull (bg) | | 3. docker-cache | 3. config copy (moved up) | | 4. infra pre-warm (bg) | 4. Install k3s (split from composite) | | 5. k3s-setup (composite, ~45s blocking) | 5. docker-cache (runs during k3s boot) | | 6. config copy | 6. infra pre-warm (bg) | | 7. wait for bg | 7. Finalize k3s (~25s+ after install) | | 8. start stack | 8. wait for bg | | | 9. start stack | Key gain: k3s boot (30s) now overlaps with docker-cache (10-18s) instead of blocking sequentially. The composite k3s-setup action is inlined as "Install k3s" + "Finalize k3s", same pattern as frontend-e2e. Complete optimization summary across both files: 1. docker-compose.yaml — Tightened health check intervals (5s→2-3s) and start periods (10s→3-5s) across all 7 services 2. frontend-e2e — Inlined k3s, overlaps boot with Node + npm ci + Playwright (~50s overlap) 3. backend-e2e — Inlined k3s, overlaps boot with docker-cache (~15s overlap) 4. Both YAML files validated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/workflows/stack-tests.yml">
<violation number="1" location=".github/workflows/stack-tests.yml:309">
P2: Avoid piping an unverified remote install script into `sh`. This introduces a supply‑chain risk for CI runners; prefer a pinned k3s version with checksum verification (or the existing k3s setup action) instead.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Build base → Push base (13s) → Build 8 workers → Push 8 workers (35s sequential) → Build cert-gen → Push cert-gen (7s) → Build zk-certgen → Push zk-certgen (8s) → Build frontend → Push frontend-dev (12s) → Build frontend-prod → Push frontend-prod (6s) After (14 steps, 1 parallel push step): Build base → Build 8 workers → Build cert-gen → Build zk-certgen → Build frontend → Build frontend-prod → Push all 13 in parallel (~15-20s) Expected savings: ~60s (81s sequential → ~20s parallel). Job should drop from 2m 48s → ~1m 50s. The builds are all done first (same total time), then all 13 pushes fire concurrently. Since they share base layers, Docker deduplicates — the first push uploads shared layers and the rest skip them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/workflows/stack-tests.yml">
<violation number="1" location=".github/workflows/stack-tests.yml:253">
P2: Backgrounding docker pushes without checking each exit status can let failed image pushes go unnoticed. Track PIDs and `wait` them individually so the step fails if any push fails.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/stack-tests.yml:
- Around line 246-253: The parallel docker push loop that runs docker push
"$IMG/$name:$TAG" & and then calls wait can mask failures because wait only
returns the status of the last job; fix it by capturing background PIDs (e.g.,
push PIDs array when launching jobs in the for loop), then iteratively wait on
each PID and check its exit status; if any wait returns non-zero set a failure
flag and exit non-zero at the end (or immediately exit on first failure) so any
individual docker push failure is propagated from the workflow.
🧹 Nitpick comments (3)
.github/workflows/stack-tests.yml (3)
319-319: Quote variable inbasenameto prevent word splitting.The
$pidfilevariable should be quoted for shell safety.Proposed fix
- echo "Waiting for $(basename $pidfile .pid)..." + echo "Waiting for $(basename "$pidfile" .pid)..."
481-481: Quote variable inbasenameto prevent word splitting.Same issue as in backend-e2e - the
$pidfilevariable should be quoted.Proposed fix
- echo "Waiting for $(basename $pidfile .pid)..." + echo "Waiting for $(basename "$pidfile" .pid)..."
299-310: Consider extracting k3s setup to a reusable composite action.The k3s installation and finalization logic is duplicated between
backend-e2e(lines 278-310) andfrontend-e2e(lines 424-467). Extracting this to a composite action in.github/actions/would reduce duplication and simplify maintenance.
1. Parallel GHCR pushes (build-images job):
- Merged 6 separate push steps into 1 step that pushes all 13 images in parallel via for ... do docker push & done; wait
- Expected: ~81s sequential → ~15-20s parallel (saves ~60s)
2. Targeted health checks (both E2E jobs):
- Replaced deploy.sh dev --no-build --wait (waits for ALL 15+ containers) with:
- docker compose up -d --no-build (returns immediately, ~3s)
- curl loop that only waits for backend (backend-e2e) or backend + frontend (frontend-e2e)
- Workers start in background and become ready while tests run their initial setup
- Expected: "Start stack" drops from ~2:01 to ~5s + "Wait for health" ~40-60s = ~45-65s total (saves ~60s)
…-wait) takes 1:23 because depends_on: condition: service_healthy in docker-compose.yaml forces compose to wait for the entire dependency chain before creating dependent containers. Removing --wait only skipped the final "all healthy" check — the internal chain is the real bottleneck. Changes made (3 optimizations): 1. Removed docker-cache step (saves ~1:08 blocking time) The docker-cache composite action was loading 5 infra images from GHA cache in ~68s of blocking foreground time. But docker compose pull (pre-pull) already fetches ALL images in background. Removed the redundant step. 2. Merged pre-pull + pre-warm into single sequential background task Instead of: pre-pull (bg) → docker-cache (blocking 1:08) → pre-warm (bg) Now: docker compose pull && docker compose up -d ... infra all in one background process. Infra starts pulling + booting immediately after checkout, overlapping with all subsequent setup steps. 3. Pre-start cert-generator after k3s finalize cert-generator is on the critical path: cert-gen(complete) → backend(healthy) → frontend. Starting it right after kubeconfig exists gives it a ~15-20s head start while we wait for pre-pull to finish.
…_started Impact: Compose no longer waits for backend to pass its health check (~35s) before creating the frontend container. Backend and frontend now boot in parallel during docker compose up -d. For frontend-e2e: "Start stack" should drop from 1:20 to 45-50s (no backend health wait in compose), and "Wait for backend+frontend" picks up the slack but runs in parallel (45s). Net: 2:03 → ~1:30, saving ~33s → job drops to ~5:00. For backend-e2e: Smaller impact since backend tests don't need frontend. "Start stack" drops slightly (~10s) since compose returns earlier. Job should be ~5:30. At this point we're approaching the hard floor: - Backend E2E: 3:00 tests + 100s minimum setup = ~4:40 floor, currently ~5:30 (50s over) - Frontend E2E: 2:11 tests + 80s minimum setup = ~3:31 floor, currently ~5:00 (89s over, mostly from the depends_on chain which is inherent to docker-compose)
| Action | Purpose | |------------------------|-----------------------------------------------------------------------------| | e2e-boot (new) | GHCR login + pull/prewarm infra (bg) + k3s install | | e2e-ready (new) | Finalize k3s + cert-gen + config + wait + start stack + health check + seed | | k3s-setup (deleted) | Was inlined previously, never referenced | | docker-cache (deleted) | Replaced by docker compose pull, never referenced | Step count reduction: - backend-e2e: 20 steps → 8 steps (checkout + 2 actions + test + coverage + logs) - frontend-e2e: 20 steps → 13 steps (checkout + e2e-boot + 5 Node/Playwright + e2e-ready + test + report + logs) Performance preserved: The split point between e2e-boot and e2e-ready is exactly where frontend-e2e interposes Node/Playwright setup, so k3s still boots in the background during that work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 5 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/actions/e2e-boot/action.yml">
<violation number="1" location=".github/actions/e2e-boot/action.yml:39">
P1: Downloading and executing the k3s install script without verifying its integrity exposes CI to supply‑chain tampering. Add checksum/signature verification (or pin to a trusted artifact) before execution.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.github/actions/e2e-boot/action.yml:
- Around line 22-33: The background nohup block that runs the docker compose
pull/up should capture and persist its exit status so failures are not lost;
modify the nohup script (the block starting with "nohup bash -c '" that writes
to /tmp/infra-pull.log and /tmp/infra-pull.pid) to write the final exit code
into a file such as /tmp/infra-pull.exit (e.g., run the commands, capture "$?"
and echo it to /tmp/infra-pull.exit before exiting), and update the e2e-ready
step to check /tmp/infra-pull.exit after waiting (fail if the file exists and
contains non-zero, printing /tmp/infra-pull.log for diagnostics).
In @.github/actions/e2e-ready/action.yml:
- Around line 43-54: The wait step currently only uses `tail --pid` so it blocks
until the background PID in /tmp/infra-pull.pid exits but doesn't check its exit
code; change the logic in the "Wait for image pull and infra" step to poll the
background PID (e.g., check /proc/$PID) or wait for a dedicated exit-status file
written by the background job (e.g., /tmp/infra-pull.exit), and after the PID
disappears read that exit-status file (or capture the background process status)
and fail the step (exit non-zero) if the background job reported an error;
update references to /tmp/infra-pull.pid, /tmp/infra-pull.log and the chosen
exit-status file so logs are still printed but the step fails when the infra
pull/warm fails.
🧹 Nitpick comments (2)
.github/actions/e2e-boot/action.yml (1)
35-41:K3S_VERSIONis read from environment but not declared as input.This step relies on
K3S_VERSIONbeing set in the calling workflow'senvblock. If the caller doesn't define it, the script will fail silently or produce an invalid URL. Consider adding it as an input with a default, or document the dependency.Option: Add K3S_VERSION as an input
inputs: image-tag: description: 'GHCR image tag (e.g., sha-abc1234)' required: true github-token: description: 'GitHub token for GHCR authentication' required: true + k3s-version: + description: 'k3s version to install (e.g., v1.32.11+k3s1)' + required: false + default: 'v1.32.11+k3s1'Then use
${{ inputs.k3s-version }}in the step..github/actions/e2e-ready/action.yml (1)
29-35: Cert-generator background process PID not tracked.Unlike the infra-pull process, the cert-generator's PID isn't captured, so there's no way to wait for it or check its exit status. If cert generation fails, the error surfaces only when dependent services fail to start.
Capture PID for consistency
- name: Start cert-generator (background) shell: bash env: IMAGE_TAG: ${{ inputs.image-tag }} run: | nohup docker compose up -d --no-build cert-generator \ - > /tmp/cert-gen.log 2>&1 & + > /tmp/cert-gen.log 2>&1 & + echo $! > /tmp/cert-gen.pid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.github/workflows/security.yml:
- Around line 4-7: The security workflow currently restricts triggers by having
push and pull_request set to "branches: [ main ]", which skips scans on the dev
branch; update the workflow triggers (the push and pull_request entries) to
include the dev branch as well (e.g., change branches: [ main ] to branches: [
main, dev ]) so security scans run for dev PRs and pushes while retaining main
scans.
In `@docs/operations/cicd.md`:
- Line 84: Clarify the discrepancy between "13 container images" and "12 images
scanned" by adding a short explicit note in the image table or the Docker Scan &
Promote section stating which image is excluded from Trivy scanning and why
(e.g., explain that the `base` image is a build-only artifact or that
`frontend-dev` is not deployed to production and therefore not scanned);
reference the existing phrases "13 container images" and "12 deployed images" in
the note so readers can see which one is omitted and the reason for omission.
🧹 Nitpick comments (2)
docs/operations/cicd.md (2)
201-209: Add language identifier to the code fence.The code fence at line 201 is missing a language identifier. While the content is a text-based diagram rather than code, adding
textorplaintextas the language would satisfy linting rules and improve consistency.📋 Suggested fix
-``` +```text e2e-boot (GHCR login + pull + k3s install) | ├── npm ci + playwright install (overlapped with k3s)
241-241: Fix grammatical structure.The sentence lacks a subject. Adding "It" at the beginning would improve clarity.
✏️ Suggested fix
-`workflow_dispatch` with an optional SHA input to promote a specific commit. +Runs automatically when `Stack Tests` completes successfully on `main`. It can also be triggered manually via +`workflow_dispatch` with an optional SHA input to promote a specific commit.
|



Summary by cubic
Make local deploys faster by making observability optional, tightening healthcheck timing, and pulling pre-built images by default. CI builds SHA-tagged images, E2E uses them, scans all images, and promotes to latest after Stack Tests succeed (manual promote available).
New Features
Migration
Written for commit 595c223. Summary will update on new commits.
Summary by CodeRabbit
New Features
Chores
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.