Skip to content

feat(chart): zero-downtime UI rollout option + 409-tolerant reindex#32

Merged
Neverdecel merged 1 commit into
masterfrom
chore/zero-downtime-ui-rollout
Jun 16, 2026
Merged

feat(chart): zero-downtime UI rollout option + 409-tolerant reindex#32
Neverdecel merged 1 commit into
masterfrom
chore/zero-downtime-ui-rollout

Conversation

@Neverdecel

Copy link
Copy Markdown
Owner

Why

coderag-ui.neverdecel.com intermittently served Traefik's "no available server" page. Root cause: the UI Deployment is strategy: Recreate with a single replica (correct for the single-writer RWO index), so every image change kills the old pod before the new one is Ready. The dev deployment auto-rolls on each new beta-* image, so on a busy build day the public demo drops for the ~40-90s pull+boot window on every build.

What

  • ui.strategy is now configurable, default unchanged (Recreate) — no behavior change for existing installs. A deployment whose volume tolerates same-node multi-attach (k3s local-path, or RWM) and whose UI is read-only during the overlap (demo mode, Reindex hidden) can opt into a zero-surge RollingUpdate so the new pod goes Ready before the old is removed → no backend gap. Worst case if the volume can't double-mount is a stalled-but-still-up rollout, never an outage.
  • Reindex CronJob tolerates HTTP 409. A periodic full=false refresh that overlaps the per-upgrade full=true init Job got 409 ("already indexing"); curl -fsS treated that as failure → backoff retries → a pile of Error pods. 409 is now a benign no-op (exit 0); non-2xx still fails.
  • Chart 0.1.10.1.2.

Verified with helm lint + helm template: default renders Recreate, override renders RollingUpdate{maxUnavailable:0,maxSurge:1}, reindex script renders the 409 branch.

Follow-up

starnode-core flips the dev UI to RollingUpdate{0,1} once this is on master (the chart GitRepository tracks master).

The UI Deployment hardcoded strategy: Recreate. With a single replica that
is correct for the single-writer ReadWriteOnce index, but it means every image
change tears the old pod down before the new one is Ready, leaving the ingress
with no backend for the pull+boot window (surfaces as a 502 / "no available
server"). On a deployment that auto-updates on each beta image, the public UI
flaps on every build.

- Make ui.strategy configurable (default unchanged: Recreate). Operators whose
  volume tolerates same-node multi-attach (k3s local-path, RWM) AND whose UI is
  read-only during the overlap (demo mode) can opt into a zero-surge
  RollingUpdate for seamless rollouts. Worst case on a misjudged volume is a
  stalled-but-up rollout, never an outage.
- reindex CronJob: treat HTTP 409 (a build already in progress) as a benign
  no-op instead of a hard curl failure, so a periodic refresh overlapping the
  per-upgrade init Job no longer fails the Job and piles up Error pods via
  backoff retries. Non-2xx still fails.
- Bump chart 0.1.1 -> 0.1.2.
@Neverdecel Neverdecel merged commit d9d50f5 into master Jun 16, 2026
12 checks passed
@Neverdecel Neverdecel deleted the chore/zero-downtime-ui-rollout branch June 18, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant