feat(chart): zero-downtime UI rollout option + 409-tolerant reindex#32
Merged
Conversation
The UI Deployment hardcoded strategy: Recreate. With a single replica that is correct for the single-writer ReadWriteOnce index, but it means every image change tears the old pod down before the new one is Ready, leaving the ingress with no backend for the pull+boot window (surfaces as a 502 / "no available server"). On a deployment that auto-updates on each beta image, the public UI flaps on every build. - Make ui.strategy configurable (default unchanged: Recreate). Operators whose volume tolerates same-node multi-attach (k3s local-path, RWM) AND whose UI is read-only during the overlap (demo mode) can opt into a zero-surge RollingUpdate for seamless rollouts. Worst case on a misjudged volume is a stalled-but-up rollout, never an outage. - reindex CronJob: treat HTTP 409 (a build already in progress) as a benign no-op instead of a hard curl failure, so a periodic refresh overlapping the per-upgrade init Job no longer fails the Job and piles up Error pods via backoff retries. Non-2xx still fails. - Bump chart 0.1.1 -> 0.1.2.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
coderag-ui.neverdecel.comintermittently served Traefik's "no available server" page. Root cause: the UI Deployment isstrategy: Recreatewith a single replica (correct for the single-writer RWO index), so every image change kills the old pod before the new one is Ready. The dev deployment auto-rolls on each newbeta-*image, so on a busy build day the public demo drops for the ~40-90s pull+boot window on every build.What
ui.strategyis now configurable, default unchanged (Recreate) — no behavior change for existing installs. A deployment whose volume tolerates same-node multi-attach (k3s local-path, or RWM) and whose UI is read-only during the overlap (demo mode, Reindex hidden) can opt into a zero-surgeRollingUpdateso the new pod goes Ready before the old is removed → no backend gap. Worst case if the volume can't double-mount is a stalled-but-still-up rollout, never an outage.full=falserefresh that overlaps the per-upgradefull=trueinit Job got 409 ("already indexing");curl -fsStreated that as failure → backoff retries → a pile of Error pods. 409 is now a benign no-op (exit 0); non-2xx still fails.0.1.1→0.1.2.Verified with
helm lint+helm template: default rendersRecreate, override rendersRollingUpdate{maxUnavailable:0,maxSurge:1}, reindex script renders the 409 branch.Follow-up
starnode-core flips the dev UI to
RollingUpdate{0,1}once this is on master (the chart GitRepository tracks master).