Skip to content

fix(deploy): isolate per-app failures so one bad app doesn't abort the batch#160

Merged
BradMclain merged 3 commits into
developfrom
no-000/graceful-batch-fail
Jun 23, 2026
Merged

fix(deploy): isolate per-app failures so one bad app doesn't abort the batch#160
BradMclain merged 3 commits into
developfrom
no-000/graceful-batch-fail

Conversation

@BradMclain

Copy link
Copy Markdown
Collaborator

Problem

App upgrades are deployed concurrently in Deployer.deploy() via asyncio.gather(...) without return_exceptions=True. The helm upgrade step itself is guarded with suppress_errors=True, but the steps before it are not:

  • cloning a custom git chart (temp_repogit clone --branch ...)
  • helm dependency build

So a custom chart pointing at a git branch that doesn't exist raised an exception that escaped the per-app coroutine, propagated into the bare gather(), and cancelled every other app's deployment in the batch. The whole batch of upgrades got skipped because of one bad app.

Fix

Wrap _update_app_deployment in update_app_deployment with a try/except that converts any per-app exception into a failed UpdateAppResult, routed through the existing post_result path (Slack alert, GitHub status, results summary). The broken app is now reported as failed while the rest of the batch deploys normally.

Chose this over gather(return_exceptions=True) deliberately: the latter would push raw exception objects into update_results, which post_result_summary (r["exit_code"]) and the is not None filter don't handle — it would need extra unwrapping anyway.

Test

Added test_one_app_failing_does_not_abort_the_batch: forces the first app's chart fetch to raise (as a missing branch would), asserts deploy() doesn't raise, the surviving app still runs its full helm flow, and both apps get a post_result (one exit_code 1, one 0). Verified it goes red with the fix reverted and green with it in place.

Also included

uv.lock had drifted to 1.6.0 while pyproject.toml was already 1.6.1 — release-please's python release-type bumps version strings but doesn't regenerate uv.lock. Resynced the lock and added a uv.lock extra-files entry to release-please-config.json so future releases bump it automatically (per googleapis/release-please#2561).

🤖 Generated with Claude Code

BradMclain and others added 2 commits June 22, 2026 16:16
…e batch

App upgrades run concurrently via asyncio.gather() with no exception
isolation. The helm upgrade itself is guarded with suppress_errors=True,
but the steps before it (cloning a custom git chart, helm dependency
build) raise on failure. A custom chart pointing at a git branch that
doesn't exist therefore raised an exception that escaped the per-app
coroutine, propagated into the bare gather(), and cancelled every other
app's deployment in the batch.

Wrap _update_app_deployment in update_app_deployment with a try/except
that converts any per-app exception into a failed UpdateAppResult routed
through the existing post_result path (Slack alert, GitHub status,
results summary). The broken app is now reported as failed while the
rest of the batch deploys normally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
release-please's python release-type bumps version strings in
pyproject.toml, __init__.py and version.py, but does not regenerate
uv.lock. The lockfile embeds the project's own version in its
[[package]] entry, so it drifted behind pyproject.toml after the 1.6.1
release (lock still recorded 1.6.0).

Add uv.lock to the root package's extra-files with the toml updater
targeting the gitops package version, so future releases bump it
automatically (per googleapis/release-please#2561). Also resync the
lockfile to the current 1.6.1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@BradMclain BradMclain requested review from KhueDuong and uptickmetachu and removed request for uptickmetachu June 22, 2026 06:31
@BradMclain BradMclain self-assigned this Jun 22, 2026
@BradMclain BradMclain requested a review from uptickmetachu June 22, 2026 06:31
@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Docker Images

Commit: 83dbc8e217e358ea51168eb955007ac76bceb9aa

Tag
610829907584.dkr.ecr.ap-southeast-2.amazonaws.com/gitops:test-83dbc8e

uptickmetachu
uptickmetachu previously approved these changes Jun 22, 2026

@uptickmetachu uptickmetachu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

I think we might want 1 more try catch block within the post_result as now its another point of failure within the gather block.

Comment thread gitops_server/workers/deployer/deploy.py
…ort the batch

Addresses PR review: post_result makes Slack/GitHub network calls and runs
inside the asyncio.gather() in deploy(). It is invoked from call sites not
covered by update_app_deployment's guard (its own except handler and
uninstall_app), so a reporting failure could still cancel every other app's
deployment.

Wrap post_result's body in try/except that logs and swallows. The
success/failure bookkeeping (successful_apps/failed_apps) is recorded before
the network call, so it survives a notification failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@BradMclain BradMclain merged commit 29ff9c3 into develop Jun 23, 2026
3 checks passed
@BradMclain BradMclain deleted the no-000/graceful-batch-fail branch June 23, 2026 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants