fix(deploy): isolate per-app failures so one bad app doesn't abort the batch#160
Merged
Conversation
…e batch App upgrades run concurrently via asyncio.gather() with no exception isolation. The helm upgrade itself is guarded with suppress_errors=True, but the steps before it (cloning a custom git chart, helm dependency build) raise on failure. A custom chart pointing at a git branch that doesn't exist therefore raised an exception that escaped the per-app coroutine, propagated into the bare gather(), and cancelled every other app's deployment in the batch. Wrap _update_app_deployment in update_app_deployment with a try/except that converts any per-app exception into a failed UpdateAppResult routed through the existing post_result path (Slack alert, GitHub status, results summary). The broken app is now reported as failed while the rest of the batch deploys normally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
release-please's python release-type bumps version strings in pyproject.toml, __init__.py and version.py, but does not regenerate uv.lock. The lockfile embeds the project's own version in its [[package]] entry, so it drifted behind pyproject.toml after the 1.6.1 release (lock still recorded 1.6.0). Add uv.lock to the root package's extra-files with the toml updater targeting the gitops package version, so future releases bump it automatically (per googleapis/release-please#2561). Also resync the lockfile to the current 1.6.1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Docker ImagesCommit:
|
uptickmetachu
previously approved these changes
Jun 22, 2026
uptickmetachu
left a comment
Collaborator
There was a problem hiding this comment.
Good find!
I think we might want 1 more try catch block within the post_result as now its another point of failure within the gather block.
…ort the batch Addresses PR review: post_result makes Slack/GitHub network calls and runs inside the asyncio.gather() in deploy(). It is invoked from call sites not covered by update_app_deployment's guard (its own except handler and uninstall_app), so a reporting failure could still cancel every other app's deployment. Wrap post_result's body in try/except that logs and swallows. The success/failure bookkeeping (successful_apps/failed_apps) is recorded before the network call, so it survives a notification failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
uptickmetachu
approved these changes
Jun 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
App upgrades are deployed concurrently in
Deployer.deploy()viaasyncio.gather(...)withoutreturn_exceptions=True. Thehelm upgradestep itself is guarded withsuppress_errors=True, but the steps before it are not:temp_repo→git clone --branch ...)helm dependency buildSo a custom chart pointing at a git branch that doesn't exist raised an exception that escaped the per-app coroutine, propagated into the bare
gather(), and cancelled every other app's deployment in the batch. The whole batch of upgrades got skipped because of one bad app.Fix
Wrap
_update_app_deploymentinupdate_app_deploymentwith atry/exceptthat converts any per-app exception into a failedUpdateAppResult, routed through the existingpost_resultpath (Slack alert, GitHub status, results summary). The broken app is now reported as failed while the rest of the batch deploys normally.Chose this over
gather(return_exceptions=True)deliberately: the latter would push raw exception objects intoupdate_results, whichpost_result_summary(r["exit_code"]) and theis not Nonefilter don't handle — it would need extra unwrapping anyway.Test
Added
test_one_app_failing_does_not_abort_the_batch: forces the first app's chart fetch to raise (as a missing branch would), assertsdeploy()doesn't raise, the surviving app still runs its full helm flow, and both apps get apost_result(one exit_code 1, one 0). Verified it goes red with the fix reverted and green with it in place.Also included
uv.lockhad drifted to1.6.0whilepyproject.tomlwas already1.6.1— release-please's python release-type bumps version strings but doesn't regenerateuv.lock. Resynced the lock and added auv.lockextra-filesentry torelease-please-config.jsonso future releases bump it automatically (per googleapis/release-please#2561).🤖 Generated with Claude Code