diego-release: bump executor to gardenhealth retry fix (tnz-96144)#1140
Open
navinms711 wants to merge 2 commits into
Open
diego-release: bump executor to gardenhealth retry fix (tnz-96144)#1140navinms711 wants to merge 2 commits into
navinms711 wants to merge 2 commits into
Conversation
Points the executor submodule at the upstream-merged commit
9e97ac1b ("gardenhealth: retry initial health check until timeout")
on navinms711/executor main, replacing the old devperf proto-version
at 63bb0792 (tnz-96144-local).
The upstream version adds CellUnhealthyMetric emission on timeout
and UnrecoverableError fail-fast on top of the retry loop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses an issue where rep crash-restarts unnecessarily during BOSH deployments because it treats early Garden health check failures as fatal.
During BOSH starting_jobs, Monit starts Garden and rep simultaneously. However, Garden requires ~60–90 seconds to warm up on a fresh VM before it can successfully create containers. Previously, the executor's Garden health check would fail fatally on the very first transient error, causing rep to exit and enter a ~53s Monit restart cycle.
This change makes the initial health check phase resilient to transient errors. Instead of failing immediately, the runner now retries the health check in a bounded loop until the full GardenHealthcheckTimeout (default 10m) expires.
Key properties preserved:
Note
Bumps executor submodule to 9e97ac1b, which contains the gardenhealth retry fix. See: cloudfoundry/executor#130 (tnz-96144)
This PR bump picks up two upstream commits since the last pointer:
Backward Compatibility
Breaking Change? No
This change modifies internal startup resilience only. It does not alter the external API, metric emission behavior, or the ultimate failure conditions. Fully backwards compatible.