Skip to content

diego-release: bump executor to gardenhealth retry fix (tnz-96144)#1140

Open
navinms711 wants to merge 2 commits into
cloudfoundry:developfrom
navinms711:develop
Open

diego-release: bump executor to gardenhealth retry fix (tnz-96144)#1140
navinms711 wants to merge 2 commits into
cloudfoundry:developfrom
navinms711:develop

Conversation

@navinms711
Copy link
Copy Markdown
Contributor

Summary

This PR addresses an issue where rep crash-restarts unnecessarily during BOSH deployments because it treats early Garden health check failures as fatal.

During BOSH starting_jobs, Monit starts Garden and rep simultaneously. However, Garden requires ~60–90 seconds to warm up on a fresh VM before it can successfully create containers. Previously, the executor's Garden health check would fail fatally on the very first transient error, causing rep to exit and enter a ~53s Monit restart cycle.

This change makes the initial health check phase resilient to transient errors. Instead of failing immediately, the runner now retries the health check in a bounded loop until the full GardenHealthcheckTimeout (default 10m) expires.

Key properties preserved:

  • The cell is correctly marked as unhealthy during the retry phase so BBS does not schedule LRPs there.
  • An UnrecoverableError (e.g. bad TLS certs) still causes an immediate fatal exit — only transient connection errors trigger the retry.
  • A fatal error is triggered if the full timeout is reached, ensuring permanently broken Garden instances are still detected and the CellUnhealthy metric is emitted.

Note

Bumps executor submodule to 9e97ac1b, which contains the gardenhealth retry fix. See: cloudfoundry/executor#130 (tnz-96144)

This PR bump picks up two upstream commits since the last pointer:

  • 645a0bd2 — Fix go test failures for go1.26 (Amin Jamali)
  • 9e97ac1b — gardenhealth: retry initial health check until timeout (tnz-96144)

Backward Compatibility

Breaking Change? No

This change modifies internal startup resilience only. It does not alter the external API, metric emission behavior, or the ultimate failure conditions. Fully backwards compatible.

Points the executor submodule at the upstream-merged commit
9e97ac1b ("gardenhealth: retry initial health check until timeout")
on navinms711/executor main, replacing the old devperf proto-version
at 63bb0792 (tnz-96144-local).

The upstream version adds CellUnhealthyMetric emission on timeout
and UnrecoverableError fail-fast on top of the retry loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant