[Update] Start clustermgtd on update-compute-fleet failure#3109
Merged
gmarciani merged 2 commits intoaws:developfrom Feb 13, 2026
Merged
Conversation
5bf1736 to
4a5cf50
Compare
b7b07de to
dca5f7a
Compare
…is is to reduce the risk of having clustermgtd not running. To this aim we made the existing UpdateFailureHandler abel to execute different recovery strategies. We then use the same handler for recovering both from update failure and update-compute-fleet-status failure.
…m logs can be displayed. To this aim, we needed to change the log level in the allowed Cinc command specified in sudoers configuration.
dca5f7a to
d53b04f
Compare
himani2411
approved these changes
Feb 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of changes
Start clustermgtd when update-compute-fleet fails, to reduce the risk of having clustermgt not running.
This is the same strategy we already applied to cluster updates.
To avoid code duplication, we adapted and reuse the existing
UpdateFailureHandlerso that it can handle the failure on both cluster updates and compute fleet status updates, applying different recovery strategies.We also needed to set the log level of
clusterstatusmgtdfrom auto to info so that custom logs can be displayed inchef-client.log. To change the log level we also needed to change it in the corresponding allowed Cinc command in sudoers config for the daemon.UX
After injecting a synthetic failure in the recipe
update_computefleet_status_head_node.rb, we see that the handler is able to restart clustermgtdLimits of this solution + future improvements
In this solution we control the different recovery strategies with boolean flags. This is ok for the current scenario. because we have only two recovery strategies (restart clustermgtd, cleanup dna files). To be more flexible and future proof we should apply the Stategy pattern. I preferred not to do now because it would require more boilerplate and refactoring that would introduce unnecessary complexity for such a simple case.
Tests
test_update_rollback_failurewhich has been extended in [Test] Extend integ testtest_update_rollback_failureto validate that clustermgtd gets restarted when the async flow of update-compute-fleet fails. aws-parallelcluster#7225 to cover this specific changeBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.