[Update] Start clustermgtd on update-compute-fleet failure by gmarciani · Pull Request #3109 · aws/aws-parallelcluster-cookbook

gmarciani · 2026-02-09T21:43:39Z

Description of changes

Start clustermgtd when update-compute-fleet fails, to reduce the risk of having clustermgt not running.
This is the same strategy we already applied to cluster updates.

To avoid code duplication, we adapted and reuse the existing UpdateFailureHandler so that it can handle the failure on both cluster updates and compute fleet status updates, applying different recovery strategies.

We also needed to set the log level of clusterstatusmgtd from auto to info so that custom logs can be displayed in chef-client.log. To change the log level we also needed to change it in the corresponding allowed Cinc command in sudoers config for the daemon.

UX

After injecting a synthetic failure in the recipe update_computefleet_status_head_node.rb, we see that the handler is able to restart clustermgtd

Running handlers:
[2026-02-10T05:03:26+00:00] ERROR: Running exception handlers
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Started with parameters @cleanup_dna_files=false, @start_clustermgtd=true
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Update failed on HeadNode due to: bash[update compute fleet] (aws-parallelcluster-slurm::update_computefleet_status_head_node line 18) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '126'
---- Begin output of "bash"  ----
STDOUT: clustermgtd: stopped
STDERR: + /opt/parallelcluster/pyenv/versions/3.14.2/envs/cookbook_virtualenv/bin/supervisorctl stop clustermgtd
+ /opt/parallelcluster/scripts/slurm/slurm_fleet_status_manager -cf /opt/parallelcluster/shared/computefleet-status.json
bash: line 3: /opt/parallelcluster/scripts/slurm/slurm_fleet_status_manager: Permission denied
---- End output of "bash"  ----
Ran "bash"  returned 126
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Resources that have been successfully executed before the failure:
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - ruby_block[Configure environment variable for recipes context: PATH]
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - ruby_block[load cluster configuration]
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Running recovery commands
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Executing: start clustermgtd
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Running command (attempt 1/11): /opt/parallelcluster/pyenv/versions/3.14.2/envs/cookbook_virtualenv/bin/supervisorctl start clustermgtd
[2026-02-10T05:03:28+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Command stdout: clustermgtd: started

[2026-02-10T05:03:28+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Command stderr:
[2026-02-10T05:03:28+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Successfully executed: start clustermgtd
[2026-02-10T05:03:28+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Completed successfully
  - ErrorHandlers::UpdateFailureHandler
Running handlers complete
[2026-02-10T05:03:28+00:00] ERROR: Exception handlers complete
...
[root@ip-27-6-32-233 ~]# /opt/parallelcluster/pyenv/versions/3.14.2/envs/cookbook_virtualenv/bin/supervisorctl status
cfn-hup                          RUNNING   pid 3931, uptime 0:38:54
clustermgtd                      RUNNING   pid 7293, uptime 0:00:23
clusterstatusmgtd                RUNNING   pid 6850, uptime 0:04:30

Limits of this solution + future improvements
In this solution we control the different recovery strategies with boolean flags. This is ok for the current scenario. because we have only two recovery strategies (restart clustermgtd, cleanup dna files). To be more flexible and future proof we should apply the Stategy pattern. I preferred not to do now because it would require more boilerplate and refactoring that would introduce unnecessary complexity for such a simple case.

Tests

Manually verified that by injecting a failure in compute-fleet status update, clustermgtd is started by the error handler (see UX above)
SUCCEEDED test_update_rollback_failure which has been extended in [Test] Extend integ test test_update_rollback_failure to validate that clustermgtd gets restarted when the async flow of update-compute-fleet fails. aws-parallelcluster#7225 to cover this specific change

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…is is to reduce the risk of having clustermgtd not running. To this aim we made the existing UpdateFailureHandler abel to execute different recovery strategies. We then use the same handler for recovering both from update failure and update-compute-fleet-status failure.

…m logs can be displayed. To this aim, we needed to change the log level in the allowed Cinc command specified in sudoers configuration.

gmarciani changed the title ~~Wip/mgiacomo/3.15.0/clustermgtd restart on error 0209 1~~ [Update] Start clustermgtd on update-compute-fleet failure Feb 9, 2026

gmarciani force-pushed the wip/mgiacomo/3.15.0/clustermgtd-restart-on-error-0209-1 branch 2 times, most recently from 5bf1736 to 4a5cf50 Compare February 9, 2026 22:23

gmarciani marked this pull request as ready for review February 9, 2026 22:33

gmarciani requested review from a team as code owners February 9, 2026 22:33

gmarciani force-pushed the wip/mgiacomo/3.15.0/clustermgtd-restart-on-error-0209-1 branch 6 times, most recently from b7b07de to dca5f7a Compare February 10, 2026 13:46

gmarciani added the 3.x label Feb 10, 2026

gmarciani added 2 commits February 12, 2026 13:57

[Logging] In clustertstausmgtd, set Chef log level info so that custo…

d53b04f

…m logs can be displayed. To this aim, we needed to change the log level in the allowed Cinc command specified in sudoers configuration.

gmarciani force-pushed the wip/mgiacomo/3.15.0/clustermgtd-restart-on-error-0209-1 branch from dca5f7a to d53b04f Compare February 12, 2026 19:00

gmarciani mentioned this pull request Feb 12, 2026

[Test] Extend integ test test_update_rollback_failure to validate that clustermgtd gets restarted when the async flow of update-compute-fleet fails. aws/aws-parallelcluster#7225

Merged

himani2411 reviewed Feb 13, 2026

View reviewed changes

Comment thread cookbooks/aws-parallelcluster-computefleet/files/clusterstatusmgtd/clusterstatusmgtd.py

himani2411 approved these changes Feb 13, 2026

View reviewed changes

gmarciani enabled auto-merge (rebase) February 13, 2026 22:37

gmarciani added the skip-system-tests label Feb 13, 2026

gmarciani merged commit eecd6a1 into aws:develop Feb 13, 2026
35 of 43 checks passed

gmarciani deleted the wip/mgiacomo/3.15.0/clustermgtd-restart-on-error-0209-1 branch February 13, 2026 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Update] Start clustermgtd on update-compute-fleet failure#3109

[Update] Start clustermgtd on update-compute-fleet failure#3109
gmarciani merged 2 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3.15.0/clustermgtd-restart-on-error-0209-1

gmarciani commented Feb 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gmarciani commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

UX

Tests

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gmarciani commented Feb 9, 2026 •

edited

Loading