Skip to content

[Update] Start clustermgtd on update-compute-fleet failure#3109

Merged
gmarciani merged 2 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3.15.0/clustermgtd-restart-on-error-0209-1
Feb 13, 2026
Merged

[Update] Start clustermgtd on update-compute-fleet failure#3109
gmarciani merged 2 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3.15.0/clustermgtd-restart-on-error-0209-1

Conversation

@gmarciani
Copy link
Copy Markdown
Contributor

@gmarciani gmarciani commented Feb 9, 2026

Description of changes

Start clustermgtd when update-compute-fleet fails, to reduce the risk of having clustermgt not running.
This is the same strategy we already applied to cluster updates.

To avoid code duplication, we adapted and reuse the existing UpdateFailureHandler so that it can handle the failure on both cluster updates and compute fleet status updates, applying different recovery strategies.

We also needed to set the log level of clusterstatusmgtd from auto to info so that custom logs can be displayed in chef-client.log. To change the log level we also needed to change it in the corresponding allowed Cinc command in sudoers config for the daemon.

UX

After injecting a synthetic failure in the recipe update_computefleet_status_head_node.rb, we see that the handler is able to restart clustermgtd

Running handlers:
[2026-02-10T05:03:26+00:00] ERROR: Running exception handlers
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Started with parameters @cleanup_dna_files=false, @start_clustermgtd=true
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Update failed on HeadNode due to: bash[update compute fleet] (aws-parallelcluster-slurm::update_computefleet_status_head_node line 18) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '126'
---- Begin output of "bash"  ----
STDOUT: clustermgtd: stopped
STDERR: + /opt/parallelcluster/pyenv/versions/3.14.2/envs/cookbook_virtualenv/bin/supervisorctl stop clustermgtd
+ /opt/parallelcluster/scripts/slurm/slurm_fleet_status_manager -cf /opt/parallelcluster/shared/computefleet-status.json
bash: line 3: /opt/parallelcluster/scripts/slurm/slurm_fleet_status_manager: Permission denied
---- End output of "bash"  ----
Ran "bash"  returned 126
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Resources that have been successfully executed before the failure:
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - ruby_block[Configure environment variable for recipes context: PATH]
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler:   - ruby_block[load cluster configuration]
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Running recovery commands
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Executing: start clustermgtd
[2026-02-10T05:03:26+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Running command (attempt 1/11): /opt/parallelcluster/pyenv/versions/3.14.2/envs/cookbook_virtualenv/bin/supervisorctl start clustermgtd
[2026-02-10T05:03:28+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Command stdout: clustermgtd: started

[2026-02-10T05:03:28+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Command stderr:
[2026-02-10T05:03:28+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Successfully executed: start clustermgtd
[2026-02-10T05:03:28+00:00] INFO: ErrorHandlers::UpdateFailureHandler: Completed successfully
  - ErrorHandlers::UpdateFailureHandler
Running handlers complete
[2026-02-10T05:03:28+00:00] ERROR: Exception handlers complete
...
[root@ip-27-6-32-233 ~]# /opt/parallelcluster/pyenv/versions/3.14.2/envs/cookbook_virtualenv/bin/supervisorctl status
cfn-hup                          RUNNING   pid 3931, uptime 0:38:54
clustermgtd                      RUNNING   pid 7293, uptime 0:00:23
clusterstatusmgtd                RUNNING   pid 6850, uptime 0:04:30

Limits of this solution + future improvements
In this solution we control the different recovery strategies with boolean flags. This is ok for the current scenario. because we have only two recovery strategies (restart clustermgtd, cleanup dna files). To be more flexible and future proof we should apply the Stategy pattern. I preferred not to do now because it would require more boilerplate and refactoring that would introduce unnecessary complexity for such a simple case.

Tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani changed the title Wip/mgiacomo/3.15.0/clustermgtd restart on error 0209 1 [Update] Start clustermgtd on update-compute-fleet failure Feb 9, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/3.15.0/clustermgtd-restart-on-error-0209-1 branch 2 times, most recently from 5bf1736 to 4a5cf50 Compare February 9, 2026 22:23
@gmarciani gmarciani marked this pull request as ready for review February 9, 2026 22:33
@gmarciani gmarciani requested review from a team as code owners February 9, 2026 22:33
@gmarciani gmarciani force-pushed the wip/mgiacomo/3.15.0/clustermgtd-restart-on-error-0209-1 branch 6 times, most recently from b7b07de to dca5f7a Compare February 10, 2026 13:46
@gmarciani gmarciani added the 3.x label Feb 10, 2026
…is is to reduce the risk of having clustermgtd not running.

To this aim we made the existing UpdateFailureHandler abel to execute different recovery strategies. We then use the same handler for recovering both from update failure and update-compute-fleet-status failure.
…m logs can be displayed.

To this aim, we needed to change the log level in the
allowed Cinc command specified in sudoers configuration.
@gmarciani gmarciani enabled auto-merge (rebase) February 13, 2026 22:37
@gmarciani gmarciani merged commit eecd6a1 into aws:develop Feb 13, 2026
35 of 43 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3.15.0/clustermgtd-restart-on-error-0209-1 branch February 13, 2026 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants