Skip to content

Comments

[Observability] Use metric filter to generate the clustermgtd heartbeat metric.#7219

Merged
gmarciani merged 3 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3150/clustermgtd-alarm-fix-0204-1
Feb 5, 2026
Merged

[Observability] Use metric filter to generate the clustermgtd heartbeat metric.#7219
gmarciani merged 3 commits intoaws:developfrom
gmarciani:wip/mgiacomo/3150/clustermgtd-alarm-fix-0204-1

Conversation

@gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Feb 4, 2026

Description of changes

Use metric filter to generate the clustermgtd heartbeat metric rather than metric pushed by clustermgtd.

In #7209 we introduced a new alarm for clustermgtd not running which was based on a metric pushed by clustermgtd. We decided to change the approach and replace the explicit metric publishing form clustermgtd with a metric filter.

The advantage of this new approach compared to the previous one is that:

  1. is does not require extra permission cloudwatch:PutMetricData
  2. it does not require an extra VPC Endpoint for CloudWatch Monitoring in air-gapped environments.

UX

Here is the alarm and widget visible on the dashboard. We did not change the user experience of the dashboard compared to what we had before. This is just to show that it is functional.

Screenshot 2026-02-05 at 9 08 44 AM

Notes

  1. In this PR we reverted 3aacea3, which is the commit where we introduced the VPC Endpoint in test test_cluster_in_no_internet_subnet. We do not need such VPC Endpoint anymore with the new approach.
  2. In this PR we reverted 9329636, which is the commit that introduced the extra permissions cloudwatch:PutMetricData that is not required anymore with the new approach.
  3. This PR requires [Observability] Clustermgtd to emit heartbeat into event logs rather than explicit CW metric aws-parallelcluster-node#687 to be merged because it is where we introduce clustermgtd writing the heartbeat event into the clustermgtd events log.

Q&A

  1. Why are we including InstanceId as dimension for ClustermgtdHeartbeat? Isn't ClusterName enough considering that clustermgtd only runs on the head node?
    Yes, it would be enough to have clustername dimension in that metric. However, our default behavior so far is to include instance id in whatever metric filter. Being stick to such default has an advantage in terms of coding simplicity and also in terms of user experience. When the user browses the metrics within ParallelCluster namespace they will only see one metric folder "ClusterName, InstanceId". If we change from the default, then they would see two folders: one "ClusterName, InstanceId" and another one "ClusterName" which could be confusing. Also this is more future proof in case we want to add head node high availability in future where the head node could run on multiple instances.

Tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani changed the title Wip/mgiacomo/3150/clustermgtd alarm fix 0204 1 [Observability] Use metric filter to generate the clustermgtd heartbeat metric. Feb 4, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-alarm-fix-0204-1 branch 2 times, most recently from 14c5fba to cd40ce8 Compare February 4, 2026 22:30
…ndpoint for CloudWatch Metrics (com.amazonaws.$region.monitoring), which is now required by clustermgtd to put metrics."

This reverts commit 3aacea3.

We must revert this commit because the VPC Endpoint for CloudWatch Monitoring is not needed anymore.
We introduced it when we introduced for the first time a call to PutMetricData from clustermgtd.
Now we moved to an approach where we make no rewquests to CloudWatch Monitoring so the endpoint is not required anymore.
…the head node policy so that clustermgtd is able to emit metrics."

This reverts commit 9329636.

We must revert this commit because we do not need the permission cloudwatch:PutMetricData` anymore.
We introduced such extra permissions when we introduced a PutMetricData request from the head node
to let clustermgtd publish a metric.

However, we now changed the approach moving to metric filters rather than explicit PutMetricData,
so the permissionb is not required anymore.
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-alarm-fix-0204-1 branch from cd40ce8 to ef2eab7 Compare February 4, 2026 22:32
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-alarm-fix-0204-1 branch from ef2eab7 to ba3c5e2 Compare February 5, 2026 02:08
@gmarciani gmarciani marked this pull request as ready for review February 5, 2026 02:09
@gmarciani gmarciani requested review from a team as code owners February 5, 2026 02:09
@gmarciani gmarciani enabled auto-merge (rebase) February 5, 2026 17:05
@gmarciani gmarciani merged commit f648a81 into aws:develop Feb 5, 2026
27 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3150/clustermgtd-alarm-fix-0204-1 branch February 5, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants