Skip to content

[Observability] Emit metric ClustermgtdHeartbeat to signal clustermgtd heartbeat.#685

Merged
gmarciani merged 1 commit intoaws:developfrom
gmarciani:wip/mgiacomo/3150/clustermgtd-metrics-0126-1
Jan 27, 2026
Merged

[Observability] Emit metric ClustermgtdHeartbeat to signal clustermgtd heartbeat.#685
gmarciani merged 1 commit intoaws:developfrom
gmarciani:wip/mgiacomo/3150/clustermgtd-metrics-0126-1

Conversation

@gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Jan 26, 2026

Description of changes

Emit metric ClustermgtdHeartbeat to signal clustermgtd heartbeat.
The metric is emitted with dimensions: ClusterName and InstanceId.

The metric is intentionally emitted at the end of the clustermgtd loop to represent the real health of the daemon.
If it was emitted at the beginning of the iteration, it would be open to false negatives.

This PR depends on the permissions added in aws/aws-parallelcluster#7209

Tests

  • Unit tests (extended to cover the current changes)
  • Manually verified by creating a cluster and checking that (i) the metric is pushed (ii) when the metric push fails (manually removing the permissions to push it), the overall clustermgtd iteration is not compromised.

References

  1. The max length for a metric name is 255 chars. See https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_MetricDatum.html

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0126-1 branch 2 times, most recently from 8b3ee75 to b9d50bb Compare January 26, 2026 22:21
@gmarciani gmarciani marked this pull request as ready for review January 26, 2026 22:26
@gmarciani gmarciani requested review from a team as code owners January 26, 2026 22:26
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0126-1 branch 3 times, most recently from 9983ec3 to af6495c Compare January 27, 2026 17:34
@gmarciani gmarciani force-pushed the wip/mgiacomo/3150/clustermgtd-metrics-0126-1 branch from af6495c to 165c3c1 Compare January 27, 2026 17:44
Copy link
Contributor

@himani2411 himani2411 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@gmarciani gmarciani merged commit ebb742f into aws:develop Jan 27, 2026
12 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3150/clustermgtd-metrics-0126-1 branch January 27, 2026 22:54
@gmarciani gmarciani restored the wip/mgiacomo/3150/clustermgtd-metrics-0126-1 branch January 27, 2026 22:54
@gmarciani gmarciani deleted the wip/mgiacomo/3150/clustermgtd-metrics-0126-1 branch January 28, 2026 19:22
Comment on lines +591 to +593
# Publish heartbeat metric to CloudWatch
self._metrics_publisher.put_metric(metric_name=CW_METRICS_HEARTBEAT, value=1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about surrounding this with a try catch and adding a warning log line? This is not a critical cluster management logic. We should avoid it throwing an Exception out of the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is done inside the metrics publisher itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants