Skip to content

Commit f12a9f3

Browse files
committed
feat: install the Azure HPC Diagnostics script
The Azure HPC diagnostics script captures information about the system hardware and software for dianostic purposes. It is intended to supplement the RHEL sosreport diagnostics to cover the Azure specific hardware and software that the sosreport does not capture. This script will be used for information gathering in support contexts, it is not intended to be run on active HPC nodes. Before we install the downloaded diagnostic script, we need to change a few things in the script: - the output should be in {{ __hpc_azure_runtime_dir }}/diagnostics - permanently disable the auto update code - fix the version number instead of assuming the script it running from a local git repository - change from defaulting to online mode (requires internet access) to offline mode. --offline option goes away, replaced by --online option - Indicate that the diagnostic log files should be passed on to Red Hat, not Microsoft. To make this easy, we will add a patch file to the system role that contains the code changes we need to make to the script. This is much simpler to apply that needing to do complex parser based matches and replacements to make the changes we need. The resultant patch file will then need to be treated as a template to do path substitution for the runtime output directory. This will place the diagnostic output in a well known place by default, rather than where-ever the script was run from. The script will be installed to {{ __hpc_azure_tools_dir }}. If the script is already present in this location, then we will skip over the installation entirely. Signed-off-by: Dave Chinner <dchinner@redhat.com>
1 parent afd3d63 commit f12a9f3

2 files changed

Lines changed: 209 additions & 3 deletions

File tree

tasks/main.yml

Lines changed: 55 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1400,19 +1400,71 @@
14001400
- name: Install Azure HPC Diagnostics tool
14011401
when: hpc_install_diagnostics
14021402
block:
1403-
- name: Check if diagnostics are already installed
1403+
- name: Check if Diagnostics are already installed
14041404
stat:
14051405
path: "{{ __hpc_azure_tools_dir }}/gather_azhpc_vm_diagnostics.sh"
14061406
register: __hpc_azure_diags_installed
14071407

14081408
- name: Download and install Diagnostics
14091409
when: not __hpc_azure_diags_installed.stat.exists
14101410
block:
1411+
- name: Install dependencies
1412+
package:
1413+
name: "{{ __hpc_azure_diagnostics_packages }}"
1414+
state: present
1415+
use: "{{ (__hpc_server_is_ostree | d(false)) |
1416+
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
1417+
register: __hpc_azure_diagnostics_packages_install
1418+
until: __hpc_azure_diagnostics_packages_install is success
14111419

14121420
- name: Download Diagnostics
14131421
include_tasks: download_extract_package.yml
14141422
vars:
1415-
__hpc_pkg_info: "{{ __hpc_azure_diags_info }}"
1423+
__hpc_pkg_info: "{{ __hpc_azhpc_diags_info }}"
1424+
1425+
# The downloaded diagnostics script needs sufficient customisation that
1426+
# the simplest way to do this is to patch it. However, we also have to
1427+
# replace hard coded paths, so we need to be able to template it as
1428+
# well.
1429+
#
1430+
# Templating can only occur on the control node, so we must first run
1431+
# the patch through the template module and use the output as the patch
1432+
# source. Then we can apply the patch to the remote extracted file and
1433+
# copy it to the install location.
1434+
- name: Create a temp file for the diagnostics patch
1435+
tempfile:
1436+
state: file
1437+
prefix: hpc_diags
1438+
suffix: .patch
1439+
register: __hpc_diags_patch_file
1440+
1441+
- name: Configure the diagnostics patch
1442+
template:
1443+
src: azhpc_vm_diagnostics.sh.patch.j2
1444+
dest: "{{ __hpc_diags_patch_file.path }}"
1445+
mode: '0644'
1446+
1447+
- name: Patch Diagnostics script
1448+
patch:
1449+
src: "{{ __hpc_diags_patch_file.path }}"
1450+
dest: "{{ __hpc_pkg_extracted.path }}/Linux/src/gather_azhpc_vm_diagnostics.sh"
1451+
remote_src: false
1452+
state: present
1453+
strip: 1
1454+
1455+
- name: Install Diagnostics script
1456+
copy:
1457+
src: "{{ __hpc_pkg_extracted.path }}/Linux/src/gather_azhpc_vm_diagnostics.sh"
1458+
dest: "{{ __hpc_azure_tools_dir }}/gather_azhpc_vm_diagnostics.sh"
1459+
remote_src: true
1460+
owner: root
1461+
group: root
1462+
mode: '0755'
1463+
1464+
- name: Clean up temporary patch file
1465+
file:
1466+
path: "{{ __hpc_diags_patch_file.path }}"
1467+
state: absent
14161468

14171469
- name: Remove extracted temp directory
14181470
file:
@@ -1422,4 +1474,4 @@
14221474

14231475
- name: Clean dnf cache
14241476
command: dnf clean all
1425-
changed_when: false
1477+
changed_when: false
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
--- gather_azhpc_vm_diagnostics.sh.orig 2026-02-05 15:20:05.410458535 +1100
2+
+++ gather_azhpc_vm_diagnostics.sh 2026-02-11 17:32:31.436256365 +1100
3+
@@ -1,4 +1,7 @@
4+
-#!/bin/bash
5+
+#!/bin/bash -u
6+
+{{ ansible_managed | comment(prefix="", postfix="") | trim }}
7+
+{{ "system_role:hpc" | comment(prefix="", postfix="") | trim }}
8+
+
9+
# Azure HPC Diagnostics Tool
10+
# Gathers Diagnostic info from guest VM
11+
#
12+
@@ -51,16 +54,11 @@
13+
# Copyright (c) Microsoft Corporation.
14+
# Licensed under the MIT license.
15+
16+
-
17+
-
18+
####################################################################################################
19+
# Begin Constants
20+
####################################################################################################
21+
22+
STREAM_URL='https://azhpcstor.blob.core.windows.net/diagtool-binaries/stream.tgz'
23+
-LSVMBUS_URL='https://raw.githubusercontent.com/torvalds/linux/master/tools/hv/lsvmbus'
24+
-HPC_DIAG_URL='https://raw.githubusercontent.com/Azure/azhpc-diagnostics/main/Linux/src/gather_azhpc_vm_diagnostics.sh'
25+
-SCRIPT_DIR="$( cd "$( dirname "$0" )" >/dev/null 2>&1 && pwd )"
26+
SYSFS_PATH=/sys # store as a variable so it is mockable
27+
ETC_PATH=/etc
28+
PROC_PATH=/proc
29+
@@ -74,15 +72,7 @@
30+
CPU_LIST=(["Standard_HB120rs_v2"]="0 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,109,113,117"
31+
["Standard_HB60rs"]="0 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57")
32+
RELEASE_DATE=20220316 # update upon each release
33+
-COMMIT_HASH=$(
34+
- (
35+
- command -v git >/dev/null &&
36+
- cd "$SCRIPT_DIR" &&
37+
- git config --get remote.origin.url | grep -q 'Azure/azhpc-diagnostics.git$' &&
38+
- git rev-parse HEAD 2>/dev/null
39+
- ) ||
40+
- echo 'Unknown')
41+
-VERSION_INFO="$RELEASE_DATE-$COMMIT_HASH"
42+
+VERSION_INFO="$RELEASE_DATE-unknown"
43+
44+
HELP_MESSAGE="
45+
Usage: $0 [OPTION]
46+
@@ -100,8 +90,8 @@
47+
Execution Mode:
48+
--gpu-level=GPU_LEVEL dcgmi run level (default is 1)
49+
--mem-level=MEM_LEVEL set to 1 to run stream test (default is 0)
50+
- --no-update do not prompt for auto-update
51+
- --offline skips steps that require Internet access
52+
+ --no-update Does nothing, auto-update functionality has been elided.
53+
+ --online Run steps that require Internet access
54+
55+
For more information on this script and the data it gathers, visit its Github:
56+
57+
@@ -217,10 +207,9 @@
58+
echo "${CPU_LIST[$1]}"
59+
}
60+
61+
+COLUMNS=80
62+
if tput cols >/dev/null 2>/dev/null && (( $(tput cols) < 80 )); then
63+
COLUMNS=$(tput cols)
64+
-else
65+
- COLUMNS=80
66+
fi
67+
68+
print_enclosed() {
69+
@@ -244,24 +233,6 @@
70+
echo ''
71+
}
72+
73+
-check_for_updates() {
74+
- local message="You are not running the latest release of this tool. Switch to latest version?"
75+
-
76+
- local tmpfile
77+
- tmpfile=$(mktemp)
78+
- curl -s "$HPC_DIAG_URL" >"$tmpfile" || return 1
79+
- if ! cmp --silent "$0" "$tmpfile"; then
80+
- if prompt "$message"; then
81+
- mv "$tmpfile" "$0"
82+
- bash "$0" "$RUNTIME_OPTIONS"
83+
- exit $?
84+
- else
85+
- return 0
86+
- fi
87+
- fi
88+
- rm "$tmpfile"
89+
-}
90+
-
91+
get_metadata() {
92+
local path="$1"
93+
curl -s -H Metadata:true "http://169.254.169.254/metadata/instance/$path?api-version=2021-03-01&format=text"
94+
@@ -876,11 +847,11 @@
95+
print_divider
96+
print_enclosed "NOTICES:"
97+
print_divider
98+
- print_enclosed This tool generates and bundles together various logs and diagnostic information. It, however, DOES NOT TRANSMIT any of said data. It is left to the user to choose to transmit this data to Microsoft.
99+
+ print_enclosed This tool generates and bundles together various logs and diagnostic information. It, however, DOES NOT TRANSMIT any of said data. It is left to the user to choose to transmit this data to Red Hat.
100+
print_divider
101+
- print_enclosed Some of this info, such as IP addresses, may be Personally Identifiable Information. It is up to the user to redact any sensitive info from the output 'if' necessary before sending it to Microsoft.
102+
+ print_enclosed Some of this info, such as IP addresses, may be Personally Identifiable Information. It is up to the user to redact any sensitive info from the output 'if' necessary before sending it to Red Hat.
103+
print_divider
104+
- print_enclosed This tool invokes various 3rd party tools 'if' they are present on the system Please review them and their EULAs at:
105+
+ print_enclosed This tool invokes various 3rd party tools 'if' they are present on the system. Please review them and their EULAs at:
106+
print_enclosed "https://github.com/Azure/azhpc-diagnostics"
107+
print_divider
108+
print_enclosed WARNING: THINK BEFORE YOU RUN THIS
109+
@@ -1018,9 +989,6 @@
110+
print_enclosed 'Placing diagnostic files in the following location:'
111+
print_enclosed "$DIAG_DIR.tar.gz"
112+
print_divider
113+
- print_enclosed If you have already opened a support request, you can take the tarball and follow this link to upload it:
114+
- print_enclosed 'https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/managesupportrequest'
115+
- print_divider
116+
tar czf "$DIAG_DIR.tar.gz" -C "$DIAG_DIR_LOC" "$VM_ID.$TIMESTAMP" 2>/dev/null && rm -r "$DIAG_DIR"
117+
}
118+
119+
@@ -1030,9 +998,11 @@
120+
121+
GPU_LEVEL=1
122+
MEM_LEVEL=0
123+
+OFFLINE=true
124+
DISPLAY_HELP=false
125+
-# should be /opt/azurehpc/diagnostics
126+
-DIAG_DIR_LOC="$SCRIPT_DIR"
127+
+DISPLAY_VERSION=false
128+
+# should be /var/hpc/azure/diagnostics
129+
+DIAG_DIR_LOC="{{ __hpc_azure_runtime_dir }}/diagnostics"
130+
131+
# save options
132+
RUNTIME_OPTIONS=$*
133+
@@ -1063,8 +1033,8 @@
134+
validate_run_level "$1"
135+
MEM_LEVEL="$1"
136+
;;
137+
- --no-update) DISABLE_UPDATE=true;;
138+
- --offline) OFFLINE=true;;
139+
+ --no-update) ;; # does nothing
140+
+ --online) OFFLINE=false;;
141+
--tuning) TUNING=true;;
142+
-V|--version) DISPLAY_VERSION=true;;
143+
esac
144+
@@ -1081,10 +1051,6 @@
145+
# End Option Parsing
146+
####################################################################################################
147+
148+
-if [ "$OFFLINE" != true ] && [ "$DISABLE_UPDATE" != true ] && ! [[ $- =~ 's' ]]; then
149+
- check_for_updates
150+
-fi
151+
-
152+
if [ ! "${BASH_SOURCE[0]}" -ef "$0" ]; then
153+
# This lets us load all functions for unit testing.
154+
# We wouldn't want people sourcing this script anyway.

0 commit comments

Comments
 (0)