Skip to content

Commit bd20207

Browse files
committed
feat: install the Azure HPC Diagnostics script
The Azure HPC diagnostics script captures information about the system hardware and software for dianostic purposes. It is intended to supplement the RHEL sosreport diagnostics to cover the Azure specific hardware and software that the sosreport does not capture. This script will be used for information gathering in support contexts, it is not intended to be run on active HPC nodes. Before we install the downloaded diagnostic script, we need to change a few things in the script: - the output should be in {{ __hpc_azure_runtime_dir }}/diagnostics - permanently disable the auto update code - fix the version number instead of assuming the script it running from a local git repository - change from defaulting to online mode (requires internet access) to offline mode. --offline option goes away, replaced by --online option - Indicate that the diagnostic log files should be passed on to Red Hat, not Microsoft. To make this easy, we will add a patch file to the system role that contains the code changes we need to make to the script. This is much simpler to apply that needing to do complex parser based matches and replacements to make the changes we need. The resultant patch file will then need to be treated as a template to do path substitution for the runtime output directory. This will place the diagnostic output in a well known place by default, rather than where-ever the script was run from. The script will be installed to {{ __hpc_azure_tools_dir }}. If the script is already present in this location, then we will skip over the installation entirely. Signed-off-by: Dave Chinner <dchinner@redhat.com>
1 parent afd3d63 commit bd20207

2 files changed

Lines changed: 208 additions & 3 deletions

File tree

tasks/main.yml

Lines changed: 55 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1400,19 +1400,71 @@
14001400
- name: Install Azure HPC Diagnostics tool
14011401
when: hpc_install_diagnostics
14021402
block:
1403-
- name: Check if diagnostics are already installed
1403+
- name: Check if Diagnostics are already installed
14041404
stat:
14051405
path: "{{ __hpc_azure_tools_dir }}/gather_azhpc_vm_diagnostics.sh"
14061406
register: __hpc_azure_diags_installed
14071407

14081408
- name: Download and install Diagnostics
14091409
when: not __hpc_azure_diags_installed.stat.exists
14101410
block:
1411+
- name: Install dependencies
1412+
package:
1413+
name: "{{ __hpc_azure_diagnostics_packages }}"
1414+
state: present
1415+
use: "{{ (__hpc_server_is_ostree | d(false)) |
1416+
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
1417+
register: __hpc_azure_diagnostics_packages_install
1418+
until: __hpc_azure_diagnostics_packages_install is success
14111419

14121420
- name: Download Diagnostics
14131421
include_tasks: download_extract_package.yml
14141422
vars:
1415-
__hpc_pkg_info: "{{ __hpc_azure_diags_info }}"
1423+
__hpc_pkg_info: "{{ __hpc_azhpc_diags_info }}"
1424+
1425+
# The downloaded diagnostics script needs sufficient customisation that
1426+
# the simplest way to do this is to patch it. However, we also have to
1427+
# replace hard coded paths, so we need to be able to template it as
1428+
# well.
1429+
#
1430+
# Templating can only occur on the control node, so we must first run
1431+
# the patch through the template module and use the output as the patch
1432+
# source. Then we can apply the patch to the remote extracted file and
1433+
# copy it to the install location.
1434+
- name: Create a temp file for the diagnostics patch
1435+
tempfile:
1436+
state: file
1437+
prefix: hpc_diags
1438+
suffix: .patch
1439+
register: __hpc_diags_patch_file
1440+
1441+
- name: Configure the diagnostics patch
1442+
template:
1443+
src: azhpc_vm_diagnostics.sh.patch.j2
1444+
dest: "{{ __hpc_diags_patch_file.path }}"
1445+
mode: '0644'
1446+
1447+
- name: Patch Diagnostics script
1448+
patch:
1449+
src: "{{ __hpc_diags_patch_file.path }}"
1450+
dest: "{{ __hpc_pkg_extracted.path }}/Linux/src/gather_azhpc_vm_diagnostics.sh"
1451+
remote_src: false
1452+
state: present
1453+
strip: 1
1454+
1455+
- name: Install Diagnostics script
1456+
copy:
1457+
src: "{{ __hpc_pkg_extracted.path }}/Linux/src/gather_azhpc_vm_diagnostics.sh"
1458+
dest: "{{ __hpc_azure_tools_dir }}/gather_azhpc_vm_diagnostics.sh"
1459+
remote_src: true
1460+
owner: root
1461+
group: root
1462+
mode: '0755'
1463+
1464+
- name: Clean up temporary patch file
1465+
file:
1466+
path: "{{ __hpc_diags_patch_file.path }}"
1467+
state: absent
14161468

14171469
- name: Remove extracted temp directory
14181470
file:
@@ -1422,4 +1474,4 @@
14221474

14231475
- name: Clean dnf cache
14241476
command: dnf clean all
1425-
changed_when: false
1477+
changed_when: false
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
--- gather_azhpc_vm_diagnostics.sh.orig 2026-02-05 15:20:05.410458535 +1100
2+
+++ gather_azhpc_vm_diagnostics.sh 2026-02-11 18:24:23.902965560 +1100
3+
@@ -1,4 +1,7 @@
4+
#!/bin/bash
5+
+{{ ansible_managed | comment(prefix="", postfix="") | trim }}
6+
+{{ "system_role:hpc" | comment(prefix="", postfix="") | trim }}
7+
+
8+
# Azure HPC Diagnostics Tool
9+
# Gathers Diagnostic info from guest VM
10+
#
11+
@@ -51,16 +54,11 @@
12+
# Copyright (c) Microsoft Corporation.
13+
# Licensed under the MIT license.
14+
15+
-
16+
-
17+
####################################################################################################
18+
# Begin Constants
19+
####################################################################################################
20+
21+
STREAM_URL='https://azhpcstor.blob.core.windows.net/diagtool-binaries/stream.tgz'
22+
-LSVMBUS_URL='https://raw.githubusercontent.com/torvalds/linux/master/tools/hv/lsvmbus'
23+
-HPC_DIAG_URL='https://raw.githubusercontent.com/Azure/azhpc-diagnostics/main/Linux/src/gather_azhpc_vm_diagnostics.sh'
24+
-SCRIPT_DIR="$( cd "$( dirname "$0" )" >/dev/null 2>&1 && pwd )"
25+
SYSFS_PATH=/sys # store as a variable so it is mockable
26+
ETC_PATH=/etc
27+
PROC_PATH=/proc
28+
@@ -74,15 +72,7 @@
29+
CPU_LIST=(["Standard_HB120rs_v2"]="0 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,109,113,117"
30+
["Standard_HB60rs"]="0 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57")
31+
RELEASE_DATE=20220316 # update upon each release
32+
-COMMIT_HASH=$(
33+
- (
34+
- command -v git >/dev/null &&
35+
- cd "$SCRIPT_DIR" &&
36+
- git config --get remote.origin.url | grep -q 'Azure/azhpc-diagnostics.git$' &&
37+
- git rev-parse HEAD 2>/dev/null
38+
- ) ||
39+
- echo 'Unknown')
40+
-VERSION_INFO="$RELEASE_DATE-$COMMIT_HASH"
41+
+VERSION_INFO="$RELEASE_DATE-unknown"
42+
43+
HELP_MESSAGE="
44+
Usage: $0 [OPTION]
45+
@@ -100,8 +90,8 @@
46+
Execution Mode:
47+
--gpu-level=GPU_LEVEL dcgmi run level (default is 1)
48+
--mem-level=MEM_LEVEL set to 1 to run stream test (default is 0)
49+
- --no-update do not prompt for auto-update
50+
- --offline skips steps that require Internet access
51+
+ --no-update Does nothing, auto-update functionality has been elided.
52+
+ --online Run steps that require Internet access
53+
54+
For more information on this script and the data it gathers, visit its Github:
55+
56+
@@ -217,10 +207,9 @@
57+
echo "${CPU_LIST[$1]}"
58+
}
59+
60+
+COLUMNS=80
61+
if tput cols >/dev/null 2>/dev/null && (( $(tput cols) < 80 )); then
62+
COLUMNS=$(tput cols)
63+
-else
64+
- COLUMNS=80
65+
fi
66+
67+
print_enclosed() {
68+
@@ -244,24 +233,6 @@
69+
echo ''
70+
}
71+
72+
-check_for_updates() {
73+
- local message="You are not running the latest release of this tool. Switch to latest version?"
74+
-
75+
- local tmpfile
76+
- tmpfile=$(mktemp)
77+
- curl -s "$HPC_DIAG_URL" >"$tmpfile" || return 1
78+
- if ! cmp --silent "$0" "$tmpfile"; then
79+
- if prompt "$message"; then
80+
- mv "$tmpfile" "$0"
81+
- bash "$0" "$RUNTIME_OPTIONS"
82+
- exit $?
83+
- else
84+
- return 0
85+
- fi
86+
- fi
87+
- rm "$tmpfile"
88+
-}
89+
-
90+
get_metadata() {
91+
local path="$1"
92+
curl -s -H Metadata:true "http://169.254.169.254/metadata/instance/$path?api-version=2021-03-01&format=text"
93+
@@ -876,11 +847,11 @@
94+
print_divider
95+
print_enclosed "NOTICES:"
96+
print_divider
97+
- print_enclosed This tool generates and bundles together various logs and diagnostic information. It, however, DOES NOT TRANSMIT any of said data. It is left to the user to choose to transmit this data to Microsoft.
98+
+ print_enclosed This tool generates and bundles together various logs and diagnostic information. It, however, DOES NOT TRANSMIT any of said data. It is left to the user to choose to transmit this data to Red Hat.
99+
print_divider
100+
- print_enclosed Some of this info, such as IP addresses, may be Personally Identifiable Information. It is up to the user to redact any sensitive info from the output 'if' necessary before sending it to Microsoft.
101+
+ print_enclosed Some of this info, such as IP addresses, may be Personally Identifiable Information. It is up to the user to redact any sensitive info from the output 'if' necessary before sending it to Red Hat.
102+
print_divider
103+
- print_enclosed This tool invokes various 3rd party tools 'if' they are present on the system Please review them and their EULAs at:
104+
+ print_enclosed This tool invokes various 3rd party tools 'if' they are present on the system. Please review them and their EULAs at:
105+
print_enclosed "https://github.com/Azure/azhpc-diagnostics"
106+
print_divider
107+
print_enclosed WARNING: THINK BEFORE YOU RUN THIS
108+
@@ -1018,9 +989,6 @@
109+
print_enclosed 'Placing diagnostic files in the following location:'
110+
print_enclosed "$DIAG_DIR.tar.gz"
111+
print_divider
112+
- print_enclosed If you have already opened a support request, you can take the tarball and follow this link to upload it:
113+
- print_enclosed 'https://portal.azure.com/#blade/Microsoft_Azure_Support/HelpAndSupportBlade/managesupportrequest'
114+
- print_divider
115+
tar czf "$DIAG_DIR.tar.gz" -C "$DIAG_DIR_LOC" "$VM_ID.$TIMESTAMP" 2>/dev/null && rm -r "$DIAG_DIR"
116+
}
117+
118+
@@ -1030,9 +998,11 @@
119+
120+
GPU_LEVEL=1
121+
MEM_LEVEL=0
122+
+OFFLINE=true
123+
DISPLAY_HELP=false
124+
-# should be /opt/azurehpc/diagnostics
125+
-DIAG_DIR_LOC="$SCRIPT_DIR"
126+
+DISPLAY_VERSION=false
127+
+# should be /var/hpc/azure/diagnostics
128+
+DIAG_DIR_LOC="{{ __hpc_azure_runtime_dir }}/diagnostics"
129+
130+
# save options
131+
RUNTIME_OPTIONS=$*
132+
@@ -1063,8 +1033,8 @@
133+
validate_run_level "$1"
134+
MEM_LEVEL="$1"
135+
;;
136+
- --no-update) DISABLE_UPDATE=true;;
137+
- --offline) OFFLINE=true;;
138+
+ --no-update) ;; # does nothing
139+
+ --online) OFFLINE=false;;
140+
--tuning) TUNING=true;;
141+
-V|--version) DISPLAY_VERSION=true;;
142+
esac
143+
@@ -1081,10 +1051,6 @@
144+
# End Option Parsing
145+
####################################################################################################
146+
147+
-if [ "$OFFLINE" != true ] && [ "$DISABLE_UPDATE" != true ] && ! [[ $- =~ 's' ]]; then
148+
- check_for_updates
149+
-fi
150+
-
151+
if [ ! "${BASH_SOURCE[0]}" -ef "$0" ]; then
152+
# This lets us load all functions for unit testing.
153+
# We wouldn't want people sourcing this script anyway.

0 commit comments

Comments
 (0)