Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 0 additions & 43 deletions Example_output

This file was deleted.

99 changes: 92 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,106 @@ This repository contains scripts to pull resource usage data from job logs into

### NCI Gadi HPC

**[gadi-usage-report.pl](Scripts/gadi_usage_report.pl)**
**[gadi_usage_report_v1.2.pl](Scripts/gadi_usage_report_v1.2.pl)**

Description:

This script gathers the job compute requests and usage metrics from Gadi PBS logs and summarises them into a tab-delimited output.

Efficiency/utilisation values are reported for CPU using the formula `cpu_e = cputime/walltime/cpus_used`.

GPU usage (NGPUS, memory used, and GPU utilisation) can be optionally reported by appliny `-g` flag to the run.


Options:

This script gathers the job requests and usage metrics from Gadi log files for a collection of job log files with the same prefix within the same directory, and calculates efficiency values using the formula:
```
e = cputime/walltime/cpus_used
-a <dir> Report on all .o log files in the specified directory
-l <logfile> Report on one exact logfile
-p <pattern> Report on .o log files matching a filename pattern
-g Include GPU metrics
```

At least one of `-a <val>`, `-l <val>` or `-p <val> must be supplied.

GPU metrics can be included with any of the above 3 parameters with the optional `-g` flag. Logs with no GPU usage will have `NA` for the 3 GPU output fields.

Usage examples:

```bash
perl gadi_usage_report_v1.2.pl -a /path/to/logdir # all logs in dir
perl gadi_usage_report_v1.2.pl myjob.o -g # a specific log, report GPU usage
perl gadi_usage_report_v1.2.pl name # all logs with name including 'name'
```

If no prefix is specified, a warning wil be given, and the usage metrics will be reported for all job logs found within the present directory. Please see script header for execution instructions.
Output:

Tab-delimited summary of the resources requested and used for each job will be printed to STDOUT.

Use output redirection when executing the script to save the data to a text file, eg:

**[gadi-queuetime-report.pl](Scripts/gadi_queuetime_report.pl)**
`perl <path/to/script/gadi_usage_report_v1.2.pl <options> > resources_summary.txt`

If no prefix is specified, a warning wil be given, and the usage metrics will be reported for all job logs found within the present directory.

This script reports the queue time of a collection of completed jobs with the same output log file prefix on Gadi. If no prefix is specified, a warning will be given, and the queue time will be reported for all jobs with logs found within the present directory. Please note that PBS does not preserve job history on Gadi past 24 hours post job-completion.
Example output:

In order to remove this time restriction, jobs can be submitted with the line `qstat -xf $PBS_JOBID`` anywhere in the job script, with or without output redirection. This will preserve the required record in the ".o" output log file (no output redirection) or on a separate file (with output redirection). There are THREE ways in which this script can be run. Please see script header for execution instructions.
```console
perl ./HPC_usage_reports/Scripts/gadi_usage_report_v1.2.pl -a /scratch/aa00/my-pbs-logs/ -g

######
Reporting on all usage log files in /scratch/aa00/my-pbs-logs/.
######

#JobName Exit_status Service_units CPU_efficiency CPUs GPU_util NGPUS Mem_req Mem_used GPU_mem_used CPUtime_mins Walltime_req Walltime_mins JobFS_req JobFS_used Date
hg38_1140_test_three_cpu_only.o 0 8.28 0.14 12 NA NA 48.0GB 14.99GB NA 11.88 00:10:00 6.90 100.0MB 0B 2026-03-19
dgxa100_4pod5drs_2ngpu.o 0 64.40 0.15 64 0.83 4 1000.0GB 34.71GB 312.89GB 130.72 00:30:00 13.42 200.0GB 0B 2026-04-13
gpuhopper_4pod5drs_2ngpu.o 0 41.20 0.36 24 0.74 2 480.0GB 33.39GB 173.62GB 117.37 00:30:00 13.73 200.0GB 0B 2026-04-12
gpuhopper_4pod5drs_4ngpu.o 0 115.60 0.21 48 0.11 4 1.0TB 35.7GB 372.48GB 190.73 00:30:00 19.27 200.0GB 0B 2026-04-13
gpuvolta_4pod5drs_2ngpu.o 0 113.84 0.19 48 0.83 4 382.0GB 32.98GB 91.47GB 431.80 01:00:00 47.43 200.0GB 0B 2026-04-13

```

**[gadi-nfcore-report.sh](Scripts/gadi_nfcore_report.sh)**

This script gathers the job requests and usage metrics from Gadi log files, same as [gadi-queuetime-report.pl](Scripts/gadi-queuetime-report.pl). However, this script loops through the Nextflow work directory to collect `.commmand.log` files and prints all output to a .tsv file: `gadi-nf-core-joblogs.tsv`

**[gadi_nextflow_usage_v1.1.sh](Scripts/gadi_nextflow_usage_v1.1.sh)**

This script takes a nextflow run name (e.g. from `nextflow log`), pulls out all the task hashes from the run, and finds the relevant work directory to collect `.command.log` files from that run only. The script gathers the job requests and usage metrics from Gadi post-job files similar to [gadi-queuetime-report.pl](Scripts/gadi-queuetime-report.pl), and
[gadi-nfcore-report.sh](Scripts/gadi-queuetime-report.pl).

Results are printed to file: `resource_usage.<nextflow_run_name>.log`.

The script takes requires the nextflow run name as first and only positional argument. If you have forgotten the run name, identify it from the output of the `nextflow log` command (most recent run name is printed closest to command prompt):

```bash
module load nextflow
nextflow log
```

```console
TIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID COMMAND

2026-02-25 11:51:55 - kickass_cantor - 593881520d e2ddc027-c09f-487c-a241-be9771114df6 nextflow run main.nf ...
2026-02-25 11:54:03 50m 35s loving_boltzmann ERR 593881520d e2ddc027-c09f-487c-a241-be9771114df6 nextflow run main.nf ...
2026-02-25 13:07:06 5h 34m 53s maniac_lorenz OK 593881520d e2ddc027-c09f-487c-a241-be9771114df6 nextflow run main.nf ...
```

Run the script:

```bash
bash Scripts/gadi_nextflow_usage.sh maniac_lorenz
```

Example output:

```console
Job_name Hash Log_path Exit_status Service_units NCPUs_requested CPU_time_used(mins) CPU_efficiency Memory_requested Memory_used Walltime_requested Walltime_used(mins)JobFS_requested JobFS_used
PREPARE_GENOME:INDEX_MINIMAP2 (T2T) 68/7bbdc7 ../work/68/7cbdc706bba77935ff576939e5478a/.command.log 0 0.19 4 2.42 0.4315 16.0GB 16.0GB 0:30:00 1.4 100.0MB 0B
PREPARE_GENOME:BUILD_BED12 (T2T) a1/bad978 ../work/a1/bad9785fc4775f110218ef1a350609/.command.log NA NA NA NA NA NA NA NA NA NA NA
PREPARE_GENOME:INDEX_SAMTOOLS (T2T) 8d/3dd9b9 ../work/8d/32d9b9ebe7d3993d8cc952f15b75e0/.command.log 0 0.09 12 0.15 0.0577 48.0GB 3.1GB 1:00:00 0.22 100.0MB 0B
MAPPING:MINIMAP2_MAP_SORT_INDEX (15022) 7a/2e38bd ../work/7a/2e38bd6b0dec1dc227f8de0b90d389/.command.log 0 33.71 24 608.33 0.6016 96.0GB 84.49GB 6:00:00 42.13 100.0MB 0B
MAPPING:MINIMAP2_MAP_SORT_INDEX (15022) f6/d611b9 ../work/f6/d611b993ae5fcf51ca6c0307425c98/.command.log 0 55.71 24 1066.82 0.6384 96.0GB 78.22GB 6:00:00 69.63 100.0MB 0B
BAM_QC (BAM QC: 15022) 1d/a2c7f5 ../work/1d/a2c7f512627c840726d4c7375df4f2/.command.log 0 7.11 12 41.68 0.1953 48.0GB 48.0GB 1:00:00 17.78 100.0MB 0B
```
File renamed without changes.
123 changes: 123 additions & 0 deletions Scripts/Archive/gadi_usage_report_v1.2.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
#!/usr/bin/env perl

#------------------------------------------------------------------
# gadi_usage_report/1.1
# Platform: NCI Gadi HPC
#
# Description:
# This script gathers the job requests and usage metrics from Gadi log
# files for a collection of job log files with the same prefix within the
# same directory, and calculates efficiency values using the formula
# e = cputime/walltime/cpus_used.
# # If no prefix is specified, a warning wil be given, and the usage metrics
# will be reported for all job logs found within the present directory.
#
# Version 1.1 updates
# Reports usage for all logs in /path/to/dir or for logs specified
# Faster, by only checking end of log (was slow for logs with big
# stdout)
# Reports job exit status
# Reports files with no usage log
#
# Usage:
# command line, eg:
# perl gadi_usage_report_v1.1.pl /path/to/logdir
# perl gadi_usage_report_v1.1.pl myjob.o
#
# Output:
# Tab-delimited summary of the resources requested and used for each job
# will be printed to STDOUT. Use output redirection when executing the
# script to save the data to a text file, eg:
# perl <path/to/script/gadi_usage_report.pl <prefix> > resources_summary.txt
#
# Date last modified: 13/04/26
# Version 1.2 updates:
# - reorder headings now the log is so long, to bring VIP details to fore
# - remove time, to reduce log complexity
# - fixed new failure from NCI droopping redundant CPU field from PBS .o log
# - added usage option to do all logs matching pattern word:
# perl gadi_usage_report_v1.1.pl myjob # will do all logs in dir with name containing 'myjob'
#
# If you use this script towards a publication, please acknowledge the
# Sydney Informatics Hub (or co-authorship, where appropriate).
#
# Suggested acknowledgement:
# The authors acknowledge the scientific and technical assistance
# <or e.g. bioinformatics assistance of <PERSON>> of Sydney Informatics
# Hub and resources and services from the National Computational
# Infrastructure (NCI), which is supported by the Australian Government
# with access facilitated by the University of Sydney.
#------------------------------------------------------------------

use warnings;
use strict;
use POSIX;
use File::Basename;

my $dir=`pwd`;
chomp $dir;
my @logs;
my @no_report;

my $prefix = '';
if ($ARGV[0]) {
$prefix = $ARGV[0];
chomp $prefix;
if ($ARGV[0] =~ m/.o$/) {
@logs = (`ls "$prefix"`);
}
else {
@logs = split(' ', `ls $dir\/*$prefix*.o`);
}
}
else {
print "\n######\nNo usage log prefix specified. Will report on all usage log files in $dir.\n######\n\n";
@logs=split(' ', `ls $dir\/*.o`);
}

my $report={};

if (@logs){
print "#JobName\tExit_status\tService_units\tCPU_efficiency\tCPUs\tMem_req\tMem_used\tCPUtime_mins\tWalltime_req\tWalltime_mins\tJobFS_req\tJobFS_used\tDate\n";

foreach my $file (@logs) {
chomp $file;
my @name_fields = split('\/', $file);
my $name=basename($file);
my @walltime = split(' ', `tail -12 $file | grep "Walltime"`);
if($walltime[2]){
my $walltime_req = $walltime[2];
my $walltime_used = $walltime[5];
my ($wall_hours, $wall_mins, $wall_secs) = split('\:', $walltime_used);
my $walltime_mins = sprintf("%.2f",(($wall_hours*60) + $wall_mins + ($wall_secs/60)));
my @cpus = split(' ', `tail -12 $file | grep -i "NCPUs"`);
my $cpus = $cpus[2];
my @mem = split(' ', `tail -n 12 $file | grep -i "Memory"`);
my $mem_req = $mem[2];
my $mem_used = $mem[5];
chomp (my $cputime = `tail -12 $file | grep -i "CPU Time Used" | awk '{print \$4}'`);
my ($cpu_hours, $cpu_mins, $cpu_secs, $cputime_mins) = 0;
my @jobFS = split(' ', `tail -12 $file | grep -i "JobFS"`);
my $jobFS_req = $jobFS[2];
my $jobFS_used = $jobFS[5];
my $cpu_e = 0;
if ($cpus!~m/unknown/) { #not sure if this 'unknown' report ever happens on Gadi like it does on Artemis...
$cpus = ceil($cpus);
($cpu_hours, $cpu_mins, $cpu_secs) = split('\:', $cputime);
$cputime_mins = sprintf("%.2f",(($cpu_hours*60) + $cpu_mins + ($cpu_secs/60)));
$cpu_e = sprintf("%.2f",($cputime_mins/$walltime_mins/$cpus));
}
chomp (my $SUs = `tail -12 $file | grep -i "Service Units" | awk '{print \$3}'`);
chomp (my $exit_status = `tail -12 $file | grep -i "Exit Status" | cut -d ":" -f2 | awk '{\$1=\$1};1' | awk '{print \$1}'`);
chomp (my $date = `tail -12 $file | grep -i "Resource Usage on" | awk '{print \$4}'`);
chomp (my $time = `tail -12 $file | grep -i "Resource Usage on" | awk '{print \$5}' | sed 's/:\$//'`);
print "$name\t$exit_status\t$SUs\t$cpu_e\t$cpus\t$mem_req\t$mem_used\t$cputime_mins\t$walltime_req\t$walltime_mins\t$jobFS_req\t$jobFS_used\t$date\n";
}
else{
push(@no_report, $file);
}
}
}
if (@no_report){
print "\n\n######\nWARNING: Usage metrics were not reported for: @no_report\n######\n\n";
}
104 changes: 104 additions & 0 deletions Scripts/gadi_nextflow_usage_v1.1.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
#!/bin/bash

module load nextflow

RUN_NAME="$1"
WORKDIR="${2:-work}" # optional positional command line argument, default is './work'

if [ -z "$RUN_NAME" ]; then
echo "No run name supplied. Exiting."
exit 1
fi

OUTPUT="resource_usage.${RUN_NAME}.log"
TMPOUT="${OUTPUT}.tmp"

if [ -f "$TMPOUT" ]; then
echo "Temp file ${TMPOUT} already exists. Refusing to run."
exit 1
fi

if [ ! -d "$WORKDIR" ]; then
echo "Cannot find work directory $WORKDIR. Exiting."
exit 1
fi

nextflow log -f hash,name "$RUN_NAME" > "$TMPOUT"

if [[ ! -s "$TMPOUT" ]]; then
echo "ERROR: run name $RUN_NAME not found in this directory" >&2
rm -f "$TMPOUT"
exit 1
fi

echo -e "Job_name\tHash\tLog_path\tExit_status\tService_units\tNCPUs_requested\tCPU_time_used(mins)\tCPU_efficiency\tMemory_requested\tMemory_used\tWalltime_requested\tWalltime_used(mins)\tJobFS_requested\tJobFS_used" > "$OUTPUT"

while read -r HASH JOBNAME; do
LOG=$(find "$WORKDIR" -type f -path "*/${HASH}*" -name ".command.log" | head -n 1)

if [[ -z "$LOG" ]]; then
continue
fi

awk -v OFS="\t" -v logfile="$LOG" -v hash="$HASH" -v jobname="$JOBNAME" '
function time_to_mins(t, a, n, h, m, s, total_secs) {
n = split(t, a, ":")
if (n != 3) return "NA"
h = a[1] + 0
m = a[2] + 0
s = a[3] + 0
total_secs = (h * 3600) + (m * 60) + s
return total_secs / 60
}

BEGIN {
exit_status = "NA"
service_units = "NA"
ncpus_requested = "NA"
cpu_time_used = "NA"
cpu_time_used_mins = "NA"
cpu_efficiency = "NA"
memory_requested = "NA"
memory_used = "NA"
walltime_requested = "NA"
walltime_used = "NA"
walltime_used_mins = "NA"
jobfs_requested = "NA"
jobfs_used = "NA"
}

/^=+$/ {flag1=1; next}
flag1 && ! /Resource Usage/ {flag1=0; next}
flag1 && /Resource Usage/ {flag2=1; next}

flag2 {
if ($0 ~ /Exit Status/) exit_status = $3
if ($0 ~ /Service Units/) service_units = $3
if ($0 ~ /NCPUs Requested/) ncpus_requested = $3
if ($0 ~ /CPU Time Used/) cpu_time_used = $7
if ($0 ~ /Memory Requested/) memory_requested = $3
if ($0 ~ /Memory Used/) memory_used = $6
if ($0 ~ /Walltime Requested/) walltime_requested = $3
if ($0 ~ /Walltime Used/) walltime_used = $6
if ($0 ~ /JobFS Requested/) jobfs_requested = $3
if ($0 ~ /JobFS Used/) jobfs_used = $6
}

END {
if (cpu_time_used != "NA")
cpu_time_used_mins = sprintf("%.2f", time_to_mins(cpu_time_used))

if (walltime_used != "NA")
walltime_used_mins = sprintf("%.2f", time_to_mins(walltime_used))

if (cpu_time_used != "NA" && walltime_used != "NA" && ncpus_requested != "NA" && ncpus_requested > 0) {
cpu_efficiency = time_to_mins(cpu_time_used) / time_to_mins(walltime_used) / ncpus_requested
cpu_efficiency = sprintf("%.4f", cpu_efficiency)
}

print jobname, hash, logfile, exit_status, service_units, ncpus_requested, cpu_time_used_mins, cpu_efficiency, memory_requested, memory_used, walltime_requested, walltime_used_mins, jobfs_requested, jobfs_used
}' "$LOG"

done < "$TMPOUT" >> "$OUTPUT"

rm "$TMPOUT"
Empty file modified Scripts/gadi_nfcore_report.sh
100644 → 100755
Empty file.
2 changes: 1 addition & 1 deletion Scripts/gadi_queuetime_report.pl
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@
use POSIX;
use Time::Local;

my $dir=`pwd`;
my $dir='.';
chomp $dir;

my $prefix = '';
Expand Down
Loading