Skip to content

Commit 57b329f

Browse files
committed
update the OLCF workflow with best practices
1 parent 2b7c682 commit 57b329f

2 files changed

Lines changed: 68 additions & 7 deletions

File tree

job_scripts/frontier/frontier.slurm

Lines changed: 30 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
#SBATCH --cpus-per-task=7
1111
#SBATCH --gpus-per-task=1
1212
#SBATCH --gpu-bind=closest
13+
#SBATCH --signal=B:URG@300
1314

1415
EXEC=./Castro3d.hip.x86-trento.MPI.HIP.SMPLSDC.ex
1516
INPUTS=inputs_3d.N14.coarse
@@ -22,18 +23,13 @@ module load rocm/6.3.1
2223

2324
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
2425

25-
# libfabric workaround
26-
export FI_MR_CACHE_MONITOR=memhooks
27-
2826
# set the file system striping
2927

3028
echo $SLURM_SUBMIT_DIR
3129

3230
module load lfs-wrapper
3331
lfs setstripe -c 32 -S 10M $SLURM_SUBMIT_DIR
3432

35-
module list
36-
3733
function find_chk_file {
3834
# find_chk_file takes a single argument -- the wildcard pattern
3935
# for checkpoint files to look through
@@ -76,6 +72,21 @@ else
7672
restartString="amr.restart=${restartFile}"
7773
fi
7874

75+
76+
# clean up any run management files left over from previous runs
77+
rm -f dump_and_stop
78+
79+
# The `--signal=B:URG@<n>` option tells slurm to send SIGURG to this batch
80+
# script n seconds before the runtime limit, so we can exit gracefully.
81+
function sig_handler {
82+
touch dump_and_stop
83+
# disable this signal handler
84+
trap - URG
85+
echo "BATCH: allocation ending soon; telling Castro to dump a checkpoint and stop"
86+
}
87+
trap sig_handler URG
88+
89+
7990
export OMP_NUM_THREADS=1
8091
export NMPI_PER_NODE=8
8192
export TOTAL_NMPI=$(( ${SLURM_JOB_NUM_NODES} * ${NMPI_PER_NODE} ))
@@ -107,5 +118,18 @@ echo appending parameters: ${FILE_IO_PARAMS}
107118

108119
(sleep 300; check_restart ) &
109120

110-
srun -n${TOTAL_NMPI} -N${SLURM_JOB_NUM_NODES} --ntasks-per-node=8 --gpus-per-task=1 ./$EXEC $INPUTS ${restartString} ${FILE_IO_PARAMS}
121+
# execute srun in the background then use the builtin wait so the shell can
122+
# handle the signal
123+
srun -n${TOTAL_NMPI} -N${SLURM_JOB_NUM_NODES} --ntasks-per-node=8 --gpus-per-task=1 ./$EXEC $INPUTS ${restartString} ${FILE_IO_PARAMS} &
124+
pid=$!
125+
wait $pid
126+
ret=$?
127+
128+
if (( ret == 128 + 23 )); then
129+
# received SIGURG, keep waiting
130+
wait $pid
131+
ret=$?
132+
fi
133+
134+
exit $ret
111135

sphinx_docs/source/olcf-workflow.rst

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,30 @@ Submitting jobs
3838

3939
Frontier uses SLURM.
4040

41-
Here's a script that runs on GPUs and has the I/O fixes described above.
41+
Here's a script that uses our best practices on Frontier. It uses 64 nodes (512 GPUs)
42+
and does the following:
43+
44+
* Sets the filesystem striping (see https://docs.olcf.ornl.gov/data/index.html#lfs-setstripe-wrapper)
45+
46+
* Includes logic for automatically restarting from the last checkpoint file
47+
(useful for job-chaining). This is done via the ``find_chk_file`` function.
48+
49+
* Installs a signal handler to create a ``dump_and_stop`` file shortly before
50+
the queue window ends. This ensures that we get a checkpoint at the very
51+
end of the queue window.
52+
53+
* Can do a special check on restart to ensure that we don't hang on
54+
reading the initial checkpoint file (uncomment out the line):
55+
56+
::
57+
58+
(sleep 300; check_restart ) &
59+
60+
This uses the ``check_restart`` function and will kill the job if it doesn't
61+
detect a successful restart within 5 minutes.
62+
63+
* Adds special I/O parameters to the job to work around filesystem issues
64+
(these are defined in ``FILE_IO_PARAMS``.
4265

4366
.. literalinclude:: ../../job_scripts/frontier/frontier.slurm
4467
:language: bash
@@ -51,6 +74,20 @@ The job is submitted as:
5174

5275
where ``frontier.slurm`` is the name of the submission script.
5376

77+
.. note::
78+
79+
If the job times out before writing out a checkpoint (leaving a
80+
``dump_and_stop`` file behind), you can give it more time between the
81+
warning signal and the end of the allocation by adjusting the
82+
``#SBATCH --signal=B:URG@<n>`` line at the top of the script.
83+
84+
Also, by default, AMReX will output a plotfile at the same time as a checkpoint file,
85+
which means you'll get one from the ``dump_and_stop``, which may not be at the same
86+
time intervals as your ``amr.plot_per``. To suppress this, set:
87+
88+
::
89+
90+
amr.write_plotfile_with_checkpoint = 0
5491

5592
Also see the WarpX docs: https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html
5693

0 commit comments

Comments
 (0)