@@ -38,7 +38,30 @@ Submitting jobs
3838
3939Frontier uses SLURM.
4040
41- Here's a script that runs on GPUs and has the I/O fixes described above.
41+ Here's a script that uses our best practices on Frontier. It uses 64 nodes (512 GPUs)
42+ and does the following:
43+
44+ * Sets the filesystem striping (see https://docs.olcf.ornl.gov/data/index.html#lfs-setstripe-wrapper)
45+
46+ * Includes logic for automatically restarting from the last checkpoint file
47+ (useful for job-chaining). This is done via the ``find_chk_file `` function.
48+
49+ * Installs a signal handler to create a ``dump_and_stop `` file shortly before
50+ the queue window ends. This ensures that we get a checkpoint at the very
51+ end of the queue window.
52+
53+ * Can do a special check on restart to ensure that we don't hang on
54+ reading the initial checkpoint file (uncomment out the line):
55+
56+ ::
57+
58+ (sleep 300; check_restart ) &
59+
60+ This uses the ``check_restart `` function and will kill the job if it doesn't
61+ detect a successful restart within 5 minutes.
62+
63+ * Adds special I/O parameters to the job to work around filesystem issues
64+ (these are defined in ``FILE_IO_PARAMS ``.
4265
4366.. literalinclude :: ../../job_scripts/frontier/frontier.slurm
4467 :language: bash
@@ -51,6 +74,20 @@ The job is submitted as:
5174
5275where ``frontier.slurm `` is the name of the submission script.
5376
77+ .. note ::
78+
79+ If the job times out before writing out a checkpoint (leaving a
80+ ``dump_and_stop `` file behind), you can give it more time between the
81+ warning signal and the end of the allocation by adjusting the
82+ ``#SBATCH --signal=B:URG@<n> `` line at the top of the script.
83+
84+ Also, by default, AMReX will output a plotfile at the same time as a checkpoint file,
85+ which means you'll get one from the ``dump_and_stop ``, which may not be at the same
86+ time intervals as your ``amr.plot_per ``. To suppress this, set:
87+
88+ ::
89+
90+ amr.write_plotfile_with_checkpoint = 0
5491
5592Also see the WarpX docs: https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html
5693
0 commit comments