Pollux/MULTINODE.md at main · metauto-ai/Pollux

Multi-Node Slurm Training

We use lingua/stool.py to submit slurm jobs. By revise the original code, we support unpack the conda env to each compute node when submitting a job.

NOTE: we can continue to use torchrun --standalone command in a single node for quick debugging.

Usage

Pack the conda environment into a .tar.gz file and put it under the shared file system. For instance,

pip install conda-pack
conda-pack -n pollux -o /jfs/shuming/code/env/pollux_env.tar.gz

In the above, the name of your conda env is pollux, and the packed env will be saved at the jfs.

Submit to Slurm through lingua.stool. The example config is train_bucket_256_latent_code.yaml, which will use total 8GPUs for training.

python -m lingua.stool script=apps.main.train config=apps/main/configs/train_bucket_256_latent_code.yaml \
  nodes=1 \
  ngpu=8 \
  ncpu=16 \
  mem=256G \
  partition=debug \
  time=72:00:00 \
  anaconda_zip=/jfs/shuming/code/env/pollux_env.tar.gz \
  anaconda=/tmp/shuming/

In the above, anaconda_zip is the path of the packed file in step 1, and anaconda is the path to unpack it in each compute node, which is recommended to be under /tmp/ for faster speed.

For other arguments, such as ngpu, ncpu, you can refer to the detailed explanation here.

Note that the nodes and ngpu should be consistent with the setting in your yaml file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Node Slurm Training

Usage

FilesExpand file tree

MULTINODE.md

Latest commit

History

MULTINODE.md

File metadata and controls

Multi-Node Slurm Training

Usage