Conversation
| Lets create `ami.yml` (see [DLAMI release notes](https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html) to get AMI ARN (`ParentImage` in config bellow)): | ||
| ``` | ||
| Build: | ||
| SecurityGroupIds: [<insert you SG - in requires outbound traffic>] |
There was a problem hiding this comment.
it's automated with pcluster CLI, we could add cloudformation that sets up VPC, subnet, SG and runs pcluster CLI via bash runner (or even better trigger lambda to run it). I though to create short bash script with templating for config yaml file but I don't think it's worth the effort since it's not abstracting anything and adds boilerplate.
There was a problem hiding this comment.
100% chance someone won't know how to find a security group and or even what it is. Provide the steps to create one and retrieve ID or provide the retrieve security group id step.
| First let's fetch the assets required to build the image: | ||
|
|
||
| ```bash | ||
| wget https://ml.hpcworkshops.com/scripts/packer/packer.tar.gz |
There was a problem hiding this comment.
Explain the content of the archive before proposing to download.
For sake of clarity, I suggest you make the reader download the 3 files separately. At least they can review the file on github before hands. That's also give an opportunity to reviewer to look at the content of the files that are of this workshop.
There was a problem hiding this comment.
@sean-smith you added this and I just moved it, any specific reason for this?
There was a problem hiding this comment.
We don't need this anymore. This can just be:
git clone git@github.com:aws-samples/parallelcluster-efa-gpu-preflight-ami.git
| You can install Packer using [Brew](https://brew.sh/) on OSX or Linux as follows: | ||
|
|
||
| ```bash | ||
| brew install packer |
There was a problem hiding this comment.
Standardize on Cloud9 or cloudshell..
If I have windows how do I do?
Provide a specific version to prevent regression in the future.
There was a problem hiding this comment.
We'll standardize on Cloud9. Cloudshell storage space is too limited. IMHO most ML devops don't need instructions on how to use the cli. This is different than HPC.
| brew install packer | ||
| ``` | ||
|
|
||
| Alternatively, you can download the Packer binary through the [tool website](https://www.packer.io/). Ensure your `PATH` is set to use the binary or use its absolute path. Once Packer installed, proceed to the next stage. |
Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
|
|
||
| Now run [pcluster command](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.build-image-v3.html) that will add all pcluster dependencies to your DLAMI of choice: | ||
| ``` | ||
| pcluster build-image -c ami.yml -i NEW_AMI_ID -r REGION |
There was a problem hiding this comment.
variables...
| pcluster build-image -c ami.yml -i NEW_AMI_ID -r REGION | |
| pcluster build-image -c ami.yml -i $NEW_AMI_ID -r $AWS_REGION |
| @@ -0,0 +1,10 @@ | |||
| --- | |||
| title: "b. Download, compile and run the NCCL tests" | |||
There was a problem hiding this comment.
| title: "b. Download, compile and run the NCCL tests" | |
| title: "b. Run the NCCL tests" |
| ```bash | ||
| cd ~ | ||
|
|
||
| cat > compile_nccl.sh << EOF |
There was a problem hiding this comment.
Absolute path
| cat > compile_nccl.sh << EOF | |
| cat > ~/compile_nccl.sh << EOF |
| Create your job submission script for *OSU Latency* and use **sbatch** to submit your job: | ||
|
|
||
| ```bash | ||
| cat > nccl_test.sbatch << \EOF |
There was a problem hiding this comment.
| cat > nccl_test.sbatch << \EOF | |
| cat > ~/nccl_test.sbatch << EOF |
| NCCL_TEST_PATH=${HOME}/nccl-tests/build | ||
| MPI_PATH=/opt/amazon/openmpi | ||
|
|
||
| export LD_LIBRARY_PATH=${HOME}/nccl/build/lib:${HOME}/aws-ofi-nccl/install/lib |
| #SBATCH --output=nccl.out | ||
|
|
||
| NCCL_TEST_PATH=${HOME}/nccl-tests/build | ||
| MPI_PATH=/opt/amazon/openmpi |
|
|
||
| git clone -b v2.17.1-1 https://github.com/NVIDIA/nccl.git | ||
| cd nccl | ||
| make -j src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE='-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80' |
There was a problem hiding this comment.
I hope cuda version does not change...
There was a problem hiding this comment.
It's default "system" cuda - new AMIs will have new CUDA on the same path. I can add a note that if they have custom cuda they change this path, I assumed if someone is more advanced to add specific CUDA version they will be familiar with these parameters. But good point, I'll add the note.
|
|
||
| git clone -b v2.13.6 https://github.com/NVIDIA/nccl-tests.git | ||
| cd nccl-tests | ||
| make MPI=1 CUDA_HOME=/usr/local/cuda MPI_HOME=/opt/amazon/openmpi NCCL_HOME=${HOME}/nccl/build |
There was a problem hiding this comment.
It's using OpenMPI form pcluster AMI, no need for IntelMPI to get correct performance.
There was a problem hiding this comment.
module load openmpi. That's what I said.
| export NCCL_DEBUG=INFO | ||
| export FI_LOG_LEVEL=1 | ||
|
|
||
| ${MPI_PATH}/bin/mpirun --map-by ppr:8:node --rank-by slot \ |
There was a problem hiding this comment.
One the openmpi is loaded no need to have path like this.
| export NCCL_DEBUG=INFO | ||
| export FI_LOG_LEVEL=1 | ||
|
|
||
| ${MPI_PATH}/bin/mpirun --map-by ppr:8:node --rank-by slot \ |
There was a problem hiding this comment.
| ${MPI_PATH}/bin/mpirun --map-by ppr:8:node --rank-by slot \ | |
| ${MPI_PATH}/bin/mpirun --map-by ppr:4:socket \ |
There was a problem hiding this comment.
| EbsSettings: | ||
| VolumeType: gp3 | ||
| Size: 200 | ||
| Throughput: 300 |
| tags : ["Huggingface", "data", "ML", "srun", "slurm"] | ||
| --- | ||
|
|
||
| In this section, you will learn how to run script from Huggingface examples with PyTorch FSDP and DDP. |
There was a problem hiding this comment.
What am I running exactly? a Script? What does it do?
Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.