feat(training/quickstart): add self-contained training demo#15
feat(training/quickstart): add self-contained training demo#15
Conversation
|
I love the concept here. Is there much of a lift needed to make this work for gh200? |
Could we make use of a public bucket instead? |
* basic megatron container * latest torch-extras base * public NVIDIA/megatron-lm * use data mocking * auto-download tokenizer (built-in) * wandb optional, automatic if ~/.netrc present
There was a problem hiding this comment.
I'm not entirely sure where the line is between "user literally just runs this and it'll work" and "user assumed to make some adjustments for this to work" for this, but there currently are assumptions in this script that will not apply generically for what I think of for a demo in my opinion.
Comments enclosed. Cheers!
| #SBATCH --ntasks-per-node 8 | ||
| #SBATCH --gpus-per-node 8 |
There was a problem hiding this comment.
This may not be true for all jobs with GPUs. Some have less than this allocatable.
Also, this is a personal preference thing but I highly prefer the setup --ntasks-per-node=1 --gpus-per-task=8 whenever possible. One task running the sbatch before fanning out in the training script allows you to be able to install things and modify any other filesystem state without running in to race conditions.
The only change required for this to work is to simply wrap the entrypoint script in torchrun, which also has the added benefit of automatically assigning the correct identifying environment variables to each rank, like:
torchrun \
--nproc_per_node="$SLURM_GPUS_PER_NODE" \
--nnodes="$SLURM_NNODES" \
--node_rank=$SLURM_NODEID \
--master_addr="$MASTER_ADDR" \
--master_port="$MASTER_PORT" \
pretrain_gpt.py "${train_args[@]}"There was a problem hiding this comment.
that's a great call out and suggestion, thanks. I'm also not a huge fan of hardcoding this to 1 node. An alternative could be removing all three of those directives from the top entirely and instead modeling usage to the user with CLI args, i.e.
sbatch launch_training.sbatch \
--nodes 1 \
--ntasks-per-node 1 \
--gpus-per-task=8
thoughts?
| export NCCL_SOCKET_IFNAME=eth0 | ||
| export SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 | ||
| export NCCL_COLLNET_ENABLE=0 | ||
| export NCCL_IB_HCA=ibp | ||
| export UCX_NET_DEVICES=ibp0:1,ibp1:1,ibp2:1,ibp3:1,ibp4:1,ibp5:1,ibp6:1,ibp7:1 |
There was a problem hiding this comment.
These environment variables assume very specific things and as such aren't appropriate if this is meant to be trivially ran by users. It assumes:
- An ethernet NIC for TCP/IP comms (this is probably a totally fine assumption)
- Tries to run SHARP if possible (as long as this doesn't error out if a node doesn't have SHARP this is also fine)
- Disables the NCCL CollNet plugin. This shouldn't be unconditionally disabled, and for what it's worth disabling it may not allow SHARP in the first place.
- Assumes the node running has IB and exactly 8 ports
If this is meant to be pretty well generalizable as a demo script we may as well just remove these environment variables entirely and let NCCL figure these out on its own.
There was a problem hiding this comment.
Very fair, and also echoes Brad's suggestion that we keep GH200 in mind as well. Considering an environment where these settings would actually be appropriate/optimal, what would be the impact of removing them? Just a little extra setup time as NCCL probes the topology?
A simple path forward could be to just do a couple of different versions of this demo with settings pre-optimized for different topologies. A higher LOE path for the future could be a script that figures out what the environment is and chooses recommended env vars based on what it finds.
|
|
||
|
|
||
| # Use squished container if available, otw download and save it. | ||
| export REMOTE_IMAGE_URI="ghcr.io#coreweave/ml-containers/megatron:dmarx-megatron-update-0c68584" |
There was a problem hiding this comment.
Last I checked valid GitHub credentials need to be figured here via enroot to download this. Is this documented?
There was a problem hiding this comment.
this container is public, does that still require gh creds?
There was a problem hiding this comment.
In any event, I'll add that to the instructions. got it written up for the megatron demo already anyway.
| export CUDA_DEVICE_MAX_CONNECTIONS=1 | ||
|
|
||
| export WORLD_SIZE="${SLURM_NTASKS:?}" | ||
| export RANK="${SLURM_PROCID:?}" | ||
| export LOCAL_RANK="${SLURM_LOCALID:?}" | ||
| export CUDA_DEVICE_ORDER='PCI_BUS_ID' |
There was a problem hiding this comment.
These are unnecessary when using torchrun fwiw. If our container has this I strongly recommend we make this the preferred pattern. torchrun is automatically installed when installing torch, so we should be able to assume it's available.
| --tensor-model-parallel-size 4 \ | ||
| --pipeline-model-parallel-size 1 \ |
There was a problem hiding this comment.
Assumes the job's available VRAM can support 2 separate model replicas during training. Since we can't know the exact VRAM a generic job will have, setting the tensor parallelism to the world size is probably a good defensive option.
| #SBATCH --constraint gpu | ||
| #SBATCH --job-name test | ||
| #SBATCH --output test.%j | ||
| #SBATCH --export all |
There was a problem hiding this comment.
It's probably not a good idea to leak in all environment variables from the login node and it ought to not need any of them.
| --num-layers 32 \ | ||
| --hidden-size 4096 \ | ||
| --seq-length 8192 \ | ||
| --max-position-embeddings 8192 \ |
There was a problem hiding this comment.
Again makes assumptions about VRAM capacity for jobs, but you need something for these so it's kind of unavoidable that there won't be a magic set of parameters that will always work. Could maybe considering defensively reducing these by a factor of 2 but it's a nitpick.
This PR publishes an sbatch script which can be used to launch a "self-contained" training demo.
The main role of this demo is for debugging and smoke testing. As such, it's sort of an odd fit for thereference-architecturerepo because it actually illustrates an anti-pattern: shipping training data packaged into the container. This is useful for the expedient purpose this container is generally used for, but we might want to hold off on merging to main until after adding a reference example for training that doesn't demonstrate the anti-pattern.Anti-pattern resolved, demo now uses megatron container from ml-containers rather than a special container prepackaged with a tokenizer and data. Also, fixed it so wandb is optional so there's no additional setup besides sunk.