Conversation
address review from previous PR Credit co-authors for prior squash Co-authored-by: ZixianWangAMD <zixiwang@amd.com> Co-authored-by: Michal Marcinkiewicz <michalm@nvidia.com> Co-authored-by: Lukasz Pierscieniewski <l.pierscieniewski@gmail.com>
disable async save and save intermediate checkpoint
…arget log perplexity to be 3.3 for consistency purposes
This reverts commit fed1bb4.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
|
@mmarcinkiewicz can you please review this? |
|
It seems the datadir needs to be writeable (presumably to store the index) - can we put index into a different dir so the datadir stays RO? |
| fp8: null # Disabled - using bf16 instead | ||
|
|
||
| # hyper parameters | ||
| train_iters: ${PRIMUS_TRAIN_ITERS:20000} |
There was a problem hiding this comment.
we need to talk about that
Add option to run with SLURM
pbaumstarck
left a comment
There was a problem hiding this comment.
Looking good overall and I got the code running. Another minor comment that we don't have any binary whl files in the repo, so it'd be ideal if we could dynamically retrieve and install that.
| rank = int(os.getenv("RANK", "0")) | ||
| world_size = int(os.getenv("WORLD_SIZE", "1")) | ||
| master_addr = os.getenv("MASTER_ADDR", "127.0.0.1") | ||
| master_port = int(os.getenv("MASTER_PORT", "29500")) |
There was a problem hiding this comment.
This conflicts with the port being set to 29501 in the shell commands. Should these all be the same?
| # Report result | ||
| result=$(( end - start )) | ||
| result_name="GPT_OSS_20B" | ||
| echo "RESULT,$result_name,,$result,AMD,$start_fmt" |
There was a problem hiding this comment.
Hardcoded "AMD" string but this code is shared between vendors.
This PR provides the reference code for GPT-OSS-20B using Primus framework that can be run on both AMD and NVIDIA hardware.