feat: Merge megatron checkpoints with lora adapters and convert to HF format#2173
Conversation
Signed-off-by: pengdurice <pengduhit@gmail.com>
Signed-off-by: pengdurice <pengduhit@gmail.com>
|
Hi Peng |
by the way, i got the following error when rerun the code File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 506, in _validate_global_shapes |
Thank you for your interests! Can you give me more details so that I can reproduce on my side? |
great and big thanks for your reply, this issue brother me a lot. ` checkpointing: loss_fn: policy: data: train: validation: null default: cluster: |
yuki-97
left a comment
There was a problem hiding this comment.
@pengdurice nice work! thanks for adding this, lgtm.
|
/ok to test a423dd9 |
Signed-off-by: pengdurice <pengduhit@gmail.com>
@hanguangmic , |
Thank you! I fixed the 2 issues in the CI/CD: lint check by adding the new convert file to pyrefly.toml and rebased to fix the PR out of date error. The functional test GPU failing happened on SFT training and should be irrelevant to this PR. Thank you! It'd be great to have another CI/CD test. Thanks! |
can you share me your code using to create cached megatron checkpoint, i didn't think the training phase or there was a bug. thanks so much again. Maybe my chached chckpoint is not right. |
Simply import model from hf name can give a megatron cached checkpoint. Feel free to check your own checkpoint to see if the missing key is there, I guess maybe not. This conversion will give you a checkpoint that is identical with original HF in weights values, but in Megatron's format and should have the key. |
thanks so much, and may i know if it is possible to save lora only instead of merge anything? |
|
/ok to test b206875 |
If I understand your question correctly, only saving lora is already supported, people can just save the trained lora adapters somewhere (that's one input of this PR). This PR aims at merging base model with the trained lora adapters and converting to HF for downstream inference and evaluations. |
@yuki-97 @terrykong, @yaoyu-33 lmk if you have additional questions! thank you! |
yuki-97
left a comment
There was a problem hiding this comment.
@pengdurice sorry for late reply, lgtm, and thanks for the contribution again!
|
hi @hanguangmic @pengdurice , feel free to continue discuss here or move to #2190 |
@yuki-97 , thank you! @hanguangmic , if you have any more questions, let's continue our discussions, either here or in the issue #2190 ;-) |
What does this PR do ?
Add merge script (merge the megatron checkpoint and lora checkpoint) and convert to HF format.
This provides a one-stop solution for training using LORA with Megatron backend and then merge it with the base while converting to HF for downstream inference / evaluation, eliminating the need to resort to other places.
Usage
The following is the python command to run it. The base-ckpt is the cached megatron format of the base model and the adapter-ckpt is the saved lora adapter checkpoint.
Test
Unit tests
Tested with using it on a trained LORA adapter for GLM5 model, the merged and converted model is further tested using vllm on one open benchmark.
Before your PR is "Ready for review"
Pre checks:
Issues
NA
Additional Information