feat: Merge megatron checkpoints with lora adapters and convert to HF format by pengdurice · Pull Request #2173 · NVIDIA-NeMo/RL

pengdurice · 2026-03-30T18:06:40Z

What does this PR do ?

Add merge script (merge the megatron checkpoint and lora checkpoint) and convert to HF format.
This provides a one-stop solution for training using LORA with Megatron backend and then merge it with the base while converting to HF for downstream inference / evaluation, eliminating the need to resort to other places.

Usage

The following is the python command to run it. The base-ckpt is the cached megatron format of the base model and the adapter-ckpt is the saved lora adapter checkpoint.

    uv run --extra mcore python examples/converters/convert_lora_to_hf.py \
        --base-ckpt ~/.cache/huggingface/nemo_rl/zai-org/GLM-5/iter_0000000 \
        --adapter-ckpt results/dpo_glm5/step_5/policy/weights/iter_0000000 \
        --hf-model-name zai-org/GLM-5 \
        --hf-ckpt-path ./merged_hf_model

Test

Unit tests
Tested with using it on a trained LORA adapter for GLM5 model, the merged and converted model is further tested using vllm on one open benchmark.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Issues

NA

Additional Information

...

Signed-off-by: pengdurice <pengduhit@gmail.com>

copy-pr-bot · 2026-03-30T18:06:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: pengdurice <pengduhit@gmail.com>

hanguangmic · 2026-04-03T02:06:30Z

Hi Peng
so impressed with your work, do not know how to convert dist torch lora into hf format without merge into base model?

hanguangmic · 2026-04-03T02:59:45Z

by the way, i got the following error when rerun the code

File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 506, in _validate_global_shapes
raise KeyError(
KeyError: "decoder.layers.0.mixer.dt_bias from model not in state dict: ['decoder.layers.0.mixer.in_proj.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_in.weight', 'decoder.layers.0.mixer.in_proj.adapter.linear_out._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.B', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.C', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.dt', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.x', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.z', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in.weight', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_out._extra_state/shard_0_1', 'decoderxxx]

pengdurice · 2026-04-03T03:07:29Z

Thank you for your interests! Can you give me more details so that I can reproduce on my side?
Thanks!

hanguangmic · 2026-04-03T12:36:34Z

great and big thanks for your reply, this issue brother me a lot.
My docker image is https://hub.docker.com/r/hanguang02/hanguang02-nemo-rl/tags
My training config is as below

`
defaults: grpo_math_1B.yaml
grpo:
max_num_steps: 900
num_prompts_per_step: 16
num_generations_per_prompt: 4
val_period: 0
val_at_start: false
val_at_end: false
async_grpo:
enabled: true
max_trajectory_age_steps: 1
in_flight_weight_updates: false # Enable for faster weight synchronization
recompute_kv_cache_after_weight_updates: false # Invalidates kv cache after in-flight-weight-updates

checkpointing:
enabled: true
checkpoint_dir: "/mnt/workspace/NVMRC2026/grpo-nemotron-30B-lora-r32"
metric_name: "train:reward"
higher_is_better: false
keep_top_k: 3
save_optimizer: false
#save_consolidated: false
save_period: 4
#v4_compatible: true

loss_fn:
use_importance_sampling_correction: true

policy:
model_name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
tokenizer:
name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
train_global_batch_size: 16
train_micro_batch_size: 1
logprob_batch_size: 1
max_total_sequence_length: 10240
make_sequence_length_divisible_by: 4
dtensor_cfg:
enabled: false
megatron_cfg:
enabled: true
bias_activation_fusion: false
tensor_model_parallel_size: 4
expert_model_parallel_size: 4
sequence_parallel: true
peft:
enabled: true
dim: 32
alpha: 64
#target_modules: ['q_proj', 'k_proj','v_proj','up_proj','down_proj',]
exclude_modules: ['out_proj']
sequence_packing:
enabled: false
generation:
vllm_cfg:
async_engine: true
precision: "bfloat16"
max_model_len: 10240
max_tokens: 7680
gpu_memory_utilization: 0.8
tensor_parallel_size: 1
enforce_eager: false
use_deep_gemm: true
vllm_kwargs:
max_num_seqs: 128
max_num_batched_tokens: 4096
enable_chunked_prefill: true
colocated:
enabled: false
resources:
gpus_per_node: 8
num_nodes: 1

data:
max_input_seq_length: 1000
shuffle: true
num_workers: 8
use_multiple_dataloader: false

train:
dataset_name: "ResponseDataset"
data_path: "/mnt/workspace/NVMRC2026/data/train_0326.jsonl"
input_key: "input"
output_key: "output"
split_validation_size: 0.0

validation: null

default:
prompt_file: null
system_prompt_file: null
processor: "math_hf_data_processor"
env_name: "math"

cluster:
gpus_per_node: 8
num_nodes: 3
`
may lora is here
https://drive.google.com/file/d/1zG1UPPWeatrKiC1tqQPBKVQCcW1nmjFQ/view?usp=sharing

yuki-97

@pengdurice nice work! thanks for adding this, lgtm.

yuki-97 · 2026-04-03T13:43:20Z

/ok to test a423dd9

Signed-off-by: pengdurice <pengduhit@gmail.com>

pengdurice · 2026-04-03T18:44:19Z

@hanguangmic ,
thank you for sharing the details! I locally created a cached megatron checkpoint and also downloaded your lora checkpoint and used this PR to merge the two and everything works fine. The error you mentioned above is not here and all keys matched. I wonder if the error you see is because during training, there was some issues with the checkpointing. My test concludes that given a valid base model checkpoint and valid lora checkpoint, the merge and --> hf conversion works. I would encourage you to start a run with lora, have the checkpoint to save only lora adapters (which should be the case), find the original cached Megatron checkpoint, examine if the missing key is there and debug from there. This should work with using the base cached checkpoint (including all keys) and saved lora checkpoints will all corresponding lora adapters. Thank you! Happy to chat more!

pengdurice · 2026-04-03T18:47:07Z

Thank you! I fixed the 2 issues in the CI/CD: lint check by adding the new convert file to pyrefly.toml and rebased to fix the PR out of date error. The functional test GPU failing happened on SFT training and should be irrelevant to this PR. Thank you! It'd be great to have another CI/CD test. Thanks!

hanguangmic · 2026-04-04T01:48:35Z

can you share me your code using to create cached megatron checkpoint, i didn't think the training phase or there was a bug. thanks so much again. Maybe my chached chckpoint is not right.

pengdurice · 2026-04-04T02:23:10Z

HF_MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
BASE_CKPT_DIR = "/your/path/RL/results/nemotron_nano30b_megatron_base"
def create_megatron_base_checkpoint():
    """Convert HF model to Megatron format using community import."""
    base_iter_dir = os.path.join(BASE_CKPT_DIR, "iter_0000000")
    if os.path.exists(base_iter_dir):
        print(f"Base Megatron checkpoint already exists at {base_iter_dir}, skipping.")
        return base_iter_dir
    print(f"Converting {HF_MODEL_NAME} to Megatron format...")
    from megatron.bridge.training.model_load_save import temporary_distributed_context
    from nemo_rl.models.megatron.community_import import import_model_from_hf_name
    with temporary_distributed_context(backend="gloo"):
        import_model_from_hf_name(HF_MODEL_NAME, BASE_CKPT_DIR)
    gc.collect()
    print(f"Base Megatron checkpoint saved to: {base_iter_dir}")
    return base_iter_dir

Simply import model from hf name can give a megatron cached checkpoint. Feel free to check your own checkpoint to see if the missing key is there, I guess maybe not. This conversion will give you a checkpoint that is identical with original HF in weights values, but in Megatron's format and should have the key.

hanguangmic · 2026-04-04T10:40:35Z

thanks so much, and may i know if it is possible to save lora only instead of merge anything?

yuki-97 · 2026-04-04T12:52:17Z

/ok to test b206875

pengdurice · 2026-04-04T15:50:22Z

If I understand your question correctly, only saving lora is already supported, people can just save the trained lora adapters somewhere (that's one input of this PR). This PR aims at merging base model with the trained lora adapters and converting to HF for downstream inference and evaluations.

pengdurice · 2026-04-08T00:37:03Z

@yuki-97 @terrykong, @yaoyu-33 lmk if you have additional questions! thank you!

yuki-97

@pengdurice sorry for late reply, lgtm, and thanks for the contribution again!

yuki-97 · 2026-04-08T03:46:49Z

hi @hanguangmic @pengdurice , feel free to continue discuss here or move to #2190

pengdurice · 2026-04-08T04:17:33Z

@yuki-97 , thank you! @hanguangmic , if you have any more questions, let's continue our discussions, either here or in the issue #2190 ;-)

first version, script and test case is added, local tests passes as well

76b1b96

Signed-off-by: pengdurice <pengduhit@gmail.com>

github-actions bot added the community-request label Mar 30, 2026

pengdurice changed the title ~~feat: Add merge script (merge the megatron checkpoint and lora checkpoint) and convert to HF format~~ feat: Merge megatron checkpoints with lora adapters and convert to HF format in one command Mar 30, 2026

pengdurice changed the title ~~feat: Merge megatron checkpoints with lora adapters and convert to HF format in one command~~ feat: Merge megatron checkpoints with lora adapters and convert to HF format Mar 30, 2026

add documentation

a423dd9

Signed-off-by: pengdurice <pengduhit@gmail.com>

github-actions bot added the Documentation Improvements or additions to documentation label Apr 1, 2026

pengdurice marked this pull request as ready for review April 2, 2026 23:30

pengdurice requested review from a team as code owners April 2, 2026 23:30

yuki-97 previously approved these changes Apr 3, 2026

View reviewed changes

yuki-97 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Apr 3, 2026

copy-pr-bot bot had a problem deploying to nemo-ci April 3, 2026 13:43 Failure

yuki-97 requested review from terrykong and yaoyu-33 April 3, 2026 13:47

pengdurice added 2 commits April 3, 2026 15:43

Merge branch 'main' into peng-lora-merge-hf-example-v1

c855afe

fix lint check

b206875

Signed-off-by: pengdurice <pengduhit@gmail.com>

pengdurice dismissed yuki-97’s stale review via b206875 April 3, 2026 18:21

copy-pr-bot bot temporarily deployed to nemo-ci April 4, 2026 12:52 Inactive

chtruong814 added the needs-follow-up Issue needs follow-up label Apr 6, 2026

chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 8, 2026

yuki-97 approved these changes Apr 8, 2026

View reviewed changes

yuki-97 merged commit c9277a3 into NVIDIA-NeMo:main Apr 8, 2026
27 of 28 checks passed

yuki-97 mentioned this pull request Apr 8, 2026

How save Lora Weight into normal format #2190

Closed

Conversation

pengdurice commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Usage

Test

Before your PR is "Ready for review"

Issues

Additional Information

Uh oh!

copy-pr-bot bot commented Mar 30, 2026

Uh oh!

hanguangmic commented Apr 3, 2026

Uh oh!

hanguangmic commented Apr 3, 2026

Uh oh!

pengdurice commented Apr 3, 2026

Uh oh!

hanguangmic commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuki-97 left a comment

Choose a reason for hiding this comment

Uh oh!

yuki-97 commented Apr 3, 2026

Uh oh!

pengdurice commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pengdurice commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanguangmic commented Apr 4, 2026

Uh oh!

pengdurice commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanguangmic commented Apr 4, 2026

Uh oh!

yuki-97 commented Apr 4, 2026

Uh oh!

pengdurice commented Apr 4, 2026

Uh oh!

pengdurice commented Apr 8, 2026

Uh oh!

yuki-97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuki-97 commented Apr 8, 2026

Uh oh!

pengdurice commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pengdurice commented Mar 30, 2026 •

edited

Loading

hanguangmic commented Apr 3, 2026 •

edited

Loading

pengdurice commented Apr 3, 2026 •

edited

Loading

pengdurice commented Apr 3, 2026 •

edited

Loading

pengdurice commented Apr 4, 2026 •

edited

Loading