Skip to content

feat: Merge megatron checkpoints with lora adapters and convert to HF format#2173

Merged
yuki-97 merged 4 commits intoNVIDIA-NeMo:mainfrom
pengdurice:peng-lora-merge-hf-example-v1
Apr 8, 2026
Merged

feat: Merge megatron checkpoints with lora adapters and convert to HF format#2173
yuki-97 merged 4 commits intoNVIDIA-NeMo:mainfrom
pengdurice:peng-lora-merge-hf-example-v1

Conversation

@pengdurice
Copy link
Copy Markdown
Contributor

@pengdurice pengdurice commented Mar 30, 2026

What does this PR do ?

Add merge script (merge the megatron checkpoint and lora checkpoint) and convert to HF format.
This provides a one-stop solution for training using LORA with Megatron backend and then merge it with the base while converting to HF for downstream inference / evaluation, eliminating the need to resort to other places.

Usage

The following is the python command to run it. The base-ckpt is the cached megatron format of the base model and the adapter-ckpt is the saved lora adapter checkpoint.

    uv run --extra mcore python examples/converters/convert_lora_to_hf.py \
        --base-ckpt ~/.cache/huggingface/nemo_rl/zai-org/GLM-5/iter_0000000 \
        --adapter-ckpt results/dpo_glm5/step_5/policy/weights/iter_0000000 \
        --hf-model-name zai-org/GLM-5 \
        --hf-ckpt-path ./merged_hf_model

Test

Unit tests
Tested with using it on a trained LORA adapter for GLM5 model, the merged and converted model is further tested using vllm on one open benchmark.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Issues

NA

Additional Information

  • ...

Signed-off-by: pengdurice <pengduhit@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@pengdurice pengdurice changed the title feat: Add merge script (merge the megatron checkpoint and lora checkpoint) and convert to HF format feat: Merge megatron checkpoints with lora adapters and convert to HF format in one command Mar 30, 2026
@pengdurice pengdurice changed the title feat: Merge megatron checkpoints with lora adapters and convert to HF format in one command feat: Merge megatron checkpoints with lora adapters and convert to HF format Mar 30, 2026
Signed-off-by: pengdurice <pengduhit@gmail.com>
@github-actions github-actions bot added the Documentation Improvements or additions to documentation label Apr 1, 2026
@pengdurice pengdurice marked this pull request as ready for review April 2, 2026 23:30
@pengdurice pengdurice requested review from a team as code owners April 2, 2026 23:30
@hanguangmic
Copy link
Copy Markdown

Hi Peng
so impressed with your work, do not know how to convert dist torch lora into hf format without merge into base model?

@hanguangmic
Copy link
Copy Markdown

Hi Peng so impressed with your work, do not know how to convert dist torch lora into hf format without merge into base model?

by the way, i got the following error when rerun the code

File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 506, in _validate_global_shapes
raise KeyError(
KeyError: "decoder.layers.0.mixer.dt_bias from model not in state dict: ['decoder.layers.0.mixer.in_proj.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_in.weight', 'decoder.layers.0.mixer.in_proj.adapter.linear_out._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.B', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.C', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.dt', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.x', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.z', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in.weight', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_out._extra_state/shard_0_1', 'decoderxxx]

@pengdurice
Copy link
Copy Markdown
Contributor Author

Hi Peng so impressed with your work, do not know how to convert dist torch lora into hf format without merge into base model?

by the way, i got the following error when rerun the code

File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 506, in _validate_global_shapes raise KeyError( KeyError: "decoder.layers.0.mixer.dt_bias from model not in state dict: ['decoder.layers.0.mixer.in_proj.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_in.weight', 'decoder.layers.0.mixer.in_proj.adapter.linear_out._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.B', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.C', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.dt', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.x', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.z', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in.weight', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_out._extra_state/shard_0_1', 'decoderxxx]

Thank you for your interests! Can you give me more details so that I can reproduce on my side?
Thanks!

@hanguangmic
Copy link
Copy Markdown

hanguangmic commented Apr 3, 2026

Hi Peng so impressed with your work, do not know how to convert dist torch lora into hf format without merge into base model?

by the way, i got the following error when rerun the code
File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 506, in _validate_global_shapes raise KeyError( KeyError: "decoder.layers.0.mixer.dt_bias from model not in state dict: ['decoder.layers.0.mixer.in_proj.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_in.weight', 'decoder.layers.0.mixer.in_proj.adapter.linear_out._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.B', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.C', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.dt', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.x', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.z', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in.weight', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_out._extra_state/shard_0_1', 'decoderxxx]

Thank you for your interests! Can you give me more details so that I can reproduce on my side? Thanks!

great and big thanks for your reply, this issue brother me a lot.
My docker image is https://hub.docker.com/r/hanguang02/hanguang02-nemo-rl/tags
My training config is as below

`
defaults: grpo_math_1B.yaml
grpo:
max_num_steps: 900
num_prompts_per_step: 16
num_generations_per_prompt: 4
val_period: 0
val_at_start: false
val_at_end: false
async_grpo:
enabled: true
max_trajectory_age_steps: 1
in_flight_weight_updates: false # Enable for faster weight synchronization
recompute_kv_cache_after_weight_updates: false # Invalidates kv cache after in-flight-weight-updates

checkpointing:
enabled: true
checkpoint_dir: "/mnt/workspace/NVMRC2026/grpo-nemotron-30B-lora-r32"
metric_name: "train:reward"
higher_is_better: false
keep_top_k: 3
save_optimizer: false
#save_consolidated: false
save_period: 4
#v4_compatible: true

loss_fn:
use_importance_sampling_correction: true

policy:
model_name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
tokenizer:
name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
train_global_batch_size: 16
train_micro_batch_size: 1
logprob_batch_size: 1
max_total_sequence_length: 10240
make_sequence_length_divisible_by: 4
dtensor_cfg:
enabled: false
megatron_cfg:
enabled: true
bias_activation_fusion: false
tensor_model_parallel_size: 4
expert_model_parallel_size: 4
sequence_parallel: true
peft:
enabled: true
dim: 32
alpha: 64
#target_modules: ['q_proj', 'k_proj','v_proj','up_proj','down_proj',]
exclude_modules: ['out_proj']
sequence_packing:
enabled: false
generation:
vllm_cfg:
async_engine: true
precision: "bfloat16"
max_model_len: 10240
max_tokens: 7680
gpu_memory_utilization: 0.8
tensor_parallel_size: 1
enforce_eager: false
use_deep_gemm: true
vllm_kwargs:
max_num_seqs: 128
max_num_batched_tokens: 4096
enable_chunked_prefill: true
colocated:
enabled: false
resources:
gpus_per_node: 8
num_nodes: 1

data:
max_input_seq_length: 1000
shuffle: true
num_workers: 8
use_multiple_dataloader: false

train:
dataset_name: "ResponseDataset"
data_path: "/mnt/workspace/NVMRC2026/data/train_0326.jsonl"
input_key: "input"
output_key: "output"
split_validation_size: 0.0

validation: null

default:
prompt_file: null
system_prompt_file: null
processor: "math_hf_data_processor"
env_name: "math"

cluster:
gpus_per_node: 8
num_nodes: 3
`
may lora is here
https://drive.google.com/file/d/1zG1UPPWeatrKiC1tqQPBKVQCcW1nmjFQ/view?usp=sharing

yuki-97
yuki-97 previously approved these changes Apr 3, 2026
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pengdurice nice work! thanks for adding this, lgtm.

@yuki-97 yuki-97 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Apr 3, 2026
@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented Apr 3, 2026

/ok to test a423dd9

@pengdurice
Copy link
Copy Markdown
Contributor Author

pengdurice commented Apr 3, 2026

Hi Peng so impressed with your work, do not know how to convert dist torch lora into hf format without merge into base model?

by the way, i got the following error when rerun the code
File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 506, in _validate_global_shapes raise KeyError( KeyError: "decoder.layers.0.mixer.dt_bias from model not in state dict: ['decoder.layers.0.mixer.in_proj.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_in.weight', 'decoder.layers.0.mixer.in_proj.adapter.linear_out._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.B', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.C', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.dt', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.x', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.z', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in.weight', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_out._extra_state/shard_0_1', 'decoderxxx]

Thank you for your interests! Can you give me more details so that I can reproduce on my side? Thanks!

great and big thanks for your reply, this issue brother me a lot. My docker image is https://hub.docker.com/r/hanguang02/hanguang02-nemo-rl/tags My training config is as below

` defaults: grpo_math_1B.yaml grpo: max_num_steps: 900 num_prompts_per_step: 16 num_generations_per_prompt: 4 val_period: 0 val_at_start: false val_at_end: false async_grpo: enabled: true max_trajectory_age_steps: 1 in_flight_weight_updates: false # Enable for faster weight synchronization recompute_kv_cache_after_weight_updates: false # Invalidates kv cache after in-flight-weight-updates

checkpointing: enabled: true checkpoint_dir: "/mnt/workspace/NVMRC2026/grpo-nemotron-30B-lora-r32" metric_name: "train:reward" higher_is_better: false keep_top_k: 3 save_optimizer: false #save_consolidated: false save_period: 4 #v4_compatible: true

loss_fn: use_importance_sampling_correction: true

policy: model_name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 tokenizer: name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 train_global_batch_size: 16 train_micro_batch_size: 1 logprob_batch_size: 1 max_total_sequence_length: 10240 make_sequence_length_divisible_by: 4 dtensor_cfg: enabled: false megatron_cfg: enabled: true bias_activation_fusion: false tensor_model_parallel_size: 4 expert_model_parallel_size: 4 sequence_parallel: true peft: enabled: true dim: 32 alpha: 64 #target_modules: ['q_proj', 'k_proj','v_proj','up_proj','down_proj',] exclude_modules: ['out_proj'] sequence_packing: enabled: false generation: vllm_cfg: async_engine: true precision: "bfloat16" max_model_len: 10240 max_tokens: 7680 gpu_memory_utilization: 0.8 tensor_parallel_size: 1 enforce_eager: false use_deep_gemm: true vllm_kwargs: max_num_seqs: 128 max_num_batched_tokens: 4096 enable_chunked_prefill: true colocated: enabled: false resources: gpus_per_node: 8 num_nodes: 1

data: max_input_seq_length: 1000 shuffle: true num_workers: 8 use_multiple_dataloader: false

train: dataset_name: "ResponseDataset" data_path: "/mnt/workspace/NVMRC2026/data/train_0326.jsonl" input_key: "input" output_key: "output" split_validation_size: 0.0

validation: null

default: prompt_file: null system_prompt_file: null processor: "math_hf_data_processor" env_name: "math"

cluster: gpus_per_node: 8 num_nodes: 3 ` may lora is here https://drive.google.com/file/d/1zG1UPPWeatrKiC1tqQPBKVQCcW1nmjFQ/view?usp=sharing

@hanguangmic ,
thank you for sharing the details! I locally created a cached megatron checkpoint and also downloaded your lora checkpoint and used this PR to merge the two and everything works fine. The error you mentioned above is not here and all keys matched. I wonder if the error you see is because during training, there was some issues with the checkpointing. My test concludes that given a valid base model checkpoint and valid lora checkpoint, the merge and --> hf conversion works. I would encourage you to start a run with lora, have the checkpoint to save only lora adapters (which should be the case), find the original cached Megatron checkpoint, examine if the missing key is there and debug from there. This should work with using the base cached checkpoint (including all keys) and saved lora checkpoints will all corresponding lora adapters. Thank you! Happy to chat more!

@pengdurice
Copy link
Copy Markdown
Contributor Author

pengdurice commented Apr 3, 2026

@pengdurice nice work! thanks for adding this, lgtm.

Thank you! I fixed the 2 issues in the CI/CD: lint check by adding the new convert file to pyrefly.toml and rebased to fix the PR out of date error. The functional test GPU failing happened on SFT training and should be irrelevant to this PR. Thank you! It'd be great to have another CI/CD test. Thanks!

@hanguangmic
Copy link
Copy Markdown

Hi Peng so impressed with your work, do not know how to convert dist torch lora into hf format without merge into base model?

by the way, i got the following error when rerun the code
File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 506, in _validate_global_shapes raise KeyError( KeyError: "decoder.layers.0.mixer.dt_bias from model not in state dict: ['decoder.layers.0.mixer.in_proj.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_in.weight', 'decoder.layers.0.mixer.in_proj.adapter.linear_out._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.B', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.C', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.dt', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.x', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.z', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in.weight', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_out._extra_state/shard_0_1', 'decoderxxx]

Thank you for your interests! Can you give me more details so that I can reproduce on my side? Thanks!

great and big thanks for your reply, this issue brother me a lot. My docker image is https://hub.docker.com/r/hanguang02/hanguang02-nemo-rl/tags My training config is as below
defaults: grpo_math_1B.yaml grpo: max_num_steps: 900 num_prompts_per_step: 16 num_generations_per_prompt: 4 val_period: 0 val_at_start: false val_at_end: false async_grpo: enabled: true max_trajectory_age_steps: 1 in_flight_weight_updates: false # Enable for faster weight synchronization recompute_kv_cache_after_weight_updates: false # Invalidates kv cache after in-flight-weight-updates checkpointing: enabled: true checkpoint_dir: "/mnt/workspace/NVMRC2026/grpo-nemotron-30B-lora-r32" metric_name: "train:reward" higher_is_better: false keep_top_k: 3 save_optimizer: false #save_consolidated: false save_period: 4 #v4_compatible: true loss_fn: use_importance_sampling_correction: true policy: model_name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 tokenizer: name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 train_global_batch_size: 16 train_micro_batch_size: 1 logprob_batch_size: 1 max_total_sequence_length: 10240 make_sequence_length_divisible_by: 4 dtensor_cfg: enabled: false megatron_cfg: enabled: true bias_activation_fusion: false tensor_model_parallel_size: 4 expert_model_parallel_size: 4 sequence_parallel: true peft: enabled: true dim: 32 alpha: 64 #target_modules: ['q_proj', 'k_proj','v_proj','up_proj','down_proj',] exclude_modules: ['_out_proj_'] sequence_packing: enabled: false generation: vllm_cfg: async_engine: true precision: "bfloat16" max_model_len: 10240 max_tokens: 7680 gpu_memory_utilization: 0.8 tensor_parallel_size: 1 enforce_eager: false use_deep_gemm: true vllm_kwargs: max_num_seqs: 128 max_num_batched_tokens: 4096 enable_chunked_prefill: true colocated: enabled: false resources: gpus_per_node: 8 num_nodes: 1 data: max_input_seq_length: 1000 shuffle: true num_workers: 8 use_multiple_dataloader: false train: dataset_name: "ResponseDataset" data_path: "/mnt/workspace/NVMRC2026/data/train_0326.jsonl" input_key: "input" output_key: "output" split_validation_size: 0.0 validation: null default: prompt_file: null system_prompt_file: null processor: "math_hf_data_processor" env_name: "math" cluster: gpus_per_node: 8 num_nodes: 3 may lora is here https://drive.google.com/file/d/1zG1UPPWeatrKiC1tqQPBKVQCcW1nmjFQ/view?usp=sharing

@hanguangmic , thank you for sharing the details! I locally created a cached megatron checkpoint and also downloaded your lora checkpoint and used this PR to merge the two and everything works fine. The error you mentioned above is not here and all keys matched. I wonder if the error you see is because during training, there was some issues with the checkpointing. My test concludes that given a valid base model checkpoint and valid lora checkpoint, the merge and --> hf conversion works. I would encourage you to start a run with lora, have the checkpoint to save only lora adapters (which should be the case), find the original cached Megatron checkpoint, examine if the missing key is there and debug from there. This should work with using the base cached checkpoint (including all keys) and saved lora checkpoints will all corresponding lora adapters. Thank you! Happy to chat more!

can you share me your code using to create cached megatron checkpoint, i didn't think the training phase or there was a bug. thanks so much again. Maybe my chached chckpoint is not right.

@pengdurice
Copy link
Copy Markdown
Contributor Author

pengdurice commented Apr 4, 2026

Hi Peng so impressed with your work, do not know how to convert dist torch lora into hf format without merge into base model?

by the way, i got the following error when rerun the code
File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 506, in _validate_global_shapes raise KeyError( KeyError: "decoder.layers.0.mixer.dt_bias from model not in state dict: ['decoder.layers.0.mixer.in_proj.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_in.weight', 'decoder.layers.0.mixer.in_proj.adapter.linear_out._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.B', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.C', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.dt', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.x', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.z', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in.weight', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_out._extra_state/shard_0_1', 'decoderxxx]

Thank you for your interests! Can you give me more details so that I can reproduce on my side? Thanks!

great and big thanks for your reply, this issue brother me a lot. My docker image is https://hub.docker.com/r/hanguang02/hanguang02-nemo-rl/tags My training config is as below
defaults: grpo_math_1B.yaml grpo: max_num_steps: 900 num_prompts_per_step: 16 num_generations_per_prompt: 4 val_period: 0 val_at_start: false val_at_end: false async_grpo: enabled: true max_trajectory_age_steps: 1 in_flight_weight_updates: false # Enable for faster weight synchronization recompute_kv_cache_after_weight_updates: false # Invalidates kv cache after in-flight-weight-updates checkpointing: enabled: true checkpoint_dir: "/mnt/workspace/NVMRC2026/grpo-nemotron-30B-lora-r32" metric_name: "train:reward" higher_is_better: false keep_top_k: 3 save_optimizer: false #save_consolidated: false save_period: 4 #v4_compatible: true loss_fn: use_importance_sampling_correction: true policy: model_name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 tokenizer: name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 train_global_batch_size: 16 train_micro_batch_size: 1 logprob_batch_size: 1 max_total_sequence_length: 10240 make_sequence_length_divisible_by: 4 dtensor_cfg: enabled: false megatron_cfg: enabled: true bias_activation_fusion: false tensor_model_parallel_size: 4 expert_model_parallel_size: 4 sequence_parallel: true peft: enabled: true dim: 32 alpha: 64 #target_modules: ['q_proj', 'k_proj','v_proj','up_proj','down_proj',] exclude_modules: ['_out_proj_'] sequence_packing: enabled: false generation: vllm_cfg: async_engine: true precision: "bfloat16" max_model_len: 10240 max_tokens: 7680 gpu_memory_utilization: 0.8 tensor_parallel_size: 1 enforce_eager: false use_deep_gemm: true vllm_kwargs: max_num_seqs: 128 max_num_batched_tokens: 4096 enable_chunked_prefill: true colocated: enabled: false resources: gpus_per_node: 8 num_nodes: 1 data: max_input_seq_length: 1000 shuffle: true num_workers: 8 use_multiple_dataloader: false train: dataset_name: "ResponseDataset" data_path: "/mnt/workspace/NVMRC2026/data/train_0326.jsonl" input_key: "input" output_key: "output" split_validation_size: 0.0 validation: null default: prompt_file: null system_prompt_file: null processor: "math_hf_data_processor" env_name: "math" cluster: gpus_per_node: 8 num_nodes: 3 may lora is here https://drive.google.com/file/d/1zG1UPPWeatrKiC1tqQPBKVQCcW1nmjFQ/view?usp=sharing

@hanguangmic , thank you for sharing the details! I locally created a cached megatron checkpoint and also downloaded your lora checkpoint and used this PR to merge the two and everything works fine. The error you mentioned above is not here and all keys matched. I wonder if the error you see is because during training, there was some issues with the checkpointing. My test concludes that given a valid base model checkpoint and valid lora checkpoint, the merge and --> hf conversion works. I would encourage you to start a run with lora, have the checkpoint to save only lora adapters (which should be the case), find the original cached Megatron checkpoint, examine if the missing key is there and debug from there. This should work with using the base cached checkpoint (including all keys) and saved lora checkpoints will all corresponding lora adapters. Thank you! Happy to chat more!

can you share me your code using to create cached megatron checkpoint, i didn't think the training phase or there was a bug. thanks so much again. Maybe my chached chckpoint is not right.

HF_MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
BASE_CKPT_DIR = "/your/path/RL/results/nemotron_nano30b_megatron_base"
def create_megatron_base_checkpoint():
    """Convert HF model to Megatron format using community import."""
    base_iter_dir = os.path.join(BASE_CKPT_DIR, "iter_0000000")
    if os.path.exists(base_iter_dir):
        print(f"Base Megatron checkpoint already exists at {base_iter_dir}, skipping.")
        return base_iter_dir
    print(f"Converting {HF_MODEL_NAME} to Megatron format...")
    from megatron.bridge.training.model_load_save import temporary_distributed_context
    from nemo_rl.models.megatron.community_import import import_model_from_hf_name
    with temporary_distributed_context(backend="gloo"):
        import_model_from_hf_name(HF_MODEL_NAME, BASE_CKPT_DIR)
    gc.collect()
    print(f"Base Megatron checkpoint saved to: {base_iter_dir}")
    return base_iter_dir

Simply import model from hf name can give a megatron cached checkpoint. Feel free to check your own checkpoint to see if the missing key is there, I guess maybe not. This conversion will give you a checkpoint that is identical with original HF in weights values, but in Megatron's format and should have the key.

@hanguangmic
Copy link
Copy Markdown

Hi Peng so impressed with your work, do not know how to convert dist torch lora into hf format without merge into base model?

by the way, i got the following error when rerun the code
File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 506, in _validate_global_shapes raise KeyError( KeyError: "decoder.layers.0.mixer.dt_bias from model not in state dict: ['decoder.layers.0.mixer.in_proj.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_in.weight', 'decoder.layers.0.mixer.in_proj.adapter.linear_out._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.B', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.C', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.dt', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.x', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.z', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in.weight', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_out._extra_state/shard_0_1', 'decoderxxx]

Thank you for your interests! Can you give me more details so that I can reproduce on my side? Thanks!

great and big thanks for your reply, this issue brother me a lot. My docker image is https://hub.docker.com/r/hanguang02/hanguang02-nemo-rl/tags My training config is as below
defaults: grpo_math_1B.yaml grpo: max_num_steps: 900 num_prompts_per_step: 16 num_generations_per_prompt: 4 val_period: 0 val_at_start: false val_at_end: false async_grpo: enabled: true max_trajectory_age_steps: 1 in_flight_weight_updates: false # Enable for faster weight synchronization recompute_kv_cache_after_weight_updates: false # Invalidates kv cache after in-flight-weight-updates checkpointing: enabled: true checkpoint_dir: "/mnt/workspace/NVMRC2026/grpo-nemotron-30B-lora-r32" metric_name: "train:reward" higher_is_better: false keep_top_k: 3 save_optimizer: false #save_consolidated: false save_period: 4 #v4_compatible: true loss_fn: use_importance_sampling_correction: true policy: model_name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 tokenizer: name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 train_global_batch_size: 16 train_micro_batch_size: 1 logprob_batch_size: 1 max_total_sequence_length: 10240 make_sequence_length_divisible_by: 4 dtensor_cfg: enabled: false megatron_cfg: enabled: true bias_activation_fusion: false tensor_model_parallel_size: 4 expert_model_parallel_size: 4 sequence_parallel: true peft: enabled: true dim: 32 alpha: 64 #target_modules: ['q_proj', 'k_proj','v_proj','up_proj','down_proj',] exclude_modules: ['_out_proj_'] sequence_packing: enabled: false generation: vllm_cfg: async_engine: true precision: "bfloat16" max_model_len: 10240 max_tokens: 7680 gpu_memory_utilization: 0.8 tensor_parallel_size: 1 enforce_eager: false use_deep_gemm: true vllm_kwargs: max_num_seqs: 128 max_num_batched_tokens: 4096 enable_chunked_prefill: true colocated: enabled: false resources: gpus_per_node: 8 num_nodes: 1 data: max_input_seq_length: 1000 shuffle: true num_workers: 8 use_multiple_dataloader: false train: dataset_name: "ResponseDataset" data_path: "/mnt/workspace/NVMRC2026/data/train_0326.jsonl" input_key: "input" output_key: "output" split_validation_size: 0.0 validation: null default: prompt_file: null system_prompt_file: null processor: "math_hf_data_processor" env_name: "math" cluster: gpus_per_node: 8 num_nodes: 3 may lora is here https://drive.google.com/file/d/1zG1UPPWeatrKiC1tqQPBKVQCcW1nmjFQ/view?usp=sharing

@hanguangmic , thank you for sharing the details! I locally created a cached megatron checkpoint and also downloaded your lora checkpoint and used this PR to merge the two and everything works fine. The error you mentioned above is not here and all keys matched. I wonder if the error you see is because during training, there was some issues with the checkpointing. My test concludes that given a valid base model checkpoint and valid lora checkpoint, the merge and --> hf conversion works. I would encourage you to start a run with lora, have the checkpoint to save only lora adapters (which should be the case), find the original cached Megatron checkpoint, examine if the missing key is there and debug from there. This should work with using the base cached checkpoint (including all keys) and saved lora checkpoints will all corresponding lora adapters. Thank you! Happy to chat more!

can you share me your code using to create cached megatron checkpoint, i didn't think the training phase or there was a bug. thanks so much again. Maybe my chached chckpoint is not right.

HF_MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
BASE_CKPT_DIR = "/your/path/RL/results/nemotron_nano30b_megatron_base"
def create_megatron_base_checkpoint():
    """Convert HF model to Megatron format using community import."""
    base_iter_dir = os.path.join(BASE_CKPT_DIR, "iter_0000000")
    if os.path.exists(base_iter_dir):
        print(f"Base Megatron checkpoint already exists at {base_iter_dir}, skipping.")
        return base_iter_dir
    print(f"Converting {HF_MODEL_NAME} to Megatron format...")
    from megatron.bridge.training.model_load_save import temporary_distributed_context
    from nemo_rl.models.megatron.community_import import import_model_from_hf_name
    with temporary_distributed_context(backend="gloo"):
        import_model_from_hf_name(HF_MODEL_NAME, BASE_CKPT_DIR)
    gc.collect()
    print(f"Base Megatron checkpoint saved to: {base_iter_dir}")
    return base_iter_dir

Simply import model from hf name can give a megatron cached checkpoint. Feel free to check your own checkpoint to see if the missing key is there, I guess maybe not. This conversion will give you a checkpoint that is identical with original HF in weights values, but in Megatron's format and should have the key.

thanks so much, and may i know if it is possible to save lora only instead of merge anything?

@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented Apr 4, 2026

/ok to test b206875

@pengdurice
Copy link
Copy Markdown
Contributor Author

Hi Peng so impressed with your work, do not know how to convert dist torch lora into hf format without merge into base model?

by the way, i got the following error when rerun the code
File "/opt/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py", line 506, in _validate_global_shapes raise KeyError( KeyError: "decoder.layers.0.mixer.dt_bias from model not in state dict: ['decoder.layers.0.mixer.in_proj.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_in.weight', 'decoder.layers.0.mixer.in_proj.adapter.linear_out._extra_state/shard_0_1', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.B', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.C', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.dt', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.x', 'decoder.layers.0.mixer.in_proj.adapter.linear_out.weight.z', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in._extra_state/shard_0_1', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_in.weight', 'decoder.layers.1.mlp.shared_experts.linear_fc1.adapter.linear_out._extra_state/shard_0_1', 'decoderxxx]

Thank you for your interests! Can you give me more details so that I can reproduce on my side? Thanks!

great and big thanks for your reply, this issue brother me a lot. My docker image is https://hub.docker.com/r/hanguang02/hanguang02-nemo-rl/tags My training config is as below
defaults: grpo_math_1B.yaml grpo: max_num_steps: 900 num_prompts_per_step: 16 num_generations_per_prompt: 4 val_period: 0 val_at_start: false val_at_end: false async_grpo: enabled: true max_trajectory_age_steps: 1 in_flight_weight_updates: false # Enable for faster weight synchronization recompute_kv_cache_after_weight_updates: false # Invalidates kv cache after in-flight-weight-updates checkpointing: enabled: true checkpoint_dir: "/mnt/workspace/NVMRC2026/grpo-nemotron-30B-lora-r32" metric_name: "train:reward" higher_is_better: false keep_top_k: 3 save_optimizer: false #save_consolidated: false save_period: 4 #v4_compatible: true loss_fn: use_importance_sampling_correction: true policy: model_name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 tokenizer: name: /mnt/workspace/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 train_global_batch_size: 16 train_micro_batch_size: 1 logprob_batch_size: 1 max_total_sequence_length: 10240 make_sequence_length_divisible_by: 4 dtensor_cfg: enabled: false megatron_cfg: enabled: true bias_activation_fusion: false tensor_model_parallel_size: 4 expert_model_parallel_size: 4 sequence_parallel: true peft: enabled: true dim: 32 alpha: 64 #target_modules: ['q_proj', 'k_proj','v_proj','up_proj','down_proj',] exclude_modules: ['_out_proj_'] sequence_packing: enabled: false generation: vllm_cfg: async_engine: true precision: "bfloat16" max_model_len: 10240 max_tokens: 7680 gpu_memory_utilization: 0.8 tensor_parallel_size: 1 enforce_eager: false use_deep_gemm: true vllm_kwargs: max_num_seqs: 128 max_num_batched_tokens: 4096 enable_chunked_prefill: true colocated: enabled: false resources: gpus_per_node: 8 num_nodes: 1 data: max_input_seq_length: 1000 shuffle: true num_workers: 8 use_multiple_dataloader: false train: dataset_name: "ResponseDataset" data_path: "/mnt/workspace/NVMRC2026/data/train_0326.jsonl" input_key: "input" output_key: "output" split_validation_size: 0.0 validation: null default: prompt_file: null system_prompt_file: null processor: "math_hf_data_processor" env_name: "math" cluster: gpus_per_node: 8 num_nodes: 3 may lora is here https://drive.google.com/file/d/1zG1UPPWeatrKiC1tqQPBKVQCcW1nmjFQ/view?usp=sharing

@hanguangmic , thank you for sharing the details! I locally created a cached megatron checkpoint and also downloaded your lora checkpoint and used this PR to merge the two and everything works fine. The error you mentioned above is not here and all keys matched. I wonder if the error you see is because during training, there was some issues with the checkpointing. My test concludes that given a valid base model checkpoint and valid lora checkpoint, the merge and --> hf conversion works. I would encourage you to start a run with lora, have the checkpoint to save only lora adapters (which should be the case), find the original cached Megatron checkpoint, examine if the missing key is there and debug from there. This should work with using the base cached checkpoint (including all keys) and saved lora checkpoints will all corresponding lora adapters. Thank you! Happy to chat more!

can you share me your code using to create cached megatron checkpoint, i didn't think the training phase or there was a bug. thanks so much again. Maybe my chached chckpoint is not right.

HF_MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
BASE_CKPT_DIR = "/your/path/RL/results/nemotron_nano30b_megatron_base"
def create_megatron_base_checkpoint():
    """Convert HF model to Megatron format using community import."""
    base_iter_dir = os.path.join(BASE_CKPT_DIR, "iter_0000000")
    if os.path.exists(base_iter_dir):
        print(f"Base Megatron checkpoint already exists at {base_iter_dir}, skipping.")
        return base_iter_dir
    print(f"Converting {HF_MODEL_NAME} to Megatron format...")
    from megatron.bridge.training.model_load_save import temporary_distributed_context
    from nemo_rl.models.megatron.community_import import import_model_from_hf_name
    with temporary_distributed_context(backend="gloo"):
        import_model_from_hf_name(HF_MODEL_NAME, BASE_CKPT_DIR)
    gc.collect()
    print(f"Base Megatron checkpoint saved to: {base_iter_dir}")
    return base_iter_dir

Simply import model from hf name can give a megatron cached checkpoint. Feel free to check your own checkpoint to see if the missing key is there, I guess maybe not. This conversion will give you a checkpoint that is identical with original HF in weights values, but in Megatron's format and should have the key.

thanks so much, and may i know if it is possible to save lora only instead of merge anything?

If I understand your question correctly, only saving lora is already supported, people can just save the trained lora adapters somewhere (that's one input of this PR). This PR aims at merging base model with the trained lora adapters and converting to HF for downstream inference and evaluations.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 6, 2026
@pengdurice
Copy link
Copy Markdown
Contributor Author

@pengdurice nice work! thanks for adding this, lgtm.

Thank you! I fixed the 2 issues in the CI/CD: lint check by adding the new convert file to pyrefly.toml and rebased to fix the PR out of date error. The functional test GPU failing happened on SFT training and should be irrelevant to this PR. Thank you! It'd be great to have another CI/CD test. Thanks!

@yuki-97 @terrykong, @yaoyu-33 lmk if you have additional questions! thank you!

@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 8, 2026
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pengdurice sorry for late reply, lgtm, and thanks for the contribution again!

@yuki-97 yuki-97 merged commit c9277a3 into NVIDIA-NeMo:main Apr 8, 2026
27 of 28 checks passed
@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented Apr 8, 2026

hi @hanguangmic @pengdurice , feel free to continue discuss here or move to #2190

@pengdurice
Copy link
Copy Markdown
Contributor Author

hi @hanguangmic @pengdurice , feel free to continue discuss here or move to #2190

@yuki-97 , thank you! @hanguangmic , if you have any more questions, let's continue our discussions, either here or in the issue #2190 ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) community-request Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants