Add Qwen3VL MCore Export support#895
Conversation
📝 WalkthroughWalkthroughThis PR adds support for Qwen3-VL vision-language models to the Megatron Core export/import framework by introducing model mappings to convert between Hugging Face and Megatron Core structures for quantization workflows, alongside documentation updates and comprehensive unit tests. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
## What does this PR do? This PR implements RegionSearch class. RegionSearch could help partition big ONNX model into small region. QDQ autouning will be performed on the regions. **Overview:** ? ## Usage <!-- You can potentially add a usage example below. --> ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes - **Did you write any new necessary tests?**: Yes - **Did you add or update any necessary documentation?**: No, document updates is in Part 4. - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: CHANGELOG will be updated when all changes are ready. ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes **Refactor** * Improved ONNX quantization backend with new optimization framework and extensive test coverage to enhance internal graph processing capabilities. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Will Guo <willg@nvidia.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
## What does this PR do? This PR implement RegionInspect tool. This tool could be used to visualize the regions parititioned by RegionSearch classes. This tool could be used to analyze if the partitioned regions match the fusion patterns. **Overview:** ? ## Usage <!-- You can potentially add a usage example below. --> ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes - **Did you write any new necessary tests?**: Yes - **Did you add or update any necessary documentation?**: No, document update is in Part 4. - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: No, CHANGELOG will be updated when all changes are ready. ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## New Features * Added a region inspection tool for ONNX models. Analyzes model structure and generates detailed reports including region statistics, hierarchical relationships, node coverage metrics, and size distribution analysis. Available through a command-line interface with configurable parameters. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Will Guo <willg@nvidia.com> Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
## What does this PR do? **Type of change:** ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> Bug fix **Overview:** ? 1. Fixing megatron ignore module has additional `.` in the suffix 2. Change megatron export to safe per layer as a safetensor (avoid ghost safetensors) ## Usage <!-- You can potentially add a usage example below. --> ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Export workflow now supports additional model components (EAGLE/Medusa modules) * Per-layer model state organization for improved checkpoint management * **Bug Fixes** * More robust Hugging Face configuration, tokenizer, and image processor preservation * Enhanced multimodal component extraction and loading * **Refactor** * Optimized model export process with improved per-layer safetensors handling <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
## What does this PR do? **Type of change:** New model support <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** Add PTQ support for https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1 ## Usage <!-- You can potentially add a usage example below. --> ```python python3 hf_ptq.py --pyt_ckpt_path /home/omniml_data_3/models/NVIDIA-Nemotron-Parse-v1.1 --qformat fp8 --export_path /home/omniml_data_3/zhiyuc/checkpoints/NVIDIA-Nemotron-Parse-v1.1-FP8 --trust_remote_code --kv_cache_qformat none --attn_implementation eager ``` By default, image-text data will be used in calibration for VLMs. ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Not yet <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added support for Nemotron-Parse multimodal models, including proper device mapping, processor loading, and generation handling. * **Improvements** * Enhanced quantization robustness with safer handling of quantization attributes and fallback logic. * Improved model loading with better device placement and encoder buffer management for vision-language models. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
## What does this PR do? **Type of change:** Bug fix <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> ## Testing <!-- Mention how have you tested your change if applicable. --> Nemotron Nano v2 pruned can be saved <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Fixed Hugging Face model loading to properly respect the `trust_remote_code` parameter during model instantiation. * **Improvements** * Enhanced distributed training logging with rank-0 aware warning and logging mechanisms for cleaner, non-redundant output in multi-GPU and multi-node scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
## What does this PR do? [Short term]: Megatron based tests take a long time often resulting in CICD timeout. Splitting megatron tests into a dedicated CICD job for faster overall CI/CD run [Mid/Long term]: Run all megatron gpu tests using `torchrun` instead of `pytest` so all dist processes are already created and all individual tests no longer need to setup and destroy their processes which adds a lot of overhead per test ## Testing <!-- Mention how have you tested your change if applicable. --> - [x] 1-GPU CI/CD passing (on this PR) - [x] 2-GPU CI/CD passing (on nightly run - manually triggered): https://github.com/NVIDIA/Model-Optimizer/actions/runs/22000517688 --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
…t (#884)
## What does this PR do?
**Type of change:** new feature
**Overview:** Enable full TE spec support for NemotronH (Mamba hybrid)
models during HF-to-Megatron weight import via
`import_mcore_gpt_from_hf`.
Previously, importing HF weights into a Megatron model built with the
full TE spec (`TELayerNormColumnParallelLinear`, `TEGroupedMLP`, etc.)
failed for NemotronH models due to two issues:
1. **Grouped expert prefix bug**: The `experts.linear_fc1/fc2` import
rules had a hard-coded `mtp.layers.{}` prefix, which was only correct
for MTP layers. When regular decoder MoE layers use `TEGroupedMLP` (via
the full TE spec), the importer generated incorrect HF keys (e.g.,
`mtp.layers.27.mixer.experts.0.up_proj.weight` instead of
`backbone.layers.27.mixer.experts.0.up_proj.weight`).
2. **Fused layer norm loading**: In the full TE spec, layer norms are
fused into `TELayerNormColumnParallelLinear` modules as
`layer_norm_weight`. The importer's `_name_remapping` would crash trying
to load `layer_norm_weight` from a non-existent HF path (e.g.,
`backbone.layers.X.mixer.in_proj.layer_norm_weight`), when the actual HF
norm weight lives at `backbone.layers.X.norm.weight`.
### Changes
**`mcore_nemotron.py`**:
- Fixed grouped expert prefix from `mtp.layers.{}` to
`backbone.layers.{}`. The `_grouped_mlp_merging` function already
handles `backbone` → `mtp` replacement when `is_mtp=True`, so both
decoder and MTP layers work correctly.
- Added `mapping={"layer_norm_weight": None}` to `in_proj` and
`linear_fc1` rules to skip `layer_norm_weight` during `_name_remapping`
(loaded separately via `fused_norm`).
- Added `fused_norm` rule
(`NameRemapping("backbone.layers.{}.norm.weight")`) to load HF norm
weights into fused TE modules.
**`megatron_importer.py`**:
- Added `source_key is None` check in `_name_remapping` to skip keys
mapped to `None` in the mapping dict (keeps existing value instead of
crashing on missing HF key).
- Added fused norm loading in `_import_mamba_layer`: after loading
`in_proj`, loads `layer_norm_weight` from HF via `fused_norm` rule when
`layer.norm` is `IdentityOp`.
- Added fused norm loading in `_import_transformer_layer`: loads
`layer_norm_weight` into `linear_qkv` (when `input_layernorm` is
`IdentityOp`) and into `linear_fc1` (when `pre_mlp_layernorm` is
`IdentityOp`).
## Usage
The full TE spec is enabled via the `--full-te-spec` flag on the
Megatron-LM side (separate PR). On the ModelOpt side, no user-facing
changes are needed -- the import rules automatically handle both local
spec and full TE spec models.
```bash
# Convert HF checkpoint to Megatron with full TE spec (megatron-lm side)
unset MLM_MODEL_CKPT && export MLM_MODEL_SAVE=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16_mlm && export HF_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
export PP=2
export MLM_EXTRA_ARGS="--full-te-spec"
bash convert.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
# Quantize the converted checkpoint (megatron-lm side)
export MLM_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16_mlm
export MLM_MODEL_SAVE=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-fp8_mlm
export HF_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
export PP=2 && export TP=4 && export EP=4 && export ETP=1
export MLM_EXTRA_ARGS="--full-te-spec"
bash quantize.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 FP8_DEFAULT_CFG
# Generate
export PP=2 && export TP=4 && export EP=4 && export ETP=1
export MLM_EXTRA_ARGS="--full-te-spec"
export MLM_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-fp8_mlm && ./generate.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
# MMLU
export PP=2 && export TP=4 && export EP=4 && export ETP=1
export MLM_EXTRA_ARGS="--full-te-spec"
export MLM_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-fp8_mlm && export MLM_EXTRA_ARGS="--fraction 0.05 --disable-tqdm" && ./mmlu.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
```
## Testing
- Tested end-to-end: HF → Megatron conversion → FP8 quantization →
inference (generate) → MMLU evaluation with
Nemotron-3-Nano-30B-A3B-BF16.
- Verified the resulting model structure matches Megatron-Bridge's TE
spec output (TELayerNormColumnParallelLinear, TEGroupedMLP, IdentityOp
norms, etc.).
- Verified quantized model produces coherent text generation outputs.
- Verified backward compatibility: all changes are no-ops for existing
local-spec pipelines (guarded by `IdentityOp` checks, `hasattr` checks,
and `"fused_norm" in self.rules` checks).
## Before your PR is "*Ready for review*"
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes -- all changes are
guarded by conditions that only activate for full TE spec models. Local
spec models follow the exact same code paths as before.
- **Did you write any new necessary tests?**: No
- **Did you add or update any necessary documentation?**: No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
No
## Additional Information
Companion megatron-lm changes (separate PR):
- `megatron/core/post_training/modelopt/mamba/model_specs.py`: Added
`use_full_te_spec` parameter to return canonical `mamba_stack_spec` from
`mamba_layer_specs.py`.
- `megatron/post_training/model_builder.py`: Passes
`use_full_te_spec=args.full_te_spec` to `get_mamba_stack_modelopt_spec`.
- `megatron/post_training/arguments.py`: Added `--full-te-spec` CLI
flag.
- `examples/post_training/modelopt/convert_model.py`: Skip
`moe_grouped_gemm=False` override when `--full-te-spec` is set.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added support for loading fused normalization weights during model
import.
* **Bug Fixes**
* Improved weight mapping logic to correctly skip redundant layer norm
weights in specialized model architectures.
* **Refactor**
* Reorganized expert model parallel configuration paths for better
compatibility with mixed parallel processing settings.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: James Shen <yueshen@nvidia.com>
Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
## What does this PR do?
**Type of change:** new example <!-- Use one of the following: Bug fix,
new feature, new example, new tests, documentation. -->
**Overview:**
Adding LTX-2 distillation trainer.
## Usage
<!-- You can potentially add a usage example below. -->
```bash
accelerate launch \
--config_file configs/accelerate/fsdp.yaml \
--num_processes 8 \
distillation_trainer.py --config configs/distillation_example.yaml
```
See readme for more details.
## Testing
Run training with single/multiple nodes.
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes <!--- If No, explain why.
-->
- **Did you write any new necessary tests?**: NA
- **Did you add or update any necessary documentation?**: Yes
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes <!--- Only for new features, API changes, critical bug fixes or bw
breaking changes. -->
## Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## New Features
* Added distillation training support for LTX-2 models with quantization
integration.
* Introduced comprehensive documentation and example configurations for
distillation workflows.
* Includes multi-GPU and multi-node training setup with distributed
training support and customizable configuration templates.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
Signed-off-by: hychiang <kenny5312012@gmail.com>
Signed-off-by: hychiang <kenny5312012@gmail.com>
modelopt-bot
left a comment
There was a problem hiding this comment.
Review Summary
Overall this is a well-structured PR adding Qwen3-VL support. The mapping logic correctly handles the different weight structure (model.language_model. prefix) compared to Qwen3. The tests are comprehensive.
However, I have a few questions and suggestions that warrant discussion before approval.
modelopt-bot
left a comment
There was a problem hiding this comment.
Review Summary
Overall this is a well-structured PR adding Qwen3-VL support. The mapping logic correctly handles the different weight structure (model.language_model. prefix) compared to Qwen3. Tests are comprehensive.
Requested Changes:
- Merge conflicts - 22 files have conflicts that need resolution (as flagged by CodeRabbit)
- Missing newline - mcore_qwen3vl.py is missing a trailing newline
Questions/Clarifications:
3. Copyright year mismatch (2023-2025 vs 2024 in test file)
4. Router mapping includes MoE but no shared_expert mappings - is this intentional?
5. Consider adding docstrings to the mapping dictionaries
Please address the merge conflicts first, then the minor formatting issue.
|
Does this PR still in progress? |
|
@hychiang-git can you rebase with latest main branch, move the test to |
|
Since the original branch source is not available now, I will close this PR and create a new one at PR #1482 where we can update for this feature. |
What does this PR do?
new feature:
Overview: Add Qwen3-VL (Vision-Language) model support to the Megatron Core export/import
plugin, enabling HuggingFace-to-mcore weight conversion for PTQ/QAT/QAD workflows
Details
Qwen3-VL has a different weight structure from Qwen3 text-only models:
This PR adds:
Qwen3VLForConditionalGeneration and Megatron Core, handling the language_model prefix for
all decoder layers, QKV merging/slicing, gated MLP merging/slicing, Q/K layer norms.
all_mcore_hf_export_mapping and all_mcore_hf_import_mapping.
Usage
Testing
covering:
prefix, lm_head. at root, QKVMerging, GatedMLPMerging, REPLICATE
for layernorms, TP sharding configs
parallel_config
prefixes
language_model. prefix, lm_head unchanged
Before your PR is "Ready for review"
tests/unit/torch/export/test_mcore_qwen3vl.pydocs/source/deployment/3_unified_hf.rstCHANGELOG.rstAdditional Information
Companion Megatron-LM PR adds Qwen3VLModel, Qwen3VLDataset, and pretrain_qwenvl.py. Please see this PR NVIDIA/Megatron-LM#3444
Summary by CodeRabbit
New Features
Tests