Skip to content

Conversation

@Rohan-Bierneni
Copy link
Collaborator

@Rohan-Bierneni Rohan-Bierneni commented Jan 20, 2026

Description

[Ckpt Conversion] Support Qwen3-Next in Unified Checkpoint Conversion Utility

This PR migrates the Qwen3-Next (qwen3-next-80b-a3b) from standalone conversion scripts to the centralized MaxText.utils.ckpt_conversion` library.

Previously, Qwen3-Next relied on ad-hoc scripts for checkpointing. Moving this to the unified utility enables:

  1. Bidirectional Conversion: Robust support for converting both HF -> MaxText and MaxText -> HF.
  2. Scanned & Unscanned Support: Native handling of scanned layers (optimized for training) and unscanned layers (optimized for decoding/inference).
  3. Maintainability: Centralizes logic for the hybrid attention architecture (interleaved Linear and Full attention layers) within the standard mapping infrastructure.

Changes

  • src/MaxText/utils/ckpt_conversion/utils/hf_model_configs.py: Added qwen3_next_80b_a3b_config using transformers.Qwen3NextConfig and registered it in HF_MODEL_CONFIGS.
  • src/MaxText/utils/ckpt_conversion/utils/param_mapping.py:
    • Implemented QWEN3_NEXT_MAXTEXT_TO_HF_PARAM_MAPPING: Handles the inhomogeneous layer cycle (mapping Full Attention vs. Linear/Hybrid Attention blocks based on layer index) and MoE components (Shared vs. Routed experts).
    • Implemented QWEN3_NEXT_MAXTEXT_TO_HF_PARAM_HOOK_FN with robust tensor handling:
      • Correct Transposition for Scanned 1D Tensors: Added specific handling (using identity hooks) for 1D parameters like A_log (shape [1]). This ensures that the scan axis is correctly handled during conversion (e.g., transforming to [1, 12] where appropriate) rather than incorrectly collapsing to [1,].
      • Preservation of Singleton Dimensions: Implemented permute_conv to correctly handle conv1d kernels (HF: [C, 1, K] <-> MT: [K, 1, C]). This prevents dimensions with value 1 from being incorrectly squeezed or flattened during the permutation process.
  • src/MaxText/utils/ckpt_conversion/utils/hf_shape.py: Added QWEN3_NEXT_HF_WEIGHTS_TO_SHAPE to calculate expected HF tensor shapes for validation.
  • end_to_end/tpu/qwen/next/...:
    • Updated 1_test_qwen3_next_80b_a3b.sh to use python3 -m MaxText.utils.ckpt_conversion.to_maxtext instead of the legacy script.
    • Added 2_test_qwen3_next_80b_a3b.shfor XLML tests to consume for forward_pass & decode verification.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/469445683

Tests

The commands used to generate the checkpoints themselves: https://paste.googleplex.com/4921565475110912

Will run forward pass logit checker on converted checkpoint from Maxtext -> HF -> Maxtext for scanned and post results here:

Current status:

to_maxtext tests:

hf -> maxtext (scanned): https://paste.googleplex.com/5151438898593792
hf -> maxtext (unscanned): https://paste.googleplex.com/4721564912320512

to_huggingface tests:

Convert scanned & unscanned maxtext checkpoints from previous tests to hf format. Run forward_pass check against new hf checkpoints and existing maxtext checkpoints.

Maxtext (scanned) -> HF: https://paste.googleplex.com/4787924765900800
Maxtext (unscanned) -> HF: https://paste.googleplex.com/5256341314732032

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 0% with 101 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...xText/utils/ckpt_conversion/utils/param_mapping.py 0.00% 71 Missing ⚠️
...rc/MaxText/utils/ckpt_conversion/utils/hf_shape.py 0.00% 28 Missing ⚠️
...xt/utils/ckpt_conversion/utils/hf_model_configs.py 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

is_full_attention_layer = (layer_idx + 1) % cycle_interval == 0

if is_full_attention_layer:
# Full Attention Block
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Adding comments explaining how these numbers relate to the config parameters (e.g., hidden_size, num_attention_heads * head_dim, etc.) or if they are fixed architectural dimensions would greatly enhance maintainability. For example, it seems 4096 = config["num_attention_heads"] * config["head_dim"]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I will add how the hard coded numbers are calculated. The Gated Delta Net in particular has a bunch of these calculations.

Copy link
Collaborator

@parambole parambole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left a couple of comments. PTAL.

Copy link
Collaborator

@shuningjin shuningjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the model to conversion tool, along with careful logit checks! Left a minor comment.

  • For future reference, could you also add the conversion commands to the PR description? Would be nice to also add the conversion time in description, if you have it. Thank you!

For test script 2_test_qwen3_next_80b_a3b.sh:

  • Maybe also add pre-training and finetuning (example). Training was omitted from DS3 as covered by ubench.
  • Could you test this script and attach log to description?
  • Thanks for updating the description. Maybe update PR title as well to accurately reflect the change: e.g., add "update test scripts".
  • Will this be added to XLML in the other repo?

@Rohan-Bierneni Rohan-Bierneni changed the title Add Qwen3-Next to checkpoint util Add Qwen3-Next to checkpoint util & update test scripts Jan 29, 2026
@Rohan-Bierneni Rohan-Bierneni force-pushed the rbierneni-qwen3next-chkpt-util branch from 5bafc01 to 4fd58f5 Compare January 29, 2026 18:00
@Rohan-Bierneni Rohan-Bierneni force-pushed the rbierneni-qwen3next-chkpt-util branch from 4fd58f5 to a6f97ae Compare January 30, 2026 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants