Add checkpoint resharding script for faster loading by shuningjin · Pull Request #3801 · AI-Hypercomputer/maxtext

shuningjin · 2026-05-03T00:56:58Z

Description

This script re-shards a MaxText checkpoint on CPU. The goal is to pre-shard checkpoints (source) to accelerate loading on TPUs (target) by reducing re-sharding overhead.

FIXES: b/504714612

Introduction

Problem: In checkpoint conversion, we typically shard along the 0th dimension (usually the expert dimension for MoE). Consequently, loading is fast when the target sharding is EP (e.g., a few minutes), but noticeably slow for FSDP (e.g., an hour). This is a major bottleneck because FSDP is our most common use case.

Effectiveness: Our experiments show that pre-sharding a checkpoint to fsdp=16 reduces the loading time of DeepSeek-V3 from 60 minutes to 6 minutes on a v5p-128 cluster targeting fsdp=64. Furthermore, the solution scales efficiently to v7x 1k chips, maintaining a brief 10-minute load time.

Generalizability: While this was built to solve the FSDP loading bottleneck, the solution generalizes to pre-shard checkpoints into other target sharding layout.

Method

The Orbax checkpoint is streamed from storage directly into the target sharded layout on a simulated CPU mesh, and then saved to a new checkpoint.

Key operation trace: maxengine.load_params -> maxtext_utils.setup_decode_state -> checkpointing.load_params_from_path -> orbax.checkpoint.Checkpointer.restore

User Guide

Full details are in docstring.

Key Parameters:

--simulated_cpu_devices_count (defaults to 16). Examples:
- Suitable for most cases: --simulated_cpu_devices_count=16 ici_fsdp_parallelism=16
- More customization: --simulated_cpu_devices_count=32 ici_fsdp_parallelism=16 ici_expert_parallelism=2
weight_dtype: The dtype used to load and save the checkpoint. Highly recommend using weight_dtype=bfloat16.

Memory Requirements:

For X billion parameters, needs slightly over 2X GB RAM (each param takes 2 bytes with weight_dtype=bfloat16).
Note: We only hold one model copy in memory, as the re-sharding happens dynamically during the read operation. Additional buffer memory is needed mainly for the I/O streaming overhead, usually small compared to model weight.
Example: deepseek3 with MTP layers has 685B parameters, uses 1.37 TB for weights, and hits a peak RAM of ~1.45 TB (overhead is trivial relative to weight).
Example: deepseek2-16b, uses 32GB for weights, and hits a peak RAM of ~63 GB (overhead seems non-trivial, as the model size is small).

Tests

deepseek3-671b with mtp

Full test details in b/504714612 (comment3, comment8)

pre-sharded with fsdp=16
- conversion on CPU: time 134min, peak RAM 1486 GB.
- The loading time is reduced to 6min (from 1hr), target sharding is fsdp=64 on v5p-128.
loading time on v7x with 1k chips is 10min

deepseek2-16b

Reshard:

# reshard CKPT1 to CKPT2 on CPU
python3 -m maxtext.checkpoint_conversion.reshard_checkpoint \
model_name=deepseek2-16b attention=dot_product mla_naive_kvcache=false \
scan_layers=True load_parameters_path=$CKPT1 \
base_output_directory=$CKPT2_DIR \
weight_dtype=bfloat16 \
checkpoint_storage_concurrent_gb=1024 checkpoint_storage_use_ocdbt=True checkpoint_storage_use_zarr3=True \
skip_jax_distributed_system=True ici_fsdp_parallelism=16 \
--simulated_cpu_devices_count=16

log: https://paste.googleplex.com/4918283263410176
Elapsed: 2 minutes, Peak Memory: 63.11 GB (32GB for weight, overhead is non-trivial as the model is small)

Inspect structure:

CKPT1 (old): shard 0th dim into 16 shards when possible
CKPT2 (new): shard with fsdp=16

# CKPT1 (old)
ArrayMetadata :  name=params.params.decoder.moe_layers.DeepSeekMoeBlock_0.MoeBlock_0.wi_0,  directory=gs://ml-auto-solutions/output/unowned/maxtext_nightly_deepseek2-16b-v5p-8-2026-04-20-06-52-15/scanned/0/items,  shape=(64, 26, 2048, 1408),  sharding=NamedShardingMetadata(shape=[16], axis_names=['checkpoint_sharding_axis'], axis_types=(Auto,), partition_spec=('checkpoint_sharding_axis',)) device_mesh=DeviceMetadataMesh(mesh=[DeviceMetadata(id=0), DeviceMetadata(id=1), DeviceMetadata(id=2), DeviceMetadata(id=3), DeviceMetadata(id=4), DeviceMetadata(id=5), DeviceMetadata(id=6), DeviceMetadata(id=7), DeviceMetadata(id=8), DeviceMetadata(id=9), DeviceMetadata(id=10), DeviceMetadata(id=11), DeviceMetadata(id=12), DeviceMetadata(id=13), DeviceMetadata(id=14), DeviceMetadata(id=15)]),  dtype=float16,  storage=StorageMetadata(chunk_shape=(4, 26, 2048, 1408), write_shape=(4, 26, 2048, 1408)),

# CKPT2 (new)
ArrayMetadata :  name=params.params.decoder.moe_layers.DeepSeekMoeBlock_0.MoeBlock_0.wi_0,  directory=gs://shuningjin-multipod-dev/conversion/ds2-fsdp-2026-05-03-10-58/0/items,  shape=(64, 26, 2048, 1408),  sharding=NamedShardingMetadata(shape=[ 1  1  1 16  1  1  1  1  1  1  1  1], axis_names=['diloco', 'data', 'stage', 'fsdp', 'fsdp_transpose', 'context', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive'], axis_types=(Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto, Auto), partition_spec=('expert', None, ['fsdp', 'tensor_transpose', 'context'], ['fsdp_transpose', 'tensor', 'tensor_sequence', 'autoregressive'])) device_mesh=DeviceMetadataMesh(mesh=[[[[[[[[[[[[DeviceMetadata(id=0)]]]]]]]], [[[[[[[[DeviceMetadata(id=1)]]]]]]]], [[[[[[[[DeviceMetadata(id=2)]]]]]]]], [[[[[[[[DeviceMetadata(id=3)]]]]]]]], [[[[[[[[DeviceMetadata(id=4)]]]]]]]], [[[[[[[[DeviceMetadata(id=5)]]]]]]]], [[[[[[[[DeviceMetadata(id=6)]]]]]]]], [[[[[[[[DeviceMetadata(id=7)]]]]]]]], [[[[[[[[DeviceMetadata(id=8)]]]]]]]], [[[[[[[[DeviceMetadata(id=9)]]]]]]]], [[[[[[[[DeviceMetadata(id=10)]]]]]]]], [[[[[[[[DeviceMetadata(id=11)]]]]]]]], [[[[[[[[DeviceMetadata(id=12)]]]]]]]], [[[[[[[[DeviceMetadata(id=13)]]]]]]]], [[[[[[[[DeviceMetadata(id=14)]]]]]]]], [[[[[[[[DeviceMetadata(id=15)]]]]]]]]]]]]),  dtype=bfloat16,  storage=StorageMetadata(chunk_shape=(64, 26, 128, 1408), write_shape=(64, 26, 128, 1408)),

forward_pass_logit_checker, load with target sharding fsdp=16:

CKPT1 (old): Finished load in 85.74 seconds
CKPT2 (new): Finished load in 44.90 seconds

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

github-actions · 2026-05-03T23:36:59Z

🤖 Hi @shuningjin, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request introduces a new script reshard_checkpoint.py designed to re-shard MaxText checkpoints on CPU. This utility is highly effective for reducing checkpoint loading times on TPUs, as demonstrated by the significant performance gains reported for DeepSeek-V3. The PR also includes minor robustness improvements and bug fixes in llama_or_mistral_ckpt.py.

🔍 General Feedback

Performance: The reported 10x reduction in loading time (from 60 min to 6 min) for DeepSeek-V3 is a major improvement for large-scale model training and inference.
Initialization Timing: A key concern is the timing of JAX initialization in the new script. Setting environment variables like XLA_FLAGS after importing JAX-dependent modules may lead to them being ignored if the XLA backend has already been initialized.
Flexibility: Adding a way to specify or preserve the step_number would enhance the utility of the resharding script.

github-actions

## 📋 Review Summary

Supplementing the previous review with the missed comment on JAX initialization timing. Overall, the PR is very valuable for optimizing large model checkpoints.

🔍 General Feedback

Initialization Timing: Setting XLA_FLAGS before JAX imports ensures the simulated CPU mesh is correctly established.

RissyRan · 2026-05-05T01:49:40Z

+- The Orbax checkpoint is streamed from storage directly into the target sharded layout on a simulated CPU mesh,
+  and then saved to a new checkpoint.
+- The goal is to pre-shard checkpoints (source) to accelerate loading on TPUs (target) by reducing re-sharding overhead.
+  E.g., when target sharding is fsdp=64, checkpoint loading time varies across source sharding (fsdp=64 < fsdp=16 < ep=16)


Have you tried fsdp=64 < fsdp=16?

I only tried fsdp=16. Just removed fsdp=64 from comment for brevity.

codecov · 2026-05-05T21:59:35Z

Codecov Report

❌ Patch coverage is 0% with 49 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...axtext/checkpoint_conversion/reshard_checkpoint.py	0.00%	49 Missing ⚠️

📢 Thoughts on this report? Let us know!

shuningjin force-pushed the shuningjin-reshard branch from d549c06 to fa776ba Compare May 3, 2026 22:26

shuningjin changed the title ~~reshard checkpoint~~ Add checkpoint resharding script for faster loading May 3, 2026

shuningjin marked this pull request as ready for review May 3, 2026 23:22

shuningjin requested review from NicoGrande, RissyRan, bvandermoon, gagika, gobbleturk, hengtaoguo, jiangjy1982, parambole, richjames0, shralex and suexu1025 as code owners May 3, 2026 23:22

shuningjin assigned RissyRan and hengtaoguo May 3, 2026

shuningjin added the gemini-review label May 3, 2026

github-actions Bot reviewed May 3, 2026

View reviewed changes

Comment thread src/maxtext/checkpoint_conversion/reshard_checkpoint.py

Comment thread src/maxtext/checkpoint_conversion/standalone_scripts/llama_or_mistral_ckpt.py

Comment thread src/maxtext/checkpoint_conversion/standalone_scripts/llama_or_mistral_ckpt.py

github-actions Bot reviewed May 3, 2026

View reviewed changes

Comment thread src/maxtext/checkpoint_conversion/reshard_checkpoint.py

RissyRan approved these changes May 5, 2026

View reviewed changes

shuningjin force-pushed the shuningjin-reshard branch from 510b209 to fa776ba Compare May 5, 2026 23:19

Add checkpoint resharding script for faster loading

3b125fc

shuningjin force-pushed the shuningjin-reshard branch from 0986380 to 3b125fc Compare May 5, 2026 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add checkpoint resharding script for faster loading#3801

Add checkpoint resharding script for faster loading#3801
shuningjin wants to merge 1 commit intomainfrom
shuningjin-reshard

shuningjin commented May 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

RissyRan May 5, 2026

Uh oh!

shuningjin May 5, 2026

Uh oh!

codecov Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shuningjin commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Introduction

Method

User Guide

Tests

deepseek3-671b with mtp

deepseek2-16b

Checklist

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

RissyRan May 5, 2026

Choose a reason for hiding this comment

Uh oh!

shuningjin May 5, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shuningjin commented May 3, 2026 •

edited

Loading

codecov Bot commented May 5, 2026 •

edited

Loading