[Feature] Offload optimizer states to CPU to reduce NPU memory with minimal performance impact by tina-wen · Pull Request #1524 · InternLM/xtuner

tina-wen · 2026-03-03T10:15:48Z

Description

This PR adds CPU offloading for optimizer states to reduce NPU memory usage. Optimizer states stay in host memory and are transferred to device only during optimizer.step() via h2d/d2h communications.

Changes

Offload optimizer states to CPU memory
Transfer to device only during optimizer.step()
Resolve conflicts with DCP.save and RL offload_optimizer
Trade memory efficiency for performance

Testing

Verified with:

Memory reduction tests
DCP checkpoint compatibility
RL optimization workflows

…ff performance)

HAOCHENYE

Please

HAOCHENYE · 2026-03-04T06:25:21Z

xtuner/v1/config/optim.py

+        self.optimizer = optimizer
+        self.swap_optimizer_times = swap_optimizer_times
+        if SwapOptimizerOperate.swap_to_device_stream is None:
+            SwapOptimizerOperate.swap_to_device_stream = torch.npu.Stream()


Please use get_torch_device_module() to get DEVICE_MODULE to replace torch.npu

HAOCHENYE · 2026-03-04T06:27:54Z

xtuner/v1/config/optim.py

+        self.optimizer.swap_numel = swap_numel
+
+        swap_memory = swap_num * 8 / 1024 / 1024
+        print('[Rank {}] swap optimizer param num: {},  param size: {}MB\n'.format(torch.npu.current_device(), swap_num, swap_memory), end='')


Using logger defined in xtuner

HAOCHENYE · 2026-03-04T06:45:04Z

xtuner/v1/config/optim.py

+                cls.swap_to_host_events_map[param] = None     
+
+    @classmethod
+    def swap_all_to_device(cls):


Should the swap_to_device_stream wait for the main cuda stream to avoid for the memory peak cause by backward computation and swap_all_to_device

swap_to_device_stream is not used?

HAOCHENYE · 2026-03-04T06:48:19Z

xtuner/v1/config/optim.py

+        cls.swap_to_device_events_map[param] =  torch.npu.current_stream().record_event()
+
+    @classmethod
+    def wait_swap_to_device_event(cls, param):


unused function?

HAOCHENYE · 2026-03-04T06:55:26Z

xtuner/v1/config/optim.py

+                                 [group['step']], amsgrad=amsgrad, lr=group['lr'], beta1=beta1, beta2=beta2, weight_decay=group['weight_decay'],
+                                 eps=group['eps'], maximize=group['maximize'])
+
+    # it maybe removed 


Why swap_all_to_host is not called here?

[Feature] offload optimizer states to CPU (reduce NPU memory, trade-o…

b8035c2

…ff performance)

HAOCHENYE force-pushed the swap_optimizer branch from 0242b59 to b8035c2 Compare March 4, 2026 05:36

HAOCHENYE reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Offload optimizer states to CPU to reduce NPU memory with minimal performance impact#1524

[Feature] Offload optimizer states to CPU to reduce NPU memory with minimal performance impact#1524
tina-wen wants to merge 1 commit intoInternLM:mainfrom
tina-wen:swap_optimizer

tina-wen commented Mar 3, 2026

Uh oh!

HAOCHENYE left a comment

Uh oh!

HAOCHENYE Mar 4, 2026

Uh oh!

HAOCHENYE Mar 4, 2026

Uh oh!

HAOCHENYE Mar 4, 2026

Uh oh!

HAOCHENYE Mar 4, 2026

Uh oh!

HAOCHENYE Mar 4, 2026

Uh oh!

HAOCHENYE Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tina-wen commented Mar 3, 2026

Description

Changes

Testing

Uh oh!

HAOCHENYE left a comment

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants