npu-sft训练中报:
[rank1139]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete [rank1136]: Traceback (most recent call last): [rank1136]: File "/root/work/filestorage/gh/code/MOVA-feat-npu/scripts/training_scripts/accelerate_train.py", line 160, in <module> [rank1136]: main() [rank1136]: File "/root/work/filestorage/gh/code/MOVA-feat-npu/scripts/training_scripts/accelerate_train.py", line 156, in main [rank1136]: trainer.train() [rank1136]: File "/root/work/filestorage/gh/code/MOVA-feat-npu/mova/engine/trainer/accelerate/accelerate_trainer.py", line 509, in train [rank1136]: self._save_checkpoint() [rank1136]: File "/root/work/filestorage/gh/code/MOVA-feat-npu/mova/engine/trainer/accelerate/accelerate_trainer.py", line 572, in _save_checkpoint [rank1136]: self.accelerator.save_state(os.path.join(step_dir, "accelerator")) [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/accelerate/accelerator.py", line 3662, in save_state [rank1136]: save_fsdp_optimizer(self.state.fsdp_plugin, self, opt, self._models[i], output_dir, i) [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/accelerate/utils/fsdp_utils.py", line 273, in save_fsdp_optimizer [rank1136]: dist_cp.save( [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/logger.py", line 87, in wrapper [rank1136]: result = func(*args, **kwargs) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 475, in inner_func [rank1136]: return func(*args, **kwargs) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 176, in save [rank1136]: return _save_state_dict( [rank1136]: ^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 362, in _save_state_dict [rank1136]: central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 217, in reduce_scatter [rank1136]: result = self.scatter_object(all_results) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 162, in scatter_object [rank1136]: dist.scatter_object_list( [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank1136]: return func(*args, **kwargs) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3762, in scatter_object_list [rank1136]: broadcast(max_tensor_size, group_src=group_src, group=group) [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank1136]: return func(*args, **kwargs) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2830, in broadcast [rank1136]: work.wait() [rank1136]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete
报错在一台机器上8卡npu,开源软件accelerator 调用gloo tcp 等待死锁导致主进程被杀死。
npu-sft训练中报:
[rank1139]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete [rank1136]: Traceback (most recent call last): [rank1136]: File "/root/work/filestorage/gh/code/MOVA-feat-npu/scripts/training_scripts/accelerate_train.py", line 160, in <module> [rank1136]: main() [rank1136]: File "/root/work/filestorage/gh/code/MOVA-feat-npu/scripts/training_scripts/accelerate_train.py", line 156, in main [rank1136]: trainer.train() [rank1136]: File "/root/work/filestorage/gh/code/MOVA-feat-npu/mova/engine/trainer/accelerate/accelerate_trainer.py", line 509, in train [rank1136]: self._save_checkpoint() [rank1136]: File "/root/work/filestorage/gh/code/MOVA-feat-npu/mova/engine/trainer/accelerate/accelerate_trainer.py", line 572, in _save_checkpoint [rank1136]: self.accelerator.save_state(os.path.join(step_dir, "accelerator")) [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/accelerate/accelerator.py", line 3662, in save_state [rank1136]: save_fsdp_optimizer(self.state.fsdp_plugin, self, opt, self._models[i], output_dir, i) [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/accelerate/utils/fsdp_utils.py", line 273, in save_fsdp_optimizer [rank1136]: dist_cp.save( [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/logger.py", line 87, in wrapper [rank1136]: result = func(*args, **kwargs) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 475, in inner_func [rank1136]: return func(*args, **kwargs) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 176, in save [rank1136]: return _save_state_dict( [rank1136]: ^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 362, in _save_state_dict [rank1136]: central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 217, in reduce_scatter [rank1136]: result = self.scatter_object(all_results) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 162, in scatter_object [rank1136]: dist.scatter_object_list( [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank1136]: return func(*args, **kwargs) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3762, in scatter_object_list [rank1136]: broadcast(max_tensor_size, group_src=group_src, group=group) [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank1136]: return func(*args, **kwargs) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2830, in broadcast [rank1136]: work.wait() [rank1136]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete报错在一台机器上8卡npu,开源软件accelerator 调用gloo tcp 等待死锁导致主进程被杀死。