RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete

npu-sft训练中报：
`[rank1139]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete
[rank1136]: Traceback (most recent call last):
[rank1136]:   File "/root/work/filestorage/gh/code/MOVA-feat-npu/scripts/training_scripts/accelerate_train.py", line 160, in <module>
[rank1136]:     main()
[rank1136]:   File "/root/work/filestorage/gh/code/MOVA-feat-npu/scripts/training_scripts/accelerate_train.py", line 156, in main
[rank1136]:     trainer.train()
[rank1136]:   File "/root/work/filestorage/gh/code/MOVA-feat-npu/mova/engine/trainer/accelerate/accelerate_trainer.py", line 509, in train
[rank1136]:     self._save_checkpoint()
[rank1136]:   File "/root/work/filestorage/gh/code/MOVA-feat-npu/mova/engine/trainer/accelerate/accelerate_trainer.py", line 572, in _save_checkpoint
[rank1136]:     self.accelerator.save_state(os.path.join(step_dir, "accelerator"))
[rank1136]:   File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/accelerate/accelerator.py", line 3662, in save_state
[rank1136]:     save_fsdp_optimizer(self.state.fsdp_plugin, self, opt, self._models[i], output_dir, i)
[rank1136]:   File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/accelerate/utils/fsdp_utils.py", line 273, in save_fsdp_optimizer
[rank1136]:     dist_cp.save(
[rank1136]:   File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/logger.py", line 87, in wrapper
[rank1136]:     result = func(*args, **kwargs)
[rank1136]:              ^^^^^^^^^^^^^^^^^^^^^
[rank1136]:   File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 475, in inner_func
[rank1136]:     return func(*args, **kwargs)
[rank1136]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1136]:   File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 176, in save
[rank1136]:     return _save_state_dict(
[rank1136]:            ^^^^^^^^^^^^^^^^^
[rank1136]:   File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 362, in _save_state_dict
[rank1136]:     central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step)
[rank1136]:                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1136]:   File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 217, in reduce_scatter
[rank1136]:     result = self.scatter_object(all_results)
[rank1136]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1136]:   File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 162, in scatter_object
[rank1136]:     dist.scatter_object_list(
[rank1136]:   File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1136]:     return func(*args, **kwargs)
[rank1136]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1136]:   File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3762, in scatter_object_list
[rank1136]:     broadcast(max_tensor_size, group_src=group_src, group=group)
[rank1136]:   File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1136]:     return func(*args, **kwargs)
[rank1136]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1136]:   File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2830, in broadcast
[rank1136]:     work.wait()
[rank1136]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete`
报错在一台机器上8卡npu，开源软件accelerator 调用gloo tcp 等待死锁导致主进程被杀死。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete #62

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete #62

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions