Skip to content

RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete #62

@leild

Description

@leild

npu-sft训练中报:
[rank1139]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete [rank1136]: Traceback (most recent call last): [rank1136]: File "/root/work/filestorage/gh/code/MOVA-feat-npu/scripts/training_scripts/accelerate_train.py", line 160, in <module> [rank1136]: main() [rank1136]: File "/root/work/filestorage/gh/code/MOVA-feat-npu/scripts/training_scripts/accelerate_train.py", line 156, in main [rank1136]: trainer.train() [rank1136]: File "/root/work/filestorage/gh/code/MOVA-feat-npu/mova/engine/trainer/accelerate/accelerate_trainer.py", line 509, in train [rank1136]: self._save_checkpoint() [rank1136]: File "/root/work/filestorage/gh/code/MOVA-feat-npu/mova/engine/trainer/accelerate/accelerate_trainer.py", line 572, in _save_checkpoint [rank1136]: self.accelerator.save_state(os.path.join(step_dir, "accelerator")) [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/accelerate/accelerator.py", line 3662, in save_state [rank1136]: save_fsdp_optimizer(self.state.fsdp_plugin, self, opt, self._models[i], output_dir, i) [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/accelerate/utils/fsdp_utils.py", line 273, in save_fsdp_optimizer [rank1136]: dist_cp.save( [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/logger.py", line 87, in wrapper [rank1136]: result = func(*args, **kwargs) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 475, in inner_func [rank1136]: return func(*args, **kwargs) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 176, in save [rank1136]: return _save_state_dict( [rank1136]: ^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 362, in _save_state_dict [rank1136]: central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 217, in reduce_scatter [rank1136]: result = self.scatter_object(all_results) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 162, in scatter_object [rank1136]: dist.scatter_object_list( [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank1136]: return func(*args, **kwargs) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3762, in scatter_object_list [rank1136]: broadcast(max_tensor_size, group_src=group_src, group=group) [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank1136]: return func(*args, **kwargs) [rank1136]: ^^^^^^^^^^^^^^^^^^^^^ [rank1136]: File "/opt/mamba/envs/ascend-torch/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2830, in broadcast [rank1136]: work.wait() [rank1136]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete
报错在一台机器上8卡npu,开源软件accelerator 调用gloo tcp 等待死锁导致主进程被杀死。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions