Conversation
Greptile SummaryFixes multi-GPU race condition where all ranks simultaneously attempted to delete and rebuild the visualize binary. Now only rank 0 performs the build, with other ranks waiting at a synchronization barrier.
Confidence Score: 2/5
Important Files Changed
Last reviewed commit: a43a592 |
| if is_distributed: | ||
| torch.distributed.barrier() |
There was a problem hiding this comment.
If build fails on main rank (lines 1499/1501/1503), exception is raised before reaching this barrier, causing non-main ranks to hang indefinitely. Wrap build in try-finally to ensure barrier is always reached:
build_error = None
if is_main_rank:
try:
# build code
except Exception as e:
build_error = e
if is_distributed:
torch.distributed.barrier()
if build_error:
raise build_error|
LGTM. Would you like to merge this? @eugenevinitsky @riccardosavorgnan |
Seems the new 2.0 removed the function entirely though due to the new changes in render |
|
Oh huh, okay, we'll make the appropriate change when multi-GPU support is added into 3.0 |
Issue