Problem
Training fails with ValueError when using FSDP with HYBRID_SHARD sharding strategy on systems where the nproc_per_node specified is below the total available set of GPUs on a given node.
Error
ValueError: The arg 'group_size' (8) must not exceed the world size (2)
Stack Trace
File ".../torch/distributed/fsdp/fully_sharded_data_parallel.py", line 439, in __init__
_init_process_group_state(
File ".../torch/distributed/fsdp/_init_utils.py", line 113, in _init_process_group_state
state = _init_process_group_state_for_hybrid_shard(
File ".../torch/distributed/fsdp/_init_utils.py", line 160, in _init_process_group_state_for_hybrid_shard
intra_node_group, inter_node_group = _init_intra_and_inter_node_groups(
File ".../torch/distributed/fsdp/_init_utils.py", line 266, in _init_intra_and_inter_node_groups
_init_intra_node_process_group(num_devices_per_node),
File ".../torch/distributed/fsdp/_init_utils.py", line 211, in _init_intra_node_process_group
intra_node_subgroup, _ = dist.new_subgroups(num_devices_per_node)
File ".../torch/distributed/distributed_c10d.py", line 5500, in new_subgroups
raise ValueError(
ValueError: The arg 'group_size' (8) must not exceed the world size (2)
Cause
When HYBRID_SHARD is used without an explicit device_mesh, FSDP1 auto-detects num_devices_per_node which defaults to 8. It then attempts to create intra-node process groups of size 8, which fails when world_size < 8.
FSDP1 does not provide a straightforward way to configure the intra-node group size when using HYBRID_SHARD without a device_mesh.
Reproduction
Run training with:
distributed_training_framework: fsdp
fsdp_sharding_strategy: HYBRID_SHARD
- Fewer than 8 GPUs (e.g., 2 GPUs)
Environment
- PyTorch with FSDP1 (
torch.distributed.fsdp.FullyShardedDataParallel)
- Accelerate
- Any system with < 8 GPUs
Problem
Training fails with
ValueErrorwhen using FSDP withHYBRID_SHARDsharding strategy on systems where thenproc_per_nodespecified is below the total available set of GPUs on a given node.Error
Stack Trace
Cause
When
HYBRID_SHARDis used without an explicitdevice_mesh, FSDP1 auto-detectsnum_devices_per_nodewhich defaults to 8. It then attempts to create intra-node process groups of size 8, which fails whenworld_size < 8.FSDP1 does not provide a straightforward way to configure the intra-node group size when using
HYBRID_SHARDwithout adevice_mesh.Reproduction
Run training with:
distributed_training_framework: fsdpfsdp_sharding_strategy: HYBRID_SHARDEnvironment
torch.distributed.fsdp.FullyShardedDataParallel)