HYBRID_SHARD fails when world_size < available GPUs

## Problem

Training fails with `ValueError` when using FSDP with `HYBRID_SHARD` sharding strategy on systems where the `nproc_per_node` specified is below the total available set of GPUs on a given node.

## Error

```
ValueError: The arg 'group_size' (8) must not exceed the world size (2)
```

## Stack Trace

```
File ".../torch/distributed/fsdp/fully_sharded_data_parallel.py", line 439, in __init__
    _init_process_group_state(
File ".../torch/distributed/fsdp/_init_utils.py", line 113, in _init_process_group_state
    state = _init_process_group_state_for_hybrid_shard(
File ".../torch/distributed/fsdp/_init_utils.py", line 160, in _init_process_group_state_for_hybrid_shard
    intra_node_group, inter_node_group = _init_intra_and_inter_node_groups(
File ".../torch/distributed/fsdp/_init_utils.py", line 266, in _init_intra_and_inter_node_groups
    _init_intra_node_process_group(num_devices_per_node),
File ".../torch/distributed/fsdp/_init_utils.py", line 211, in _init_intra_node_process_group
    intra_node_subgroup, _ = dist.new_subgroups(num_devices_per_node)
File ".../torch/distributed/distributed_c10d.py", line 5500, in new_subgroups
    raise ValueError(
ValueError: The arg 'group_size' (8) must not exceed the world size (2)
```

## Cause

When `HYBRID_SHARD` is used without an explicit `device_mesh`, FSDP1 auto-detects `num_devices_per_node` which defaults to 8. It then attempts to create intra-node process groups of size 8, which fails when `world_size < 8`.

FSDP1 does not provide a straightforward way to configure the intra-node group size when using `HYBRID_SHARD` without a `device_mesh`.

## Reproduction

Run training with:
- `distributed_training_framework: fsdp`
- `fsdp_sharding_strategy: HYBRID_SHARD`
- Fewer than 8 GPUs (e.g., 2 GPUs)

## Environment

- PyTorch with FSDP1 (`torch.distributed.fsdp.FullyShardedDataParallel`)
- Accelerate
- Any system with < 8 GPUs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HYBRID_SHARD fails when world_size < available GPUs #678

Problem

Error

Stack Trace

Cause

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HYBRID_SHARD fails when world_size < available GPUs #678

Description

Problem

Error

Stack Trace

Cause

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions