Skip to content

Conversation

@neoblizz
Copy link
Member

This refactor reorders parameters and adds support for process groups:

API Changes:

  • all_reduce: (out, in, op=SUM, group=None, async_op=False, config=None, workspace=None)
  • reduce_scatter: (out, in, op=SUM, group=None, async_op=False, config=None)
  • all_gather: (out, in, group=None, async_op=False, config=None)
  • all_to_all: (out, in, group=None, async_op=False, config=None)

New:

  • Add ReduceOp enum (SUM, PRODUCT, MIN, MAX, etc.) matching torch.distributed
  • Add extract_group_info() helper to extract rank_start/rank_stride from ProcessGroup
  • Support strided process groups (e.g., TP groups [0,1,2,3] or DP groups [0,4,8,12])
  • op parameter validates only SUM is used (other ops to be added later)

Kernel Changes:

  • All CCL kernels now accept rank_start and rank_stride constexpr parameters
  • Kernel loops updated to iterate using group-aware rank calculation
  • Ring all-reduce computes next_rank on host side for group support

This refactor reorders parameters and adds support for process groups:

API Changes:
- all_reduce: (out, in, op=SUM, group=None, async_op=False, config=None, workspace=None)
- reduce_scatter: (out, in, op=SUM, group=None, async_op=False, config=None)
- all_gather: (out, in, group=None, async_op=False, config=None)
- all_to_all: (out, in, group=None, async_op=False, config=None)

New Features:
- Add ReduceOp enum (SUM, PRODUCT, MIN, MAX, etc.) matching torch.distributed
- Add extract_group_info() helper to extract rank_start/rank_stride from ProcessGroup
- Support strided process groups (e.g., TP groups [0,1,2,3] or DP groups [0,4,8,12])
- op parameter validates only SUM is used (other ops to be added later)

Kernel Changes:
- All CCL kernels now accept rank_start and rank_stride constexpr parameters
- Kernel loops updated to iterate using group-aware rank calculation
- Ring all-reduce computes next_rank on host side for group support

Backward Compatibility:
- Existing code using keyword arguments (config=...) continues to work
- torch.distributed compatible parameter ordering (group before config)
Copilot AI review requested due to automatic review settings January 23, 2026 05:33
@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Jan 23, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the CCL (Collective Communication Library) APIs to align with torch.distributed conventions by reordering parameters and adding support for process groups. However, the implementation contains several critical bugs that prevent process groups from working correctly.

Changes:

  • Adds ReduceOp enum matching torch.distributed semantics
  • Reorders API parameters to match torch.distributed: (out, in, op, group, async_op, config)
  • Adds extract_group_info() helper to extract rank/stride information from ProcessGroup
  • Updates all CCL kernels to accept rank_start and rank_stride parameters for group-aware rank calculation

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
iris/iris.py Updated all CCL method signatures to add op and group parameters with reordered arguments
iris/experimental/iris_gluon.py Updated CCL method signatures (missing group parameter in all_gather)
iris/ccl/init.py Added ReduceOp to exports
iris/ccl/utils.py Added ReduceOp enum and extract_group_info() helper function
iris/ccl/all_reduce.py Updated kernels and function to support group parameters with rank_start/rank_stride
iris/ccl/reduce_scatter.py Updated kernel and function to support group parameters
iris/ccl/all_gather.py Updated kernel and function to support group parameters
iris/ccl/all_to_all.py Updated Triton and Gluon kernels and function to support group parameters

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

@neoblizz neoblizz requested a review from sgudapar January 26, 2026 20:02
Copy link
Collaborator

@mawad-amd mawad-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we find better names for cur_rank and cur_rank_global? They are really confusing throughout the code.

Comment on lines +29 to +30
rank_start: tl.constexpr,
rank_stride: tl.constexpr,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid concern. Is an iris instance over the entire world or just the GPUs belonging to the group?

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>
@neoblizz
Copy link
Member Author

Can we find better names for cur_rank and cur_rank_global? They are really confusing throughout the code.

Let me think about that.

Copy link
Collaborator

@mawad-amd mawad-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks!

Add optional  parameter to barrier methods to support
process group-specific synchronization, aligning with the
torch.distributed.barrier(group=None) API convention.
@neoblizz neoblizz merged commit 3a008de into main Jan 28, 2026
48 checks passed
@neoblizz neoblizz deleted the neoblizz/ccl-groups branch January 28, 2026 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants