Refactor CCL APIs to align with torch.distributed conventions #326

neoblizz · 2026-01-23T05:32:58Z

This refactor reorders parameters and adds support for process groups:

API Changes:

all_reduce: (out, in, op=SUM, group=None, async_op=False, config=None, workspace=None)
reduce_scatter: (out, in, op=SUM, group=None, async_op=False, config=None)
all_gather: (out, in, group=None, async_op=False, config=None)
all_to_all: (out, in, group=None, async_op=False, config=None)

New:

Add ReduceOp enum (SUM, PRODUCT, MIN, MAX, etc.) matching torch.distributed
Add extract_group_info() helper to extract rank_start/rank_stride from ProcessGroup
Support strided process groups (e.g., TP groups [0,1,2,3] or DP groups [0,4,8,12])
op parameter validates only SUM is used (other ops to be added later)

Kernel Changes:

All CCL kernels now accept rank_start and rank_stride constexpr parameters
Kernel loops updated to iterate using group-aware rank calculation
Ring all-reduce computes next_rank on host side for group support

This refactor reorders parameters and adds support for process groups: API Changes: - all_reduce: (out, in, op=SUM, group=None, async_op=False, config=None, workspace=None) - reduce_scatter: (out, in, op=SUM, group=None, async_op=False, config=None) - all_gather: (out, in, group=None, async_op=False, config=None) - all_to_all: (out, in, group=None, async_op=False, config=None) New Features: - Add ReduceOp enum (SUM, PRODUCT, MIN, MAX, etc.) matching torch.distributed - Add extract_group_info() helper to extract rank_start/rank_stride from ProcessGroup - Support strided process groups (e.g., TP groups [0,1,2,3] or DP groups [0,4,8,12]) - op parameter validates only SUM is used (other ops to be added later) Kernel Changes: - All CCL kernels now accept rank_start and rank_stride constexpr parameters - Kernel loops updated to iterate using group-aware rank calculation - Ring all-reduce computes next_rank on host side for group support Backward Compatibility: - Existing code using keyword arguments (config=...) continues to work - torch.distributed compatible parameter ordering (group before config)

Copilot

Pull request overview

This pull request refactors the CCL (Collective Communication Library) APIs to align with torch.distributed conventions by reordering parameters and adding support for process groups. However, the implementation contains several critical bugs that prevent process groups from working correctly.

Changes:

Adds ReduceOp enum matching torch.distributed semantics
Reorders API parameters to match torch.distributed: (out, in, op, group, async_op, config)
Adds extract_group_info() helper to extract rank/stride information from ProcessGroup
Updates all CCL kernels to accept rank_start and rank_stride parameters for group-aware rank calculation

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
iris/iris.py	Updated all CCL method signatures to add op and group parameters with reordered arguments
iris/experimental/iris_gluon.py	Updated CCL method signatures (missing group parameter in all_gather)
iris/ccl/init.py	Added ReduceOp to exports
iris/ccl/utils.py	Added ReduceOp enum and extract_group_info() helper function
iris/ccl/all_reduce.py	Updated kernels and function to support group parameters with rank_start/rank_stride
iris/ccl/reduce_scatter.py	Updated kernel and function to support group parameters
iris/ccl/all_gather.py	Updated kernel and function to support group parameters
iris/ccl/all_to_all.py	Updated Triton and Gluon kernels and function to support group parameters

iris/ccl/all_reduce.py

iris/ccl/reduce_scatter.py

iris/experimental/iris_gluon.py

iris/ccl/all_gather.py

iris/ccl/all_reduce.py

iris/ccl/all_to_all.py

iris/ccl/all_reduce.py

iris/ccl/all_to_all.py

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

tests/ccl/test_process_groups.py

mawad-amd

Can we find better names for cur_rank and cur_rank_global? They are really confusing throughout the code.

mawad-amd · 2026-01-27T04:52:56Z

iris/ccl/all_gather.py

+    rank_start: tl.constexpr,
+    rank_stride: tl.constexpr,


This is a valid concern. Is an iris instance over the entire world or just the GPUs belonging to the group?

iris/ccl/all_gather.py

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

neoblizz · 2026-01-27T19:29:21Z

Can we find better names for cur_rank and cur_rank_global? They are really confusing throughout the code.

Let me think about that.

mawad-amd

Looks good to me. Thanks!

Add optional parameter to barrier methods to support process group-specific synchronization, aligning with the torch.distributed.barrier(group=None) API convention.

neoblizz requested review from BKP and mawad-amd as code owners January 23, 2026 05:32

Copilot AI review requested due to automatic review settings January 23, 2026 05:33

github-actions bot added in-progress We are working on it iris Iris project issue labels Jan 23, 2026

Copilot started reviewing on behalf of neoblizz January 23, 2026 05:33 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

Bug fixes.

315008a

neoblizz requested a review from Copilot January 24, 2026 05:25

Copilot started reviewing on behalf of neoblizz January 24, 2026 05:25 View session

Copilot AI reviewed Jan 24, 2026

View reviewed changes

tests/ccl/test_process_groups.py Show resolved Hide resolved

tests/ccl/test_process_groups.py Outdated Show resolved Hide resolved

tests/ccl/test_process_groups.py Outdated Show resolved Hide resolved

Merge branch 'main' into neoblizz/ccl-groups

21bac8f

neoblizz requested a review from sgudapar January 26, 2026 20:02

mawad-amd reviewed Jan 27, 2026

View reviewed changes

Apply suggestions from code review

a19c859

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

neoblizz and others added 4 commits January 27, 2026 22:54

Better naming, trying to do some cleanup.

ef53205

Lint issues.

a74b8ae

Apply Ruff auto-fixes

a329c30

Merge branch 'main' into neoblizz/ccl-groups

e8f4bf7

mawad-amd approved these changes Jan 27, 2026

View reviewed changes

Add group parameter support to barrier APIs

480d5b1

Add optional parameter to barrier methods to support process group-specific synchronization, aligning with the torch.distributed.barrier(group=None) API convention.

neoblizz merged commit 3a008de into main Jan 28, 2026
48 checks passed

neoblizz deleted the neoblizz/ccl-groups branch January 28, 2026 20:47

Refactor CCL APIs to align with torch.distributed conventions #326

Refactor CCL APIs to align with torch.distributed conventions #326

Uh oh!

Conversation

neoblizz commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mawad-amd left a comment

Choose a reason for hiding this comment

Uh oh!

mawad-amd Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neoblizz commented Jan 27, 2026

Uh oh!

mawad-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants