Skip to content

feat: add collective communication primitives for CP#182

Open
Chamberlain0w0 wants to merge 1 commit into
masterfrom
feat/cp_comm
Open

feat: add collective communication primitives for CP#182
Chamberlain0w0 wants to merge 1 commit into
masterfrom
feat/cp_comm

Conversation

@Chamberlain0w0

Copy link
Copy Markdown
Contributor

新增通信层基础能力,为后续 context parallel 等并行策略提供底层通信算子支持。

主要改动:

  • AlltoAll 支持:在 CclImpl / NcclImpl 中新增 AlltoAll 接口;在 ProcessGroup 中新增 AlltoAll;在 parallel::function 中新增 AlltoAll functional wrapper。
    • 其中,NcclImpl::AlltoAll 支持两路实现:NCCL 2.28.3 及以上使用原生 ncclAlltoAll;低版本 NCCL fallback 到 grouped ncclSend / ncclRecv。
  • BatchSendRecv 支持(对标 torch.distributed.batch_isend_irecv,用于一次性提交一组 p2p send/recv op):新增 P2POpType / P2POp 定义,ProcessGroup::BatchSendRecv,复用现有 compute/comm stream event 同步模式,并使用 CCL group 语义提交批量 p2p 操作。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant