Reduce and Allreduce NVLS implementations for the cuda backend#6038
Reduce and Allreduce NVLS implementations for the cuda backend#6038nsarka wants to merge 24 commits intoNVIDIA:mainfrom
Conversation
6cc186a to
2697b92
Compare
Greptile SummaryThis PR adds Reduce and Allreduce implementations for the CUDA/NVLS backend, built on top of the existing NVLink SHARP ( Key changes:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant R0 as Rank 0
participant R1 as Rank 1
participant Sym as Symmetric Buffer (NVLS Multicast VA)
Note over R0,R1: postAllreduceWithCudaBackend / postReduceWithCudaBackend
R0->>Sym: cudaMemcpyAsync(inputBuffer[0] ← input)
R1->>Sym: cudaMemcpyAsync(inputBuffer[1] ← input)
R0->>R1: cuStreamBatchMemOp WRITE kInProgress → R1.semaphore[0]
R1->>R0: cuStreamBatchMemOp WRITE kInProgress → R0.semaphore[1]
R0->>R0: cuStreamBatchMemOp WAIT kInProgress on R1.semaphore[0]
R1->>R1: cuStreamBatchMemOp WAIT kInProgress on R0.semaphore[1]
R0->>Sym: launchMulticastReduceKernel(mc_ptr → output)
R1->>Sym: launchMulticastReduceKernel(mc_ptr → output)
Note over Sym: multimem.ld_reduce aggregates all ranks' inputBuffers
R0->>R1: cuStreamBatchMemOp WRITE kIdle → R1.semaphore[0]
R1->>R0: cuStreamBatchMemOp WRITE kIdle → R0.semaphore[1]
Note over R0,R1: waitAllreduceWithCudaBackend / waitReduceWithCudaBackend
R0->>R0: cuStreamBatchMemOp WAIT kIdle on R0.semaphore[1]
R1->>R1: cuStreamBatchMemOp WAIT kIdle on R1.semaphore[0]
|
| size_t n_vec = n_bytes / 16; | ||
|
|
||
| for (size_t i = idx; i < n_vec; i += stride) { | ||
| float r0, r1, r2, r3; | ||
| const void* addr = mc_src_c + i * 16; | ||
| asm volatile( | ||
| "multimem.ld_reduce.global.add.v4.f32 {%0,%1,%2,%3}, [%4];" | ||
| : "=f"(r0), "=f"(r1), "=f"(r2), "=f"(r3) | ||
| : "l"(addr) | ||
| : "memory"); | ||
| float4 out; | ||
| out.x = r0; | ||
| out.y = r1; | ||
| out.z = r2; | ||
| out.w = r3; | ||
| ((float4*)dst_c)[i] = out; | ||
| } |
There was a problem hiding this comment.
Tail elements silently dropped when
n_bytes is not a 16-byte multiple
The kernel computes n_vec = n_bytes / 16 using integer division, so any bytes in the range [n_vec*16, n_bytes) are silently ignored and never reduced. The caller (launchMulticastReduceKernelImpl) does check size % 16 == 0 via NVF_CHECK, so this cannot be triggered through the normal code path. However, the public launchMulticastReduceKernel wrapper (used by tests, per the header comment) does not perform that check, leaving callers free to pass a size that is not a multiple of 16 and silently get wrong results.
Consider adding the alignment assertion inside the kernel or inside launchMulticastReduceKernel itself so test callers also get a clear error.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
!test |
|
!test |
|
!test |
|
!test |
Built on top of #5620. Adds reduce and allreduce NVLS implementations. Both use the same ld_reduce kernel and synchronize using a symmetric integer tensor as a semaphore