You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A thread block cluster is a newer CUDA concept (introduced in Hopper SM90, also in Blackwell SM100). It's a grouping of thread blocks that can cooperate more tightly.
38
+
39
+
```cpp
40
+
Grid
41
+
└── Cluster (new!) ← Group of thread blocks that can sync & share memory
2. Share memory - Distributed Shared Memory (DSMEM) allows one block to read another block's shared memory within the cluster
74
+
3. Coordinate MMA - Multiple CTAs can cooperate on large matrix multiplies
75
+
76
+
37
77
## Memory types
38
78
39
79
<imgsrc="../../images/cuda_memory.png"alt="CUDA memory hierarchy showing global, shared, and local memory"style="max-width: 550px; display: block; margin: 0auto;">
@@ -232,6 +272,41 @@ For Ping-Pong, each warp group takes on a specialized role of either Data produc
232
272
233
273
The producer can feed data to Tensor cores of Consumers. While one consumer is using the Tensor cores for Main Loop (MMA), the other can work on Epilogue which uses the CUDA cores. Thereby maximizing the utilization of Tensor cores -->
234
274
275
+
## GEMM flow in blackwell
276
+
277
+
Full GEMM: (Gemm_M × Gemm_N) output, iterating over Gemm_K
278
+
│
279
+
▼
280
+
Cluster Tile: Multiple CTAs in a cluster TOGETHER compute a larger tile
// Each CTA in the cluster handles: 128 × 256 (half the M dimension)
308
+
309
+
The cluster doesn't work on ONE MMA tile together - rather, multiple CTAs in a cluster each handle their own MMA tile, but they can share data via distributed shared memory and synchronize.
235
310
236
311
## CuTe
237
312
@@ -253,6 +328,41 @@ Function from Coordinate to Index: `idx = inner_product(coord, stride)`
`cute::cosize_v<CuteLayout>`: Compile time function that results the cosize of a layout. Cosize is the min number of elements needed to store all elemtns addressed by the layout accounting for potential non-contiguous access patterns (strides > 1) For contiguous layouts, cosize equals size
335
+
336
+
`cute::ArrayEgnine<Type, N>`: Fixed sizse array storage class
337
+
338
+
`CUTE_DEVICE` - Macro that expands to `__device__` for CUDA, marking the function as callable only from GPU code.
339
+
340
+
`make_tensor`: Creates a tensor view. A tensor in CuTe is pointer + layout pair. It doesnt own memory just views it.
341
+
Parameters:
342
+
-`ptr`: A pointer (raw or CuTe smart pointer) to the data
343
+
-`layout`: A CuTe Layout describing the shape and memory access pattern
344
+
345
+
`make_smem_ptr(ptr)`: SMEM requires a special pointer type so the function wraps a raw pointer to indicate it points to shared memory (SMEM). This enables CuTe to select optimal copy operations and generates SMEM-specific PTX. Returns a special pointer type that carries SMEM address space information.
0 commit comments