Skip to content

Commit d3c18f9

Browse files
author
Basu Jindal
committed
add better preview for links
1 parent 386a2fb commit d3c18f9

5 files changed

Lines changed: 392 additions & 12 deletions

File tree

blogs/cuda.md

Lines changed: 111 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ date: 2026-01-19
44
show: false
55
---
66

7-
## Threads, Warps, Thread Blocks and Grid
7+
## Threads, Warps, Thread Blocks, Thread Block cluster and Grid
88

99
<img src="../../images/SM.png" alt="CUDA Streaming Multiprocessor architecture showing threads, warps, and thread blocks" style="max-width: 600px; display: block; margin: 0 auto;">
1010

@@ -34,6 +34,46 @@ blockIdx.x=0, blockIdx.x=1 (2 threadblocks)
3434
threadIdx.x = 0~3 (4 threads/threadblock)
3535
```
3636

37+
A thread block cluster is a newer CUDA concept (introduced in Hopper SM90, also in Blackwell SM100). It's a grouping of thread blocks that can cooperate more tightly.
38+
39+
```cpp
40+
Grid
41+
└── Cluster (new!) ← Group of thread blocks that can sync & share memory
42+
└── Thread Block ← blockDim.x * blockDim.y * blockDim.z threads
43+
└── Warp ← 32 threads
44+
45+
46+
┌────────────┬─────────────────────────────────────┬─────────────────────────────┐
47+
│ Concept │ What it represents │ Size variable │
48+
├────────────┼─────────────────────────────────────┼─────────────────────────────┤
49+
│ gridDim │ Number of thread blocks in the grid │ gridDim.x/y/z │
50+
├────────────┼─────────────────────────────────────┼─────────────────────────────┤
51+
│ blockDim │ Number of threads per block │ blockDim.x/y/z │
52+
├────────────┼─────────────────────────────────────┼─────────────────────────────┤
53+
│ clusterDim │ Number of thread blocks per cluster │ cluster_shape (e.g., 2×2×1) │
54+
└────────────┴─────────────────────────────────────┴─────────────────────────────┘
55+
```
56+
57+
#### Example
58+
59+
cluster_shape = (2, 2, 1) // 2×2×1 = 4 thread blocks per cluster
60+
gridDim = (8, 4, 1) // 32 total thread blocks
61+
blockDim = (128, 1, 1) // 128 threads per block
62+
63+
- Total clusters = 32 / 4 = 8 clusters
64+
- Each cluster has 4 thread blocks that can:
65+
1. Synchronize with each other (cluster.sync())
66+
2. Access each other's shared memory (distributed shared memory)
67+
3. Coordinate on tensor core operations
68+
69+
#### Why Clusters?
70+
71+
Thread blocks in the same cluster can:
72+
1. Synchronize - __cluster_barrier_arrive() / __cluster_barrier_wait()
73+
2. Share memory - Distributed Shared Memory (DSMEM) allows one block to read another block's shared memory within the cluster
74+
3. Coordinate MMA - Multiple CTAs can cooperate on large matrix multiplies
75+
76+
3777
## Memory types
3878

3979
<img src="../../images/cuda_memory.png" alt="CUDA memory hierarchy showing global, shared, and local memory" style="max-width: 550px; display: block; margin: 0 auto;">
@@ -232,6 +272,41 @@ For Ping-Pong, each warp group takes on a specialized role of either Data produc
232272
233273
The producer can feed data to Tensor cores of Consumers. While one consumer is using the Tensor cores for Main Loop (MMA), the other can work on Epilogue which uses the CUDA cores. Thereby maximizing the utilization of Tensor cores -->
234274

275+
## GEMM flow in blackwell
276+
277+
Full GEMM: (Gemm_M × Gemm_N) output, iterating over Gemm_K
278+
279+
280+
Cluster Tile: Multiple CTAs in a cluster TOGETHER compute a larger tile
281+
│ Size: (cluster_M × MmaTile_M) × (cluster_N × MmaTile_N)
282+
283+
CTA Tile: Each CTA within the cluster computes its portion
284+
│ Size: MmaTile_M × MmaTile_N (one CTA's responsibility)
285+
286+
MMA Atom: The hardware instruction (tcgen05.mma)
287+
Size: e.g., 64×256×16 for SM100
288+
289+
So the relationship is:
290+
┌──────────────┬───────────────────────────┬───────────────────────────────────────────────────┐
291+
│ Level │ What computes it │ Size │
292+
├──────────────┼───────────────────────────┼───────────────────────────────────────────────────┤
293+
│ Full output │ Entire grid │ Gemm_M × Gemm_N │
294+
├──────────────┼───────────────────────────┼───────────────────────────────────────────────────┤
295+
│ Cluster tile │ 1 cluster (multiple CTAs) │ (cluster_M × MmaTile_M) × (cluster_N × MmaTile_N) │
296+
├──────────────┼───────────────────────────┼───────────────────────────────────────────────────┤
297+
│ CTA tile │ 1 CTA (thread block) │ MmaTile_M × MmaTile_N │
298+
├──────────────┼───────────────────────────┼───────────────────────────────────────────────────┤
299+
│ MMA atom │ 1 MMA instruction │ ~64×256×16 │
300+
└──────────────┴───────────────────────────┴───────────────────────────────────────────────────┘
301+
Example
302+
303+
cluster_shape = (2, 1, 1) // 2 CTAs per cluster in M
304+
MmaTile_M = 128, MmaTile_N = 256
305+
306+
// One CLUSTER handles: (2 × 128) × (1 × 256) = 256 × 256 output tile
307+
// Each CTA in the cluster handles: 128 × 256 (half the M dimension)
308+
309+
The cluster doesn't work on ONE MMA tile together - rather, multiple CTAs in a cluster each handle their own MMA tile, but they can share data via distributed shared memory and synchronize.
235310

236311
## CuTe
237312

@@ -253,6 +328,41 @@ Function from Coordinate to Index: `idx = inner_product(coord, stride)`
253328
| 2D Grid<br>`[[a, b, c],`<br>` [d, e, f]]` | Padded Col-major<br>Shape: `(2,3)`<br>Stride: `(1,4)` | `[a, d, _, _, b, e, _, _, c, f, _, _]`<br>*(Includes gaps/padding)* | `idx = i*1 + j*4` |
254329
| 3D Tensor<br>Layer 0: `[[a, b], [c, d]]`<br>Layer 1: `[[e, f], [g, h]]` | Tensor layout<br>Shape: `(2,2,2)`<br>Stride: `(4,1,2)` | `[a, b, e, f, c, d, g, h]` | `idx = inner_product(coord, stride)` |
255330

331+
332+
## Functions
333+
334+
`cute::cosize_v<CuteLayout>`: Compile time function that results the cosize of a layout. Cosize is the min number of elements needed to store all elemtns addressed by the layout accounting for potential non-contiguous access patterns (strides > 1) For contiguous layouts, cosize equals size
335+
336+
`cute::ArrayEgnine<Type, N>`: Fixed sizse array storage class
337+
338+
`CUTE_DEVICE` - Macro that expands to `__device__` for CUDA, marking the function as callable only from GPU code.
339+
340+
`make_tensor`: Creates a tensor view. A tensor in CuTe is pointer + layout pair. It doesnt own memory just views it.
341+
Parameters:
342+
- `ptr`: A pointer (raw or CuTe smart pointer) to the data
343+
- `layout`: A CuTe Layout describing the shape and memory access pattern
344+
345+
`make_smem_ptr(ptr)`: SMEM requires a special pointer type so the function wraps a raw pointer to indicate it points to shared memory (SMEM). This enables CuTe to select optimal copy operations and generates SMEM-specific PTX. Returns a special pointer type that carries SMEM address space information.
346+
347+
```cpp
348+
make_tensor(make_smem_ptr(A.begin()), ASmemLayout{});
349+
```
350+
351+
`tiled_divide`:
352+
353+
![alt text](../../images/cute_tiled_divide.png)
354+
355+
```cpp
356+
// Example: divide a 4x6 layout
357+
auto layout_4x6 = make_layout(make_shape(Int<4>{}, Int<6>{})); // (4, 6)
358+
auto tile_2x3 = make_tile(Int<2>{}, Int<3>{}); // Tile: (2, 3)
359+
360+
auto result_2d = tiled_divide(layout_4x6, tile_2x3);
361+
// Result shape: ((2, 3), 2, 2)
362+
// - Inner mode (2,3): elements within each tile
363+
// - Outer modes 2,2 : grid of tiles (4/2=2 tiles in M, 6/3=2 tiles in N)
364+
```
365+
256366
### Examples
257367

258368
Compile on DGX B200

css/thoughts.css

Lines changed: 135 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1159,14 +1159,15 @@
11591159
display: flex;
11601160
align-items: center;
11611161
gap: 0.75rem;
1162-
margin: 0.25rem 0;
1162+
margin: 0.5rem 0;
11631163
padding: 0.75rem 1rem;
11641164
background: var(--code-bg);
11651165
border: 1px solid var(--border);
11661166
border-radius: 10px;
11671167
text-decoration: none;
11681168
color: var(--text);
11691169
transition: border-color 0.2s, background 0.2s;
1170+
overflow: hidden;
11701171
}
11711172

11721173
.link-preview:hover {
@@ -1175,6 +1176,53 @@
11751176
text-decoration: none;
11761177
}
11771178

1179+
/* Rich link preview with image */
1180+
.link-preview.rich {
1181+
flex-direction: column;
1182+
align-items: stretch;
1183+
padding: 0;
1184+
gap: 0;
1185+
}
1186+
1187+
.link-preview-image {
1188+
width: 100%;
1189+
height: 160px;
1190+
object-fit: cover;
1191+
border-radius: 9px 9px 0 0;
1192+
}
1193+
1194+
.link-preview-body {
1195+
display: flex;
1196+
align-items: center;
1197+
gap: 0.75rem;
1198+
padding: 0.75rem 1rem;
1199+
}
1200+
1201+
.link-preview.rich .link-preview-content {
1202+
gap: 0.25rem;
1203+
}
1204+
1205+
.link-preview-title {
1206+
font-weight: 600;
1207+
font-size: 0.95rem;
1208+
color: var(--text);
1209+
display: -webkit-box;
1210+
-webkit-line-clamp: 2;
1211+
-webkit-box-orient: vertical;
1212+
overflow: hidden;
1213+
line-height: 1.3;
1214+
}
1215+
1216+
.link-preview-description {
1217+
font-size: 0.85rem;
1218+
color: var(--text-secondary);
1219+
display: -webkit-box;
1220+
-webkit-line-clamp: 2;
1221+
-webkit-box-orient: vertical;
1222+
overflow: hidden;
1223+
line-height: 1.4;
1224+
}
1225+
11781226
.link-preview-favicon {
11791227
width: 24px;
11801228
height: 24px;
@@ -1196,6 +1244,14 @@
11961244
color: var(--text);
11971245
}
11981246

1247+
.link-preview-site {
1248+
font-size: 0.8rem;
1249+
color: var(--text-secondary);
1250+
display: flex;
1251+
align-items: center;
1252+
gap: 0.5rem;
1253+
}
1254+
11991255
.link-preview-url {
12001256
font-size: 0.8rem;
12011257
color: var(--text-secondary);
@@ -1216,6 +1272,45 @@
12161272
color: var(--accent);
12171273
}
12181274

1275+
/* Embed containers (YouTube, Twitter) */
1276+
.embed-container {
1277+
margin: 0.5rem 0;
1278+
border-radius: 10px;
1279+
overflow: hidden;
1280+
}
1281+
1282+
.youtube-embed {
1283+
position: relative;
1284+
padding-bottom: 56.25%; /* 16:9 aspect ratio */
1285+
height: 0;
1286+
background: var(--code-bg);
1287+
}
1288+
1289+
.youtube-embed iframe {
1290+
position: absolute;
1291+
top: 0;
1292+
left: 0;
1293+
width: 100%;
1294+
height: 100%;
1295+
border-radius: 10px;
1296+
}
1297+
1298+
.twitter-embed {
1299+
background: var(--code-bg);
1300+
border: 1px solid var(--border);
1301+
border-radius: 10px;
1302+
padding: 1rem;
1303+
min-height: 100px;
1304+
}
1305+
1306+
.twitter-embed .twitter-tweet {
1307+
margin: 0 !important;
1308+
}
1309+
1310+
.twitter-embed a {
1311+
color: var(--accent);
1312+
}
1313+
12191314
/* Code blocks */
12201315
.post-content .code-block {
12211316
margin: 0.75rem 0 0.25rem 0;
@@ -1419,11 +1514,33 @@
14191514
gap: 0.5rem;
14201515
}
14211516

1517+
.link-preview.rich {
1518+
padding: 0;
1519+
}
1520+
1521+
.link-preview-image {
1522+
height: 140px;
1523+
}
1524+
1525+
.link-preview-body {
1526+
padding: 0.6rem 0.75rem;
1527+
gap: 0.5rem;
1528+
}
1529+
14221530
.link-preview-favicon {
14231531
width: 20px;
14241532
height: 20px;
14251533
}
14261534

1535+
.link-preview-title {
1536+
font-size: 0.9rem;
1537+
}
1538+
1539+
.link-preview-description {
1540+
font-size: 0.8rem;
1541+
-webkit-line-clamp: 1;
1542+
}
1543+
14271544
.link-preview-domain {
14281545
font-size: 0.85rem;
14291546
}
@@ -1432,6 +1549,23 @@
14321549
font-size: 0.75rem;
14331550
}
14341551

1552+
/* Embeds on mobile */
1553+
.embed-container {
1554+
margin-left: -1rem;
1555+
margin-right: -1rem;
1556+
border-radius: 0;
1557+
}
1558+
1559+
.youtube-embed iframe {
1560+
border-radius: 0;
1561+
}
1562+
1563+
.twitter-embed {
1564+
border-radius: 0;
1565+
border-left: none;
1566+
border-right: none;
1567+
}
1568+
14351569
/* Code blocks on mobile */
14361570
.post-content .code-block {
14371571
padding: 0.75rem;

images/cute_tiled_divide.png

187 KB
Loading

0 commit comments

Comments
 (0)