|
1 | 1 | +++ |
2 | | -date = '2026-02-12T11:06:26+01:00' |
| 2 | +date = '2026-02-24T11:06:26+01:00' |
3 | 3 | draft = false |
4 | 4 | title = 'Avoid the RAM Latency: Keeping the Cache Hot and on Linear Access is the Ultimate C++ Optimization' |
5 | 5 | summary = 'In this benchmark, we explore the importance of keeping data within the CPU cache to avoid expensive retrieval from RAM. By simply ensuring linear data access and accessing by blocks that fit in L1 and L2, we can achieve massive performance gains without changing the underlying algorithm.' |
6 | 6 | tags = ["advanced-level", "HPC", "cache-locality", "performance", "blocking", "tiling", "simd", "DOD", "AoS", "perf"] |
7 | 7 | +++ |
8 | 8 |
|
9 | | -TODO: |
10 | | -// std::span |
| 9 | + |
11 | 10 |
|
12 | 11 | In this benchmark, we explore the importance of keeping data within the CPU cache to avoid expensive retrieval from RAM. By simply ensuring **linear data access**, we can achieve massive performance gains without changing the underlying algorithm. This principle is used also in **Data Oriented Design (DOD)** with **Array of Structures (AoS)** or **Structures of Arrays (SoA)**, since we lay down all our data to fit linearly. Apart from that there is also the benefit, that when we follow **DOD** designs we avoid also the runtime dynamic dispatch on polymorphism. Though here, in our example we will just focus on the benefit of keeping the cache hot with and we will demonstrate the performance gain in a simple matrix multiplication example using the **perf** tool and **google-benchmark**. |
13 | 12 |
|
14 | 13 | We will have 3 scenarios: |
15 | 14 |
|
16 | 15 | 1. one bad multiplication that we do not access the data linearly, |
17 | 16 | 2. then one that we do access the data linearly. We will notice how massive speed we can gain just from this small change. |
18 | | -3. Then we will try to improve it even more, accessing in **blocks** of size that fit in cache L1/L2 (**tiling**). |
| 17 | +3. Then we will try to improve it even more, accessing in **blocks** of size that fit in L1 cache (**tiling**). |
19 | 18 |
|
20 | 19 | (Note that similar techniques are implemented to fit data in the cache line when reading or modifying data, aligning with 64 bytes which is the cache line. In C++17 we also have `std::hardware_constructive_interference_size`, but in most machines this is the same as 64bytes anyways.) |
21 | 20 |
|
@@ -315,18 +314,21 @@ When we just move data from RAM to cache and do not perform any calculations on |
315 | 314 | --- |
316 | 315 |
|
317 | 316 |
|
318 | | -### Deeper Dive with perf |
| 317 | +### Deeper Dive with perf tool |
319 | 318 |
|
320 | 319 |
|
321 | | -Let's see what is going on for the scenario with matrixes of the big N=2048 that will not fit in our cache at all. |
| 320 | +Let's see what is going on for the scenario with matrixes of the big N=2048 that will not fit in our cache at all. With the below flags we can see all the cache and hits on L1, L2 and L3. We will analyse the 3 implementations: |
322 | 321 |
|
323 | 322 |
|
| 323 | +Below command allows kernel-level profiling and hardware counts. |
324 | 324 |
|
325 | 325 |
|
326 | | -With the below flags we can see all the cache and hits on L1, L2 and L3. We will analyse the 3 implementations: |
| 326 | +``` bash |
| 327 | +sudo sysctl -w kernel.perf_event_paranoid=-1 |
| 328 | +``` |
327 | 329 |
|
328 | 330 |
|
329 | | -First the naive: |
| 331 | +### First the naive: |
330 | 332 |
|
331 | 333 |
|
332 | 334 | ``` bash |
@@ -364,7 +366,11 @@ LLC-load-misses: 8403483781 96969310198 48489258590 |
364 | 366 | ``` |
365 | 367 |
|
366 | 368 |
|
367 | | -The Perf - Linear Access: |
| 369 | +LLC-load-misses are the L3 misses. Which is 95%. This is HUGE. Almost every single time the CPU looks for data in the L1, L2 or L3 cache, it isn't there. The CPU has to stop everything and wait for the RAM, and this also gives a really-really bad instructions per cycle of 0.19. |
| 370 | + |
| 371 | + |
| 372 | + |
| 373 | +### The Perf - Linear Access: |
368 | 374 |
|
369 | 375 | ```bash |
370 | 376 | perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,\ |
@@ -405,9 +411,18 @@ LLC-load-misses: 147854096 5791414256 2892376125 |
405 | 411 |
|
406 | 412 | ``` |
407 | 413 |
|
| 414 | +- L1 Miss Rate: Dropped from 95% to 26%. |
| 415 | + |
| 416 | +- LLC Miss Rate: Dropped from 94% (Naive) to 21%. |
| 417 | + |
| 418 | +Now we are accessing memory in a straight line, the Prefetcher can guess what you need next. It starts pulling data from RAM before you even ask for it. |
| 419 | + |
| 420 | +The instructions per cycle - IPC jumped to 0.59. It is 3x faster, but the CPU is still stalling - we have described the reason above - once we need data for the next row multiplication we need the data with start of this row in cache - previously they were already there but we were out of L3 space, so they got evicted from L3 in LRU order. This means we need AGAIN the same data from RAM. And this gives also a not that good IPC. |
408 | 421 |
|
409 | 422 |
|
410 | | -The Blocking/Tilling: |
| 423 | + |
| 424 | + |
| 425 | +### The Blocking/Tilling: |
411 | 426 |
|
412 | 427 | ``` bash |
413 | 428 | perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,\ |
@@ -449,25 +464,29 @@ LLC-load-misses: 3111522 4600547279 2298895236 |
449 | 464 | ``` |
450 | 465 |
|
451 | 466 |
|
| 467 | +- LLC-load-misses is now down to 3.5%. The CPU rarely needs to go to RAM to get data, what it needs is almost always in cache. |
| 468 | +- Now for every clock cycle, we get more than 1 instruction, we got IPC from 0.6 to IPC = 1.17. **Excellent!** |
| 469 | + |
452 | 470 |
|
453 | 471 |
|
454 | 472 |
|
455 | 473 | --- |
456 | 474 |
|
457 | 475 |
|
458 | 476 |
|
| 477 | +## Conclusion: |
459 | 478 |
|
| 479 | +When writing High Performance code, how you traverse your data is often more important than the algorithm itself. Benchmark always, imagive we have a huge computational system and this iterations should run multiple times on different data. Summing all this extra waiting time up makes a big difference. Also, **DO NOT guess, MEAUSRE directly**. `perf` is an excellent tool to see the cache miss or hits and **understand the hardware**. Keep the data linear, and in cache - as much as possible. |
460 | 480 |
|
461 | | -## Conclusion: |
462 | 481 |
|
463 | | -When writing HPC code, how you traverse your data is often more important than the algorithm itself. Keep them linear, and in cache. **DOD** aims to benefit exactly from this. We might see such an example in a future article. |
464 | 482 |
|
465 | 483 |
|
466 | | -We do in order to.. |
| 484 | +### Notes |
| 485 | + |
| 486 | + |
| 487 | +- `std::mdspan` introduced in C++23 and does the Tiling - Blocking part, avoiding the loop we implemented manually |
| 488 | +- **Data Oriented Design (DOD)** aims to benefit exactly from this. We might see such an example in a future article. |
467 | 489 |
|
468 | 490 |
|
469 | | -``` bash |
470 | | -sudo sysctl -w kernel.perf_event_paranoid=-1 |
471 | | -``` |
472 | 491 |
|
473 | 492 |
|
0 commit comments