Skip to content

Commit 3434868

Browse files
author
Konstantinos Diamantis
committed
Analysis of cache hits in blocking
1 parent 199bc9d commit 3434868

1 file changed

Lines changed: 35 additions & 16 deletions

File tree

content/posts/hpc-cache-locality.md

Lines changed: 35 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,20 @@
11
+++
2-
date = '2026-02-12T11:06:26+01:00'
2+
date = '2026-02-24T11:06:26+01:00'
33
draft = false
44
title = 'Avoid the RAM Latency: Keeping the Cache Hot and on Linear Access is the Ultimate C++ Optimization'
55
summary = 'In this benchmark, we explore the importance of keeping data within the CPU cache to avoid expensive retrieval from RAM. By simply ensuring linear data access and accessing by blocks that fit in L1 and L2, we can achieve massive performance gains without changing the underlying algorithm.'
66
tags = ["advanced-level", "HPC", "cache-locality", "performance", "blocking", "tiling", "simd", "DOD", "AoS", "perf"]
77
+++
88

9-
TODO:
10-
// std::span
9+
1110

1211
In this benchmark, we explore the importance of keeping data within the CPU cache to avoid expensive retrieval from RAM. By simply ensuring **linear data access**, we can achieve massive performance gains without changing the underlying algorithm. This principle is used also in **Data Oriented Design (DOD)** with **Array of Structures (AoS)** or **Structures of Arrays (SoA)**, since we lay down all our data to fit linearly. Apart from that there is also the benefit, that when we follow **DOD** designs we avoid also the runtime dynamic dispatch on polymorphism. Though here, in our example we will just focus on the benefit of keeping the cache hot with and we will demonstrate the performance gain in a simple matrix multiplication example using the **perf** tool and **google-benchmark**.
1312

1413
We will have 3 scenarios:
1514

1615
1. one bad multiplication that we do not access the data linearly,
1716
2. then one that we do access the data linearly. We will notice how massive speed we can gain just from this small change.
18-
3. Then we will try to improve it even more, accessing in **blocks** of size that fit in cache L1/L2 (**tiling**).
17+
3. Then we will try to improve it even more, accessing in **blocks** of size that fit in L1 cache (**tiling**).
1918

2019
(Note that similar techniques are implemented to fit data in the cache line when reading or modifying data, aligning with 64 bytes which is the cache line. In C++17 we also have `std::hardware_constructive_interference_size`, but in most machines this is the same as 64bytes anyways.)
2120

@@ -315,18 +314,21 @@ When we just move data from RAM to cache and do not perform any calculations on
315314
---
316315

317316

318-
### Deeper Dive with perf
317+
### Deeper Dive with perf tool
319318

320319

321-
Let's see what is going on for the scenario with matrixes of the big N=2048 that will not fit in our cache at all.
320+
Let's see what is going on for the scenario with matrixes of the big N=2048 that will not fit in our cache at all. With the below flags we can see all the cache and hits on L1, L2 and L3. We will analyse the 3 implementations:
322321

323322

323+
Below command allows kernel-level profiling and hardware counts.
324324

325325

326-
With the below flags we can see all the cache and hits on L1, L2 and L3. We will analyse the 3 implementations:
326+
``` bash
327+
sudo sysctl -w kernel.perf_event_paranoid=-1
328+
```
327329

328330

329-
First the naive:
331+
### First the naive:
330332

331333

332334
``` bash
@@ -364,7 +366,11 @@ LLC-load-misses: 8403483781 96969310198 48489258590
364366
```
365367

366368

367-
The Perf - Linear Access:
369+
LLC-load-misses are the L3 misses. Which is 95%. This is HUGE. Almost every single time the CPU looks for data in the L1, L2 or L3 cache, it isn't there. The CPU has to stop everything and wait for the RAM, and this also gives a really-really bad instructions per cycle of 0.19.
370+
371+
372+
373+
### The Perf - Linear Access:
368374

369375
```bash
370376
perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,\
@@ -405,9 +411,18 @@ LLC-load-misses: 147854096 5791414256 2892376125
405411

406412
```
407413

414+
- L1 Miss Rate: Dropped from 95% to 26%.
415+
416+
- LLC Miss Rate: Dropped from 94% (Naive) to 21%.
417+
418+
Now we are accessing memory in a straight line, the Prefetcher can guess what you need next. It starts pulling data from RAM before you even ask for it.
419+
420+
The instructions per cycle - IPC jumped to 0.59. It is 3x faster, but the CPU is still stalling - we have described the reason above - once we need data for the next row multiplication we need the data with start of this row in cache - previously they were already there but we were out of L3 space, so they got evicted from L3 in LRU order. This means we need AGAIN the same data from RAM. And this gives also a not that good IPC.
408421

409422

410-
The Blocking/Tilling:
423+
424+
425+
### The Blocking/Tilling:
411426

412427
``` bash
413428
perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,\
@@ -449,25 +464,29 @@ LLC-load-misses: 3111522 4600547279 2298895236
449464
```
450465

451466

467+
- LLC-load-misses is now down to 3.5%. The CPU rarely needs to go to RAM to get data, what it needs is almost always in cache.
468+
- Now for every clock cycle, we get more than 1 instruction, we got IPC from 0.6 to IPC = 1.17. **Excellent!**
469+
452470

453471

454472

455473
---
456474

457475

458476

477+
## Conclusion:
459478

479+
When writing High Performance code, how you traverse your data is often more important than the algorithm itself. Benchmark always, imagive we have a huge computational system and this iterations should run multiple times on different data. Summing all this extra waiting time up makes a big difference. Also, **DO NOT guess, MEAUSRE directly**. `perf` is an excellent tool to see the cache miss or hits and **understand the hardware**. Keep the data linear, and in cache - as much as possible.
460480

461-
## Conclusion:
462481

463-
When writing HPC code, how you traverse your data is often more important than the algorithm itself. Keep them linear, and in cache. **DOD** aims to benefit exactly from this. We might see such an example in a future article.
464482

465483

466-
We do in order to..
484+
### Notes
485+
486+
487+
- `std::mdspan` introduced in C++23 and does the Tiling - Blocking part, avoiding the loop we implemented manually
488+
- **Data Oriented Design (DOD)** aims to benefit exactly from this. We might see such an example in a future article.
467489

468490

469-
``` bash
470-
sudo sysctl -w kernel.perf_event_paranoid=-1
471-
```
472491

473492

0 commit comments

Comments
 (0)