Skip to content

Commit 7d0bb3c

Browse files
author
Konstantinos Diamantis
committed
add hpc-blog
1 parent e28483f commit 7d0bb3c

1 file changed

Lines changed: 241 additions & 0 deletions

File tree

Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
+++
2+
date = '2026-02-12T11:06:26+01:00'
3+
draft = false
4+
title = 'Avoid the RAM Latency: Keeping the Cache Hot and on Linear Access is the Ultimate C++ Optimization'
5+
summary = 'In this benchmark, we explore the importance of keeping data within the CPU cache to avoid expensive retrieval from RAM. By simply ensuring linear data access and accessing by blocks that fit in L1 and L2, we can achieve massive performance gains without changing the underlying algorithm.'
6+
tags = ["advanced-level", "HPC", "cache-locality", "performance", "tiling", "simd"]
7+
+++
8+
9+
10+
In this benchmark, we explore the importance of keeping data within the CPU cache to avoid expensive retrieval from RAM. By simply ensuring **linear data access**, we can achieve massive performance gains without changing the underlying algorithm. We will demonstrate this effect in a simple matrix multiplication example.
11+
12+
13+
14+
## The Core Concept: CPU Cache vs. RAM
15+
16+
Data access speed is largely determined by physical distance and the hierarchy of memory. When we operate on tables linearly, the CPU can effectively "predict" what data we need next.
17+
18+
19+
| Memory Level - | Time to reach | CPU Cycles (Approx.)
20+
| :--- | :--- | :--- |
21+
| L1 Cache | ~1 ns | 4–5 cycles |
22+
| L2 Cache | ~4 ns | 12–15 cycles|
23+
| L3 Cache | ~10-40 ns | 40–60 cycles |
24+
| Main RAM | ~100 ns+ | 200–300+ cycles |
25+
26+
27+
28+
## The Code: Naive vs. Optimized Matrix Multiplication
29+
30+
We are comparing two versions of a matrix multiplication. The only difference is the order of the nested loops, which dictates how we traverse memory.
31+
32+
33+
34+
``` cpp
35+
#include<vector>
36+
#include <benchmark/benchmark.h>
37+
38+
39+
40+
// Simple and naive
41+
template<typename T>
42+
void multiply_naive(const std::vector<T>& a, const std::vector<T>& b, std::vector<T>& result, int const N) {
43+
std::fill(result.begin(), result.end(), 0);
44+
45+
for (int i = 0; i < N; ++i) {
46+
for (int j = 0; j < N; ++j) {
47+
for (int k = 0; k < N; ++k) {
48+
// b[k*N + j] is accessed with a stride of N - vector "a" has linear access
49+
result[i * N + j] += a[i * N + k] * b[k * N + j]; // Prefetcher cannot really help here with the b vector, since it is prefetching linearly
50+
}
51+
}
52+
}
53+
}
54+
```
55+
56+
57+
58+
Now the improved version for performance, just changing the stride:
59+
60+
61+
```cpp
62+
// Performance vesrion
63+
template<typename T>
64+
void multiply_performance(const std::vector<T>& a, const std::vector<T>& b, std::vector<T>& result, int const N) {
65+
std::fill(result.begin(), result.end(), 0);
66+
67+
for (int i = 0; i < N; ++i) {
68+
for (int k = 0; k < N; ++k) {
69+
// Cache A[i][k] in a register since it's constant for the j-loop
70+
auto temp = a[i * N + k];
71+
for (int j = 0; j < N; ++j) {
72+
// Now accessing result[i][j] and b[k][j] linearly
73+
result[i * N + j] += temp * b[k * N + j]; // Prefetcher will load already the next block in memory
74+
}
75+
}
76+
}
77+
78+
}
79+
```
80+
81+
82+
Now, we can use tiling in order to split in blocks that fit in L1 and L2 and save some extra CPU cycles:
83+
84+
```cpp
85+
template<typename T>
86+
void multiply_performance_tilling(const std::vector<T>& a, const std::vector<T>& b, std::vector<T>& result, int const N) {
87+
std::fill(result.begin(), result.end(), 0);
88+
89+
// Choose a block size. 32 or 64 is often a sweet spot for modern CPUs.
90+
const int BLOCK_SIZE = 32;
91+
92+
// Outer loops: Iterate over tiles
93+
for (int i_tile = 0; i_tile < N; i_tile += BLOCK_SIZE) {
94+
for (int k_tile = 0; k_tile < N; k_tile += BLOCK_SIZE) {
95+
for (int j_tile = 0; j_tile < N; j_tile += BLOCK_SIZE) {
96+
97+
// Inner loops: Perform multiplication within the tiles
98+
// Note: std::min handles cases where N is not perfectly divisible by BLOCK_SIZE
99+
for (int i = i_tile; i < std::min(i_tile + BLOCK_SIZE, N); ++i) {
100+
for (int k = k_tile; k < std::min(k_tile + BLOCK_SIZE, N); ++k) {
101+
102+
auto temp = a[i * N + k];
103+
int row_i = i * N;
104+
int row_k = k * N;
105+
106+
for (int j = j_tile; j < std::min(j_tile + BLOCK_SIZE, N); ++j) {
107+
result[row_i + j] += temp * b[row_k + j];
108+
}
109+
}
110+
}
111+
112+
}
113+
}
114+
}
115+
}
116+
```
117+
118+
119+
```cpp
120+
// Now the Benchmarks for all the above:
121+
template <typename T>
122+
static void BM_Multiply_Perf_Tilling_Template(benchmark::State& state) {
123+
int N = state.range(0);
124+
125+
std::vector<T> m1(N * N, 10.41);
126+
std::vector<T> m2(N * N, 20.09);
127+
std::vector<T> m3(N * N);
128+
129+
for (auto _ : state) {
130+
multiply_performance_tilling(m1, m2, m3, N);
131+
benchmark::DoNotOptimize(m3.data());
132+
}
133+
}
134+
135+
136+
137+
template <typename T>
138+
static void BM_Multiply_Naive_Template(benchmark::State& state) {
139+
int N = state.range(0);
140+
std::vector<T> m1(N * N, 10.41);
141+
std::vector<T> m2(N * N, 20.09);
142+
std::vector<T> m3(N * N);
143+
144+
for (auto _ : state) {
145+
multiply_naive(m1, m2, m3, N);
146+
benchmark::DoNotOptimize(m3.data());
147+
}
148+
}
149+
150+
151+
152+
153+
154+
template <typename T>
155+
static void BM_Multiply_Perf_Template(benchmark::State& state) {
156+
int N = state.range(0);
157+
158+
std::vector<T> m1(N * N, 10.41);
159+
std::vector<T> m2(N * N, 20.09);
160+
std::vector<T> m3(N * N);
161+
162+
for (auto _ : state) {
163+
multiply_performance(m1, m2, m3, N);
164+
benchmark::DoNotOptimize(m3.data());
165+
}
166+
}
167+
168+
169+
170+
171+
// Tests with N = 1024
172+
//
173+
// One array of 1024 * 1024 * 8 will fit in L3 (in my machine 12 MB)
174+
BENCHMARK_TEMPLATE(BM_Multiply_Naive_Template, double)->Arg(1024);
175+
BENCHMARK_TEMPLATE(BM_Multiply_Perf_Template, double)->Arg(1024);
176+
BENCHMARK_TEMPLATE(BM_Multiply_Perf_Tilling_Template, double)->Arg(1024);
177+
BENCHMARK_MAIN();
178+
179+
`;
180+
```
181+
182+
---
183+
184+
Compile with:
185+
186+
```zsh
187+
g++ -03 cache_locality_matrix.cpp -o cache_perf_test.exe -lpthread -lbenchmark
188+
```
189+
190+
We use aggresive optimization of `-O3` for `SIMD vectorization` apart from the linear prefetcher, to get even even better results
191+
192+
193+
194+
195+
## Benchmark Results
196+
197+
198+
199+
The following hardware specs were used to run these benchmarks. Note that you can find the code in my github repo and run it to test it in your machine. Of course, results will vary per hardware.
200+
201+
| Component | Specification |
202+
| :--- | :--- |
203+
| **CPU** | 12 X 2712 MHz |
204+
| **L1 Data Cache** | 48 KiB (x6) |
205+
| **L1 Instruction Cache** | 32 KiB (x6) |
206+
| **L2 Unified Cache** | 512 KiB (x6) |
207+
| **L3 Unified Cache** | 12288 KiB (x1) |
208+
209+
210+
211+
212+
| Benchmark | Time (Wall) | CPU Time | Iterations |
213+
| :--- | :--- | :--- | :--- |
214+
| `BM_Multiply_Naive<double>/1024` | 1,946,863,116 ns | 1,946,797,038 ns | 1 |
215+
| `BM_Multiply_Perf<double>/1024` | 379,033,700 ns | 379,018,676 ns | 2 |
216+
217+
218+
219+
220+
We compared a **Naive** matrix multiplication (`i-j-k` loops) **vs** an **Optimized** version (`i-k-j` loops).
221+
222+
223+
---
224+
225+
## Analysis: Why the 5.1x Speedup?
226+
227+
We managed to get **~5.1x faster** calculations just by changing the for-loop index.
228+
229+
### 1. Linear Access vs. Strided Access
230+
On the **Naive** implementation, we iterate over `i-j-k`. In this pattern, the `b` vector needs to jump `k * N` every time `k` is incremented. This results in "strided" access, which is the enemy of the CPU cache. The prefetcher will load the next elements of the vector in the cache but they are useless in our case since we do not need the next elements but a stride of them. And cache is small so the CPU takes them from RAM, slowing us down.
231+
232+
233+
234+
On the **Performance** version, we iterate over `i-k-j`, so the `b` vector has **linear access**. The compiler and hardware are smart enough to prefetch the data: while we operate on the `j^{th}` element, the CPU loads the `(j+1)^{th}` and `(j+2)^{th}` elements into the cache before they are even requested. Now we saved the extra cycles "walking" to RAM.
235+
236+
### 2. SIMD Vectorization
237+
Because the data is contiguous, the compiler (especially with `-O3` optimization) can use **AVX instructions** for **SIMD (Single Instruction, Multiple Data)** vectorization. This allows the CPU to calculate multiple multiplications in a single clock cycle.
238+
239+
## Conclusion:
240+
241+
When writing HPC code, how you traverse your data is often more important than the algorithm itself. Keep them linear, and in cache.

0 commit comments

Comments
 (0)