Skip to content

Commit 4b59e56

Browse files
Merge pull request #2763 from kieranhejmadi01/pinning-threads
LP - Getting Started with CPU affinity
2 parents 33ad1a3 + 58cb1bc commit 4b59e56

10 files changed

Lines changed: 654 additions & 0 deletions

File tree

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
---
2+
title: Getting Started with CPU Affinity
3+
4+
minutes_to_complete: 30
5+
6+
who_is_this_for: Developers, performance engineers and system administrators looking to fine-tune the performance of their workload on many-core Arm-based systems.
7+
8+
learning_objectives:
9+
- Create CPU Sets and implement directly into sourcecode
10+
- Understand the performance tradeoff when pinning threads with CPU affinity masks
11+
12+
prerequisites:
13+
- Intermediate understanding of multi-threaded object-orientated programming in C++ and Python
14+
- Foundational understanding of build systems and computer architecture
15+
16+
author: Kieran Hejmadi
17+
18+
### Tags
19+
skilllevels: Introductory
20+
subjects: Performance and Architecture
21+
armips:
22+
- Neoverse
23+
tools_software_languages:
24+
- C++
25+
- Python
26+
operatingsystems:
27+
- Linux
28+
29+
further_reading:
30+
- resource:
31+
title: Taskset Manual
32+
link: https://man7.org/linux/man-pages/man1/taskset.1.html
33+
type: documentation
34+
35+
36+
37+
### FIXED, DO NOT MODIFY
38+
# ================================================================================
39+
weight: 1 # _index.md always has weight of 1 to order correctly
40+
layout: "learningpathall" # All files under learning paths have this same wrapper
41+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
42+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # The weight controls the order of the pages. _index.md always has weight 1.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
title: Background Information
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Introduction
10+
11+
12+
CPU affinity is the practice of binding a process or thread to a specific CPU core or set of cores, telling the operating system scheduler where that work is allowed to run. By default the Linux scheduler dynamically migrates threads across cores to balance load and maximize overall throughput. Pinning overrides this behavior by constraining execution to a chosen set of cores.
13+
14+
Pinning is most often used as a fine-tuning technique for workloads that aim to consume as many CPU cycles as possible while running alongside other demanding applications. Scientific computing pipelines and real time analytics frequently fall into this category. Typical applications that pin processes to specific cores are often sensitive to latency variation rather than just average throughput or have intricate memory access patterns. Pinning can reduce this noise and provide more consistent execution behavior or better memory access patterns under load.
15+
16+
Another important motivation is memory locality. On modern systems with Non Uniform Memory Access architectures (NUMA), different cores have memory access times and characteristics depending on where the data is fetched from. For example, in a server with 2 CPU sockets, that from a programmers view appears as a single processor, would have different memory access times depending on the core. By pinning threads to cores that are close to the memory they use and allocating memory accordingly, an application can reduce memory access latency and improve bandwidth.
17+
18+
Developers can set affinity directly in source code using system calls. Many parallel frameworks expose higher level controls such as OpenMP affinity settings that manage thread placement automatically. Alternatively, at runtime system administrators can pin existing processes using utilities like `taskset` or launch applications with `NUMACTL` to control both CPU and memory placement without modifying code.
19+
20+
Pinning is a tradeoff. It can improve determinism and locality but it can also reduce flexibility and hurt performance if the chosen layout is suboptimal or if system load changes. Over constraining the scheduler may lead to idle cores while pinned threads contend unnecessarily. As a general rule it is best to rely on the operating system scheduler as a first pass and introduce pinning only if you are looking to fine-tune performance.
80.6 KB
Loading
37.6 KB
Loading
31.3 KB
Loading
31.4 KB
Loading
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
---
2+
title: Setup
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Setup
10+
11+
In this example we will be using an AWS Graviton 3 `m7g.4xlarge` instance running Ubuntu 22.04 LTS, based on the Arm Neoverse V1 architecture. If you are unfamiliar with creating a cloud instance, please refer to our [getting started learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/).
12+
13+
This learning path is expected to work on any linux-based Arm instance with 4 or more CPU cores. The `m7g.4xlarge` instance has a uniform processor architecture so there is neglible different in memory or CPU core performance across the cores. On Linux, this can easily be checked with the following command.
14+
15+
```bash
16+
lscpu | grep -i numa
17+
```
18+
19+
For our `m7g.4xlarge` all 16 cores are in the same NUMA (non-uniform memory architecture) node.
20+
21+
```out
22+
NUMA node(s): 1
23+
NUMA node0 CPU(s): 0-15
24+
```
25+
26+
First we will demonstrate how we can pin threads easily using the `taskset` utility available in Linux. This is used to set or retrieve the CPU affinity of a running process or set the affinity of a process about to be launched. This does not require any modifications to the source code.
27+
28+
29+
## Install Prerequisites
30+
31+
Run the following commands:
32+
33+
```bash
34+
sudo apt update && sudo apt install g++ cmake python3.12-venv -y
35+
```
36+
37+
Install Google's Microbenchmarking support library.
38+
39+
```bash
40+
# Check out the library.
41+
git clone https://github.com/google/benchmark.git
42+
# Go to the library root directory
43+
cd benchmark
44+
# Make a build directory to place the build output.
45+
cmake -E make_directory "build"
46+
# Generate build system files with cmake, and download any dependencies.
47+
cmake -E chdir "build" cmake -DBENCHMARK_DOWNLOAD_DEPENDENCIES=on -DCMAKE_BUILD_TYPE=Release ../
48+
# or, starting with CMake 3.13, use a simpler form:
49+
# Build the library.
50+
sudo cmake --build "build" --config Release --target install -j $(nproc)
51+
```
52+
If you have issues building and installing, please refer to the [official installation guide](https://github.com/google/benchmark).
53+
54+
Finally, you will need to install the Linux perf utility for measuring performance. We recommend using our [install guide](https://learn.arm.com/install-guides/perf/). As you may need to build from source.
55+
56+
## Example 1
57+
58+
To demonstrate a use case of CPU affinity, we will create a program that heavily utilizes all the available CPU cores. Create a file named `use_all_cores.cpp` and paste in the source code below. In this example, we are repeatedly calculating the [Leibniz equation](https://en.wikipedia.org/wiki/Leibniz_formula_for_%CF%80) to compute the value of Pi. This is a computationally inefficient algorithm to calculate the value of Pi and we are splitting the work across many threads.
59+
60+
```bash
61+
cd ~
62+
touch use_all_cores.cpp && chmod 755 use_all_cores.cpp
63+
```
64+
65+
66+
```cpp
67+
#include <vector>
68+
#include <iostream>
69+
#include <chrono>
70+
#include <thread>
71+
#include <future>
72+
73+
using namespace std;
74+
75+
76+
double multiplethreaded_leibniz(int terms, bool use_all_cores){
77+
78+
int NUM_THREADS = 2; // use 2 cores by default
79+
if (use_all_cores){
80+
NUM_THREADS = std::thread::hardware_concurrency(); // e.g., 16 for a 16-core, single-threaded processor
81+
}
82+
std::vector<double> partial_results(NUM_THREADS);
83+
84+
85+
auto calculation = [&](int thread_id){
86+
// Lambda function that does the calculation of the Leibniz equation
87+
double denominator = 0.0;
88+
double term = 0.0;
89+
90+
for (int i = thread_id; i < terms; i += NUM_THREADS){
91+
if (i % 32768 == 0){
92+
this_thread::sleep_for(std::chrono::nanoseconds(20));
93+
}
94+
denominator = (2*i) + 1;
95+
if (i%2==0){
96+
partial_results[thread_id] += (1/denominator);
97+
} else{
98+
partial_results[thread_id] -= (1/denominator);
99+
}
100+
}
101+
};
102+
103+
104+
std::vector<thread> threads;
105+
for (int i = 0; i < NUM_THREADS; i++){
106+
threads.push_back(std::thread(calculation, i));
107+
}
108+
109+
for (auto& thread: threads){
110+
thread.join();
111+
}
112+
113+
// Accumulate and return final result
114+
double final_result = 0.0;
115+
for (auto& partial_result: partial_results){
116+
final_result += partial_result;
117+
}
118+
final_result = final_result * 4;
119+
120+
return final_result;
121+
}
122+
123+
int main(){
124+
125+
double result = 0.0;
126+
127+
auto start = std::chrono::steady_clock::now();
128+
for (int i = 0; i < 5; i++){
129+
result = multiplethreaded_leibniz((1<<29),true);
130+
std::cout << "iteration\t" << i << std::endl;
131+
}
132+
auto end = std::chrono::steady_clock::now();
133+
134+
auto duration = std::chrono::duration_cast<chrono::milliseconds>(end-start);
135+
std::this_thread::sleep_for(chrono::seconds(5)); // Wait until Python script has finished before printing Answer
136+
std::cout << "Answer = " << result << "\t5 iterations took " << duration.count() << " milliseconds" << std::endl;
137+
138+
return 0;
139+
}
140+
```
141+
142+
Compile the program with the following command.
143+
144+
```bash
145+
g++ -O2 --std=c++11 use_all_cores.cpp -o prog
146+
```
147+
148+
In a separate terminal we can use the `top` utility to quickly view the utilization of each core. For example, run the following command and press the number `1`. Then we can run the program by entering `./prog`.
149+
150+
```bash
151+
top -d 0.1 # then press 1 to view per core utilization
152+
```
153+
154+
![CPU-utilization](./CPU-util.jpg)
155+
156+
As the screenshot above shows, you should observe all cores on your system being periodically utilized up to 100% and then down to idle until the program exits. In the next section we will look at how to bind this program to specific CPU cores when running alongside a single-threaded Python script.

0 commit comments

Comments
 (0)