ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/_index.md‎
Lines changed: 42 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/_index.md‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/_next-steps.md‎
Lines changed: 8 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/_next-steps.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/background_info.md‎
Lines changed: 20 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/background_info.md‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/cpu_util.jpg‎
80.6 KB b/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/cpu_util.jpg‎
80.6 KB
diff --git a/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/exclusive.jpg‎
37.6 KB b/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/exclusive.jpg‎
37.6 KB
diff --git a/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/free.jpg‎
31.3 KB b/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/free.jpg‎
31.3 KB
diff --git a/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/pinned_shared.jpg‎
31.4 KB b/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/pinned_shared.jpg‎
31.4 KB
diff --git a/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/setup.md‎
Lines changed: 156 additions & 0 deletions b/‎content/learning-paths/servers-and-cloud-computing/pinning-threads/setup.md‎
Lines changed: 156 additions & 0 deletions
@@ -0,0 +1,42 @@
+---
+title: Getting Started with CPU Affinity
+
+minutes_to_complete: 30
+
+who_is_this_for: Developers, performance engineers and system administrators looking to fine-tune the performance of their workload on many-core Arm-based systems.
+
+learning_objectives: 
+    - Create CPU Sets and implement directly into sourcecode
+    - Understand the performance tradeoff when pinning threads with CPU affinity masks
+
+prerequisites:
+    - Intermediate understanding of multi-threaded object-orientated programming in C++ and Python
+    - Foundational understanding of build systems and computer architecture
+
+author: Kieran Hejmadi
+
+### Tags
+skilllevels: Introductory
+subjects: Performance and Architecture
+armips:
+    - Neoverse
+tools_software_languages:
+    - C++
+    - Python
+operatingsystems:
+    - Linux
+
+further_reading:
+    - resource:
+        title: Taskset Manual  
+        link: https://man7.org/linux/man-pages/man1/taskset.1.html
+        type: documentation
+
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # The weight controls the order of the pages. _index.md always has weight 1.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
@@ -0,0 +1,20 @@
+---
+title: Background Information
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Introduction
+
+
+CPU affinity is the practice of binding a process or thread to a specific CPU core or set of cores, telling the operating system scheduler where that work is allowed to run. By default the Linux scheduler dynamically migrates threads across cores to balance load and maximize overall throughput. Pinning overrides this behavior by constraining execution to a chosen set of cores.
+
+Pinning is most often used as a fine-tuning technique for workloads that aim to consume as many CPU cycles as possible while running alongside other demanding applications. Scientific computing pipelines and real time analytics frequently fall into this category. Typical applications that pin processes to specific cores are often sensitive to latency variation rather than just average throughput or have intricate memory access patterns. Pinning can reduce this noise and provide more consistent execution behavior or better memory access patterns under load.
+
+Another important motivation is memory locality. On modern systems with Non Uniform Memory Access architectures (NUMA), different cores have memory access times and characteristics depending on where the data is fetched from. For example, in a server with 2 CPU sockets, that from a programmers view appears as a single processor, would have different memory access times depending on the core. By pinning threads to cores that are close to the memory they use and allocating memory accordingly, an application can reduce memory access latency and improve bandwidth. 
+
+Developers can set affinity directly in source code using system calls. Many parallel frameworks expose higher level controls such as OpenMP affinity settings that manage thread placement automatically. Alternatively, at runtime system administrators can pin existing processes using utilities like `taskset` or launch applications with `NUMACTL` to control both CPU and memory placement without modifying code.
+
+Pinning is a tradeoff. It can improve determinism and locality but it can also reduce flexibility and hurt performance if the chosen layout is suboptimal or if system load changes. Over constraining the scheduler may lead to idle cores while pinned threads contend unnecessarily. As a general rule it is best to rely on the operating system scheduler as a first pass and introduce pinning only if you are looking to fine-tune performance.
@@ -0,0 +1,156 @@
+---
+title: Setup
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Setup
+
+In this example we will be using an AWS Graviton 3 `m7g.4xlarge` instance running Ubuntu 22.04 LTS, based on the Arm Neoverse V1 architecture. If you are unfamiliar with creating a cloud instance, please refer to our [getting started learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/).
+
+This learning path is expected to work on any linux-based Arm instance with 4 or more CPU cores. The `m7g.4xlarge` instance has a uniform processor architecture so there is neglible different in memory or CPU core performance across the cores. On Linux, this can easily be checked with the following command. 
+
+```bash
+lscpu | grep -i numa
+```
+
+For our `m7g.4xlarge` all 16 cores are in the same NUMA (non-uniform memory architecture) node.
+
+```out
+NUMA node(s):                            1
+NUMA node0 CPU(s):                       0-15
+```
+
+First we will demonstrate how we can pin threads easily using the `taskset` utility available in Linux. This is used to set or retrieve the CPU affinity of a running process or set the affinity of a process about to be launched. This does not require any modifications to the source code. 
+
+
+## Install Prerequisites
+
+Run the following commands:
+
+```bash
+sudo apt update && sudo apt install g++ cmake python3.12-venv -y 
+```
+
+Install Google's Microbenchmarking support library. 
+
+```bash
+# Check out the library.
+git clone https://github.com/google/benchmark.git
+# Go to the library root directory
+cd benchmark
+# Make a build directory to place the build output.
+cmake -E make_directory "build"
+# Generate build system files with cmake, and download any dependencies.
+cmake -E chdir "build" cmake -DBENCHMARK_DOWNLOAD_DEPENDENCIES=on -DCMAKE_BUILD_TYPE=Release ../
+# or, starting with CMake 3.13, use a simpler form:
+# Build the library.
+sudo cmake --build "build" --config Release --target install -j $(nproc)
+```
+If you have issues building and installing, please refer to the [official installation guide](https://github.com/google/benchmark).
+
+Finally, you will need to install the Linux perf utility for measuring performance. We recommend using our [install guide](https://learn.arm.com/install-guides/perf/). As you may need to build from source.
+
+## Example 1
+
+To demonstrate a use case of CPU affinity, we will create a program that heavily utilizes all the available CPU cores. Create a file named `use_all_cores.cpp` and paste in the source code below. In this example, we are repeatedly calculating the [Leibniz equation](https://en.wikipedia.org/wiki/Leibniz_formula_for_%CF%80) to compute the value of Pi. This is a computationally inefficient algorithm to calculate the value of Pi and we are splitting the work across many threads. 
+
+```bash
+cd ~
+touch use_all_cores.cpp && chmod 755 use_all_cores.cpp
+```
+
+
+```cpp
+#include <vector>
+#include <iostream>
+#include <chrono>
+#include <thread>
+#include <future>
+
+using namespace std;
+
+
+double multiplethreaded_leibniz(int terms, bool use_all_cores){
+
+    int NUM_THREADS = 2; // use 2 cores by default
+    if (use_all_cores){
+        NUM_THREADS = std::thread::hardware_concurrency(); // e.g., 16 for a 16-core, single-threaded processor
+    }
+    std::vector<double> partial_results(NUM_THREADS);
+
+
+    auto calculation = [&](int thread_id){
+        // Lambda function that does the calculation of the Leibniz equation
+        double denominator = 0.0;
+        double term = 0.0;
+
+        for (int i = thread_id; i < terms; i += NUM_THREADS){
+            if (i % 32768 == 0){
+                this_thread::sleep_for(std::chrono::nanoseconds(20));
+            }
+            denominator = (2*i) + 1;
+            if (i%2==0){
+                partial_results[thread_id] += (1/denominator);
+            } else{
+                partial_results[thread_id] -= (1/denominator);
+            }
+        }
+    };
+
+
+    std::vector<thread> threads;
+    for (int i = 0; i < NUM_THREADS; i++){
+        threads.push_back(std::thread(calculation, i));
+    }
+
+    for (auto& thread: threads){
+        thread.join();
+    }
+
+    // Accumulate and return final result
+    double final_result = 0.0;
+    for (auto& partial_result: partial_results){
+        final_result += partial_result;
+    }
+    final_result = final_result * 4;
+
+    return final_result;
+}
+
+int main(){
+
+    double result = 0.0;
+
+    auto start = std::chrono::steady_clock::now();
+    for (int i = 0; i < 5; i++){
+        result = multiplethreaded_leibniz((1<<29),true);
+        std::cout << "iteration\t" << i << std::endl;
+    }
+    auto end = std::chrono::steady_clock::now();
+
+    auto duration = std::chrono::duration_cast<chrono::milliseconds>(end-start);
+    std::this_thread::sleep_for(chrono::seconds(5)); // Wait until Python script has finished before printing Answer
+    std::cout << "Answer = " << result << "\t5 iterations took " << duration.count() << " milliseconds" << std::endl;
+
+    return 0;
+}
+```
+
+Compile the program with the following command. 
+
+```bash
+g++ -O2 --std=c++11 use_all_cores.cpp -o prog
+```
+
+In a separate terminal we can use the `top` utility to quickly view the utilization of each core. For example, run the following command and press the number `1`. Then we can run the program by entering `./prog`.
+
+```bash
+top -d 0.1 # then press 1 to view per core utilization
+```
+
+![CPU-utilization](./CPU-util.jpg)
+
+As the screenshot above shows, you should observe all cores on your system being periodically utilized up to 100% and then down to idle until the program exits. In the next section we will look at how to bind this program to specific CPU cores when running alongside a single-threaded Python script.