|
| 1 | +--- |
| 2 | +title: Setup |
| 3 | +weight: 3 |
| 4 | + |
| 5 | +### FIXED, DO NOT MODIFY |
| 6 | +layout: learningpathall |
| 7 | +--- |
| 8 | + |
| 9 | +## Setup |
| 10 | + |
| 11 | +In this example we will be using an AWS Graviton 3 `m7g.4xlarge` instance running Ubuntu 22.04 LTS, based on the Arm Neoverse V1 architecture. If you are unfamiliar with creating a cloud instance, please refer to our [getting started learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/). |
| 12 | + |
| 13 | +This learning path is expected to work on any linux-based Arm instance with 4 or more CPU cores. The `m7g.4xlarge` instance has a uniform processor architecture so there is neglible different in memory or CPU core performance across the cores. On Linux, this can easily be checked with the following command. |
| 14 | + |
| 15 | +```bash |
| 16 | +lscpu | grep -i numa |
| 17 | +``` |
| 18 | + |
| 19 | +For our `m7g.4xlarge` all 16 cores are in the same NUMA (non-uniform memory architecture) node. |
| 20 | + |
| 21 | +```out |
| 22 | +NUMA node(s): 1 |
| 23 | +NUMA node0 CPU(s): 0-15 |
| 24 | +``` |
| 25 | + |
| 26 | +First we will demonstrate how we can pin threads easily using the `taskset` utility available in Linux. This is used to set or retrieve the CPU affinity of a running process or set the affinity of a process about to be launched. This does not require any modifications to the source code. |
| 27 | + |
| 28 | + |
| 29 | +## Install Prerequisites |
| 30 | + |
| 31 | +Run the following commands: |
| 32 | + |
| 33 | +```bash |
| 34 | +sudo apt update && sudo apt install g++ cmake python3.12-venv -y |
| 35 | +``` |
| 36 | + |
| 37 | +Install Google's Microbenchmarking support library. |
| 38 | + |
| 39 | +```bash |
| 40 | +# Check out the library. |
| 41 | +git clone https://github.com/google/benchmark.git |
| 42 | +# Go to the library root directory |
| 43 | +cd benchmark |
| 44 | +# Make a build directory to place the build output. |
| 45 | +cmake -E make_directory "build" |
| 46 | +# Generate build system files with cmake, and download any dependencies. |
| 47 | +cmake -E chdir "build" cmake -DBENCHMARK_DOWNLOAD_DEPENDENCIES=on -DCMAKE_BUILD_TYPE=Release ../ |
| 48 | +# or, starting with CMake 3.13, use a simpler form: |
| 49 | +# Build the library. |
| 50 | +sudo cmake --build "build" --config Release --target install -j $(nproc) |
| 51 | +``` |
| 52 | +If you have issues building and installing, please refer to the [official installation guide](https://github.com/google/benchmark). |
| 53 | + |
| 54 | +Finally, you will need to install the Linux perf utility for measuring performance. We recommend using our [install guide](https://learn.arm.com/install-guides/perf/). As you may need to build from source. |
| 55 | + |
| 56 | +## Example 1 |
| 57 | + |
| 58 | +To demonstrate a use case of CPU affinity, we will create a program that heavily utilizes all the available CPU cores. Create a file named `use_all_cores.cpp` and paste in the source code below. In this example, we are repeatedly calculating the [Leibniz equation](https://en.wikipedia.org/wiki/Leibniz_formula_for_%CF%80) to compute the value of Pi. This is a computationally inefficient algorithm to calculate the value of Pi and we are splitting the work across many threads. |
| 59 | + |
| 60 | +```bash |
| 61 | +cd ~ |
| 62 | +touch use_all_cores.cpp && chmod 755 use_all_cores.cpp |
| 63 | +``` |
| 64 | + |
| 65 | + |
| 66 | +```cpp |
| 67 | +#include <vector> |
| 68 | +#include <iostream> |
| 69 | +#include <chrono> |
| 70 | +#include <thread> |
| 71 | +#include <future> |
| 72 | + |
| 73 | +using namespace std; |
| 74 | + |
| 75 | + |
| 76 | +double multiplethreaded_leibniz(int terms, bool use_all_cores){ |
| 77 | + |
| 78 | + int NUM_THREADS = 2; // use 2 cores by default |
| 79 | + if (use_all_cores){ |
| 80 | + NUM_THREADS = std::thread::hardware_concurrency(); // e.g., 16 for a 16-core, single-threaded processor |
| 81 | + } |
| 82 | + std::vector<double> partial_results(NUM_THREADS); |
| 83 | + |
| 84 | + |
| 85 | + auto calculation = [&](int thread_id){ |
| 86 | + // Lambda function that does the calculation of the Leibniz equation |
| 87 | + double denominator = 0.0; |
| 88 | + double term = 0.0; |
| 89 | + |
| 90 | + for (int i = thread_id; i < terms; i += NUM_THREADS){ |
| 91 | + if (i % 32768 == 0){ |
| 92 | + this_thread::sleep_for(std::chrono::nanoseconds(20)); |
| 93 | + } |
| 94 | + denominator = (2*i) + 1; |
| 95 | + if (i%2==0){ |
| 96 | + partial_results[thread_id] += (1/denominator); |
| 97 | + } else{ |
| 98 | + partial_results[thread_id] -= (1/denominator); |
| 99 | + } |
| 100 | + } |
| 101 | + }; |
| 102 | + |
| 103 | + |
| 104 | + std::vector<thread> threads; |
| 105 | + for (int i = 0; i < NUM_THREADS; i++){ |
| 106 | + threads.push_back(std::thread(calculation, i)); |
| 107 | + } |
| 108 | + |
| 109 | + for (auto& thread: threads){ |
| 110 | + thread.join(); |
| 111 | + } |
| 112 | + |
| 113 | + // Accumulate and return final result |
| 114 | + double final_result = 0.0; |
| 115 | + for (auto& partial_result: partial_results){ |
| 116 | + final_result += partial_result; |
| 117 | + } |
| 118 | + final_result = final_result * 4; |
| 119 | + |
| 120 | + return final_result; |
| 121 | +} |
| 122 | + |
| 123 | +int main(){ |
| 124 | + |
| 125 | + double result = 0.0; |
| 126 | + |
| 127 | + auto start = std::chrono::steady_clock::now(); |
| 128 | + for (int i = 0; i < 5; i++){ |
| 129 | + result = multiplethreaded_leibniz((1<<29),true); |
| 130 | + std::cout << "iteration\t" << i << std::endl; |
| 131 | + } |
| 132 | + auto end = std::chrono::steady_clock::now(); |
| 133 | + |
| 134 | + auto duration = std::chrono::duration_cast<chrono::milliseconds>(end-start); |
| 135 | + std::this_thread::sleep_for(chrono::seconds(5)); // Wait until Python script has finished before printing Answer |
| 136 | + std::cout << "Answer = " << result << "\t5 iterations took " << duration.count() << " milliseconds" << std::endl; |
| 137 | + |
| 138 | + return 0; |
| 139 | +} |
| 140 | +``` |
| 141 | +
|
| 142 | +Compile the program with the following command. |
| 143 | +
|
| 144 | +```bash |
| 145 | +g++ -O2 --std=c++11 use_all_cores.cpp -o prog |
| 146 | +``` |
| 147 | + |
| 148 | +In a separate terminal we can use the `top` utility to quickly view the utilization of each core. For example, run the following command and press the number `1`. Then we can run the program by entering `./prog`. |
| 149 | + |
| 150 | +```bash |
| 151 | +top -d 0.1 # then press 1 to view per core utilization |
| 152 | +``` |
| 153 | + |
| 154 | + |
| 155 | + |
| 156 | +As the screenshot above shows, you should observe all cores on your system being periodically utilized up to 100% and then down to idle until the program exits. In the next section we will look at how to bind this program to specific CPU cores when running alongside a single-threaded Python script. |
0 commit comments