ArmDeveloperEcosystem · pareenaverma · Mar 12, 2026 · Mar 3, 2026 · Mar 12, 2026
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -573,6 +573,7 @@ BMS
 BoardRenderer
 BoatAttack
 Bolt
+BOLT
 BOLT's
 bonza
 bool
@@ -601,6 +602,7 @@ brian
 brianfrankcooper
 Broadcom
 Brossard
+BRBE
 brstack
 BSON
 bsp
@@ -701,6 +703,7 @@ CDE
 CDH
 CDK
 cdn
+cdsort
 ce
 cea
 cebbb

diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-demo/_index.md b/content/learning-paths/servers-and-cloud-computing/bolt-demo/_index.md
@@ -0,0 +1,65 @@
+---
+title: "Get started with BOLT"
+
+draft: true
+cascade:
+    draft: true
+
+minutes_to_complete: 20
+
+who_is_this_for: This is an introductory topic for performance‑minded developers
+    who have a compiled aarch64 Linux program and want to see if BOLT can make it run faster.
+
+learning_objectives:
+    - Identify whether a program is a good candidate for code layout optimization
+    - Apply BOLT to optimize a small program with poor spatial locality
+    - Use different profiling techniques, including BRBE, Instrumentation, SPE, and PMU events
+    - Verify the impact of BOLT optimization using performance metrics
+
+
+prerequisites:
+    - An AArch64 system running Linux with [Perf](/install-guides/perf/) installed
+    - Linux kernel version 6.17 or later for [BRBE](./brbe) profiling
+    - Linux kernel version 6.14 or later for [SPE](./spe) profiling
+    - GCC version 13.3 or later to compile the demo program ([GCC](/install-guides/gcc/) )
+    - BOLT version [21.1.8](https://github.com/llvm/llvm-project/releases/tag/llvmorg-21.1.8) or later (download [zip](https://github.com/llvm/llvm-project/releases/download/llvmorg-21.1.8/LLVM-21.1.8-Linux-ARM64.tar.xz))
+    - A system with enough performance counters for the [TopDown](/install-guides/topdown-tool) methodology, typically a non-virtualized instance
+
+
+author: Paschalis Mpeis
+
+### Tags
+skilllevels: Introductory
+subjects: Performance and Architecture
+armips:
+    - Neoverse
+    - Cortex-A
+tools_software_languages:
+    - BOLT
+    - perf
+
+operatingsystems:
+    - Linux
+
+further_reading:
+    - resource:
+        title: BOLT README
+        link: https://github.com/llvm/llvm-project/tree/main/bolt
+        type: documentation
+    - resource:
+        title: Arm Statistical Profiling Extension Whitepaper
+        link: https://developer.arm.com/documentation/109429/latest/
+        type: documentation
+    - resource:
+        title: Arm Topdown Methodology
+        link: https://developer.arm.com/documentation/109542/02/Arm-Topdown-methodology
+        type: documentation
+
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-demo/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/bolt-demo/_next-steps.md
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # Set to always be larger than the content in this path to be at the end of the navigation.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-demo/brbe.md b/content/learning-paths/servers-and-cloud-computing/bolt-demo/brbe.md
@@ -0,0 +1,62 @@
+---
+title: "BOLT with BRBE"
+weight: 5
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+### What is BRBE
+BRBE stands for Branch Record Buffer Extension. It is an Arm hardware unit with a circular buffer that captures the most recent 32 or 64 taken branches. The exact size depends on the hardware implementation.
+
+For BOLT, BRBE provides an effective, low-overhead sampling mechanism that records taken branches directly in hardware without frequent interruptions. Each recorded taken branch represents a control-flow edge, which makes BRBE an edge-based profiling method.
+
+Taken branches are continuously added to the circular buffer, and the buffer is periodically sampled to keep overheads low.
+Recording only taken branches is an efficient use of the buffer, since fall-through paths do not need to be captured at runtime.
+During post-processing, fall-through edges between the recorded taken branches are reconstructed, extending the effective branch history beyond what is stored in the buffer. BOLT performs this reconstruction automatically.
+
+### When to use BRBE
+When available, BRBE is the preferred profiling option for BOLT.
+It is expected to have the lowest runtime overhead while still providing near-optimal profiles, close to those obtained with instrumentation.
+
+### Optimizing with BRBE
+We check [BRBE availability](#availability) before recording a profile.
+We then record a BRBE profile by running our workload under perf, convert it into a format that BOLT understands, and run the BOLT optimization.
+
+```bash { line_numbers=true }
+mkdir -p prof
+perf record -j any,u -o prof/brbe.data -- ./out/bsort
+perf2bolt -p prof/brbe.data -o prof/brbe.fdata out/bsort
+llvm-bolt out/bsort -o out/bsort.opt.brbe --data prof/brbe.fdata \
+        -reorder-blocks=ext-tsp -reorder-functions=cdsort -split-functions \
+        --dyno-stats
+```
+
+
+### Availability
+BRBE is an optional feature in processors that implement [Armv9.1](https://developer.arm.com/documentation/109697/2025_09/Feature-descriptions/The-Armv9-2-architecture-extension#extension__feat_FEAT_BRBE) or later. To check availability, we record a trace.
+
+On a successful recording we see:
+```bash { command_line="user@host | 2-5"}
+perf record -j any,u -o prof/brbe.data -- ./out/bsort
+Bubble sorting 10000 elements
+421 ms (first=100669 last=2147469841)
+[ perf record: Woken up 161 times to write data ]
+[ perf record: Captured and wrote 40.244 MB brbe.data (26662 samples) ]
+```
+
+When unavailable:
+```bash { command_line="user@host | 2-3"}
+perf record -j any,u -o prof/brbe.data -- ./out/bsort
+Error:
+cycles:P: PMU Hardware or event type doesn't support branch stack sampling.
+```
+
+To record a BRBE trace we need a Linux system that is version 6.17 or later. We can check the version using:
+```bash
+perf --version
+```
+
+
+### Further Reading
+- [Arm Architecture Reference Manual for A-profile architecture](https://developer.arm.com/documentation/ddi0487/latest)
diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-demo/bsort.cpp b/content/learning-paths/servers-and-cloud-computing/bolt-demo/bsort.cpp
@@ -0,0 +1,113 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <time.h>
+
+#define ARRAY_LEN 10000
+#define FUNC_COPIES 5
+volatile bool Cond = false;
+#define COND() (__builtin_expect(Cond, true))
+
+#define NOPS(N) \
+  asm volatile( \
+      ".rept %0\n" \
+      "nop\n" \
+      ".endr\n" \
+      : : "i"(N) : "memory")
+
+// Swap functionality plus some cold blocks.
+#define SWAP_FUNC(ID) \
+    static __attribute__((noinline)) \
+    void swap##ID(int *left, int *right) { \
+        if (COND()) NOPS(300); \
+        int tmp = *left; \
+        if (COND()) NOPS(300); else *left = *right; \
+        if (COND()) NOPS(300); else *right = tmp; \
+    }
+
+// Aligned at 16KiB
+#define COLD_FUNC(ID) \
+    static __attribute__((noinline, aligned(16384), used)) \
+    void cold_func##ID(void) { \
+        asm volatile("nop"); \
+    }
+
+// Create copies of swap, and interleave with big chunks of cold code.
+SWAP_FUNC(1) COLD_FUNC(1)
+SWAP_FUNC(2) COLD_FUNC(2)
+SWAP_FUNC(3) COLD_FUNC(3)
+SWAP_FUNC(4) COLD_FUNC(4)
+SWAP_FUNC(5) COLD_FUNC(5)
+
+typedef void (*swap_fty)(int *, int *);
+static swap_fty const swap_funcs[FUNC_COPIES] = {
+    swap1, swap2, swap3, swap4, swap5
+};
+
+
+/* Sorting Logic */
+void bubble_sort(int *a, int n) {
+    if (n <= 1)
+        return;
+
+    int end = n - 1;
+    int swapped = 1;
+    unsigned idx = 0;
+
+    while (swapped && end > 0) {
+        swapped = 0;
+        // pick a different copy of the swap function, in a round-robin fashion
+        // and call it.
+        for (int i = 1; i <= end; ++i) {
+            if (a[i] < a[i - 1]) {
+                auto swap_func = swap_funcs[idx++];
+                idx %= FUNC_COPIES;
+                swap_func(&a[i - 1], &a[i]);
+                swapped = 1;
+            }
+        }
+        --end;
+    }
+}
+
+void sort_array(int *data) {
+    for (int i = 0; i < ARRAY_LEN; ++i) {
+        data[i] = rand();
+    }
+    bubble_sort(data, ARRAY_LEN);
+}
+
+/* Timers, helpers, and main */
+static struct timespec timer_start;
+static inline void start_timer(void) {
+    clock_gettime(CLOCK_MONOTONIC, &timer_start);
+}
+
+static inline void stop_timer(void) {
+    struct timespec timer_end;
+    clock_gettime(CLOCK_MONOTONIC, &timer_end);
+    long long ms = (timer_end.tv_sec - timer_start.tv_sec) * 1000LL +
+                   (timer_end.tv_nsec - timer_start.tv_nsec) / 1000000LL;
+    printf("%lld ms ", ms);
+}
+
+static void print_first_last(const int *data, int n) {
+    if (n <= 0)
+        return;
+
+    const int first = data[0];
+    const int last = data[n - 1];
+    printf("(first=%d last=%d)\n", first, last);
+}
+
+int main(void) {
+    srand(0);
+    printf("Bubble sorting %d elements\n", ARRAY_LEN);
+    int data[ARRAY_LEN];
+
+    start_timer();
+    sort_array(data);
+    stop_timer();
+
+    print_first_last(data, ARRAY_LEN);
+    return 0;
+}
diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-demo/good-candidates.md b/content/learning-paths/servers-and-cloud-computing/bolt-demo/good-candidates.md
@@ -0,0 +1,81 @@
+---
+title: Good BOLT Candidates
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Which code is a good BOLT candidate?
+A few hardware metrics can indicate whether a program is a good candidate for code-layout optimization.
+These metrics are commonly analyzed using general methodologies such as the [Arm TopDown methodology](https://developer.arm.com/documentation/109542/02/Arm-Topdown-methodology).
+
+Here, we focus on a small set of TopDown indicators related to instruction delivery and code locality.
+These indicators describe how efficiently the processor can fetch instructions and keep its execution pipeline busy.
+When instruction delivery is inefficient, the workload is said to be **front-end bound**, meaning the CPU often waits for instructions instead of executing them.
+This usually points to instruction fetch or code layout issues, where improving code layout can help.
+
+The L1 instruction cache (L1 I-cache) is the first and fastest cache used to store instructions close to the CPU.
+When instructions are not found there, the CPU must fetch them from slower memory, which can stall execution.
+MPKI, short for misses per kilo instructions, measures how often an event misses per 1,000 executed instructions, which makes it easier to compare across programs and workloads.
+A high **L1 I-cache MPKI** usually indicates poor instruction locality in the binary.
+
+Based on these observations, the BOLT community suggests the following two indicators of a good candidate:
+- Front-End bound workload above 10%.
+- More than 30 L1 I-cache misses per kilo instructions (MPKI).
+
+Higher branch mispredictions or I-TLB misses can also indicate that layout optimization may help.
+
+We can use the Topdown Methodology (see [installation guide](/install-guides/topdown-tool)) to collect these metrics, which is based on the Linux [perf](/install-guides/perf/) tool.  
+Alternatively, we can compute only the L1 I-cache MPKI metric manually using plain Linux perf stat.
+
+{{< tabpane code=true >}}
+  {{< tab header="topdown-tool" language="bash" output_lines="2-21">}}
+    topdown-tool ./out/bsort
+      CPU Neoverse V1 metrics
+      ├── Stage 1 (Topdown metrics)
+      │   └── Topdown Level 1 (Topdown_L1)
+      │       └── ┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┓
+      │           ┃ Metric          ┃ Value ┃ Unit ┃
+      │           ┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━┩
+      │           │ Backend Bound   │ 11.77 │ %    │
+      │           │ Bad Speculation │ 17.92 │ %    │
+      │         » │ Frontend Bound  │ 55.73 │ %    │ «
+      │           │ Retiring        │ 14.88 │ %    │
+      │           └─────────────────┴───────┴──────┘
+      └── Stage 2 (uarch metrics)
+          ├── Misses Per Kilo Instructions (MPKI)
+          │   └── ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+          │       ┃ Metric                  ┃ Value  ┃ Unit                          ┃
+          │       ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+          │       │ Branch MPKI             │ 16.583 │ misses per 1,000 instructions │
+          │     » │ L1I Cache MPKI          │ 60.408 │ misses per 1,000 instructions │ «
+          │       └─────────────────────────┴────────┴───────────────────────────────┘
+          ...
+  {{< /tab >}}
+  {{< tab header="perf stat" language="bash" output_lines="2-10">}}
+    perf stat -e instructions,L1-icache-misses:u ./out/bsort
+      Performance counter stats for './out/bsort':
+
+          957828603 instructions
+           58003648 L1-icache-misses
+
+        0.282472631 seconds time elapsed
+
+        0.282541000 seconds user
+        0.000000000 seconds sys
+  {{< /tab >}}
+{{< /tabpane >}}
+
+We see that the program is **55%** front-end bound.
+At Stage 2, the micro-architectural metrics report **60 L1I MPKI**, which indicates a good candidate for layout optimization.
+The branch MPKI of **16** is also relatively high.
+
+Under the hood, the `topdown-tool` collects perf counters and applies formulas to derive these metrics.
+To compute the L1 I-cache MPKI manually from the `perf stat` output, we apply:
+$$\frac{(\text{L1-icache-misses} \times 1000)}{\text{instructions}}$$
+
+### Further Reading
+- [Arm Topdown methodology]( https://developer.arm.com/documentation/109542/02/Arm-Topdown-methodology)
+- [Optimizing Clang : A Practical Example of Applying BOLT](https://github.com/llvm/llvm-project/blob/main/bolt/docs/OptimizingClang.md)
+- [Metrics by metric group in Neoverse V2](https://developer.arm.com/documentation/109528/0200/Metrics-by-metric-group-in-Neoverse-V2?lang=en)
diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-demo/instrumentation.md b/content/learning-paths/servers-and-cloud-computing/bolt-demo/instrumentation.md
@@ -0,0 +1,28 @@
+---
+title: "BOLT with Instrumentation"
+weight: 6
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+### What is instrumentation
+
+Instrumentation is a profiling method, not specific to BOLT, that augments code with counters to record exact execution counts.
+
+For BOLT, Instrumentation provides complete execution counts for the paths that run. This gives a near-optimal profile for code-layout optimization and therefore the highest optimization potential, without requiring special hardware.
+
+Instrumentation can increase binary size and add significant runtime overhead, making it less attractive for production use. It is mainly used when other profiling methods, such as BRBE, are unavailable, or for comparison to understand the maximum optimization potential.
+
+### Optimizing with instrumentation
+We first build an instrumented binary and then execute the workload to generate a profile.
+By default, BOLT writes the profile to `/tmp/prof.fdata`, unless a path is specified using the `--instrumentation-file` flag.
+Finally, we use the generated profile to optimize the binary with BOLT.
+
+```bash
+llvm-bolt --instrument out/bsort -o out/bsort.instr
+./out/bsort.instr
+llvm-bolt out/bsort -o out/bsort.opt.instr --data /tmp/prof.fdata \
+        -reorder-blocks=ext-tsp -reorder-functions=cdsort -split-functions \
+        --dyno-stats
+```
diff --git a/content/learning-paths/servers-and-cloud-computing/bolt-demo/orderfile.txt b/content/learning-paths/servers-and-cloud-computing/bolt-demo/orderfile.txt
@@ -0,0 +1,10 @@
+_ZL5swap1PiS_
+_ZL10cold_func1v
+_ZL5swap2PiS_
+_ZL10cold_func2v
+_ZL5swap3PiS_
+_ZL10cold_func3v
+_ZL5swap4PiS_
+_ZL10cold_func4v
+_ZL5swap5PiS_
+_ZL10cold_func5v
-Original file line number
+Diff line change
@@ Expand Up / @@ -573,6 +573,7 @@ BMS @@
     BoardRenderer
     BoatAttack
     Bolt
+    BOLT
     BOLT's
     bonza
     bool
@@ Expand Down Expand Up / @@ -601,6 +602,7 @@ brian @@
     brianfrankcooper
     Broadcom
     Brossard
+    BRBE
     brstack
     BSON
     bsp
@@ Expand Down Expand Up / @@ -701,6 +703,7 @@ CDE @@
     CDH
     CDK
     cdn
+    cdsort
     ce
     cea
     cebbb
@@ Expand Down @@