Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -573,6 +573,7 @@ BMS
BoardRenderer
BoatAttack
Bolt
BOLT
BOLT's
bonza
bool
Expand Down Expand Up @@ -601,6 +602,7 @@ brian
brianfrankcooper
Broadcom
Brossard
BRBE
brstack
BSON
bsp
Expand Down Expand Up @@ -701,6 +703,7 @@ CDE
CDH
CDK
cdn
cdsort
ce
cea
cebbb
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: "Get started with BOLT"

draft: true
cascade:
draft: true

minutes_to_complete: 20

who_is_this_for: This is an introductory topic for performance‑minded developers
who have a compiled aarch64 Linux program and want to see if BOLT can make it run faster.

learning_objectives:
- Identify whether a program is a good candidate for code layout optimization
- Apply BOLT to optimize a small program with poor spatial locality
- Use different profiling techniques, including BRBE, Instrumentation, SPE, and PMU events
- Verify the impact of BOLT optimization using performance metrics


prerequisites:
- An AArch64 system running Linux with [Perf](/install-guides/perf/) installed
- Linux kernel version 6.17 or later for [BRBE](./brbe) profiling
- Linux kernel version 6.14 or later for [SPE](./spe) profiling
- GCC version 13.3 or later to compile the demo program ([GCC](/install-guides/gcc/) )
- BOLT version [21.1.8](https://github.com/llvm/llvm-project/releases/tag/llvmorg-21.1.8) or later (download [zip](https://github.com/llvm/llvm-project/releases/download/llvmorg-21.1.8/LLVM-21.1.8-Linux-ARM64.tar.xz))
- A system with enough performance counters for the [TopDown](/install-guides/topdown-tool) methodology, typically a non-virtualized instance


author: Paschalis Mpeis

### Tags
skilllevels: Introductory
subjects: Performance and Architecture
armips:
- Neoverse
- Cortex-A
tools_software_languages:
- BOLT
- perf

operatingsystems:
- Linux

further_reading:
- resource:
title: BOLT README
link: https://github.com/llvm/llvm-project/tree/main/bolt
type: documentation
- resource:
title: Arm Statistical Profiling Extension Whitepaper
link: https://developer.arm.com/documentation/109429/latest/
type: documentation
- resource:
title: Arm Topdown Methodology
link: https://developer.arm.com/documentation/109542/02/Arm-Topdown-methodology
type: documentation



### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: "BOLT with BRBE"
weight: 5

### FIXED, DO NOT MODIFY
layout: learningpathall
---

### What is BRBE
BRBE stands for Branch Record Buffer Extension. It is an Arm hardware unit with a circular buffer that captures the most recent 32 or 64 taken branches. The exact size depends on the hardware implementation.

For BOLT, BRBE provides an effective, low-overhead sampling mechanism that records taken branches directly in hardware without frequent interruptions. Each recorded taken branch represents a control-flow edge, which makes BRBE an edge-based profiling method.

Taken branches are continuously added to the circular buffer, and the buffer is periodically sampled to keep overheads low.
Recording only taken branches is an efficient use of the buffer, since fall-through paths do not need to be captured at runtime.
During post-processing, fall-through edges between the recorded taken branches are reconstructed, extending the effective branch history beyond what is stored in the buffer. BOLT performs this reconstruction automatically.

### When to use BRBE
When available, BRBE is the preferred profiling option for BOLT.
It is expected to have the lowest runtime overhead while still providing near-optimal profiles, close to those obtained with instrumentation.

### Optimizing with BRBE
We check [BRBE availability](#availability) before recording a profile.
We then record a BRBE profile by running our workload under perf, convert it into a format that BOLT understands, and run the BOLT optimization.

```bash { line_numbers=true }
mkdir -p prof
perf record -j any,u -o prof/brbe.data -- ./out/bsort
perf2bolt -p prof/brbe.data -o prof/brbe.fdata out/bsort
llvm-bolt out/bsort -o out/bsort.opt.brbe --data prof/brbe.fdata \
-reorder-blocks=ext-tsp -reorder-functions=cdsort -split-functions \
--dyno-stats
```


### Availability
BRBE is an optional feature in processors that implement [Armv9.1](https://developer.arm.com/documentation/109697/2025_09/Feature-descriptions/The-Armv9-2-architecture-extension#extension__feat_FEAT_BRBE) or later. To check availability, we record a trace.

On a successful recording we see:
```bash { command_line="user@host | 2-5"}
perf record -j any,u -o prof/brbe.data -- ./out/bsort
Bubble sorting 10000 elements
421 ms (first=100669 last=2147469841)
[ perf record: Woken up 161 times to write data ]
[ perf record: Captured and wrote 40.244 MB brbe.data (26662 samples) ]
```

When unavailable:
```bash { command_line="user@host | 2-3"}
perf record -j any,u -o prof/brbe.data -- ./out/bsort
Error:
cycles:P: PMU Hardware or event type doesn't support branch stack sampling.
```

To record a BRBE trace we need a Linux system that is version 6.17 or later. We can check the version using:
```bash
perf --version
```


### Further Reading
- [Arm Architecture Reference Manual for A-profile architecture](https://developer.arm.com/documentation/ddi0487/latest)
113 changes: 113 additions & 0 deletions content/learning-paths/servers-and-cloud-computing/bolt-demo/bsort.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define ARRAY_LEN 10000
#define FUNC_COPIES 5
volatile bool Cond = false;
#define COND() (__builtin_expect(Cond, true))

#define NOPS(N) \
asm volatile( \
".rept %0\n" \
"nop\n" \
".endr\n" \
: : "i"(N) : "memory")

// Swap functionality plus some cold blocks.
#define SWAP_FUNC(ID) \
static __attribute__((noinline)) \
void swap##ID(int *left, int *right) { \
if (COND()) NOPS(300); \
int tmp = *left; \
if (COND()) NOPS(300); else *left = *right; \
if (COND()) NOPS(300); else *right = tmp; \
}

// Aligned at 16KiB
#define COLD_FUNC(ID) \
static __attribute__((noinline, aligned(16384), used)) \
void cold_func##ID(void) { \
asm volatile("nop"); \
}

// Create copies of swap, and interleave with big chunks of cold code.
SWAP_FUNC(1) COLD_FUNC(1)
SWAP_FUNC(2) COLD_FUNC(2)
SWAP_FUNC(3) COLD_FUNC(3)
SWAP_FUNC(4) COLD_FUNC(4)
SWAP_FUNC(5) COLD_FUNC(5)

typedef void (*swap_fty)(int *, int *);
static swap_fty const swap_funcs[FUNC_COPIES] = {
swap1, swap2, swap3, swap4, swap5
};


/* Sorting Logic */
void bubble_sort(int *a, int n) {
if (n <= 1)
return;

int end = n - 1;
int swapped = 1;
unsigned idx = 0;

while (swapped && end > 0) {
swapped = 0;
// pick a different copy of the swap function, in a round-robin fashion
// and call it.
for (int i = 1; i <= end; ++i) {
if (a[i] < a[i - 1]) {
auto swap_func = swap_funcs[idx++];
idx %= FUNC_COPIES;
swap_func(&a[i - 1], &a[i]);
swapped = 1;
}
}
--end;
}
}

void sort_array(int *data) {
for (int i = 0; i < ARRAY_LEN; ++i) {
data[i] = rand();
}
bubble_sort(data, ARRAY_LEN);
}

/* Timers, helpers, and main */
static struct timespec timer_start;
static inline void start_timer(void) {
clock_gettime(CLOCK_MONOTONIC, &timer_start);
}

static inline void stop_timer(void) {
struct timespec timer_end;
clock_gettime(CLOCK_MONOTONIC, &timer_end);
long long ms = (timer_end.tv_sec - timer_start.tv_sec) * 1000LL +
(timer_end.tv_nsec - timer_start.tv_nsec) / 1000000LL;
printf("%lld ms ", ms);
}

static void print_first_last(const int *data, int n) {
if (n <= 0)
return;

const int first = data[0];
const int last = data[n - 1];
printf("(first=%d last=%d)\n", first, last);
}

int main(void) {
srand(0);
printf("Bubble sorting %d elements\n", ARRAY_LEN);
int data[ARRAY_LEN];

start_timer();
sort_array(data);
stop_timer();

print_first_last(data, ARRAY_LEN);
return 0;
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
title: Good BOLT Candidates
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Which code is a good BOLT candidate?
A few hardware metrics can indicate whether a program is a good candidate for code-layout optimization.
These metrics are commonly analyzed using general methodologies such as the [Arm TopDown methodology](https://developer.arm.com/documentation/109542/02/Arm-Topdown-methodology).

Here, we focus on a small set of TopDown indicators related to instruction delivery and code locality.
These indicators describe how efficiently the processor can fetch instructions and keep its execution pipeline busy.
When instruction delivery is inefficient, the workload is said to be **front-end bound**, meaning the CPU often waits for instructions instead of executing them.
This usually points to instruction fetch or code layout issues, where improving code layout can help.

The L1 instruction cache (L1 I-cache) is the first and fastest cache used to store instructions close to the CPU.
When instructions are not found there, the CPU must fetch them from slower memory, which can stall execution.
MPKI, short for misses per kilo instructions, measures how often an event misses per 1,000 executed instructions, which makes it easier to compare across programs and workloads.
A high **L1 I-cache MPKI** usually indicates poor instruction locality in the binary.

Based on these observations, the BOLT community suggests the following two indicators of a good candidate:
- Front-End bound workload above 10%.
- More than 30 L1 I-cache misses per kilo instructions (MPKI).

Higher branch mispredictions or I-TLB misses can also indicate that layout optimization may help.

We can use the Topdown Methodology (see [installation guide](/install-guides/topdown-tool)) to collect these metrics, which is based on the Linux [perf](/install-guides/perf/) tool.
Alternatively, we can compute only the L1 I-cache MPKI metric manually using plain Linux perf stat.

{{< tabpane code=true >}}
{{< tab header="topdown-tool" language="bash" output_lines="2-21">}}
topdown-tool ./out/bsort
CPU Neoverse V1 metrics
├── Stage 1 (Topdown metrics)
│ └── Topdown Level 1 (Topdown_L1)
│ └── ┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┓
│ ┃ Metric ┃ Value ┃ Unit ┃
│ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━┩
│ │ Backend Bound │ 11.77 │ % │
│ │ Bad Speculation │ 17.92 │ % │
│ » │ Frontend Bound │ 55.73 │ % │ «
│ │ Retiring │ 14.88 │ % │
│ └─────────────────┴───────┴──────┘
└── Stage 2 (uarch metrics)
├── Misses Per Kilo Instructions (MPKI)
│ └── ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
│ ┃ Metric ┃ Value ┃ Unit ┃
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ │ Branch MPKI │ 16.583 │ misses per 1,000 instructions │
│ » │ L1I Cache MPKI │ 60.408 │ misses per 1,000 instructions │ «
│ └─────────────────────────┴────────┴───────────────────────────────┘
...
{{< /tab >}}
{{< tab header="perf stat" language="bash" output_lines="2-10">}}
perf stat -e instructions,L1-icache-misses:u ./out/bsort
Performance counter stats for './out/bsort':

957828603 instructions
58003648 L1-icache-misses

0.282472631 seconds time elapsed

0.282541000 seconds user
0.000000000 seconds sys
{{< /tab >}}
{{< /tabpane >}}

We see that the program is **55%** front-end bound.
At Stage 2, the micro-architectural metrics report **60 L1I MPKI**, which indicates a good candidate for layout optimization.
The branch MPKI of **16** is also relatively high.

Under the hood, the `topdown-tool` collects perf counters and applies formulas to derive these metrics.
To compute the L1 I-cache MPKI manually from the `perf stat` output, we apply:
$$\frac{(\text{L1-icache-misses} \times 1000)}{\text{instructions}}$$

### Further Reading
- [Arm Topdown methodology]( https://developer.arm.com/documentation/109542/02/Arm-Topdown-methodology)
- [Optimizing Clang : A Practical Example of Applying BOLT](https://github.com/llvm/llvm-project/blob/main/bolt/docs/OptimizingClang.md)
- [Metrics by metric group in Neoverse V2](https://developer.arm.com/documentation/109528/0200/Metrics-by-metric-group-in-Neoverse-V2?lang=en)
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
title: "BOLT with Instrumentation"
weight: 6

### FIXED, DO NOT MODIFY
layout: learningpathall
---

### What is instrumentation

Instrumentation is a profiling method, not specific to BOLT, that augments code with counters to record exact execution counts.

For BOLT, Instrumentation provides complete execution counts for the paths that run. This gives a near-optimal profile for code-layout optimization and therefore the highest optimization potential, without requiring special hardware.

Instrumentation can increase binary size and add significant runtime overhead, making it less attractive for production use. It is mainly used when other profiling methods, such as BRBE, are unavailable, or for comparison to understand the maximum optimization potential.

### Optimizing with instrumentation
We first build an instrumented binary and then execute the workload to generate a profile.
By default, BOLT writes the profile to `/tmp/prof.fdata`, unless a path is specified using the `--instrumentation-file` flag.
Finally, we use the generated profile to optimize the binary with BOLT.

```bash
llvm-bolt --instrument out/bsort -o out/bsort.instr
./out/bsort.instr
llvm-bolt out/bsort -o out/bsort.opt.instr --data /tmp/prof.fdata \
-reorder-blocks=ext-tsp -reorder-functions=cdsort -split-functions \
--dyno-stats
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
_ZL5swap1PiS_
_ZL10cold_func1v
_ZL5swap2PiS_
_ZL10cold_func2v
_ZL5swap3PiS_
_ZL10cold_func3v
_ZL5swap4PiS_
_ZL10cold_func4v
_ZL5swap5PiS_
_ZL10cold_func5v
Loading
Loading