-
Notifications
You must be signed in to change notification settings - Fork 291
LP - Profile and Accelerate GPT2 Inference with Performix #3368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kieranhejmadi01
wants to merge
4
commits into
ArmDeveloperEcosystem:main
Choose a base branch
from
kieranhejmadi01:performix-instruction-mix
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
62a23a8
initial commit before technical review
kieranhejmadi01 27fba09
Fix - replaced *.png to *.webp to reduce file size
kieranhejmadi01 f64a1ae
incorporate technical feedback into PR before merge
kieranhejmadi01 3c83ce3
revert accidental error in baseline gif description
kieranhejmadi01 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
68 changes: 68 additions & 0 deletions
68
.../learning-paths/servers-and-cloud-computing/performix-instruction-mix/_index.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| --- | ||
| title: Profile GPT-2 instruction mix with Arm Performix | ||
|
|
||
| description: Learn how to profile GPT-2 inference on Arm Neoverse with the Arm Performix Instruction Mix recipe, identify scalar versus vector execution patterns, and improve throughput with NEON, SVE, and KleidiAI kernels. | ||
|
|
||
| minutes_to_complete: 45 | ||
|
|
||
| who_is_this_for: This is an introductory topic for developers who want to get started using the instruction mix recipe in Arm Performix through a practical example. | ||
|
|
||
| learning_objectives: | ||
| - Explain how the Instruction Mix recipe combines static disassembly with runtime sampling to show execution behavior | ||
| - Build and run the GPT-2 inference example on an Arm Linux server | ||
| - Identify why matrix multiplication dominates runtime and how vectorization changes the instruction mix | ||
| - Compare throughput and instruction mix across scalar, NEON, SVE, and KleidiAI implementations | ||
|
|
||
| prerequisites: | ||
| - Access to Arm Performix configured with a remote Arm Linux target. For setup, see the [Arm Performix install guide](/install-guides/performix/) | ||
| - Basic understanding of C++ and compiler optimization | ||
| - Basic understanding of matrix multiplication | ||
| - Basic understanding of writing SIMD code with Neon and/or SVE. | ||
|
|
||
| author: | ||
| - Kieran Hejmadi | ||
| - Oliver Grainge | ||
|
|
||
| ### Tags | ||
| skilllevels: Introductory | ||
| subjects: Performance and Architecture | ||
| armips: | ||
| - Neoverse | ||
| tools_software_languages: | ||
| - Arm Performix | ||
| - C++ | ||
| - LLM | ||
| - NEON | ||
| - SVE | ||
| operatingsystems: | ||
| - Linux | ||
| further_reading: | ||
| - resource: | ||
| title: Arm Performix User Guide | ||
| link: https://developer.arm.com/documentation/110163/latest | ||
| type: documentation | ||
| - resource: | ||
| title: Find code hotspots with Arm Performix | ||
| link: /learning-paths/servers-and-cloud-computing/cpu_hotspot_performix/ | ||
| type: learning-path | ||
| - resource: | ||
| title: Identify code hotspots using Arm Performix through the Arm MCP Server | ||
| link: /learning-paths/servers-and-cloud-computing/performix-mcp-agent/ | ||
| type: learning-path | ||
| - resource: | ||
| title: Arm MCP Server GitHub Repository | ||
| link: https://github.com/arm/mcp | ||
| type: website | ||
| - resource: | ||
| title: GPT-2 Example repository | ||
| link: https://github.com/arm-education/GPT-2-Example | ||
| type: website | ||
|
|
||
|
|
||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| # ================================================================================ | ||
| weight: 1 # _index.md always has weight of 1 to order correctly | ||
| layout: "learningpathall" # All files under learning paths have this same wrapper | ||
| learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. | ||
| --- |
8 changes: 8 additions & 0 deletions
8
...ning-paths/servers-and-cloud-computing/performix-instruction-mix/_next-steps.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| --- | ||
| # ================================================================================ | ||
| # FIXED, DO NOT MODIFY THIS FILE | ||
| # ================================================================================ | ||
| weight: 21 # The weight controls the order of the pages. _index.md always has weight 1. | ||
| title: "Next Steps" # Always the same, html page title. | ||
| layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. | ||
| --- |
Binary file added
BIN
+63.7 KB
...g-paths/servers-and-cloud-computing/performix-instruction-mix/code_hotspot.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+44.1 KB
...servers-and-cloud-computing/performix-instruction-mix/code_hotspot_results.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+63.8 KB
...ervers-and-cloud-computing/performix-instruction-mix/configuring-performix.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+64.8 KB
...hs/servers-and-cloud-computing/performix-instruction-mix/dynamic-functions.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+85 KB
...g-paths/servers-and-cloud-computing/performix-instruction-mix/gpt2-baseline.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+88.9 KB
...arning-paths/servers-and-cloud-computing/performix-instruction-mix/gpt_neon.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+49.9 KB
...arning-paths/servers-and-cloud-computing/performix-instruction-mix/hotspot.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 44 additions & 0 deletions
44
...earning-paths/servers-and-cloud-computing/performix-instruction-mix/how-to-1.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| --- | ||
| title: Background | ||
| weight: 2 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## What the instruction mix recipe shows | ||
|
|
||
| The Arm Performix Instruction Mix recipe shows the types and proportions of machine instructions your workload executes at runtime and in static analysis, so you can see how efficiently your code uses Arm CPU hardware resources. | ||
|
|
||
| The Instruction Mix recipe classifies each instruction into a group. The available groups depend on the Neoverse architecture version you are profiling. Therefore the categories you see may vary depending on the version of Arm Neoverse you are using. Typical categories include: | ||
|
|
||
| - integer and floating-point arithmetic | ||
| - memory loads and stores (including exclusive operations) | ||
| - control flow instructions, such as branches and loops | ||
| - specialized instructions, such as cryptographic operations | ||
| - SIMD (Single Instruction, Multiple Data) instructions, including NEON (fixed 128-bit) and SVE (scalable vector length) | ||
|
|
||
| The instruction mix result gives you two complementary views: | ||
|
|
||
| - static analysis, which inspects compiled machine code without running it | ||
| - dynamic analysis, which measures instruction usage during real execution | ||
|
|
||
| Together, these views help you verify whether architecture-specific features are actually active in hot code paths. | ||
|
|
||
| ## Why instruction mix is useful | ||
|
|
||
| Instruction mix is useful when you need to confirm that performance-critical code uses Arm CPU features effectively. This is especially helpful when you are, for example, validating the effectiveness of compiler autovectorization. | ||
|
|
||
| For example, if a hot function is mostly scalar at runtime when you expected NEON or SVE activity, that often indicates missed vectorization opportunities. You can then focus optimization work on compiler flags, data layout, loop structure, and kernel implementation to improve throughput where it matters most. | ||
|
|
||
| ## Why use a GPT-2 workload | ||
|
|
||
| In this Learning Path, you run the [GPT-2 Medium](https://huggingface.co/openai-community/gpt2-medium) model on a minimal C++ inference engine to analyze instruction mix and throughput. This model is available under a [modified MIT License](https://github.com/openai/gpt-2/blob/master/LICENSE). You will confirm that matrix multiplication (`matmul`) is the hot path, then compare how scalar, NEON, and SVE implementations change instruction behavior and token generation speed. | ||
|
|
||
| This example implements only the forward inference path, with no back propagation or training. You do not need to understand the full transformer architecture to complete this Learning Path. Familiarity with matrix multiplication is enough. For background on GPT-2, see the original 2019 paper, [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) | ||
|
|
||
| You will also try implementing your own `matmul` kernels that target NEON and SVE, then use instruction mix data to verify that these vector paths are active and improving throughput. | ||
|
|
||
| ## What you've learned and what's next | ||
|
|
||
| In this section, you learned what instruction mix represents and why it is useful for LLM inference optimization on Arm. Next, you will set up the GPT-2 example, build the binaries, and run a baseline test. | ||
112 changes: 112 additions & 0 deletions
112
...earning-paths/servers-and-cloud-computing/performix-instruction-mix/how-to-2.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,112 @@ | ||
| --- | ||
| title: Set up and run GPT-2 baseline | ||
| weight: 3 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## Prepare the environment | ||
|
|
||
| Use an Arm Linux target, such as an Arm Neoverse cloud instance. The results in this Learning Path were collected on a Graviton 3 instance based on Neoverse V1 running Ubuntu 24.04 LTS. If you have not configured Arm Performix yet, complete setup and target connection using the [Arm Performix install guide](/install-guides/performix/). | ||
|
|
||
| Install build prerequisites and clone the GPT-2 example repository: | ||
|
|
||
| ```bash | ||
| sudo apt update | ||
| sudo apt install -y git g++ cmake python3 python3-venv | ||
| git clone --recurse-submodules https://github.com/arm-education/GPT-2-Example.git | ||
| cd GPT-2-Example | ||
| git checkout tags/v0.0.2 | ||
| ``` | ||
|
|
||
| ## Export GPT-2 model assets | ||
|
|
||
| The C++ runtime expects exported model binaries. Create a Python virtual environment, install dependencies, and export GPT-2 Medium weights and vocabulary: | ||
|
|
||
| This Learning Path uses [openai-community/gpt2-medium on Hugging Face](https://huggingface.co/openai-community/gpt2-medium), which corresponds to the GPT-2 Medium model from the original OpenAI GPT-2 release in 2019. The model has 355 million parameters, and in this workflow it runs with unquantized FP32 (32-bit floating-point) weights. | ||
|
|
||
| ```bash | ||
| python3 -m venv venv | ||
| source venv/bin/activate | ||
| pip install -r src/requirements.txt | ||
| python3 src/export_gpt2.py --model gpt2-medium | ||
| ``` | ||
|
|
||
| This creates: | ||
|
|
||
| - `models/gpt2-medium/weights.bin` | ||
| - `models/gpt2-medium/vocab.bin` | ||
|
|
||
| ## Review the source code | ||
|
|
||
| The `src/gpt2.cpp` file implements the end-to-end GPT-2 inference loop. Each generated token triggers a forward pass over all 24 transformer layers. Inside each layer, `matmul` is called multiple times: for the query/key/value projection, the attention output projection, and both feed-forward layers. It is called once more at the end for logits projection over the vocabulary: | ||
|
|
||
| ```cpp | ||
| // Attention QKV projection | ||
| matmul(s.qkv.data(), s.xb.data(), | ||
| w.c_attn_w.data()+(size_t)l*3*E*E, | ||
| w.c_attn_b.data()+(size_t)l*3*E, E, 3*E); | ||
|
|
||
| // FFN expand | ||
| matmul(s.mlp_h.data(), s.xb.data(), | ||
| w.mlp_fc_w.data()+(size_t)l*4*E*E, | ||
| w.mlp_fc_b.data()+(size_t)l*4*E, E, 4*E); | ||
|
|
||
| // Logits projection (vocab_size x n_embd) | ||
| matmul(s.logits.data(), s.x.data(), w.wte.data(), nullptr, E, cfg.vocab_size); | ||
| ``` | ||
|
|
||
| The `matmul` dispatch in `gpt2.cpp` selects a kernel at compile time based on a preprocessor flag: | ||
|
|
||
| ```cpp | ||
| static void matmul(float *out, const float *x, const float *W, const float *b, | ||
| int n_in, int n_out) { | ||
| #if defined(GPT2_KERNEL_NEON) | ||
| kernels::matmul_neon(out, x, W, b, n_in, n_out); | ||
| #elif defined(GPT2_KERNEL_SVE) | ||
| kernels::matmul_sve(out, x, W, b, n_in, n_out); | ||
| #elif defined(GPT2_KERNEL_USER) | ||
| kernels::matmul_user(out, x, W, b, n_in, n_out); | ||
| #else | ||
| kernels::matmul_ref(out, x, W, b, n_in, n_out); | ||
| #endif | ||
| } | ||
| ``` | ||
|
|
||
| The baseline kernel (`src/kernels/matmul_ref.cpp`) is a straightforward scalar nested for loop: for each output row, it walks the weight matrix row and accumulates a dot product with the input vector: | ||
|
|
||
| ```cpp | ||
| void matmul_ref(float *out, const float *x, const float *W, const float *b, | ||
| int n_in, int n_out) { | ||
| for (int i = 0; i < n_out; i++) { | ||
| float acc = b ? b[i] : 0.f; | ||
| const float *row = W + (size_t)i * n_in; | ||
| for (int j = 0; j < n_in; j++) acc += row[j] * x[j]; | ||
| out[i] = acc; | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| This scalar implementation can leave NEON and SVE vector units underused if the compiler cannot efficiently autovectorize it. Because `matmul` is called hundreds of times per token, explicitly optimizing this kernel guarantees SIMD execution where most of the available compute is spent. | ||
|
|
||
| ## Build and run the baseline | ||
|
|
||
| Configure and build the project with CMake. The project uses `-O2 -g`, which keeps optimization enabled while preserving debug symbols for profiling. | ||
|
|
||
| ```bash | ||
| cmake -S . -B build -DBUILD_USER_MATMUL=ON | ||
| cmake --build build --parallel | ||
| ``` | ||
|
|
||
| Run the scalar baseline binary: | ||
|
|
||
| ```bash | ||
| ./build/gpt2 --model gpt2-medium "Once upon a time" -n 20 | ||
| ``` | ||
|
|
||
|  | ||
|
|
||
| ## What you've learned and what's next | ||
|
|
||
| You now have a working baseline binary and model files. Next, you will use the Instruction Mix recipe in Arm Performix to inspect static disassembly and dynamic runtime behavior. |
68 changes: 68 additions & 0 deletions
68
...earning-paths/servers-and-cloud-computing/performix-instruction-mix/how-to-3.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| --- | ||
| title: Profile with instruction mix | ||
| weight: 4 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## Find the code hotspot | ||
|
|
||
| Before you optimize, identify where the application spends most of its time. Use the Code Hotspots recipe to periodically sample the running application and build a profile of the functions that execute most often. | ||
|
|
||
| Open Arm Performix and select the **Code Hotspots** recipe. If this is your first run on the target, complete tool deployment as prompted. | ||
|
|
||
| Set the launch command to your baseline binary with the number of tokens (`-n`) set to 150 as per the command below. This value keeps startup overhead small compared to inference time, so the profile minimizes the time taken to load the model weights: | ||
|
|
||
| ```out | ||
| <path to repository>/build/gpt2 --model gpt2-medium "Once upon a time" -n 150 | ||
| ``` | ||
|
|
||
|  | ||
|
|
||
| The results show that `kernels::matmul_ref()` is the hottest function. Double-clicking on the function with show which lines of source code the samples are mostly attributed to the accumulate step of `kernels::matmul_ref()`. | ||
|
|
||
|  | ||
|
|
||
| This confirms that matrix multiplication is the highest-impact optimization target. | ||
|
|
||
| ## Assess compiler output | ||
|
|
||
| We can use online tools such as [Compiler Explorer](https://godbolt.org/) to conveniently see how this function is being compiled with the `-O2 -g` flags. The example below uses `GCC 12.1.0`. You can check your installed compiler version with the `g++ --version` command and select the corresponding version from the Compiler Explorer drop-down menu. The generated assembly may differ slightly across compiler versions. | ||
|
|
||
| {{< godbolt width="100%" height="400px" mode="assembly" opt="-O2 -g" src="void matmul_ref(float *out, const float *x, const float *W, const float *b, int n_in, int n_out)\n{\n for (int i = 0; i < n_out; i++) {\n float acc = b ? b[i] : 0.f;\n const float *row = W + (unsigned long long)i * (unsigned long long)n_in;\n for (int j = 0; j < n_in; j++) {\n acc += row[j] * x[j];\n }\n out[i] = acc;\n }\n}" >}} | ||
|
|
||
| This view helps you spot missed vectorization opportunities. In an optimized build, you would expect the accumulation step to use SIMD instructions, for example `fmla v0.4s, v3.4s, v2.4s` with use of the vector register (`v0->v3`). However, assembly inspection has limitations. First, you need familiarity with SIMD mnemonics to recognize vectorized code. Second, this narrow snippet does not show whether changing compiler flags introduces regressions in other parts of the codebase. Third, and most importantly, this static view does not show which instructions in this function run most often on the CPU. | ||
|
|
||
| The Instruction Mix recipe helps fill this gap. | ||
|
|
||
| ## Configure the Instruction Mix recipe | ||
|
|
||
| Open Arm Performix and select the **Instruction Mix** recipe. If this is your first run on the target, complete tool deployment as prompted. | ||
| Set the launch command to your baseline binary with the same runtime arguments used for baseline testing: | ||
|
|
||
| ```output | ||
| </path/to/GPT-2-Example>/build/gpt2 --model gpt2-medium "Once upon a time" -n 150` | ||
| ``` | ||
|
|
||
| Use the same model and prompt arguments as your baseline terminal run so the measurements are comparable. | ||
|
|
||
|  | ||
|
|
||
| ### Analyze static disassembly | ||
|
|
||
| After the run completes, review static disassembly first. This view is ordered by percentage contribution and provides a high-level profile of the application’s generated instruction stream. It can help you identify broad characteristics, such as whether the code is branch-heavy, dominated by memory operations, or making effective use of SIMD instructions. Use this static view to understand overall code generation patterns rather than to attribute performance to specific functions or source lines. Dynamic analysis is typically more relevant for optimization because it reflects the instructions that are actually executed at runtime. | ||
|
|
||
|  | ||
|
|
||
| ### Dynamic analysis | ||
|
|
||
| Then inspect dynamic analysis bar chart to see where sampled runtime work is concentrated. Dynamic data is typically more useful for optimization because it reflects actual execution behavior for your input, runtime settings, and call frequencies. | ||
|
|
||
|  | ||
|
|
||
| Finally, in dynamic functions, you can break down operation types to individual functions. This is particularly useful when no single function dominates the profile, allowing you to inspect dynamic instruction patterns for specific functions. | ||
|
|
||
| ## What you've learned and what's next | ||
|
|
||
| You used Instruction Mix to confirm that baseline runtime is dominated by scalar-heavy `matmul` execution. Next, you will compare updated instruction mix and throughput across scalar, NEON, SVE, and KleidiAI variants. |
71 changes: 71 additions & 0 deletions
71
...earning-paths/servers-and-cloud-computing/performix-instruction-mix/how-to-4.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| --- | ||
| title: Optimize | ||
| weight: 5 | ||
|
|
||
| ### FIXED, DO NOT MODIFY | ||
| layout: learningpathall | ||
| --- | ||
|
|
||
| ## Complete the challenge (optional) | ||
|
|
||
| In this project, `src/kernels/matmul_user.cpp` is your editable implementation file. The baseline behavior in this file is scalar, and the build uses `-O2 -g`, so compiler optimization is enabled but vector hardware is still underused in the hot loop. | ||
|
|
||
| Use the profiling evidence from Performix to implement your own NEON or SVE intrinsics in `src/kernels/matmul_user.cpp`, then rebuild and profile `gpt2_user`. | ||
|
|
||
| {{% notice Hint %}} | ||
|
|
||
| Focus on the accumulation loop in `matmul_user` (`acc += row[j] * x[j];`). Think about lane utilization, loop unrolling, and handling the tail when the input width is not an exact multiple of the vector width. | ||
|
|
||
| {{% /notice %}} | ||
|
|
||
| Rebuild after your edits: | ||
|
|
||
| ```bash | ||
| cmake -S . -B build -DBUILD_USER_MATMUL=ON | ||
| cmake --build build --parallel | ||
| ``` | ||
|
|
||
| Then profile the `build/gpt2_user` binary with the same runtime arguments and compare the Instruction Mix and throughput against baseline. | ||
|
|
||
| Example solutions are available in: | ||
|
|
||
| - `src/kernels/matmul_neon.cpp` | ||
| - `src/kernels/matmul_sve.cpp` | ||
|
|
||
| You can use `AGENTS.md` in the GPT-2 example repository for guided learning support. | ||
|
|
||
| ### Use the Arm MCP Server with Performix (optional) | ||
|
|
||
| You can also use an MCP-compatible coding assistant, such as GitHub Copilot or Codex, with the Arm MCP Server. This gives the assistant direct tool access to run Performix recipes on your remote Arm target and create a faster feedback loop while you iterate on `matmul_user`. | ||
|
|
||
| For setup details, see [Automate x86-to-Arm application migration using Arm MCP Server](/learning-paths/servers-and-cloud-computing/arm-mcp-server/). | ||
|
|
||
| Install Docker if needed, then pull the MCP server image: | ||
|
|
||
| ```bash | ||
| docker pull armlimited/arm-mcp:latest | ||
| ``` | ||
|
|
||
| To allow Performix access to remote targets from inside the container, mount your workspace plus SSH key and known hosts in your Codex MCP configuration (example `~/.codex/config.toml`): | ||
|
|
||
| ```output | ||
| [mcp_servers.arm-mcp] | ||
| command = "docker" | ||
| args = [ | ||
| "run", | ||
| "--rm", | ||
| "-i", | ||
| "-v", "/path/to/your/workspace:/workspace", | ||
| "-v", "/path/to/your/ssh/private_key:/run/keys/ssh-key.pem:ro", | ||
| "-v", "/path/to/your/ssh/known_hosts:/run/keys/known_hosts:ro", | ||
| "armlimited/arm-mcp" | ||
| ] | ||
| ``` | ||
|
|
||
| Restart your coding assistant, then prompt it to run Performix Instruction Mix and Code Hotspots on your `gpt2_user` binary and suggest Arm intrinsics improvements. | ||
|
|
||
|  | ||
|
|
||
| ## What you've learned and what's next | ||
|
|
||
| In this optional section, you implemented and profiled a custom `matmul_user` kernel using the same workflow you used for baseline analysis. Next, you will compare instruction mix and throughput across scalar, NEON, SVE, and KleidiAI variants. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.