diff --git a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/1-overview.md b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/1-overview.md index ab0f43dd92..8ef8f8f9e0 100644 --- a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/1-overview.md +++ b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/1-overview.md @@ -1,5 +1,5 @@ --- -title: Overview of the Adler-32 algorithm and optimization approach +title: Understand the Adler-32 algorithm and optimization approach weight: 2 ### FIXED, DO NOT MODIFY @@ -8,24 +8,24 @@ layout: learningpathall ## The optimization task -You'll take a simple scalar implementation of the Adler-32 checksum algorithm written in C and incrementally optimize it to use Arm Scalable Vector Extension (SVE) intrinsics. The final SVE version runs significantly faster than the original scalar code on Neoverse processors. +In this Learning Path, you'll take a simple scalar implementation of the Adler-32 checksum algorithm written in C and incrementally optimize it to use Arm Scalable Vector Extension (SVE) intrinsics. The final SVE version runs significantly faster than the original scalar code on Neoverse processors. -What makes this Learning Path different from a typical optimization tutorial is how you'll get there. Rather than being handed a finished SVE implementation, you'll use an AI coding assistant connected to the Arm MCP server to guide you through each step. You'll ask questions, look up intrinsics, understand the algorithm's constraints, and build the solution piece by piece. +This Learning Path is different from a typical optimization tutorial. Rather than starting with a finished SVE implementation, you'll use an AI coding assistant connected to the Arm MCP server to guide you through each step. You'll ask questions, look up intrinsics, understand the algorithm's constraints, and build the solution piece by piece. -AI coding assistants are not yet able to automatically generate optimized code, but you can use them to guide your learning and the implementation details. This way, you can maintain and explain the code and arrive at optimized solutions. This process mirrors what you'd do on your own projects. +AI coding assistants are not yet able to generate optimized code, but you can use them to guide your learning and the implementation details. By working this way, you can maintain and explain the code and arrive at optimized solutions, mirroring what you'd do on your own projects. ## The Adler-32 algorithm -Adler-32 is a checksum algorithm used to verify data integrity. It is used in the zlib compression format. It's fast, simple, and a good candidate for vectorization because its inner loop processes one byte at a time. +Adler-32 is a checksum algorithm used to verify data integrity. It's used in the zlib compression format. The algorithm is fast, simple, and a good candidate for vectorization because its inner loop processes one byte at a time. The algorithm maintains two 16-bit accumulators, `A` and `B`: - `A` starts at 1 and accumulates the sum of all input bytes - `B` accumulates the running sum of all `A` values -Both are taken modulo 65521, the largest prime smaller than 2^16. The final checksum is `(B << 16) | A`. +Both are taken modulo 65521, the largest prime number smaller than 2^16. The final checksum is `(B << 16) | A`. -The scalar implementation is straightforward: +The scalar implementation is as follows: ```c #define MOD_ADLER 65521 @@ -46,28 +46,20 @@ uint32_t adler32(const uint8_t *data, size_t len) This loop has two characteristics that make it interesting to vectorize: -- The `a` accumulator is a simple sum that parallelizes well +- The `a` accumulator is a sum that parallelizes well - The `b` accumulator depends on the running value of `a` after each byte, which makes it harder to vectorize You'll learn how SVE intrinsics solve both of these challenges. ## The role of the Arm MCP server -The Arm MCP server gives your AI coding assistant access to Arm-specific knowledge, including the full SVE intrinsics reference. When you ask about specific intrinsics like `svdot` or `svwhilelt`, the assistant queries the MCP server and returns the exact function signature, pseudocode, and required compiler flags. +The Arm MCP server gives your AI coding assistant access to Arm-specific knowledge, including the full SVE intrinsics reference. When you ask about specific intrinsics such as `svdot` or `svwhilelt`, the assistant queries the MCP server and returns the exact function signature, pseudocode, and required compiler flags. -This means you don't need to keep opening the intrinsics reference material. You can ask questions in plain language and get precise, actionable answers grounded in the actual Arm documentation. +This means you don't need to keep referring to the intrinsics reference material. You can ask questions in plain language and get precise, actionable answers grounded in Arm documentation. -## Outline of each section -Each section follows a consistent pattern: +## What you've learned and what's next -1. A short explanation of what you need to understand at this stage -2. Suggested prompts to ask your AI assistant -3. An explanation of what to look for in the response -4. The code or configuration changes that result from the conversation +You now understand the Adler-32 algorithm and how you can use the Arm MCP server to optimize the algorithm to use SVE intrinsics. -You can follow along exactly, or adapt the prompts to your own style. The goal is to learn the process of using an AI assistant to apply SVE and achieve improved performance. - -## What's next - -Start by setting up the project and establishing a performance baseline for the scalar implementation. The baseline is required before you can measure any improvement. +Next, you'll set up the project and establish a performance baseline for the scalar implementation. diff --git a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/2-baseline.md b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/2-baseline.md index c41a9ed1ce..7a638009ee 100644 --- a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/2-baseline.md +++ b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/2-baseline.md @@ -8,15 +8,15 @@ layout: learningpathall ## Before you begin -To get started, you need an Arm Linux system with SVE support. Suitable cloud instances include AWS Graviton3 or Graviton4, Microsoft Cobalt 100, or Google Axion. The examples in this Learning Path were tested on Ubuntu 26.04. +To get started, you need an Arm Linux system with SVE support. Suitable cloud instances can run on AWS Graviton3 or Graviton4, Microsoft Cobalt 100, or Google Axion. The examples in this Learning Path were tested on Ubuntu 26.04. -You also need an AI coding assistant with the Arm MCP server configured. Supported assistants include [GitHub Copilot](/install-guides/github-copilot/), [Kiro CLI](/install-guides/kiro-cli/), [Claude Code](/install-guides/claude-code/), [Gemini CLI](/install-guides/gemini/), and [Codex CLI](/install-guides/codex-cli/). See the [Arm MCP server Learning Path](/learning-paths/servers-and-cloud-computing/arm-mcp-server/) for setup instructions. +You also need an AI coding assistant with the Arm MCP server configured. Supported assistants include [GitHub Copilot](/install-guides/github-copilot/), [Kiro CLI](/install-guides/kiro-cli/), [Claude Code](/install-guides/claude-code/), [Gemini CLI](/install-guides/gemini/), and [Codex CLI](/install-guides/codex-cli/). For setup instructions, see the [Arm MCP server Learning Path](/learning-paths/servers-and-cloud-computing/arm-mcp-server/). {{< notice Note >}} -The AI responses shown are samples. Your AI assistant may word responses differently, include more or less detail, or structure the output differently depending on the tool and model you are using. Focus on the key concepts rather than the exact wording. +The AI responses shown are samples. Your AI assistant's responses will vary in wording, detail, and structure depending on the tool and model you use. Focus on the key concepts rather than the exact wording. {{< /notice >}} -Start by installing the required software and check your system includes SVE. +Start by installing the required software and checking your system includes SVE. Install GCC and GNU Make: @@ -40,7 +40,7 @@ If `sve` does not appear, the system does not support SVE and the final implemen ## Create the project files -On your Arm Neoverse system, create a working directory and add the source files. +On your Arm Neoverse system, create a working directory and add the source files: ```bash mkdir adler32-sve && cd adler32-sve @@ -177,10 +177,12 @@ clean: The `-mcpu=native` flag tells GCC to optimize for the exact CPU you're running on, which enables SVE code generation on Neoverse processors that have SVE. -### ASK AI: about compiler flags +### Ask AI about compiler flags Before running anything, ask your AI assistant to confirm that your build setup is correct for SVE. +Your prompt can be similar to: + ```text My Makefile uses `-O3 -mcpu=native`. Does this enable SVE code generation on a Neoverse processor? Do I need any special flags for SVE intrinsics? ``` @@ -213,11 +215,9 @@ For more on SVE programming, Arm has a good learning path: Port Code to Arm SVE (https://learn.arm.com/learning-paths/servers-and-cloud-computing/sve/). ``` -The response explains that `-mcpu=native` enables SVE. It also provides useful info about running on other systems and confirm special flags, such as `-march=armv8-a+sve` are not needed. The response also tells you to include ``. - -You also notice the reference to a Learning Path about SVE at the end. This confirms the Arm MCP server is consulted on answering the question. +The response explains that `-mcpu=native` enables SVE and provides useful information about running on other systems. It confirms that special flags, such as `-march=armv8-a+sve`, are not needed and also tells you to include ``. All of this is information you'll need when you create the SVE source later. -This is good information you'll need when you create the SVE source later. +The SVE Learning Path reference at the end of the response confirms that the assistant used the Arm MCP server to answer your question. ## Build and run the baseline @@ -240,11 +240,11 @@ Performance: 10 MB 10485760 bytes 10 iters 262.388 ms 381.1 MB/s checksum=0x285FF1B1 ``` -Your numbers will differ depending on your specific Neoverse processor and memory configuration. Make a note of the MB/s values for the 1 MB and 10 MB cases, as these are your baseline numbers to compare against after each optimization. +Your numbers will differ depending on your specific Neoverse processor and memory configuration. Note the MB/s values for the 1 MB and 10 MB cases, as these are your baseline numbers to compare against after each optimization. -### ASK AI: about auto-vectorization +### Ask AI about auto-vectorization -Now ask your AI assistant a question that many developers wonder about: +Ask your AI assistant about auto-vectorization. Your prompt can be similar to: ```text Can GCC auto-vectorize my adler32 function with SVE if I just use `-mcpu=native`? What would prevent auto-vectorization? @@ -277,16 +277,12 @@ No, GCC cannot auto-vectorize your adler32 function. It tried every vector mode - Break the dependency — use vector lanes to accumulate a and b contributions independently, then reduce at the end. ``` -The response explains that the modulo operation in every iteration (`% MOD_ADLER`) is the main blocker. The compiler can't easily prove that the intermediate values won't overflow in a way that changes the result when operations are reordered. The loop-carried dependency between iterations also makes it difficult. - -Since auto-vectorization won't work, you need to restructure the algorithm before SVE can be applied effectively. The restructuring is explained in the next two sections. +The response explains that the modulo operation in every iteration (`% MOD_ADLER`) is the main blocker. The compiler can't easily prove that the intermediate values won't overflow in a way that changes the result when operations are reordered. The loop-carried dependency between iterations also makes it difficult to auto-vectorize. -## What you've learned and what's next +Because auto-vectorization won't work, you need to restructure the algorithm before you can apply SVE effectively. You'll learn more about the restructuring in the next two sections. -In this section: +## What you've accomplished and what's next -- You created the scalar Adler-32 implementation and benchmark harness -- You recorded your baseline performance numbers -- You learned that auto-vectorization won't work +You've now created the scalar Adler-32 implementation and benchmark harness, and recorded your baseline performance numbers. Using the Arm MCP server and your AI assistant of choice, you've learned that auto-vectorization won't work. -In the next section, you'll use the Arm MCP server to learn the core SVE concepts you need before writing any intrinsics code. +In the next section, you'll use your AI assistant and the Arm MCP server to learn core SVE concepts before writing intrinsics code. diff --git a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/3-sve-concepts.md b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/3-sve-concepts.md index ceb7705876..6dd34e23d9 100644 --- a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/3-sve-concepts.md +++ b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/3-sve-concepts.md @@ -8,23 +8,23 @@ layout: learningpathall ## SVE concepts you need before writing code -SVE is different from fixed width SIMD like Neon. The vector length is not fixed at compile time. It is determined at runtime by the hardware. This means you can't write `for (i = 0; i < n; i += 16)` and assume you're processing 16 bytes per iteration. SVE code must be vector length agnostic (VLA) to work correctly on any processor with SVE support. +SVE is different from fixed-width SIMD such as Neon. The vector length is not fixed at compile time and is determined at runtime by the hardware. This means you can't write `for (i = 0; i < n; i += 16)` and assume you're processing 16 bytes per iteration. SVE code must be Vector Length Agnostic (VLA) to work correctly on any processor with SVE support. Before writing SVE intrinsics, it's helpful to understand three things: -1. How SVE predicates control which elements are active -2. How to handle loop tails when data length isn't a multiple of the vector length -3. How to widen narrow data types for accumulation +- How SVE predicates control which elements are active +- How to handle loop tails when data length isn't a multiple of the vector length +- How to widen narrow data types for accumulation -The Arm MCP server is the right tool for this. Ask your AI assistant the questions below and read the responses. The goal isn't to memorize intrinsic names, but to understand the concepts well enough to recognize when you need each one. +Use the Arm MCP server with your AI assistant to ask the following questions and read the responses. By doing this, you can understand the concepts well enough to recognize when you need each one, without needing to memorize intrinsic names. ## Comparing SVE and Neon Start with understanding the big picture about SVE. -### ASK AI: about SVE versus Neon +### Ask AI about SVE versus Neon -Ask your assistant: +Ask your assistant the following question. Your prompt can be similar to: ```text Ask the Arm MCP server what is SVE and how does it differ from Neon? My Makefile targets the native CPU on a Neoverse processor. @@ -80,19 +80,17 @@ Here's what the Arm knowledge base says: (https://learn.arm.com/learning-paths/servers-and-cloud-computing/bitmap_scan_sve2/) ``` -The response explains that Neon uses fixed 128 bit vectors, while SVE uses vectors of variable length (a multiple of 128 bits, from 128 to 2048 bits). Neoverse N2 and Neoverse V2 support SVE with 128 bit vectors. Neoverse V1 supports SVE with 256 bit vectors. The key point is that your code doesn't need to know the vector length at compile time. SVE intrinsics handle it at runtime. +The response explains that Neon uses fixed 128-bit vectors, while SVE uses vectors of variable length (a multiple of 128 bits, from 128 to 2048 bits). Neoverse N2 and Neoverse V2 support SVE with 128-bit vectors. Neoverse V1 supports SVE with 256 bit vectors. The key point is that your code doesn't need to know the vector length at compile time. SVE intrinsics handle it at runtime. ## Predicates and loop tails -Predicates and loop tails may be new to software developers. +In fixed-width SIMD such as Neon, every lane in a vector always participates in every operation. This works fine when your data length is a multiple of the vector width. However, it forces you to write special-case scalar code to handle the leftover elements at the end of a loop. With SVE's variable vector length, you don't know the vector width at compile time, so that approach breaks down entirely. -In fixed-width SIMD like Neon, every lane in a vector always participates in every operation. That works fine when your data length is a multiple of the vector width, but it forces you to write special-case scalar code to handle the leftover elements at the end of a loop. With SVE's variable vector length, you don't even know the vector width at compile time, so that approach breaks down entirely. +SVE solves this with predicate registers. A predicate is a bitmask with one bit per vector element. Each bit controls whether the corresponding lane is active or inactive for a given operation. Inactive lanes are ignored: they don't load memory, don't compute, and don't write results. This lets you run the same vector code on the final partial chunk of data as on every full chunk before it. You don't need a special-case tail loop. -SVE solves this with predicate registers. A predicate is a bitmask with one bit per vector element. Each bit controls whether the corresponding lane is active or inactive for a given operation. Inactive lanes are ignored: they don't load memory, don't compute, and don't write results. This lets you run the same vector code on the final partial chunk of data as on every full chunk before it, there is no special-case tail loop needed. +### Ask AI how predicate registers work -### ASK AI: about how predicate registers work - -Ask your assistant: +Ask your assistant the following question. Your prompt can be similar to: ```text Masking with predicate registers seems like a key concept in SVE. How does it work to handle loops when my data length isn't a multiple of the vector length? @@ -165,7 +163,7 @@ A sample response is: The response explains that the `svwhilelt_b8(i, n)` intrinsic creates a predicate where element `k` is active if `i + k < n`. This handles the loop tail automatically. When you're near the end of the data, the predicate deactivates the elements that would go out of bounds. -A typical SVE loop looks like this: +An SVE loop usually looks like this: ```c uint64_t vl = svcntb(); // vector length in bytes, determined at runtime @@ -182,9 +180,9 @@ The loop body runs even for the final partial vector. The predicate ensures only For Adler-32, you're loading `uint8_t` bytes but accumulating into `uint32_t` sums. -### ASK AI: about widening +### Ask AI about widening -Ask your assistant: +Ask your assistant the following question. Your prompt can be similar to: ```text The adler32 loop accumulates uint8_t values into a uint32_t sum. How does SVE handle widening from 8 bit to 32 bit elements? @@ -256,7 +254,7 @@ Loading uint8_t values trickiest part. ``` -The response introduces `svld1_u8` for loading bytes, and explains that SVE doesn't have a single "load and widen" intrinsic. Instead, you use `svld1ub_u32` to load bytes and extend them with zeroes directly into a 32 bit vector. This is the right approach for Adler-32 because your accumulators are 32 bit. +The response introduces `svld1_u8` for loading bytes, and explains that SVE doesn't have a single "load and widen" intrinsic. Instead, you use `svld1ub_u32` to load bytes and extend them with zeroes directly into a 32-bit vector. This is the right approach for Adler-32 because your accumulators are 32 bit. The Arm MCP server will return the exact signature: @@ -264,15 +262,15 @@ The Arm MCP server will return the exact signature: svuint32_t svld1ub_u32(svbool_t pg, const uint8_t *base); ``` -This loads one byte per active 32 bit lane and extends each byte with zeroes to 32 bits. On a processor with 256 bit SVE vectors, this loads 8 bytes per iteration (8 lanes × 32 bits = 256 bits). +This loads one byte per active 32-bit lane and extends each byte with zeroes to 32 bits. On a processor with 256 bit SVE vectors, this loads 8 bytes per iteration (8 lanes × 32 bits = 256 bits). ## The dot product intrinsic -You may see things in the AI assistant responses you don't understand. You can continue to ask for more explanation until you are totally comfortable. One of the responses above implies svdot is commonly used for arithmetic. You can ask more for information about what it does. +If the AI assistant response includes something you don't understand, you can continue to ask for more explanation. For example, one of the previous responses implies svdot is commonly used for arithmetic. You can ask for more information about what it does. -### ASK AI: about the svdot instruction +### Ask AI about the svdot instruction -Ask your assistant: +Ask your assistant the following questions. Your prompt can be similar to: ```text What is the svdot intrinsic? How does it differ from a simple multiply and accumulate operation? @@ -333,7 +331,7 @@ A sample output is: just acc[i] += bytes[4i+0] + bytes[4i+1] + bytes[4i+2] + bytes[4i+3]. ``` -The response explains that `svdot` computes a dot product between two vectors of narrow elements and accumulates the result into a vector of wider elements. For example, `svdot_u32` takes two `svuint8_t` vectors and a `svuint32_t` accumulator, multiplying corresponding 8 bit elements and adding the products into 32 bit lanes. +The response explains that `svdot` computes a dot product between two vectors of narrow elements and accumulates the result into a vector of wider elements. For example, `svdot_u32` takes two `svuint8_t` vectors and a `svuint32_t` accumulator, multiplying corresponding 8-bit elements and adding the products into 32-bit lanes. The signature is: @@ -344,15 +342,11 @@ svuint32_t svdot_u32(svuint32_t op1, svuint8_t op2, svuint8_t op3); You'll use this to compute the weighted sum for the `B` accumulator. The reason will become clear in the SVE implementation section. {{< notice Note >}} -You may notice your AI assistant asking to just create the code for you. Resist the urge to say yes and continue to ask questions and understand the theory of operation. You need to do this to get a functional result with the best performance. You also need to be able to explain and maintain the code so it's worth the extra time to learn how SVE works. +You might notice your AI assistant asking to create the code for you. Resist the urge to say yes and continue to ask questions and understand the theory of operation. You need to do this to get a functional result with the best performance. You also need to be able to explain and maintain the code, so it's worth the extra time to learn how SVE works. {{< /notice >}} ## What you've learned and what's next -In this section: - -1. You learned how SVE predicates handle loop tails without special case code -2. You investigated intrinsics for loading bytes and dot product and learned how to ask for more details -3. You understand that SVE code must be vector length agnostic +You've now learned how SVE predicates handle loop tails without special-case code. You've investigated intrinsics for loading bytes and dot product and learned how to ask for more details. You've also understood that SVE code must be Vector Length Agnostic. -Before you can use these intrinsics effectively, you need to restructure the Adler-32 algorithm to remove the modulo operation on each byte. That's the subject of the next section. +Next, you'll restructure the Adler-32 algorithm to remove the modulo operation on each byte. This is necessary to use these intrinsics effectively. diff --git a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/4-nmax-optimization.md b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/4-nmax-optimization.md index f84e3733bb..55173d1813 100644 --- a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/4-nmax-optimization.md +++ b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/4-nmax-optimization.md @@ -19,11 +19,11 @@ for (size_t i = 0; i < len; i++) { The `% MOD_ADLER` operation runs on every single byte. Division is expensive, and doing it 10 million times for a 10 MB buffer is a significant cost. More importantly, it prevents vectorization because each iteration depends on the modulo-reduced result of the previous one. -The standard solution is to defer the modulo. Of course, you might not see this immediately, but may be able to ask a question about optimizing Adler-32. +The standard solution is to defer the modulo. You can ask your AI assistant how to optimize Adler-32. -### ASK AI: about the cost of modulo operations +### Ask AI about the cost of modulo operations -Ask your assistant: +Ask your assistant the following question. Your prompt can be similar to: ```text Are there any common techniques to optimize adler-32 and reduce modulo operations? @@ -133,11 +133,11 @@ uint32_t adler32(const uint8_t *data, size_t len) } ``` -The structure is now an outer loop that processes NMAX-byte blocks, and an inner loop with no modulo at all. The modulo only runs once per 5552 bytes instead of once per byte. +The structure is now an outer loop that processes NMAX-byte blocks, and an inner loop with no modulo at all. The modulo runs only once per 5552 bytes instead of once per byte. ## Update the Makefile to test the NMAX version -Update your `Makefile` to make it easy to switch between implementations: +Update your `Makefile` to make it easy to switch from `adler32-simple.c` to `adler32-nmax.c` ```makefile CC = gcc @@ -161,7 +161,7 @@ clean: .PHONY: run clean ``` -Edit the Makefile to use `adler32-nmax.c` and build and run with the NMAX version: +Build and run with the NMAX version: ```bash make clean && make run @@ -180,12 +180,10 @@ Performance: 10 MB 10485760 bytes 10 iters 50.097 ms 1996.1 MB/s checksum=0x285FF1B1 ``` -This is a substantial improvement over the original scalar version, achieved simply by removing the per-byte modulo. Make a note of these numbers as your new intermediate baseline. +This is a substantial improvement over the original scalar version, achieved by removing the per-byte modulo. Make a note of these numbers as your intermediate baseline. -In this section: +## What you've accomplished and what's next -- You learned why deferring the modulo is safe and how to calculate the NMAX bound -- You implemented the scalar NMAX optimization and measured a significant speedup -- You now have a clean inner loop with no modulo which is the right structure for SVE vectorization +You now have a clean inner loop with no modulo, which is the right structure for SVE vectorization. You learned why deferring the modulo is safe and how to calculate the NMAX bound. You then implemented the scalar NMAX optimization and measured a significant speedup. -The inner loop of the NMAX version is now a simple accumulation loop. In the next section, you'll vectorize it with SVE intrinsics. +The inner loop of the NMAX version is now a simple accumulation loop. Next, you'll vectorize it with SVE intrinsics. diff --git a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/5-sve-implementation.md b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/5-sve-implementation.md index b2bc59fde9..f8533e0c1d 100644 --- a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/5-sve-implementation.md +++ b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/5-sve-implementation.md @@ -17,13 +17,13 @@ for (size_t i = 0; i < n; i++) { } ``` -Vectorizing the `a` accumulator is straightforward: load a vector of bytes, sum them all, add to `a`. The `b` accumulator is harder. Each byte's contribution to `b` depends on how many bytes come after it in the block. If you process N bytes at once, `data[0]` contributes N times to `b`, `data[1]` contributes N-1 times, and so on. +Vectorizing the `a` accumulator is straightforward: load a vector of bytes, sum them all, add to `a`. The `b` accumulator is harder to vectorize. Each byte's contribution to `b` depends on how many bytes come after it in the block. If you process N bytes at once, `data[0]` contributes N times to `b`, `data[1]` contributes N-1 times, and so on. Ask your AI assistant to help you think through this. -### ASK AI: about how to vectorize the loop +### Ask AI how to vectorize the loop -Ask your assistant: +Ask your assistant the following question. Your prompt can be similar to: ```text How can I vectorize the inner loop of the NMAX version using SVE? Provide a detailed explanation for how to do it and teach me about the intrinsics used. @@ -171,7 +171,7 @@ This is the skeleton. The full implementation requires careful handling of the w You can continue learning by asking questions and coding. You can also use your AI assistant to check your code and explain it. -It's unlikely that just asking your assistant to write the code using SVE intrinsics will function correctly with best performance. +Asking your assistant to generate the full SVE implementation directly, without the guided learning steps, is unlikely to produce correct or well-optimised code. ## The complete SVE implementation @@ -290,12 +290,8 @@ Performance: 10 MB 10485760 bytes 10 iters 4.743 ms 21084.6 MB/s checksum=0x649EF1B1 ``` -## What you've learned and what's next +## What you've accomplished and what's next -In this section: +You've now learned how to use `svindex_u32` to create position-weight vectors. You used `svdot` to compute the weighted sum for the `b` accumulator and built a complete, vector-length-agnostic SVE implementation. -- You learned how to use `svindex_u32` to create position-weight vectors -- You used `svdot` to compute the weighted sum for the `b` accumulator -- You built a complete, vector-length-agnostic SVE implementation - -In the final section, you'll benchmark the SVE version against the scalar and NMAX baselines, and look at the generated assembly to understand what the CPU is actually executing. +Next, you'll benchmark the SVE version against the scalar and NMAX baselines, and look at the generated assembly to understand what the CPU is actually executing. diff --git a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/6-results.md b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/6-results.md index 2114001fb8..53b49d81a7 100644 --- a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/6-results.md +++ b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/6-results.md @@ -16,21 +16,19 @@ Your numbers will vary by processor. Compare against the two baselines you recor | Scalar NMAX | ~2,000 MB/s | ~5x | | SVE | ~21,000 MB/s | ~55x | -The SVE version is roughly 10x faster than the NMAX scalar version, and about 55x faster than the original. The exact ratio depends on your SVE vector length. You can also use a Graviton3 instance to try on a processor with 256-bit SVE vectors and compare the results. The 256-bit vector length on Graviton3 shows faster performance than the 128-bit vector length on Graviton4, but Graviton3 is slower than Graviton4 on the scalar versions. +The SVE version is roughly 10x faster than the NMAX scalar version, and about 55x faster than the original. The exact ratio depends on your SVE vector length. You can also use a Graviton3-based instance to try on a processor with 256-bit SVE vectors and compare the results. The 256-bit vector length on Graviton3 shows faster performance than the 128-bit vector length on Graviton4, but Graviton3 is slower than Graviton4 on the scalar versions. -## Ask about the assembly +## Ask AI about the inner loop assembly code -Understanding the generated assembly helps you verify that the compiler is producing the instructions you expect. +By understanding the generated assembly, you can verify that the compiler is producing the instructions you expect. -### ASK AI: about the inner loop assembly code - -Ask your assistant: +Ask your assistant to explain the inner loop assembly code. Your prompt can be similar to: ```text disassemble ~/adler32-sve/adler32-test and explain the assembly code for the inner loop. ``` -The response explains the mapping of the C code to the assembly instructions, explains the intrinsics used. +The response explains the mapping of the C code to the assembly instructions and the intrinsics used. A partial example response is: @@ -60,23 +58,16 @@ objdump -d adler32-test | grep -A 40 "" Look for the `WHILELT` and `UDOT` instructions in the inner loop. If you see them, the SVE code path is active. -## Ask about debugging and performance tuning - -You can also use your AI assistant to debug any issues or clarify performance, but be careful, it is easy to divert into an endless loop of trial and error as today's assistants can easily make things worse. - -## What you've accomplished +{{< notice Note >}} +You can use your AI assistant to debug any issues or clarify performance. However, it is easy to fall into an endless loop of trial and error as today's assistants can easily make things worse. +{{< /notice >}} -You've completed the full optimization journey for Adler-32 on Arm Neoverse using an AI Assistant and the Arm MCP server: -1. You started with a simple scalar implementation and measured its baseline performance -2. You used the Arm MCP server to learn SVE concepts such as predicates, widening loads, dot products, and reductions without looking up documentation -3. You applied the NMAX modulo-deferral technique to restructure the algorithm for vectorization -4. You built a vector-length-agnostic SVE implementation that works correctly on any SVE-capable processor -5. You measured a performance improvement and learned how to read the generated assembly +## What you've accomplished and what's next -## Apply this process to your own code +You've now completed the full optimization journey for Adler-32 on Arm Neoverse using an AI assistant and the Arm MCP server. You started with a simple scalar implementation, measured its baseline performance, and used the Arm MCP server to learn SVE concepts. You then applied the NMAX modulo-deferral technique to prepare the algorithm for vectorization. From there, you built a vector-length-agnostic SVE implementation, verified its correctness, and measured the resulting performance improvement by reading the generated assembly. -The process you followed here applies directly to other scalar loops in your own projects: +You can apply the process you followed in this Learning Path directly to other scalar loops in your own projects: 1. Establish a correctness test and a performance baseline before changing anything 2. Ask your AI assistant to guide you and keep explaining along the way diff --git a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/_index.md b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/_index.md index 2e16bdc5c6..a1c0919d11 100644 --- a/content/learning-paths/servers-and-cloud-computing/adler32-kiro/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/adler32-kiro/_index.md @@ -1,9 +1,6 @@ --- title: Optimize an Adler-32 checksum function with SVE intrinsics using the Arm MCP server -draft: true -cascade: - draft: true description: Use the Arm MCP server with an AI coding assistant to incrementally optimize a scalar C Adler-32 checksum function using SVE intrinsics on Arm Neoverse servers. @@ -19,7 +16,7 @@ learning_objectives: - Validate correctness and measure the performance improvement of the SVE implementation prerequisites: - - An AI coding assistant configured with the Arm MCP server, such as Kiro CLI, GitHub Copilot, or Gemini CLI. See the [Arm MCP server Learning Path](/learning-paths/servers-and-cloud-computing/arm-mcp-server/) for setup instructions. + - An AI coding assistant configured with the Arm MCP server, such as Kiro CLI, GitHub Copilot, or Gemini CLI. For setup instructions, see the [Arm MCP server Learning Path](/learning-paths/servers-and-cloud-computing/arm-mcp-server/). - An Arm Neoverse server running Ubuntu 26.04 with SVE support (for example, AWS Graviton3 or later, Google Axion, or Microsoft Cobalt 100) - Basic familiarity with C programming