From 1bc4c14284523eaeb4f0df99b0b44f6357566639 Mon Sep 17 00:00:00 2001
From: Yiannis Papadopoulos <Yiannis.Papadopoulos@amd.com>
Date: Tue, 24 Mar 2026 15:31:40 -0400
Subject: [PATCH 1/3] Update README with improved content and structure

- Update copyright year range to 2025-2026
- Add key features section highlighting NPU capabilities
- Add comprehensive LLM inference section with Llama 3.2 1B example
- Add detailed contributing guide (CONTRIBUTING.md) with code style,
  testing, and PR guidelines
- Update contribution guide link from docs/contribute.md to CONTRIBUTING.md
- Improve formatting consistency (bullet style
---
 README.md | 134 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 123 insertions(+), 11 deletions(-)
diff --git a/README.md b/README.md
index c833eb40..f507786d 100755
--- a/README.md
+++ b/README.md
@@ -1,5 +1,5 @@
 <!--
-SPDX-FileCopyrightText: Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.
+SPDX-FileCopyrightText: Copyright (C) 2025-2026 Advanced Micro Devices, Inc. All rights reserved.
 SPDX-License-Identifier: Apache-2.0
 -->
 
@@ -13,7 +13,7 @@ SPDX-License-Identifier: Apache-2.0
    <img src="https://img.shields.io/github/downloads/amd/iron/total.svg" alt="GitHub downloads" /></a>
 <a href="https://github.com/amd/iron/actions" title="Check out our tests">
    <img src="https://github.com/amd/iron/actions/workflows/small.yml/badge.svg" alt="Iron Tests" /></a>
-<a href="https://github.com/amd/iron/blob/main/docs/contribute.md" title="Contribution Guide">
+<a href="https://github.com/amd/iron/blob/main/CONTRIBUTING.md" title="Contribution Guide">
     <img src="https://img.shields.io/badge/PRs-welcome-brightgreen.svg" alt="PRs Welcome" /></a>
 <a href="https://github.com/amd/iron/blob/main/LICENSE">
     <img src="https://img.shields.io/badge/license-Apache-yellow.svg" alt="license: Apache" /></a>
@@ -24,8 +24,15 @@ SPDX-License-Identifier: Apache-2.0
    <img src="./images/XDNA2.png" alt="IRONCLAD Logo" style="max-width: 100%; height: auto;">
 </p>
 
-IRON is an open-source & close-to-metal Python API enabling fast and efficient execution on [AMD Ryzen™ AI NPUs](https://www.amd.com/en/products/processors/consumer/ryzen-ai.html). It relies on language bindings around the [MLIR-AIE](https://github.com/Xilinx/mlir-aie) dialect. 
+IRON is an open-source & close-to-metal Python API enabling fast and efficient execution on [AMD Ryzen™ AI NPUs](https://www.amd.com/en/products/processors/consumer/ryzen-ai.html). It relies on language bindings around the [MLIR-AIE](https://github.com/Xilinx/mlir-aie) dialect.
 
+**Key Features:**
+
+- Close-to-metal NPU programming via MLIR-AIE Python bindings
+- Pre-built operator library (GEMM, MHA, RMSNorm, RoPE, activations, etc.)
+- Operator fusion for optimal performance
+- Extensible architecture for custom operators
+- End-to-end LLM inference (Llama 3.2 1B example included)
 
 The IRON Python API for Ryzen™ AI NPUs is described in the following paper:
 
@@ -78,7 +85,7 @@ These instructions will guide you through everything required for building and e
 
 ### Initial Setup
 
-  > Be sure you have the latest BIOS on your laptop or mini-PC that enables the NPU. See [here](#update-bios).
+  > **Important**: Ensure your system has the latest BIOS version that enables NPU support. Check your laptop/mini-PC manufacturer's support website for BIOS updates.
 
 If starting from `Ubuntu 24.04` you may need to update the Linux kernel to 6.11+ by installing the Hardware Enablement (HWE) stack:
 
@@ -126,26 +133,29 @@ If starting from `Ubuntu 24.04` you may need to update the Linux kernel to 6.11+
 
 All available operators can be found in `iron/operators`. These each contain:
 
-* `op.py`: The Python operator interface -- an easy access point to integrate operators into your project that prescribes how to compile the operator (build artifacts) and how to call it at runtime (buffer sizes, etc.)
-* `design.py`: The implementation of the operator's NPU code. Often references a kernel in `aie_kernels` for the compute core code and describes the data movement using ObjectFIFOs.
-* `reference.py`: A reference CPU implementation to validate the correctness of the NPU implementation.
-* `test.py`: An end-to-end test that instantiates and builds the operator, runs it and verifies its outputs against the reference.
+- `op.py`: The Python operator interface -- an easy access point to integrate operators into your project that prescribes how to compile the operator (build artifacts) and how to call it at runtime (buffer sizes, etc.)
+- `design.py`: The implementation of the operator's NPU code. Often references a kernel in `aie_kernels` for the compute core code and describes the data movement using ObjectFIFOs.
+- `reference.py`: A reference CPU implementation to validate the correctness of the NPU implementation.
+- `test.py`: An end-to-end test that instantiates and builds the operator, runs it and verifies its outputs against the reference.
 
-> NOTE: Be sure the XRT setup script has been sourced and the Python environment is activated: 
+> NOTE: Be sure the XRT setup script has been sourced and the Python environment is activated:
 >       `source /opt/xilinx/xrt/setup.sh`
 >       `source /path/to/ironenv/bin/activate`
 
 To build and test all the operators:
+
 ``` bash
 pytest iron/operators/ -m "not extensive"
-``` 
+```
 
 To run the extensive test suite:
+
 ``` bash
 pytest iron/operators/
 ```
 
 To run a specific operator's tests:
+
 ``` bash
 pytest iron/operators/axpy/
 ```
@@ -160,12 +170,114 @@ chmod +x .git/hooks/pre-push
 ```
 
 The hook will run the same linting checks as CI:
+
 - License checks (reuse)
 - Python formatting (black)
 - C++ formatting (clang-format)
 
 To bypass the hook if needed: `git push --no-verify`
 
+## Quick Start Example
+
+Here's a simple example using the AXPY operator (Y = aX + Y):
+
+```python
+#!/usr/bin/env python3
+from iron.operators.axpy import AIEAXPY
+import numpy as np
+from ml_dtypes import bfloat16
+
+# Define operator parameters
+size = 2048
+num_aie_columns = 4
+num_channels = 2
+tile_size = 512
+scalar_factor = 3.0
+
+# Create and compile the operator
+operator = AIEAXPY(
+    size=size,
+    num_aie_columns=num_aie_columns,
+    num_channels=num_channels,
+    tile_size=tile_size,
+    scalar_factor=scalar_factor,
+)
+operator.compile()
+
+# Prepare input data
+x = np.random.rand(size).astype(bfloat16)
+y = np.random.rand(size).astype(bfloat16)
+output = np.zeros(size, dtype=bfloat16)
+
+# Run on NPU
+operator(x, y, output)
+
+# Verify: output = scalar_factor * x + y
+expected = scalar_factor * x.astype(np.float32) + y.astype(np.float32)
+print(f"Result matches expected: {np.allclose(output, expected, rtol=0.04)}")
+```
+
+Run this example:
+
+```bash
+python example.py
+```
+
+## Applications
+
+### Llama 3.2 1B Inference
+
+IRON includes a complete LLM inference example demonstrating NPU acceleration:
+
+- **Location**: `iron/applications/llama_3.2_1b/`
+- **Model**: Meta Llama 3.2 1B
+- **Features**: Multi-head attention, fused operators, bfloat16 quantization
+
+See [iron/applications/llama_3.2_1b/README.md](./iron/applications/llama_3.2_1b/README.md) for setup and usage instructions.
+
+## Architecture
+
+IRON uses a three-layer architecture:
+
+1. **Operators** (`iron/operators/`): High-level Python API for NPU operations
+   - Each operator has: `op.py` (interface), `design.py` (MLIR-AIE implementation), `reference.py` (CPU reference), `test.py` (validation)
+
+2. **AIE Kernels** (`aie_kernels/`): Low-level C++ compute kernels
+   - Organized by architecture: `generic/`, `aie2/`, `aie2p/`
+   - Vectorized using AIE API for optimal performance
+
+3. **Common Infrastructure** (`iron/common/`): Compilation, device management, and utilities
+   - MLIR-AIE compilation pipeline
+   - XRT runtime integration
+   - Operator fusion framework
+
+## Performance
+
+IRON operators are designed for maximum NPU utilization:
+
+- Parallel execution across multiple AIE columns
+- Optimized data movement via ObjectFIFOs
+- Fused operations to minimize host-NPU transfers
+- Vectorized kernels using AIE intrinsics
+
+Run benchmarks:
+
+```bash
+# Run all operators with performance metrics
+pytest iron/operators/ -m "not extensive" -v
+```
+
+## Community and Support
+
+- 💬 **Discord**: Join our [Discord server](https://discord.gg/cW99Ds85e8) for discussions and support
+- 🐛 **Issues**: Report bugs and request features via [GitHub Issues](https://github.com/amd/iron/issues)
+- 📖 **Contributing**: See [CONTRIBUTING.md](./CONTRIBUTING.md) for development guidelines
+- 📚 **Documentation**: Operator examples in `iron/operators/`, kernel docs in `aie_kernels/README.md`
+
+## License
+
+IRON is licensed under the Apache License 2.0. See [LICENSE](./LICENSE) for details.
+
 -----
 
-<p align="center">Copyright&copy; 2025 Advanced Micro Devices, Inc</p>
+<p align="center">Copyright&copy; 2025-2026 Advanced Micro Devices, Inc</p>

From 4f0878394ff60ff92006866d92e9e02f6f60fef1 Mon Sep 17 00:00:00 2001
From: Yiannis Papadopoulos <Yiannis.Papadopoulos@amd.com>
Date: Tue, 24 Mar 2026 15:32:20 -0400
Subject: [PATCH 2/3] docs: add AGENTS.md with instructions for AI coding
 agents

---
 AGENTS.md | 501 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 501 insertions(+)
 create mode 100644 AGENTS.md

diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 00000000..3a53f0ab
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,501 @@
+# AGENTS.md
+
+This file provides guidance to AI coding agents when working with code in this repository.
+
+## Overview
+
+IRON is a close-to-metal Python API for AMD Ryzen™ AI NPUs (XDNA architecture). It provides language bindings around the MLIR-AIE dialect to enable fast and efficient execution on NPU hardware.
+
+**Key Technologies:**
+
+- **MLIR-AIE**: Dialect for programming AMD AI Engines (AIE) array architectures
+- **XRT (Xilinx Runtime)**: Low-level runtime for interfacing with NPU hardware
+- **Target Hardware**: AMD Ryzen AI NPUs (AIE2/AIE2P architectures - NPU1/NPU2)
+- **Primary Datatype**: bfloat16
+
+## Environment Setup
+
+```bash
+# 1. Source XRT (required for all operations)
+source /opt/xilinx/xrt/setup.sh
+
+# 2. Activate virtual environment
+source ironenv/bin/activate
+
+# 3. Install dependencies
+pip install -r requirements.txt
+```
+
+**Note:** XRT must be sourced before running any tests or operators.
+
+### Build Directory
+
+Compiled artifacts (`.xclbin`, `.bin`, `.o` files) are stored in `build/` directory by default. The build directory can be customized via `AIEContext(build_dir="path/to/build")`.
+
+### Environment Variables
+
+- `IRON_EXAMPLE_WEIGHTS_DIR`: Path to model weights for applications (default: `/srv`)
+
+## Building and Testing
+
+### Run All Operators (non-extensive tests)
+
+```bash
+pytest iron/operators/ -m "not extensive"
+```
+
+### Run Extensive Test Suite
+
+```bash
+pytest iron/operators/
+```
+
+### Run Single Operator Test
+
+```bash
+pytest iron/operators/axpy/
+# OR directly execute the test.py
+./iron/operators/axpy/test.py
+```
+
+### Run Application Tests
+
+```bash
+pytest iron/applications/
+```
+
+### Run Specific Test Function
+
+```bash
+pytest iron/operators/gemm/test.py::test_gemm_basic
+```
+
+### Parallel Testing (faster)
+
+```bash
+pytest iron/operators/ -n auto -m "not extensive"
+```
+
+## Code Style and Linting
+
+### Python (Black)
+
+```bash
+# Check formatting
+black --check .
+
+# Auto-format
+black .
+```
+
+### C++ (clang-format)
+
+```bash
+# Check C++ formatting
+python scripts/clang-format-wrapper.py --check
+
+# Show differences
+python scripts/clang-format-wrapper.py --diff
+
+# Auto-format all
+python scripts/clang-format-wrapper.py --fix
+
+# Format specific directory
+python scripts/clang-format-wrapper.py --fix --path aie_kernels/
+```
+
+### License Compliance (REUSE)
+
+```bash
+# Check all files have proper license headers
+reuse lint
+```
+
+## Architecture
+
+### Three-Layer Structure
+
+1. **Operators** (`iron/operators/`)
+   - Each operator directory contains:
+     - `op.py`: Python interface (inherits from `MLIROperator`) - defines operator parameters, compilation artifacts, and runtime argument specs
+     - `design.py`: NPU implementation using MLIR-AIE Python API - defines ObjectFIFOs, Workers, and Runtime sequences
+     - `reference.py`: CPU reference implementation for validation
+     - `test.py`: End-to-end test (build, run, verify against reference)
+
+2. **AIE Kernels** (`aie_kernels/`)
+   - Architecture-specific C++ compute kernels:
+     - `generic/`: Works on both AIE2 and AIE2P
+     - `aie2/`: AIE2-specific (NPU1)
+     - `aie2p/`: AIE2P-specific (NPU2)
+   - Use AIE API for vectorization (e.g., `aie::mmul`, `aie::add`, `aie::mul`)
+   - Compiled to `.o` files and linked into operator `.xclbin`
+
+3. **Common Infrastructure** (`iron/common/`)
+   - `base.py`: Base classes (`AIEOperatorBase`, `MLIROperator`, `CompositeOperator`)
+   - `compilation/`: Compilation artifact system (MLIR → xclbin)
+   - `fusion.py`: Operator fusion framework (`FusedMLIROperator`)
+   - `device_manager.py`: XRT device initialization and management (singleton pattern)
+   - `context.py`: `AIEContext` for operator compilation/execution
+   - `utils.py`: Helper functions (`torch_to_numpy`, `numpy_to_torch`)
+   - `test_utils.py`: Test utilities (`verify_buffer`, `nearly_equal`)
+
+### Key Concepts
+
+**ObjectFIFO**: Data movement primitive in MLIR-AIE
+
+- Connects producers and consumers (shim DMA ↔ compute tiles)
+- Uses `acquire()` to get buffer access, `release()` to free it
+- Pattern: always pair acquire with release in loops
+
+**Worker**: Compute tile task
+
+- Wraps a Python function that runs on AIE compute core
+- Function uses `range_()` for loops (not Python `range`)
+- Calls compiled C++ kernels via `Kernel` objects
+
+**TensorAccessPattern (TAP)**: Describes how data is sliced and distributed
+
+- Used to parallelize work across multiple columns
+- Format: `(tensor_shape, offset, dimensions, strides)`
+
+**Runtime Sequence**: Host-side control flow
+
+- `rt.fill()`: DMA data from host → NPU (shim → L2/L1)
+- `rt.drain()`: DMA data from NPU → host
+- `rt.start()`: Launch workers
+- `rt.task_group()`: Coordinate parallel DMA operations
+
+**Compilation Flow**:
+
+```text
+design.py (Python MLIR-AIE API)
+    ↓
+PythonGeneratedMLIRArtifact
+    ↓
+MLIR (.mlir file)
+    ↓ (aie-opt + aie-translate via Peano toolchain)
+xclbin (NPU binary) + insts.bin (instruction sequence)
+```
+
+**AIEContext**: Manages compilation and runtime state
+
+- Default build directory: `build/` in current working directory
+- Compilation rules: Defines pipeline from Python → MLIR → xclbin
+- Device manager: Singleton for XRT resource sharing
+- Use `AIEContext(build_dir="...", mlir_verbose=True)` for custom settings
+
+**Device Manager**: Singleton that manages XRT resources
+
+- Automatically initializes `pyxrt.device(0)`
+- Caches contexts and kernels per xclbin path
+- Shared across all operators to avoid resource conflicts
+
+## Hardware Constraints
+
+### NPU Architecture Limits
+
+- **NPU1 (AIE2)**: 4 rows × 1-4 columns (AMD Ryzen AI Phoenix/Hawk Point)
+- **NPU2 (AIE2P)**: 4 rows × 8 columns (AMD Ryzen AI 300 Series "Strix Point", Ryzen AI 9 HX 370 "Strix Halo", Krackan)
+
+### Tile and Dimension Constraints
+
+Common operator parameters and their constraints:
+
+- `tile_size`: Typically 64, 128, 256, or 4096 (depends on operator and data type)
+- `num_aie_columns`: Must match hardware (1-4 for NPU1, up to 8 for NPU2)
+- `num_aie_rows`: Always 4 for current NPU architectures
+
+**GEMM-specific**:
+
+- `tile_m`, `tile_k`, `tile_n`: Matrix tile dimensions (typically 64)
+- Minimum tile sizes depend on `emulate_bf16_mmul_with_bfp16` flag:
+  - `True` (default): 8×8×8 minimum
+  - `False`: 4×8×8 minimum
+- Matrix dimensions must be multiples of `tile × num_rows/columns`
+  - `M % (tile_m * 4) == 0`
+  - `K % tile_k == 0`
+  - `N % (tile_n * num_aie_columns) == 0`
+
+**Element-wise ops** (add, mul, relu, gelu, etc.):
+
+- `size % (num_aie_columns * tile_size) == 0`
+- `size % tile_size == 0`
+
+### Memory Hierarchy
+
+- **L3**: Host memory (DDR)
+- **L2**: Shared memory tiles (MemTiles in AIE-ML)
+- **L1**: Per-core local memory (limited, ~32-64 KB per tile)
+
+Data movement pattern: L3 → Shim DMA → L2 → L1 (tile local) → Compute
+
+## Adding a New Operator
+
+1. Create directory in `iron/operators/<operator_name>/`
+2. Implement `op.py`:
+   - Subclass `MLIROperator`
+   - Implement `get_operator_name()`, `get_mlir_artifact()`, `get_kernel_artifacts()`, `get_arg_spec()`
+   - Add validation for dimension constraints (assert statements)
+   - Define tile sizes and column counts
+3. Implement `design.py`:
+   - Import from `aie.iron` (Program, Runtime, Worker, ObjectFifo, Kernel)
+   - Define function that builds MLIR-AIE design
+   - Use `range_()` for loops (not Python `range`)
+   - Handle device-specific logic (NPU1 vs NPU2) if needed
+4. Implement C++ kernel in `aie_kernels/<arch>/` if needed
+   - Choose appropriate directory: `generic/`, `aie2/`, or `aie2p/`
+   - Use AIE API for portable vectorization when possible
+   - Add `event0()` and `event1()` for performance profiling
+5. Implement `reference.py` with CPU reference
+6. Implement `test.py` with pytest tests
+   - Use `@pytest.mark.extensive` for slower/larger tests
+   - Use `verify_buffer()` from `iron.common.test_utils`
+7. Register operator in `iron/operators/__init__.py`
+
+## Operator Fusion
+
+IRON supports fusing multiple operators into a single xclbin for improved performance:
+
+```python
+from iron.common.fusion import FusedMLIROperator
+
+# Define individual operators
+gemm1 = AIEGEMM(...)
+relu = AIERELU(...)
+gemm2 = AIEGEMM(...)
+
+# Create fused operator with runlist
+# Intermediate buffers are automatically managed
+fused_op = FusedMLIROperator(
+    name="fused_gemm_relu_gemm",
+    runlist=[
+        (gemm1, "in", "temp1"),      # (operator, input_buffers, output_buffers)
+        (relu, "temp1", "temp2"),
+        (gemm2, "temp2", "out"),
+    ],
+    input_args={"in": size_in},
+    output_args={"out": size_out},
+    context=ctx
+)
+```
+
+Benefits of fusion:
+
+- Reduces host ↔ NPU data transfers
+- Shares compiled resources (kernels.a)
+- Eliminates intermediate buffer round-trips
+
+## Common Patterns
+
+### Multi-Column Parallelism
+
+Distribute work across NPU columns using TensorAccessPattern:
+
+```python
+num_columns = 4
+chunk = total_elements // num_columns
+
+taps = [
+    TensorAccessPattern(
+        (1, total_elements),
+        chunk * i,  # offset for column i
+        [1, 1, 1, chunk],
+        [0, 0, 0, 1],
+    )
+    for i in range(num_columns)
+]
+```
+
+### ObjectFIFO Acquire/Release Pattern
+
+```python
+def core_body(of_in, of_out, kernel_fn):
+    for _ in range_(num_iterations):
+        elem_in = of_in.acquire(1)
+        elem_out = of_out.acquire(1)
+        kernel_fn(elem_in, elem_out, size)
+        of_in.release(1)
+        of_out.release(1)
+```
+
+### Using `range_()` vs `range`
+
+- **Always use `range_()`** in Worker functions (NPU-side code)
+- Use Python `range` only in Runtime sequences (host-side code)
+
+### Vectorized Kernel Template
+
+```cpp
+#include <aie_api/aie.hpp>
+
+void my_kernel(bfloat16* in, bfloat16* out, int32_t size) {
+    event0();  // Start performance counter
+    aie::vector<bfloat16, 32> vec_in = aie::load_v<32>(in);
+    // ... vectorized operations ...
+    aie::store_v(out, vec_out);
+    event1();  // Stop performance counter
+}
+```
+
+**Note**: `event0()` and `event1()` are performance profiling markers.
+
+### Test Verification Pattern
+
+```python
+from iron.common.test_utils import verify_buffer
+
+# Compare NPU output against CPU reference
+errors = verify_buffer(
+    output=npu_output,
+    buf_name="output",
+    reference=cpu_reference,
+    rel_tol=0.04,      # 4% relative tolerance
+    abs_tol=1e-6,      # Absolute tolerance for small values
+    max_error_rate=0.0 # 0% of elements can fail (strict)
+)
+assert len(errors) == 0, f"Found {len(errors)} mismatches"
+```
+
+### Datatype Conversion Helpers
+
+```python
+from iron.common.utils import torch_to_numpy, numpy_to_torch
+
+# Convert torch tensor to numpy (preserves bfloat16)
+np_array = torch_to_numpy(torch_tensor)
+
+# Convert numpy array to torch (preserves bfloat16)
+torch_tensor = numpy_to_torch(np_array)
+```
+
+These utilities handle bfloat16 conversion correctly (avoiding float32 intermediate).
+
+## Debugging and Performance
+
+### Debug Mode
+
+Disable XRT runlist for easier debugging (executes kernels individually):
+
+```python
+context = AIEContext(use_runlist=False)
+```
+
+This sacrifices performance but makes it easier to identify which kernel fails.
+
+### Verbose MLIR Output
+
+Enable verbose MLIR compilation output:
+
+```python
+context = AIEContext(mlir_verbose=True)
+```
+
+### Performance Profiling
+
+C++ kernels use `event0()` and `event1()` markers for performance profiling. These can be analyzed with AIE trace tools to measure cycle counts.
+
+### Logging
+
+The codebase uses Python's standard `logging` module. Enable debug logging:
+
+```python
+import logging
+logging.basicConfig(level=logging.DEBUG)
+```
+
+## CI and PR Workflow
+
+### GitHub Actions Workflows
+
+- **small.yml**: Fast operator tests (non-extensive, runs on every PR)
+- **extensive.yml**: Full test suite (all operators with extensive tests)
+- **test-examples.yml**: Application tests (e.g., Llama inference)
+- **ci-lint.yml**: Linting checks (black, clang-format, reuse)
+
+### Workflow Requirements
+
+- **Target Branch**: Always submit PRs to `devel`
+- **CI Tests**: Run on self-hosted runners with NPU hardware
+- **All CI must pass**: Including linting and formatting checks
+- **Pre-Push Hook** (optional but recommended):
+
+  ```bash
+  cp scripts/hooks/pre-push .git/hooks/pre-push
+  chmod +x .git/hooks/pre-push
+  ```
+
+- **PR Prefixes**: Use "DRAFT:" for work-in-progress, "REFACTOR:" for refactoring
+
+## Troubleshooting
+
+### Common Issues
+
+**"No XRT device found"**
+
+- Ensure `source /opt/xilinx/xrt/setup.sh` was run
+- Check XDNA driver is installed: `lsmod | grep amdxdna`
+
+**"Kernel not found" or "Symbol not defined"**
+
+- Verify kernel `.cc` file is in correct `aie_kernels/<arch>/` directory
+- Check `get_kernel_artifacts()` in `op.py` references correct kernel path
+- Ensure kernel function signature matches `Kernel()` declaration in `design.py`
+
+**Compilation hangs or fails**
+
+- Check MLIR-AIE is installed: `python -c "import aie.iron"`
+- Verify `llvm-aie` is available: `which aie-opt`
+- Look for syntax errors in `design.py` (common: using `range` instead of `range_()`)
+
+**Test failures with numerical differences**
+
+- Check datatype consistency (bfloat16 has limited precision)
+- Verify reference implementation matches NPU kernel exactly
+- Look for memory alignment issues in C++ kernel
+- Adjust tolerances in `verify_buffer()` if needed (`rel_tol`, `abs_tol`)
+
+**Dimension mismatch errors**
+
+- Check operator constraints (e.g., `M % (tile_m * 4) == 0` for GEMM)
+- Verify `tile_size`, `num_aie_columns`, and total size are compatible
+- Ensure tensor dimensions are multiples of required alignment
+
+**"Invalid configuration: NPU2 has 8 columns"**
+
+- NPU1 supports 1-4 columns only
+- NPU2 supports up to 8 columns
+- Device type is auto-detected via XRT
+
+**Kernel compilation failures**
+
+- Check kernel is in correct architecture directory (`generic/`, `aie2/`, `aie2p/`)
+- Verify `#include <aie_api/aie.hpp>` for AIE API kernels
+- Ensure template parameters match function signature
+- Check for syntax errors in vectorization code
+
+## Applications
+
+### Llama 3.2 1B Inference
+
+Full LLM inference example at `iron/applications/llama_3.2_1b/`:
+
+- **Required files**: `model.safetensors`, `tokenizer.model` from Hugging Face
+- **Default location**: `/srv/llama3.2-1b/` (configurable via `IRON_EXAMPLE_WEIGHTS_DIR`)
+- **Additional deps**: `pip install -r requirements_examples.txt`
+- **Run**: `pytest iron/applications/llama_3.2_1b/`
+
+### AIE Kernel Reference
+
+See `aie_kernels/README.md` for catalog of available kernels:
+
+- Element-wise ops (add, mul, scale)
+- Matrix operations (mm, mv)
+- Reductions (add, max, min)
+- ML ops (conv2d, relu, exp)
+- Vision ops (rgba2gray, filter2d)
+
+Kernels are organized by coding style:
+
+- **AIE API**: Portable C++ template library (recommended)
+- **Intrinsics**: Architecture-specific low-level intrinsics (max performance)
+- **Generic C**: Works on any AIE family (basic functionality)

From 32d43c4c7328eb8d4bfc5eb8257f1a803b64f1f1 Mon Sep 17 00:00:00 2001
From: Yiannis Papadopoulos <Yiannis.Papadopoulos@amd.com>
Date: Wed, 25 Mar 2026 14:25:48 -0400
Subject: [PATCH 3/3] docs: remove Quick Start Example section from README

---
 README.md | 46 ----------------------------------------------
 1 file changed, 46 deletions(-)

diff --git a/README.md b/README.md
index f507786d..3f2f9744 100755
--- a/README.md
+++ b/README.md
@@ -177,52 +177,6 @@ The hook will run the same linting checks as CI:
 
 To bypass the hook if needed: `git push --no-verify`
 
-## Quick Start Example
-
-Here's a simple example using the AXPY operator (Y = aX + Y):
-
-```python
-#!/usr/bin/env python3
-from iron.operators.axpy import AIEAXPY
-import numpy as np
-from ml_dtypes import bfloat16
-
-# Define operator parameters
-size = 2048
-num_aie_columns = 4
-num_channels = 2
-tile_size = 512
-scalar_factor = 3.0
-
-# Create and compile the operator
-operator = AIEAXPY(
-    size=size,
-    num_aie_columns=num_aie_columns,
-    num_channels=num_channels,
-    tile_size=tile_size,
-    scalar_factor=scalar_factor,
-)
-operator.compile()
-
-# Prepare input data
-x = np.random.rand(size).astype(bfloat16)
-y = np.random.rand(size).astype(bfloat16)
-output = np.zeros(size, dtype=bfloat16)
-
-# Run on NPU
-operator(x, y, output)
-
-# Verify: output = scalar_factor * x + y
-expected = scalar_factor * x.astype(np.float32) + y.astype(np.float32)
-print(f"Result matches expected: {np.allclose(output, expected, rtol=0.04)}")
-```
-
-Run this example:
-
-```bash
-python example.py
-```
-
 ## Applications
 
 ### Llama 3.2 1B Inference