From 1bc4c14284523eaeb4f0df99b0b44f6357566639 Mon Sep 17 00:00:00 2001 From: Yiannis Papadopoulos Date: Tue, 24 Mar 2026 15:31:40 -0400 Subject: [PATCH 1/3] Update README with improved content and structure - Update copyright year range to 2025-2026 - Add key features section highlighting NPU capabilities - Add comprehensive LLM inference section with Llama 3.2 1B example - Add detailed contributing guide (CONTRIBUTING.md) with code style, testing, and PR guidelines - Update contribution guide link from docs/contribute.md to CONTRIBUTING.md - Improve formatting consistency (bullet style --- README.md | 134 +++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 123 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index c833eb40..f507786d 100755 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ @@ -13,7 +13,7 @@ SPDX-License-Identifier: Apache-2.0 GitHub downloads Iron Tests - + PRs Welcome license: Apache @@ -24,8 +24,15 @@ SPDX-License-Identifier: Apache-2.0 IRONCLAD Logo

-IRON is an open-source & close-to-metal Python API enabling fast and efficient execution on [AMD Ryzenβ„’ AI NPUs](https://www.amd.com/en/products/processors/consumer/ryzen-ai.html). It relies on language bindings around the [MLIR-AIE](https://github.com/Xilinx/mlir-aie) dialect. +IRON is an open-source & close-to-metal Python API enabling fast and efficient execution on [AMD Ryzenβ„’ AI NPUs](https://www.amd.com/en/products/processors/consumer/ryzen-ai.html). It relies on language bindings around the [MLIR-AIE](https://github.com/Xilinx/mlir-aie) dialect. +**Key Features:** + +- Close-to-metal NPU programming via MLIR-AIE Python bindings +- Pre-built operator library (GEMM, MHA, RMSNorm, RoPE, activations, etc.) +- Operator fusion for optimal performance +- Extensible architecture for custom operators +- End-to-end LLM inference (Llama 3.2 1B example included) The IRON Python API for Ryzenβ„’ AI NPUs is described in the following paper: @@ -78,7 +85,7 @@ These instructions will guide you through everything required for building and e ### Initial Setup - > Be sure you have the latest BIOS on your laptop or mini-PC that enables the NPU. See [here](#update-bios). + > **Important**: Ensure your system has the latest BIOS version that enables NPU support. Check your laptop/mini-PC manufacturer's support website for BIOS updates. If starting from `Ubuntu 24.04` you may need to update the Linux kernel to 6.11+ by installing the Hardware Enablement (HWE) stack: @@ -126,26 +133,29 @@ If starting from `Ubuntu 24.04` you may need to update the Linux kernel to 6.11+ All available operators can be found in `iron/operators`. These each contain: -* `op.py`: The Python operator interface -- an easy access point to integrate operators into your project that prescribes how to compile the operator (build artifacts) and how to call it at runtime (buffer sizes, etc.) -* `design.py`: The implementation of the operator's NPU code. Often references a kernel in `aie_kernels` for the compute core code and describes the data movement using ObjectFIFOs. -* `reference.py`: A reference CPU implementation to validate the correctness of the NPU implementation. -* `test.py`: An end-to-end test that instantiates and builds the operator, runs it and verifies its outputs against the reference. +- `op.py`: The Python operator interface -- an easy access point to integrate operators into your project that prescribes how to compile the operator (build artifacts) and how to call it at runtime (buffer sizes, etc.) +- `design.py`: The implementation of the operator's NPU code. Often references a kernel in `aie_kernels` for the compute core code and describes the data movement using ObjectFIFOs. +- `reference.py`: A reference CPU implementation to validate the correctness of the NPU implementation. +- `test.py`: An end-to-end test that instantiates and builds the operator, runs it and verifies its outputs against the reference. -> NOTE: Be sure the XRT setup script has been sourced and the Python environment is activated: +> NOTE: Be sure the XRT setup script has been sourced and the Python environment is activated: > `source /opt/xilinx/xrt/setup.sh` > `source /path/to/ironenv/bin/activate` To build and test all the operators: + ``` bash pytest iron/operators/ -m "not extensive" -``` +``` To run the extensive test suite: + ``` bash pytest iron/operators/ ``` To run a specific operator's tests: + ``` bash pytest iron/operators/axpy/ ``` @@ -160,12 +170,114 @@ chmod +x .git/hooks/pre-push ``` The hook will run the same linting checks as CI: + - License checks (reuse) - Python formatting (black) - C++ formatting (clang-format) To bypass the hook if needed: `git push --no-verify` +## Quick Start Example + +Here's a simple example using the AXPY operator (Y = aX + Y): + +```python +#!/usr/bin/env python3 +from iron.operators.axpy import AIEAXPY +import numpy as np +from ml_dtypes import bfloat16 + +# Define operator parameters +size = 2048 +num_aie_columns = 4 +num_channels = 2 +tile_size = 512 +scalar_factor = 3.0 + +# Create and compile the operator +operator = AIEAXPY( + size=size, + num_aie_columns=num_aie_columns, + num_channels=num_channels, + tile_size=tile_size, + scalar_factor=scalar_factor, +) +operator.compile() + +# Prepare input data +x = np.random.rand(size).astype(bfloat16) +y = np.random.rand(size).astype(bfloat16) +output = np.zeros(size, dtype=bfloat16) + +# Run on NPU +operator(x, y, output) + +# Verify: output = scalar_factor * x + y +expected = scalar_factor * x.astype(np.float32) + y.astype(np.float32) +print(f"Result matches expected: {np.allclose(output, expected, rtol=0.04)}") +``` + +Run this example: + +```bash +python example.py +``` + +## Applications + +### Llama 3.2 1B Inference + +IRON includes a complete LLM inference example demonstrating NPU acceleration: + +- **Location**: `iron/applications/llama_3.2_1b/` +- **Model**: Meta Llama 3.2 1B +- **Features**: Multi-head attention, fused operators, bfloat16 quantization + +See [iron/applications/llama_3.2_1b/README.md](./iron/applications/llama_3.2_1b/README.md) for setup and usage instructions. + +## Architecture + +IRON uses a three-layer architecture: + +1. **Operators** (`iron/operators/`): High-level Python API for NPU operations + - Each operator has: `op.py` (interface), `design.py` (MLIR-AIE implementation), `reference.py` (CPU reference), `test.py` (validation) + +2. **AIE Kernels** (`aie_kernels/`): Low-level C++ compute kernels + - Organized by architecture: `generic/`, `aie2/`, `aie2p/` + - Vectorized using AIE API for optimal performance + +3. **Common Infrastructure** (`iron/common/`): Compilation, device management, and utilities + - MLIR-AIE compilation pipeline + - XRT runtime integration + - Operator fusion framework + +## Performance + +IRON operators are designed for maximum NPU utilization: + +- Parallel execution across multiple AIE columns +- Optimized data movement via ObjectFIFOs +- Fused operations to minimize host-NPU transfers +- Vectorized kernels using AIE intrinsics + +Run benchmarks: + +```bash +# Run all operators with performance metrics +pytest iron/operators/ -m "not extensive" -v +``` + +## Community and Support + +- πŸ’¬ **Discord**: Join our [Discord server](https://discord.gg/cW99Ds85e8) for discussions and support +- πŸ› **Issues**: Report bugs and request features via [GitHub Issues](https://github.com/amd/iron/issues) +- πŸ“– **Contributing**: See [CONTRIBUTING.md](./CONTRIBUTING.md) for development guidelines +- πŸ“š **Documentation**: Operator examples in `iron/operators/`, kernel docs in `aie_kernels/README.md` + +## License + +IRON is licensed under the Apache License 2.0. See [LICENSE](./LICENSE) for details. + ----- -

Copyright© 2025 Advanced Micro Devices, Inc

+

Copyright© 2025-2026 Advanced Micro Devices, Inc

From 4f0878394ff60ff92006866d92e9e02f6f60fef1 Mon Sep 17 00:00:00 2001 From: Yiannis Papadopoulos Date: Tue, 24 Mar 2026 15:32:20 -0400 Subject: [PATCH 2/3] docs: add AGENTS.md with instructions for AI coding agents --- AGENTS.md | 501 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 501 insertions(+) create mode 100644 AGENTS.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 00000000..3a53f0ab --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,501 @@ +# AGENTS.md + +This file provides guidance to AI coding agents when working with code in this repository. + +## Overview + +IRON is a close-to-metal Python API for AMD Ryzenβ„’ AI NPUs (XDNA architecture). It provides language bindings around the MLIR-AIE dialect to enable fast and efficient execution on NPU hardware. + +**Key Technologies:** + +- **MLIR-AIE**: Dialect for programming AMD AI Engines (AIE) array architectures +- **XRT (Xilinx Runtime)**: Low-level runtime for interfacing with NPU hardware +- **Target Hardware**: AMD Ryzen AI NPUs (AIE2/AIE2P architectures - NPU1/NPU2) +- **Primary Datatype**: bfloat16 + +## Environment Setup + +```bash +# 1. Source XRT (required for all operations) +source /opt/xilinx/xrt/setup.sh + +# 2. Activate virtual environment +source ironenv/bin/activate + +# 3. Install dependencies +pip install -r requirements.txt +``` + +**Note:** XRT must be sourced before running any tests or operators. + +### Build Directory + +Compiled artifacts (`.xclbin`, `.bin`, `.o` files) are stored in `build/` directory by default. The build directory can be customized via `AIEContext(build_dir="path/to/build")`. + +### Environment Variables + +- `IRON_EXAMPLE_WEIGHTS_DIR`: Path to model weights for applications (default: `/srv`) + +## Building and Testing + +### Run All Operators (non-extensive tests) + +```bash +pytest iron/operators/ -m "not extensive" +``` + +### Run Extensive Test Suite + +```bash +pytest iron/operators/ +``` + +### Run Single Operator Test + +```bash +pytest iron/operators/axpy/ +# OR directly execute the test.py +./iron/operators/axpy/test.py +``` + +### Run Application Tests + +```bash +pytest iron/applications/ +``` + +### Run Specific Test Function + +```bash +pytest iron/operators/gemm/test.py::test_gemm_basic +``` + +### Parallel Testing (faster) + +```bash +pytest iron/operators/ -n auto -m "not extensive" +``` + +## Code Style and Linting + +### Python (Black) + +```bash +# Check formatting +black --check . + +# Auto-format +black . +``` + +### C++ (clang-format) + +```bash +# Check C++ formatting +python scripts/clang-format-wrapper.py --check + +# Show differences +python scripts/clang-format-wrapper.py --diff + +# Auto-format all +python scripts/clang-format-wrapper.py --fix + +# Format specific directory +python scripts/clang-format-wrapper.py --fix --path aie_kernels/ +``` + +### License Compliance (REUSE) + +```bash +# Check all files have proper license headers +reuse lint +``` + +## Architecture + +### Three-Layer Structure + +1. **Operators** (`iron/operators/`) + - Each operator directory contains: + - `op.py`: Python interface (inherits from `MLIROperator`) - defines operator parameters, compilation artifacts, and runtime argument specs + - `design.py`: NPU implementation using MLIR-AIE Python API - defines ObjectFIFOs, Workers, and Runtime sequences + - `reference.py`: CPU reference implementation for validation + - `test.py`: End-to-end test (build, run, verify against reference) + +2. **AIE Kernels** (`aie_kernels/`) + - Architecture-specific C++ compute kernels: + - `generic/`: Works on both AIE2 and AIE2P + - `aie2/`: AIE2-specific (NPU1) + - `aie2p/`: AIE2P-specific (NPU2) + - Use AIE API for vectorization (e.g., `aie::mmul`, `aie::add`, `aie::mul`) + - Compiled to `.o` files and linked into operator `.xclbin` + +3. **Common Infrastructure** (`iron/common/`) + - `base.py`: Base classes (`AIEOperatorBase`, `MLIROperator`, `CompositeOperator`) + - `compilation/`: Compilation artifact system (MLIR β†’ xclbin) + - `fusion.py`: Operator fusion framework (`FusedMLIROperator`) + - `device_manager.py`: XRT device initialization and management (singleton pattern) + - `context.py`: `AIEContext` for operator compilation/execution + - `utils.py`: Helper functions (`torch_to_numpy`, `numpy_to_torch`) + - `test_utils.py`: Test utilities (`verify_buffer`, `nearly_equal`) + +### Key Concepts + +**ObjectFIFO**: Data movement primitive in MLIR-AIE + +- Connects producers and consumers (shim DMA ↔ compute tiles) +- Uses `acquire()` to get buffer access, `release()` to free it +- Pattern: always pair acquire with release in loops + +**Worker**: Compute tile task + +- Wraps a Python function that runs on AIE compute core +- Function uses `range_()` for loops (not Python `range`) +- Calls compiled C++ kernels via `Kernel` objects + +**TensorAccessPattern (TAP)**: Describes how data is sliced and distributed + +- Used to parallelize work across multiple columns +- Format: `(tensor_shape, offset, dimensions, strides)` + +**Runtime Sequence**: Host-side control flow + +- `rt.fill()`: DMA data from host β†’ NPU (shim β†’ L2/L1) +- `rt.drain()`: DMA data from NPU β†’ host +- `rt.start()`: Launch workers +- `rt.task_group()`: Coordinate parallel DMA operations + +**Compilation Flow**: + +```text +design.py (Python MLIR-AIE API) + ↓ +PythonGeneratedMLIRArtifact + ↓ +MLIR (.mlir file) + ↓ (aie-opt + aie-translate via Peano toolchain) +xclbin (NPU binary) + insts.bin (instruction sequence) +``` + +**AIEContext**: Manages compilation and runtime state + +- Default build directory: `build/` in current working directory +- Compilation rules: Defines pipeline from Python β†’ MLIR β†’ xclbin +- Device manager: Singleton for XRT resource sharing +- Use `AIEContext(build_dir="...", mlir_verbose=True)` for custom settings + +**Device Manager**: Singleton that manages XRT resources + +- Automatically initializes `pyxrt.device(0)` +- Caches contexts and kernels per xclbin path +- Shared across all operators to avoid resource conflicts + +## Hardware Constraints + +### NPU Architecture Limits + +- **NPU1 (AIE2)**: 4 rows Γ— 1-4 columns (AMD Ryzen AI Phoenix/Hawk Point) +- **NPU2 (AIE2P)**: 4 rows Γ— 8 columns (AMD Ryzen AI 300 Series "Strix Point", Ryzen AI 9 HX 370 "Strix Halo", Krackan) + +### Tile and Dimension Constraints + +Common operator parameters and their constraints: + +- `tile_size`: Typically 64, 128, 256, or 4096 (depends on operator and data type) +- `num_aie_columns`: Must match hardware (1-4 for NPU1, up to 8 for NPU2) +- `num_aie_rows`: Always 4 for current NPU architectures + +**GEMM-specific**: + +- `tile_m`, `tile_k`, `tile_n`: Matrix tile dimensions (typically 64) +- Minimum tile sizes depend on `emulate_bf16_mmul_with_bfp16` flag: + - `True` (default): 8Γ—8Γ—8 minimum + - `False`: 4Γ—8Γ—8 minimum +- Matrix dimensions must be multiples of `tile Γ— num_rows/columns` + - `M % (tile_m * 4) == 0` + - `K % tile_k == 0` + - `N % (tile_n * num_aie_columns) == 0` + +**Element-wise ops** (add, mul, relu, gelu, etc.): + +- `size % (num_aie_columns * tile_size) == 0` +- `size % tile_size == 0` + +### Memory Hierarchy + +- **L3**: Host memory (DDR) +- **L2**: Shared memory tiles (MemTiles in AIE-ML) +- **L1**: Per-core local memory (limited, ~32-64 KB per tile) + +Data movement pattern: L3 β†’ Shim DMA β†’ L2 β†’ L1 (tile local) β†’ Compute + +## Adding a New Operator + +1. Create directory in `iron/operators//` +2. Implement `op.py`: + - Subclass `MLIROperator` + - Implement `get_operator_name()`, `get_mlir_artifact()`, `get_kernel_artifacts()`, `get_arg_spec()` + - Add validation for dimension constraints (assert statements) + - Define tile sizes and column counts +3. Implement `design.py`: + - Import from `aie.iron` (Program, Runtime, Worker, ObjectFifo, Kernel) + - Define function that builds MLIR-AIE design + - Use `range_()` for loops (not Python `range`) + - Handle device-specific logic (NPU1 vs NPU2) if needed +4. Implement C++ kernel in `aie_kernels//` if needed + - Choose appropriate directory: `generic/`, `aie2/`, or `aie2p/` + - Use AIE API for portable vectorization when possible + - Add `event0()` and `event1()` for performance profiling +5. Implement `reference.py` with CPU reference +6. Implement `test.py` with pytest tests + - Use `@pytest.mark.extensive` for slower/larger tests + - Use `verify_buffer()` from `iron.common.test_utils` +7. Register operator in `iron/operators/__init__.py` + +## Operator Fusion + +IRON supports fusing multiple operators into a single xclbin for improved performance: + +```python +from iron.common.fusion import FusedMLIROperator + +# Define individual operators +gemm1 = AIEGEMM(...) +relu = AIERELU(...) +gemm2 = AIEGEMM(...) + +# Create fused operator with runlist +# Intermediate buffers are automatically managed +fused_op = FusedMLIROperator( + name="fused_gemm_relu_gemm", + runlist=[ + (gemm1, "in", "temp1"), # (operator, input_buffers, output_buffers) + (relu, "temp1", "temp2"), + (gemm2, "temp2", "out"), + ], + input_args={"in": size_in}, + output_args={"out": size_out}, + context=ctx +) +``` + +Benefits of fusion: + +- Reduces host ↔ NPU data transfers +- Shares compiled resources (kernels.a) +- Eliminates intermediate buffer round-trips + +## Common Patterns + +### Multi-Column Parallelism + +Distribute work across NPU columns using TensorAccessPattern: + +```python +num_columns = 4 +chunk = total_elements // num_columns + +taps = [ + TensorAccessPattern( + (1, total_elements), + chunk * i, # offset for column i + [1, 1, 1, chunk], + [0, 0, 0, 1], + ) + for i in range(num_columns) +] +``` + +### ObjectFIFO Acquire/Release Pattern + +```python +def core_body(of_in, of_out, kernel_fn): + for _ in range_(num_iterations): + elem_in = of_in.acquire(1) + elem_out = of_out.acquire(1) + kernel_fn(elem_in, elem_out, size) + of_in.release(1) + of_out.release(1) +``` + +### Using `range_()` vs `range` + +- **Always use `range_()`** in Worker functions (NPU-side code) +- Use Python `range` only in Runtime sequences (host-side code) + +### Vectorized Kernel Template + +```cpp +#include + +void my_kernel(bfloat16* in, bfloat16* out, int32_t size) { + event0(); // Start performance counter + aie::vector vec_in = aie::load_v<32>(in); + // ... vectorized operations ... + aie::store_v(out, vec_out); + event1(); // Stop performance counter +} +``` + +**Note**: `event0()` and `event1()` are performance profiling markers. + +### Test Verification Pattern + +```python +from iron.common.test_utils import verify_buffer + +# Compare NPU output against CPU reference +errors = verify_buffer( + output=npu_output, + buf_name="output", + reference=cpu_reference, + rel_tol=0.04, # 4% relative tolerance + abs_tol=1e-6, # Absolute tolerance for small values + max_error_rate=0.0 # 0% of elements can fail (strict) +) +assert len(errors) == 0, f"Found {len(errors)} mismatches" +``` + +### Datatype Conversion Helpers + +```python +from iron.common.utils import torch_to_numpy, numpy_to_torch + +# Convert torch tensor to numpy (preserves bfloat16) +np_array = torch_to_numpy(torch_tensor) + +# Convert numpy array to torch (preserves bfloat16) +torch_tensor = numpy_to_torch(np_array) +``` + +These utilities handle bfloat16 conversion correctly (avoiding float32 intermediate). + +## Debugging and Performance + +### Debug Mode + +Disable XRT runlist for easier debugging (executes kernels individually): + +```python +context = AIEContext(use_runlist=False) +``` + +This sacrifices performance but makes it easier to identify which kernel fails. + +### Verbose MLIR Output + +Enable verbose MLIR compilation output: + +```python +context = AIEContext(mlir_verbose=True) +``` + +### Performance Profiling + +C++ kernels use `event0()` and `event1()` markers for performance profiling. These can be analyzed with AIE trace tools to measure cycle counts. + +### Logging + +The codebase uses Python's standard `logging` module. Enable debug logging: + +```python +import logging +logging.basicConfig(level=logging.DEBUG) +``` + +## CI and PR Workflow + +### GitHub Actions Workflows + +- **small.yml**: Fast operator tests (non-extensive, runs on every PR) +- **extensive.yml**: Full test suite (all operators with extensive tests) +- **test-examples.yml**: Application tests (e.g., Llama inference) +- **ci-lint.yml**: Linting checks (black, clang-format, reuse) + +### Workflow Requirements + +- **Target Branch**: Always submit PRs to `devel` +- **CI Tests**: Run on self-hosted runners with NPU hardware +- **All CI must pass**: Including linting and formatting checks +- **Pre-Push Hook** (optional but recommended): + + ```bash + cp scripts/hooks/pre-push .git/hooks/pre-push + chmod +x .git/hooks/pre-push + ``` + +- **PR Prefixes**: Use "DRAFT:" for work-in-progress, "REFACTOR:" for refactoring + +## Troubleshooting + +### Common Issues + +**"No XRT device found"** + +- Ensure `source /opt/xilinx/xrt/setup.sh` was run +- Check XDNA driver is installed: `lsmod | grep amdxdna` + +**"Kernel not found" or "Symbol not defined"** + +- Verify kernel `.cc` file is in correct `aie_kernels//` directory +- Check `get_kernel_artifacts()` in `op.py` references correct kernel path +- Ensure kernel function signature matches `Kernel()` declaration in `design.py` + +**Compilation hangs or fails** + +- Check MLIR-AIE is installed: `python -c "import aie.iron"` +- Verify `llvm-aie` is available: `which aie-opt` +- Look for syntax errors in `design.py` (common: using `range` instead of `range_()`) + +**Test failures with numerical differences** + +- Check datatype consistency (bfloat16 has limited precision) +- Verify reference implementation matches NPU kernel exactly +- Look for memory alignment issues in C++ kernel +- Adjust tolerances in `verify_buffer()` if needed (`rel_tol`, `abs_tol`) + +**Dimension mismatch errors** + +- Check operator constraints (e.g., `M % (tile_m * 4) == 0` for GEMM) +- Verify `tile_size`, `num_aie_columns`, and total size are compatible +- Ensure tensor dimensions are multiples of required alignment + +**"Invalid configuration: NPU2 has 8 columns"** + +- NPU1 supports 1-4 columns only +- NPU2 supports up to 8 columns +- Device type is auto-detected via XRT + +**Kernel compilation failures** + +- Check kernel is in correct architecture directory (`generic/`, `aie2/`, `aie2p/`) +- Verify `#include ` for AIE API kernels +- Ensure template parameters match function signature +- Check for syntax errors in vectorization code + +## Applications + +### Llama 3.2 1B Inference + +Full LLM inference example at `iron/applications/llama_3.2_1b/`: + +- **Required files**: `model.safetensors`, `tokenizer.model` from Hugging Face +- **Default location**: `/srv/llama3.2-1b/` (configurable via `IRON_EXAMPLE_WEIGHTS_DIR`) +- **Additional deps**: `pip install -r requirements_examples.txt` +- **Run**: `pytest iron/applications/llama_3.2_1b/` + +### AIE Kernel Reference + +See `aie_kernels/README.md` for catalog of available kernels: + +- Element-wise ops (add, mul, scale) +- Matrix operations (mm, mv) +- Reductions (add, max, min) +- ML ops (conv2d, relu, exp) +- Vision ops (rgba2gray, filter2d) + +Kernels are organized by coding style: + +- **AIE API**: Portable C++ template library (recommended) +- **Intrinsics**: Architecture-specific low-level intrinsics (max performance) +- **Generic C**: Works on any AIE family (basic functionality) From 32d43c4c7328eb8d4bfc5eb8257f1a803b64f1f1 Mon Sep 17 00:00:00 2001 From: Yiannis Papadopoulos Date: Wed, 25 Mar 2026 14:25:48 -0400 Subject: [PATCH 3/3] docs: remove Quick Start Example section from README --- README.md | 46 ---------------------------------------------- 1 file changed, 46 deletions(-) diff --git a/README.md b/README.md index f507786d..3f2f9744 100755 --- a/README.md +++ b/README.md @@ -177,52 +177,6 @@ The hook will run the same linting checks as CI: To bypass the hook if needed: `git push --no-verify` -## Quick Start Example - -Here's a simple example using the AXPY operator (Y = aX + Y): - -```python -#!/usr/bin/env python3 -from iron.operators.axpy import AIEAXPY -import numpy as np -from ml_dtypes import bfloat16 - -# Define operator parameters -size = 2048 -num_aie_columns = 4 -num_channels = 2 -tile_size = 512 -scalar_factor = 3.0 - -# Create and compile the operator -operator = AIEAXPY( - size=size, - num_aie_columns=num_aie_columns, - num_channels=num_channels, - tile_size=tile_size, - scalar_factor=scalar_factor, -) -operator.compile() - -# Prepare input data -x = np.random.rand(size).astype(bfloat16) -y = np.random.rand(size).astype(bfloat16) -output = np.zeros(size, dtype=bfloat16) - -# Run on NPU -operator(x, y, output) - -# Verify: output = scalar_factor * x + y -expected = scalar_factor * x.astype(np.float32) + y.astype(np.float32) -print(f"Result matches expected: {np.allclose(output, expected, rtol=0.04)}") -``` - -Run this example: - -```bash -python example.py -``` - ## Applications ### Llama 3.2 1B Inference