Skip to content

Commit ac79202

Browse files
jgmelberclaude
andcommitted
Add AIE development tips to CLAUDE.md
Hardware constraints, kernel development patterns, design composition techniques, and testing/debugging lessons learned from the SwiGLU fusion work. Covers DMA channel limits, bf16 precision, static buffer alignment, build caching pitfalls, and tolerance tuning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 88af3db commit ac79202

1 file changed

Lines changed: 130 additions & 0 deletions

File tree

CLAUDE.md

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
IRON is a Python API for programming AMD Ryzen AI NPUs (Neural Processing Units). It wraps the MLIR-AIE dialect to enable writing high-performance ML operators targeting NPU hardware. The project includes 22 pre-built operators and a Llama 3.2 1B inference application.
8+
9+
## Common Commands
10+
11+
### Setup
12+
```bash
13+
python3 -m venv ironenv && source ironenv/bin/activate
14+
source /opt/xilinx/xrt/setup.sh
15+
pip install -r requirements.txt
16+
```
17+
18+
### Testing
19+
```bash
20+
# Run quick test suite (smoke tests only)
21+
pytest iron/operators/ -m "not extensive"
22+
23+
# Run full test suite (all parameter sweeps)
24+
pytest iron/operators/
25+
26+
# Run a single operator's tests
27+
pytest iron/operators/gemm/test.py
28+
29+
# Run a single test function
30+
pytest iron/operators/gemm/test.py::test_gemm
31+
32+
# Run application tests
33+
pytest iron/applications/
34+
35+
# Run all tests
36+
pytest
37+
```
38+
39+
### Linting & Formatting
40+
```bash
41+
# Python formatting (black, default 88-char line width)
42+
black --check . # check
43+
black . # fix
44+
45+
# C++ formatting (clang-format, LLVM-based, 120-char column limit)
46+
python scripts/clang-format-wrapper.py --check # check
47+
python scripts/clang-format-wrapper.py --fix # fix
48+
49+
# License compliance (SPDX headers required on all files)
50+
reuse lint
51+
```
52+
53+
## Architecture
54+
55+
### Operator Pattern
56+
Every operator in `iron/operators/<name>/` follows a 4-file convention:
57+
- **`op.py`** — Python operator class (extends `AIEOperatorBase`), the user-facing API
58+
- **`design.py`** — NPU hardware design using MLIR-AIE DSL (tile mapping, data movement, kernel binding)
59+
- **`reference.py`** — CPU reference implementation for golden-value testing
60+
- **`test.py`** — Parametrized pytest tests comparing NPU output against the reference
61+
62+
### Core Infrastructure (`iron/common/`)
63+
- **`aie_base.py`**`AIEOperatorBase` abstract class: kernel/buffer registration, runlist execution, buffer I/O
64+
- **`aie_context.py`**`AIEContext` singleton orchestrating operator lifecycle: compilation, runtime prep, device management
65+
- **`compilation.py`** — Artifact-based build system with dependency tracking and caching (singleton per absolute path). Key classes: `SourceArtifact`, `XclbinArtifact`, `InstsBinArtifact`, `KernelObjectArtifact`
66+
- **`aie_device_manager.py`** — XRT runtime integration for hardware device management
67+
- **`test_utils.py`**`run_test()` and `verify_buffer()` helpers used by all operator tests
68+
69+
### AIE Kernels (`aie_kernels/`)
70+
Low-level C/C++ compute kernels organized by hardware generation:
71+
- `generic/` — Architecture-agnostic implementations
72+
- `aie2/` — AIE2 optimized
73+
- `aie2p/` — AIE2+ optimized
74+
75+
### Applications (`iron/applications/`)
76+
End-to-end applications (e.g., Llama 3.2 1B) that compose multiple operators.
77+
78+
### Key Dependencies
79+
- **mlir-aie** — MLIR dialect bindings (`from aie.iron import Kernel, ObjectFifo, Program, Runtime, Worker`)
80+
- **llvm-aie (Peano)** — Low-level compiler backend
81+
- **XRT** — Xilinx Runtime for hardware execution (loaded via `source /opt/xilinx/xrt/setup.sh`)
82+
- **torch, numpy, ml_dtypes** — Data types and tensor operations (primary dtype: `bfloat16`)
83+
84+
## CI
85+
86+
Three CI pipelines run on GitHub Actions:
87+
1. **ci-lint** (every push/PR) — black, clang-format, reuse lint
88+
2. **small** (PRs + devel/main) — quick operator tests (`-m "not extensive"`)
89+
3. **extensive** (daily + manual) — full parameter sweep tests
90+
91+
## Code Conventions
92+
93+
- All files require SPDX license headers:
94+
```
95+
SPDX-FileCopyrightText: Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.
96+
SPDX-License-Identifier: Apache-2.0
97+
```
98+
- PR target branch is `devel`
99+
- C++ style: LLVM-based, 120-char columns, 4-space indent, Linux brace style
100+
- Python style: black defaults (88-char lines)
101+
- Test metrics are collected via `conftest.py` into CSV format for regression tracking
102+
103+
## AIE Development Tips
104+
105+
### Hardware Constraints
106+
- **AIE tile L1 memory: 64 KB.** ObjectFIFO buffers, stack, and static data all share this. Budget carefully when designing multi-FIFO operators. Use `dequant/design.py` as a reference: 16384 bf16 elements (32 KB) is half of L1.
107+
- **DMA channels per compute tile: 2 input + 2 output.** Each ObjectFIFO consumer or producer counts as one channel. A design with 3 input FIFOs per tile will fail with `'aie.tile' op number of input DMA channel exceeded!`. Plan FIFO count around this limit.
108+
- **One `.o` binary per Worker.** MLIR-AIE enforces `ValueError: Currently, only one binary per works is supported`. If you need functions from multiple source files, combine them into a single `.cc` file.
109+
- **ShimDMA channels are limited.** The `SequentialPlacer` will fail with `Failed to find a tile matching column N` if the total FIFO count across columns exceeds available shim resources. Reduce `num_aie_columns` if this happens.
110+
111+
### Kernel Development
112+
- **AIE2 vs AIE2+ tanh:** AIE2 (`aie_kernels/aie2/`) uses LUT-based `getTanhBf16()` from `lut_based_ops.h`. AIE2+ (`aie_kernels/aie2p/`) uses hardware `aie::tanh<bfloat16>(float_vec)`. Always implement both variants.
113+
- **Static buffers need alignment.** If a kernel uses `aie::load_v<N>()` or `aie::begin_restrict_vector<N>()` on static arrays, add `__attribute__((aligned(64)))`. Misaligned vector access reads from wrong addresses silently.
114+
- **bf16 tanh saturates for |x| > 8.** `tanh(x/2)` in bf16 returns exactly -1.0 or 1.0, making `sigmoid(x) = 0` or `1`. This means `silu(x) = 0` for x < -8. This is expected — match tolerances to the standalone SiLU operator (`abs_tol=1.0` for composed operators).
115+
- **Static kernel buffers work** for scratch space that doesn't need DMA. Use them to hold intermediate results between kernel phases within the same core body. They avoid ObjectFIFO overhead but don't show up in the MLIR buffer address map — verify no overlap by checking total L1 usage.
116+
- **`matvec_vectorized` uses `r=64` SIMD lanes** with `aie::mac` + `aie::reduce_add`. The inner loop requires `k >= 2*r = 128` for pipelining (see `AIE_LOOP_MIN_ITERATION_COUNT(2)`).
117+
118+
### Design Patterns
119+
- **Composing operators in SwiGLU-style pipelines:** Use `get_artifacts(prefix=)` to create uniquely-named artifacts. Chain xclbins via `xclbin.xclbin_input` and `xclbin.depends`. Assign unique kernel IDs (`--xclbin-kernel-id=0x90N`).
120+
- **Pre-interleaving weights for shared FIFOs:** When a single A FIFO must carry data from two DDR buffers (e.g., W1 and W2 for dual-GEMV), pre-interleave in Python at setup time. The core body consumes tiles in FIFO order — structure the interleaving to match. Use a `phase` parameter in the kernel to select the destination buffer.
121+
- **Two `rt.fill()` calls to the same FIFO in a `task_group` have NO ordering guarantee.** The task_group only controls await/free bookkeeping. If you need ordered fills, either pre-interleave in DDR or use separate FIFOs (subject to DMA channel limits).
122+
123+
### Testing & Debugging
124+
- **Always `rm -rf build/<operator>_*` when changing kernels or designs.** The build system caches `.xclbin` and `.bin` artifacts by filename. Stale artifacts cause confusing failures including `ERT_CMD_STATE_TIMEOUT` hangs.
125+
- **Verify the reference separately.** Before debugging hardware mismatches, check that the Python reference produces correct values at the failing indices. Use `torch.manual_seed` for reproducibility.
126+
- **Isolate components.** If a fused operator fails, test each component standalone first (e.g., test GEMV alone with the same data, test SiLU+Mul alone). The standard GEMV test at `iron/operators/gemv/test.py` is a reliable baseline.
127+
- **Tolerance ladder for composed operators:** Standalone elementwise ops use `rel_tol=0.04, abs_tol=1e-6`. Multi-kernel pipelines (SwiGLU) use `rel_tol=0.07, abs_tol=0.7` to `1.0` because bf16 rounding errors accumulate through GEMV + activation + multiply.
128+
- **`ERT_CMD_STATE_TIMEOUT`** usually means the core body is deadlocked waiting on a FIFO acquire that will never be filled (DMA misconfiguration) or the design doesn't terminate. Check that fill/drain counts match the core body's acquire/release counts.
129+
- **XRT may be a system package** (no `/opt/xilinx/xrt/setup.sh`). If `pyxrt` isn't found in a venv, symlink it: `ln -s /usr/lib/python3/dist-packages/pyxrt.*.so ironenv/lib/python3.12/site-packages/`
130+
- **Inspect generated MLIR** by calling the design function directly: `python3 -c "from iron.operators.foo.design import my_foo; print(my_foo('npu2', ...))"`. Check ObjectFIFO declarations, tile assignments, and DMA channel usage before running on hardware.

0 commit comments

Comments
 (0)