Beta Support for XDNA2 Platform by Victor-Jung · Pull Request #179 · pulp-platform/Deeploy

Victor-Jung · 2026-03-24T16:12:19Z

This PR adds beta support for the XDNA2 platform. This is also the first attempt at adding a new MLIR backend for Deeploy.

Currently, it passes a single Add node ONNX graph; on this platform, there is no such "untitled" execution possible, as the AIE cores can't read outside their local L1. Hence, the XDNA platform is inherently tiled.

Currently, the CI uses self-hosted runners to run directly on the XDNA2 NPU from a Minisforum X1 Pro.

Added

A Dockerfile specifically for the XDNA platform. It only requires the 'mlir-aie' and 'llvm-aie' packages installable with pip. You have to run this container with specific arguments to forward the NPU device and driver libraries; the procedure is described in README_XDNA.md.
A new target library for XDNA2. It uses the LLVM compile for AIE (named peano) to compile the add kernel.

Changed

Fixed

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

…release

…code transformation passes

…nstallation

coderabbitai · 2026-04-09T11:34:02Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added XDNA2 NPU platform support with MLIR-based code generation capabilities
- Implemented BF16 elementwise Add kernel for XDNA2 deployment
- Added Docker container for XDNA2 development and deployment workflows
- Integrated XDNA2 with automated testing and CI infrastructure
Documentation
- Added comprehensive XDNA2 setup and execution guide

Walkthrough

This PR introduces comprehensive XDNA2 (AIE2p NPU) platform support to Deeploy, including MLIR-based code generation framework, platform-specific deployer and type system, AIE kernel implementations, Docker container setup, GitHub Actions CI workflows, and test infrastructure with XRT host integration.

Changes

Cohort / File(s)	Summary
CI & Workflow Infrastructure `.github/workflows/_runner-xdna2.yml`, `.github/workflows/ci-platform-xdna2.yml`	New reusable GitHub Actions workflow for XDNA2 testing on self-hosted runner with Docker device access; main CI workflow delegates to reusable workflow with `xdna2-kernels` marker.
Container & Deployment Environment `Container/Dockerfile.deeploy-xdna`, `requirements-xdna.txt`, `README_XDNA.md`	Ubuntu 24.04 Docker image with XRT runtime/dev, MLIR-AIE, and LLVM-AIE toolchain; dependency pinning for `mlir_aie@v1.2.1`; documentation for Docker setup and CI runner configuration.
Core MLIR Framework `Deeploy/MLIRDataTypes.py`	New base classes for MLIR code generation: `MLIRExecutionBlock` (operator state container), `MLIRCodeTransformationPass` (pass interface), `MLIRCodeTransformation` (device/runtime phase orchestrator), `MLIRNodeTemplate` (template base with `emit()` hook).
XDNA2 Platform Definition `Deeploy/Targets/XDNA2/Platform.py`	Platform classes (`XDNA2Platform`, `MemoryXDNA2Platform`), buffer types (Variable/Transient/Constant/Struct), deployer engines with configurable memory levels, topology optimizer, and mapping definitions for non-tiled and tiling-ready paths.
XDNA2 Code Transformation Passes `Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRObjectFifoPass.py`, `MLIRComputeCorePass.py`, `MLIRRuntimeSequencePass.py`, `__init__.py`	Device-phase pass for FIFO/kernel declaration; compute-core pass for nested loops with acquire/release semantics; runtime-sequence pass for DMA task configuration with 4D descriptors and token-based synchronization.
XDNA2 Deployer & Type System `Deeploy/Targets/XDNA2/Deployer.py`, `TypeCheckers.py`, `Parsers.py`	Multi-phase MLIR module generation (device + runtime sequence); BF16 `XDNA2AddChecker` type inference overrides; parser placeholder.
XDNA2 Templates & Bindings `Deeploy/Targets/XDNA2/Templates/AddTemplate.py`, `Deeploy/Targets/XDNA2/Bindings.py`, `Tiler.py`	`XDNA2AddTemplate` for BF16 elementwise Add with `func_d.call` emission; bindings wire `XDNA2AddChecker` → reference template → `XDNA2Transformer`; tiling-ready bindings wrap with `AddTileConstraint`.
CMake Build Configuration `CMakeLists.txt`, `TargetLibraries/XDNA2/CMakeLists.txt`	Top-level CMake branch for XDNA2 platform adding subdirectories; kernel compilation CMake resolving `LLVM_AIE_INSTALL_DIR` and `MLIR_AIE_INCLUDE_DIR` via Python queries; per-kernel custom commands with Peano/clang++ targeting `aie2p-none-unknown-elf`.
AIE Kernel Implementation `TargetLibraries/XDNA2/kernels/add.cc`	Templated BF16 elementwise Add with scalar and vectorized paths; exports `eltwise_add_bf16_scalar` and `eltwise_add_bf16_vector` C functions with AIE pipelining directives.
XDNA2 Test Harness `DeeployTest/Platforms/XDNA2/CMakeLists.txt`, `main.cpp`	CMake build logic for XRT executable (aiecc.py MLIR→xclbin conversion, host compilation with XRT/pthread linking); C++ host code loading xclbin/instructions, allocating BF16 buffers, executing kernel, comparing outputs against golden values with per-element diagnostics.
Test Infrastructure Updates `DeeployTest/test_platforms.py`, `test_xdna2_config.py`, `deeployRunner_xdna2.py`, `generateNetwork_xdna2.py`, `conftest.py`	New pytest marker `xdna2`; test registration for `Kernels/BF16/Add/Regular`; XDNA2-specific network generator with BF16 conversion and golden-value recomputation; XDNA2 runner wrapper with tiling enabled.
Test Utilities `DeeployTest/testUtils/deeployRunner.py`, `platformMapping.py`, `testRunner.py`, `execution.py`	CLI argument for `--l3` size override; platform mapping recognizing `xdna2`→`XDNA2`; fallback deployer factory with lazy XDNA2 import; conditional `tiling_arguments` parser; `generateNetwork_xdna2.py` selector in execution pipeline.
Configuration & Ignores `.gitignore`	Ignore `CLAUDE.md` context file and `Container/xrt-debs/` directory.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Split CI Workflows by Platform and Task, Improve Formatting and Linting Reliability #108: Main PR's per-platform CI workflow pattern (_runner-xdna2.yml and ci-platform-xdna2.yml) extends the same reusable-workflow split design introduced in that refactor, applying it to XDNA2 hardware testing with Docker device passthrough.

Suggested reviewers

Xeratec
runwangdl

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.31% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Beta Support for XDNA2 Platform' directly summarizes the main feature addition, which is clearly the introduction of beta support for the XDNA2 platform. The title accurately reflects the primary change across the entire changeset.
Description check	✅ Passed	The pull request description is directly related to the changeset, explaining that it adds beta XDNA2 platform support with a Dockerfile, target library, MLIR backend, and CI setup—all of which are present in the actual changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (16)

DeeployTest/testUtils/testRunner.py (1)

95-114: Validate tiling CLI inputs at parse-time.

--l1/--l3 currently accept non-positive values, and --defaultMemLevel accepts arbitrary strings. Adding argument-level validation avoids late failures in generation passes.

Suggested refactor

+def _positive_int(value: str) -> int:
+    ivalue = int(value)
+    if ivalue <= 0:
+        raise argparse.ArgumentTypeError("Value must be a positive integer")
+    return ivalue
+
 class TestGeneratorArgumentParser(argparse.ArgumentParser):
@@
             self.add_argument('--l1',
                               metavar = '<size>',
                               dest = 'l1',
-                              type = int,
+                              type = _positive_int,
                               default = None,
                               help = 'Set L1 memory size in bytes (enables tiling if specified).\n')
             self.add_argument('--l3',
                               metavar = '<size>',
                               dest = 'l3',
-                              type = int,
+                              type = _positive_int,
                               default = None,
                               help = 'Set L3 memory size in bytes.\n')
             self.add_argument('--defaultMemLevel',
                               metavar = '<level>',
                               dest = 'defaultMemLevel',
                               type = str,
+                              choices = ["L1", "L2", "L3"],
                               default = "L3",
                               help = 'Set default memory level (default: L3)\n')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testUtils/testRunner.py` around lines 95 - 114, The CLI accepts
invalid tiling inputs; validate them at argument-parse time by replacing the raw
int/str types with validation helpers: for '--l1' and '--l3' use a
positive-integer validator (e.g., positive_int) that raises
argparse.ArgumentTypeError for <=0 values, and for '--defaultMemLevel' restrict
values via choices (e.g., ['L1','L3']) or a validator that normalizes and
rejects unknown values; update the self.add_argument calls for l1, l3, and
defaultMemLevel in testRunner.py accordingly so parsing fails fast with clear
messages.

DeeployTest/test_platforms.py (1)

128-150: Consider updating the markers summary to include xdna2.

The markers summary comment documents available platform markers but doesn't include the newly added xdna2 marker.

Suggested addition

 #   snitch: tests from the Snitch platform (untiled)
 #   snitch_tiled: tests from the Snitch platform (tiled)
 #   siracusa: tests from the Siracusa platform (untiled)
 #   siracusa_tiled: tests from the Siracusa platform (tiled)
 #   siracusa_neureka_tiled: tests from the Siracusa + Neureka platform (tiled)
 #   gap9: tests from the GAP9 platform (untiled)
 #   gap9_tiled: tests from the GAP9 platform (tiled)
+#   xdna2: tests from the XDNA2 platform

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/test_platforms.py` around lines 128 - 150, Update the markers
summary comment in the Platform markers section of DeeployTest/test_platforms.py
to include the new platform marker `xdna2`; specifically add `xdna2: tests from
the XDNA2 platform` (or a similarly descriptive phrase) alongside the existing
platform marker lines so the comment accurately documents the new `xdna2`
marker.

CMakeLists.txt (1)

23-24: Add "XDNA2" to the platform property strings list.

The XDNA2 platform is handled in the build logic (lines 49-50 and 304-317), but it's not included in the set_property(CACHE platform PROPERTY STRINGS ...) list. This affects CMake GUI/ccmake users who rely on the dropdown for valid platform options.

Proposed fix

 set(platform MemPool CACHE STRING "Platform (MemPool, SoftHier, QEMU, Siracusa, Siracusa_w_neureka, PULP-Open, GAP9, Generic, Snitch)")
-set_property(CACHE platform PROPERTY STRINGS MemPool SoftHier QEMU Siracusa Siracusa_w_neureka PULP-Open GAP9 Generic Snitch)
+set_property(CACHE platform PROPERTY STRINGS MemPool SoftHier QEMU Siracusa Siracusa_w_neureka PULP-Open GAP9 Generic Snitch XDNA2)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@CMakeLists.txt` around lines 23 - 24, The CMake cache property list for the
platform option is missing "XDNA2", causing GUI/ccmake users to not see it as a
valid choice; update the set_property(CACHE platform PROPERTY STRINGS ...) call
to include "XDNA2" (and keep it consistent with the initial set(platform ...)
default if you want it selectable there too) so the platform dropdown contains
XDNA2 alongside MemPool, SoftHier, QEMU, Siracusa, Siracusa_w_neureka,
PULP-Open, GAP9, Generic, and Snitch.

Deeploy/Targets/XDNA2/CodeTransformationPasses/__init__.py (1)

5-7: Consider explicit imports for better IDE support and static analysis.

While wildcard imports in __init__.py are a common re-export pattern, explicit imports with __all__ provide better discoverability and make the public API explicit. This is a minor maintainability suggestion.

Optional: Explicit imports alternative

-from Deeploy.Targets.XDNA2.CodeTransformationPasses.MLIRComputeCorePass import *
-from Deeploy.Targets.XDNA2.CodeTransformationPasses.MLIRObjectFifoPass import *
-from Deeploy.Targets.XDNA2.CodeTransformationPasses.MLIRRuntimeSequencePass import *
+from Deeploy.Targets.XDNA2.CodeTransformationPasses.MLIRComputeCorePass import MLIRComputeCorePass
+from Deeploy.Targets.XDNA2.CodeTransformationPasses.MLIRObjectFifoPass import MLIRObjectFifoPass
+from Deeploy.Targets.XDNA2.CodeTransformationPasses.MLIRRuntimeSequencePass import MLIRRuntimeSequencePass
+
+__all__ = ['MLIRComputeCorePass', 'MLIRObjectFifoPass', 'MLIRRuntimeSequencePass']

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/XDNA2/CodeTransformationPasses/__init__.py` around lines 5 -
7, Replace the three wildcard imports in __init__.py with explicit imports of
the public symbols exported by MLIRComputeCorePass, MLIRObjectFifoPass, and
MLIRRuntimeSequencePass (e.g., import the specific classes/functions you intend
to re-export) and then define an explicit __all__ list containing those symbol
names; update the import lines referencing
Deeploy.Targets.XDNA2.CodeTransformationPasses.MLIRComputeCorePass,
MLIRObjectFifoPass, and MLIRRuntimeSequencePass to import only the public
symbols and set __all__ = [...names...] so IDEs and static analyzers can
discover the package API.

DeeployTest/testUtils/platformMapping.py (1)

286-287: Chain the exception for better debugging.

When re-raising an exception, use from to preserve the original traceback. This helps with debugging when the import fails for unexpected reasons.

🔧 Proposed fix

-        except ImportError:
-            raise RuntimeError(f"Deployer for platform {platform} is not implemented")
+        except ImportError as err:
+            raise RuntimeError(f"Deployer for platform {platform} is not implemented") from err

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testUtils/platformMapping.py` around lines 286 - 287, The except
ImportError handler in the function that loads deployers currently raises
RuntimeError(f"Deployer for platform {platform} is not implemented") without
chaining the original ImportError; modify the except block to capture the
ImportError (e.g., except ImportError as e:) and re-raise the RuntimeError using
exception chaining (raise RuntimeError(f"Deployer for platform {platform} is not
implemented") from e) so the original traceback is preserved for debugging.

Container/Dockerfile.deeploy-xdna (2)

14-35: Consider adding --no-install-recommends to reduce image size.

Adding --no-install-recommends can significantly reduce the Docker image size by avoiding installation of suggested but non-essential packages.

📦 Proposed fix

 RUN apt-get update && apt-get install -y \
+    --no-install-recommends \
     software-properties-common \
     && add-apt-repository -y ppa:amd-team/xrt \
-    && apt-get update && apt-get install -y \
+    && apt-get update && apt-get install -y --no-install-recommends \
     cmake \

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Container/Dockerfile.deeploy-xdna` around lines 14 - 35, Update the RUN layer
that invokes apt-get to add --no-install-recommends to both apt-get install -y
invocations in the multi-step RUN command (the line starting with "RUN apt-get
update && apt-get install -y \" and the subsequent "&& apt-get update && apt-get
install -y \" sequence) so that unnecessary recommended packages are not
installed and image size is reduced; keep the existing rm -rf
/var/lib/apt/lists/* cleanup as-is.

12-12: Clarify the purpose of LLVM_INSTALL_DIR="nope".

This environment variable with a placeholder value may cause confusion. Consider adding a comment explaining why this is set or whether it's a workaround for a specific issue.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Container/Dockerfile.deeploy-xdna` at line 12, The ENV line setting
LLVM_INSTALL_DIR="nope" is ambiguous; update the Dockerfile to either set a
meaningful path or add an inline comment explaining why "nope" is used (e.g., as
a deliberate sentinel to disable LLVM lookup or to work around a specific build
issue). Locate the ENV LLVM_INSTALL_DIR declaration and either replace "nope"
with the correct install path or append a short comment describing the reason
and any related workarounds or downstream expectations so future maintainers
aren’t confused.

README_XDNA.md (2)

21-21: Consider clarifying the temporary IRON dependency.

This line references a specific developer's path (/scratch/jungvi/IRON) and an internal workaround. Consider either removing this line for public documentation or adding context about when this workaround is needed vs. the standard flow.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README_XDNA.md` at line 21, Update the README_XDNA.md sentence that embeds a
personal developer path by either removing the specific `/scratch/jungvi/IRON`
example or replacing it with a generic, documented workaround: describe that
mounting an IRON repo is a temporary step only required until Deeploy
midend/backend support MLIR, show a generic volume example like `-v
/path/to/IRON:/opt/IRON`, and explain the corresponding env var
`IRON_OPERATORS_DIR=/opt/IRON/iron/operators` and when users should use it
versus the standard flow; ensure the text clearly labels it as a temporary,
developer-only workaround.
6-8: Add language specifiers to fenced code blocks.

The code blocks are missing language specifiers, which affects syntax highlighting and accessibility. Add bash or shell to the command blocks.
📝 Proposed fix
-```
+```bash
 docker build -f Container/Dockerfile.deeploy-xdna -t deeploy-xdna:local .
 ```
Apply similar changes to other code blocks at lines 11, 24, 38, and 44.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README_XDNA.md` around lines 6 - 8, Add language specifiers (e.g., bash or
shell) to the fenced code blocks that contain shell commands so they get proper
syntax highlighting; for example, change the block around the docker build
command (`docker build -f Container/Dockerfile.deeploy-xdna -t
deeploy-xdna:local .`) to start with ```bash and similarly update the other
command/code blocks referenced (lines with commands at the other snippets) to
use ```bash or ```shell so all shell snippets in README_XDNA.md are consistently
labeled.

DeeployTest/generateNetwork_xdna2.py (2)

191-196: Hardcoded Add operation limits reusability.

The golden output computation hardcodes elementwise addition, which will break for other operations. The TODO comment acknowledges this should be moved to ONNX4Deeploy.

Consider either:

Running ONNX inference with BF16-quantized inputs to compute operation-agnostic golden outputs

Extracting the operation type from the graph and dispatching to the appropriate NumPy function

Would you like me to help design a more generic golden output computation approach?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/generateNetwork_xdna2.py` around lines 191 - 196, The
golden-output computation currently hardcodes elementwise addition using
bf16_inputs -> bf16_input_f32 -> golden_f32 and will fail for non-add ops;
replace this with an operation-agnostic approach: either run the ONNX graph
(using the BF16-quantized inputs bf16_inputs) to produce test_outputs_bf16, or
inspect the model graph to extract the root operator(s) and dispatch to the
corresponding NumPy routine (e.g., use op_type from nodes and call the
appropriate handler) so that test_outputs_bf16 is produced generically instead
of summing elements.
129-137: Fix unused loop variable and add strict=True for safety.

The name variable is unused, and zip() without strict= can silently produce incorrect results if the iterables have different lengths.
🔧 Proposed fix
-    for index, (name, values) in enumerate(zip(inputs_npz.files, test_inputs_f32)):
+    for index, (_name, values) in enumerate(zip(inputs_npz.files, test_inputs_f32, strict=True)):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/generateNetwork_xdna2.py` around lines 129 - 137, In the loop
that currently reads "for index, (name, values) in
enumerate(zip(inputs_npz.files, test_inputs_f32))", mark the unused filename
variable as _ and make zip strict to avoid length-mismatch bugs: change the
header to "for index, (_, values) in enumerate(zip(inputs_npz.files,
test_inputs_f32, strict=True))" so you still populate inputTypes/inputOffsets
with PointerClass(bfloat16_t) as before; no other logic changes needed.

Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRObjectFifoPass.py (1)

47-47: Use next(iter(...)) instead of list(...)[0].

Per static analysis suggestion, this is more efficient as it avoids creating an intermediate list.

♻️ Proposed fix

-                firstOutputName = list(outputConstraints.keys())[0]
+                firstOutputName = next(iter(outputConstraints.keys()))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRObjectFifoPass.py` at line
47, Replace the inefficient list(...) indexing when obtaining the first key from
outputConstraints: locate the assignment to firstOutputName in
MLIRObjectFifoPass.py and use next(iter(outputConstraints)) (or
next(iter(outputConstraints.keys()))) instead of
list(outputConstraints.keys())[0] so you avoid creating an intermediate list;
update the code where firstOutputName is set accordingly.

Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRComputeCorePass.py (1)

68-88: Assumes uniform tile type for all input/output tensors.

Line 70 uses firstKey's type as tileTy, and line 74 creates a single subviewTy used for all FIFO acquisitions. This works for the current Add operation where all tensors have identical shapes and types, but will need generalization for operators with mixed tensor types.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRComputeCorePass.py` around
lines 68 - 88, The code assumes a single representative tile type (tileTy
derived from firstKey) and reuses one subviewTy for all FIFO acquisitions, which
breaks when tensors have different types; update the loop so you look up the
per-key memref type via mlirBlock.fifoTypes[key] and create a matching subview
type for each acquisition (i.e., compute subviewTyPerKey from
mlirBlock.fifoTypes[key] before calling
aie_d.objectfifo_acquire/objectfifo_subview_access), handling inputs
(self.inputTensorKeys) and outputs (self.outputTensorKeys) independently rather
than reusing tileTy/subviewTy for every key.

Deeploy/Targets/XDNA2/Platform.py (1)

126-138: Duplicated getTargetMemoryLevel logic.

Both MemoryXDNA2Platform and MemoryXDNA2PlatformWrapper implement identical logic for checking _engine_assignment. Consider extracting to a shared helper function to maintain consistency.

♻️ Proposed refactor

def _get_xdna2_target_memory_level(node: gs.Node, default_level_name: str) -> str:
    """Shared logic for XDNA2 memory level selection."""
    if hasattr(node, '_engine_assignment'):
        engine = node._engine_assignment
        if isinstance(engine, XDNA2AIECoreEngine) and hasattr(engine, 'preferredMemoryLevel'):
            return engine.preferredMemoryLevel
    return default_level_name

Then both classes can call this helper in their getTargetMemoryLevel methods.

Also applies to: 150-157

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/XDNA2/Platform.py` around lines 126 - 138, Extract the
duplicated engine-assignment logic from both
MemoryXDNA2Platform.getTargetMemoryLevel and
MemoryXDNA2PlatformWrapper.getTargetMemoryLevel into a shared helper (e.g.,
_get_xdna2_target_memory_level) that accepts the node and the default level
name; have each getTargetMemoryLevel call this helper instead of repeating the
hasattr/instanceof checks against XDNA2AIECoreEngine and preferredMemoryLevel,
returning engine.preferredMemoryLevel when present or
defaultTargetMemoryLevel.name otherwise.

Deeploy/Targets/XDNA2/Deployer.py (1)

159-171: Tensor type derived from first node only.

Line 163 derives tensorTy from firstEb.numElements and hardcodes BF16Type. This assumes all runtime sequence arguments have the same element count and type. For the current Add-only scope this is fine, but consider documenting this assumption or adding validation when extending to multi-node graphs with varying tensor sizes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/XDNA2/Deployer.py` around lines 159 - 171, The code derives
tensorTy from the first block (mlirBlocks[0]) using firstEb.numElements and a
hardcoded ir.BF16Type, which assumes every runtime-sequence arg shares the same
element count and type; update the runtime-sequence setup in Deployer.py to
validate that all blocks in mlirBlocks have matching numElements (and, if
applicable, matching element types) before creating tensorTy (or document this
assumption clearly), and emit a clear error or fallback if a mismatch is
detected so aiex_d.runtime_sequence(_seq) is not created with inconsistent
tensor shapes.

DeeployTest/Platforms/XDNA2/CMakeLists.txt (1)

29-69: LGTM! Toolchain resolution is well-structured.

The fallback chain (cache → environment → Python query) for LLVM_AIE_INSTALL_DIR and MLIR_AIE_INSTALL_DIR provides good flexibility.

One minor inconsistency: XRT_INSTALL_DIR isn't declared as a cache variable like the others. Consider adding CACHE PATH for consistency:
-if(NOT XRT_INSTALL_DIR)
+set(XRT_INSTALL_DIR "" CACHE PATH "XRT install dir")
+if(NOT XRT_INSTALL_DIR)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/Platforms/XDNA2/CMakeLists.txt` around lines 29 - 69, The
XRT_INSTALL_DIR variable should be declared as a cache variable like
LLVM_AIE_INSTALL_DIR and MLIR_AIE_INSTALL_DIR for consistent configuration and
overrides; change the declaration so XRT_INSTALL_DIR is initialized with
set(XRT_INSTALL_DIR "$ENV{XRT_INSTALL_DIR}" CACHE PATH "XRT install dir") (or
similar) and preserve the existing fallback logic in the if(NOT XRT_INSTALL_DIR)
block that checks ENV{XILINX_XRT} and defaults to "/opt/xilinx/xrt", ensuring
the cache variable is used/checked instead of an undeclared plain variable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/MLIRDataTypes.py`:
- Around line 187-201: The generate method uses a mutable default
operatorRepresentation = {} which can lead to shared-state bugs; change the
signature of generate (the method named generate in MLIRDataTypes.py) to accept
operatorRepresentation = None (or Optional[Dict]) and inside the method do a
guarded assignment like "if operatorRepresentation is None:
operatorRepresentation = {}" before using it, ensuring you preserve the existing
behavior and return value while avoiding the mutable default argument.

In `@Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRRuntimeSequencePass.py`:
- Around line 85-89: The current sequence awaits outputTasks but never frees
them and frees inputTasks before ensuring completion; change the order so both
inputTasks and outputTasks are awaited before calling aiex_d.dma_free_task on
them. Specifically, use aiex_d.dma_await_task(...) for inputTasks as well (or
otherwise ensure their completion), then call aiex_d.dma_free_task(...) on each
input task, and after awaiting each output task call aiex_d.dma_free_task(...)
on each output task; update the block handling outputTasks/inputTasks (and
consider the origin of issue_token) so freeing only happens after dma_await_task
confirms completion.

In `@DeeployTest/Platforms/XDNA2/main.cpp`:
- Around line 79-112: Wrap the main logic of main (the code that calls
read_instr_binary(), xrt::device, xrt::xclbin, device.register_xclbin,
xrt::hw_context and xrt::kernel) in a try-catch block that catches
std::exception (and optionally a catch-all) and prints a clear error message
including ex.what() to stderr (or via your logging mechanism) and returns a
non-zero exit code; ensure any thrown exceptions from read_instr_binary() or XRT
constructors are handled so the program exits gracefully with an informative
message.
- Around line 65-77: In read_instr_binary, validate the file size and read
result: use std::streampos/istream::pos_type for tellg, check for error (e.g.,
tellg()==-1) and throw if so, verify byte_size is a multiple of sizeof(uint32_t)
and throw a descriptive std::runtime_error if not aligned, and after file.read
confirm gcount/streams state to ensure the full byte_size was read and throw on
short reads or IO errors; update uses of byte_size to a safely cast size_t only
after validation.

In `@DeeployTest/testUtils/core/execution.py`:
- Around line 171-173: The current code appends "-v" whenever config.verbose >=
1 which breaks the Generic/host binary that doesn't accept flags; change the
condition to only add "-v" when the run is for the XDNA2 binary (e.g. check the
simulator/platform identifier used in this module — e.g. simulator == "XDNA2" or
platform == "XDNA2") so replace the existing guard with something like: only
append "-v" when config.verbose >= 1 AND the simulator/platform equals the XDNA2
identifier (leave the cmd and config.verbose symbols intact and reference
XDNA2/main.cpp as the binary that accepts -v).

In `@requirements-xdna.txt`:
- Around line 6-10: requirements-xdna.txt currently lists llvm-aie without a
version, causing nondeterministic installs from the nightly index; update the
dependency by replacing the unpinned "llvm-aie" entry with a specific pinned
version compatible with mlir_aie (for example
"llvm-aie==18.0.0.2024050100+59d494d3" or a released tag you verified), so the
file contains the exact version string for llvm-aie alongside mlir_aie==v1.2.1
to ensure reproducible, compatible builds.

In `@TargetLibraries/XDNA2/kernels/add.cc`:
- Around line 20-40: The vector loop in eltwise_vadd uses F = size / vec_factor
and drops trailing elements when size isn't a multiple of vec_factor; add a
scalar tail loop after the existing vector loop that processes indices from F *
vec_factor up to size, performing element-wise addition using the same types
(T_in/T_out) and either the pointer cursors (pA1/pB1/pC1) or direct indexing
(a[i], b[i], c[i]) so all remaining elements are handled.

---

Nitpick comments:
In `@CMakeLists.txt`:
- Around line 23-24: The CMake cache property list for the platform option is
missing "XDNA2", causing GUI/ccmake users to not see it as a valid choice;
update the set_property(CACHE platform PROPERTY STRINGS ...) call to include
"XDNA2" (and keep it consistent with the initial set(platform ...) default if
you want it selectable there too) so the platform dropdown contains XDNA2
alongside MemPool, SoftHier, QEMU, Siracusa, Siracusa_w_neureka, PULP-Open,
GAP9, Generic, and Snitch.

In `@Container/Dockerfile.deeploy-xdna`:
- Around line 14-35: Update the RUN layer that invokes apt-get to add
--no-install-recommends to both apt-get install -y invocations in the multi-step
RUN command (the line starting with "RUN apt-get update && apt-get install -y \"
and the subsequent "&& apt-get update && apt-get install -y \" sequence) so that
unnecessary recommended packages are not installed and image size is reduced;
keep the existing rm -rf /var/lib/apt/lists/* cleanup as-is.
- Line 12: The ENV line setting LLVM_INSTALL_DIR="nope" is ambiguous; update the
Dockerfile to either set a meaningful path or add an inline comment explaining
why "nope" is used (e.g., as a deliberate sentinel to disable LLVM lookup or to
work around a specific build issue). Locate the ENV LLVM_INSTALL_DIR declaration
and either replace "nope" with the correct install path or append a short
comment describing the reason and any related workarounds or downstream
expectations so future maintainers aren’t confused.

In `@Deeploy/Targets/XDNA2/CodeTransformationPasses/__init__.py`:
- Around line 5-7: Replace the three wildcard imports in __init__.py with
explicit imports of the public symbols exported by MLIRComputeCorePass,
MLIRObjectFifoPass, and MLIRRuntimeSequencePass (e.g., import the specific
classes/functions you intend to re-export) and then define an explicit __all__
list containing those symbol names; update the import lines referencing
Deeploy.Targets.XDNA2.CodeTransformationPasses.MLIRComputeCorePass,
MLIRObjectFifoPass, and MLIRRuntimeSequencePass to import only the public
symbols and set __all__ = [...names...] so IDEs and static analyzers can
discover the package API.

In `@Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRComputeCorePass.py`:
- Around line 68-88: The code assumes a single representative tile type (tileTy
derived from firstKey) and reuses one subviewTy for all FIFO acquisitions, which
breaks when tensors have different types; update the loop so you look up the
per-key memref type via mlirBlock.fifoTypes[key] and create a matching subview
type for each acquisition (i.e., compute subviewTyPerKey from
mlirBlock.fifoTypes[key] before calling
aie_d.objectfifo_acquire/objectfifo_subview_access), handling inputs
(self.inputTensorKeys) and outputs (self.outputTensorKeys) independently rather
than reusing tileTy/subviewTy for every key.

In `@Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRObjectFifoPass.py`:
- Line 47: Replace the inefficient list(...) indexing when obtaining the first
key from outputConstraints: locate the assignment to firstOutputName in
MLIRObjectFifoPass.py and use next(iter(outputConstraints)) (or
next(iter(outputConstraints.keys()))) instead of
list(outputConstraints.keys())[0] so you avoid creating an intermediate list;
update the code where firstOutputName is set accordingly.

In `@Deeploy/Targets/XDNA2/Deployer.py`:
- Around line 159-171: The code derives tensorTy from the first block
(mlirBlocks[0]) using firstEb.numElements and a hardcoded ir.BF16Type, which
assumes every runtime-sequence arg shares the same element count and type;
update the runtime-sequence setup in Deployer.py to validate that all blocks in
mlirBlocks have matching numElements (and, if applicable, matching element
types) before creating tensorTy (or document this assumption clearly), and emit
a clear error or fallback if a mismatch is detected so
aiex_d.runtime_sequence(_seq) is not created with inconsistent tensor shapes.

In `@Deeploy/Targets/XDNA2/Platform.py`:
- Around line 126-138: Extract the duplicated engine-assignment logic from both
MemoryXDNA2Platform.getTargetMemoryLevel and
MemoryXDNA2PlatformWrapper.getTargetMemoryLevel into a shared helper (e.g.,
_get_xdna2_target_memory_level) that accepts the node and the default level
name; have each getTargetMemoryLevel call this helper instead of repeating the
hasattr/instanceof checks against XDNA2AIECoreEngine and preferredMemoryLevel,
returning engine.preferredMemoryLevel when present or
defaultTargetMemoryLevel.name otherwise.

In `@DeeployTest/generateNetwork_xdna2.py`:
- Around line 191-196: The golden-output computation currently hardcodes
elementwise addition using bf16_inputs -> bf16_input_f32 -> golden_f32 and will
fail for non-add ops; replace this with an operation-agnostic approach: either
run the ONNX graph (using the BF16-quantized inputs bf16_inputs) to produce
test_outputs_bf16, or inspect the model graph to extract the root operator(s)
and dispatch to the corresponding NumPy routine (e.g., use op_type from nodes
and call the appropriate handler) so that test_outputs_bf16 is produced
generically instead of summing elements.
- Around line 129-137: In the loop that currently reads "for index, (name,
values) in enumerate(zip(inputs_npz.files, test_inputs_f32))", mark the unused
filename variable as _ and make zip strict to avoid length-mismatch bugs: change
the header to "for index, (_, values) in enumerate(zip(inputs_npz.files,
test_inputs_f32, strict=True))" so you still populate inputTypes/inputOffsets
with PointerClass(bfloat16_t) as before; no other logic changes needed.

In `@DeeployTest/Platforms/XDNA2/CMakeLists.txt`:
- Around line 29-69: The XRT_INSTALL_DIR variable should be declared as a cache
variable like LLVM_AIE_INSTALL_DIR and MLIR_AIE_INSTALL_DIR for consistent
configuration and overrides; change the declaration so XRT_INSTALL_DIR is
initialized with set(XRT_INSTALL_DIR "$ENV{XRT_INSTALL_DIR}" CACHE PATH "XRT
install dir") (or similar) and preserve the existing fallback logic in the
if(NOT XRT_INSTALL_DIR) block that checks ENV{XILINX_XRT} and defaults to
"/opt/xilinx/xrt", ensuring the cache variable is used/checked instead of an
undeclared plain variable.

In `@DeeployTest/test_platforms.py`:
- Around line 128-150: Update the markers summary comment in the Platform
markers section of DeeployTest/test_platforms.py to include the new platform
marker `xdna2`; specifically add `xdna2: tests from the XDNA2 platform` (or a
similarly descriptive phrase) alongside the existing platform marker lines so
the comment accurately documents the new `xdna2` marker.

In `@DeeployTest/testUtils/platformMapping.py`:
- Around line 286-287: The except ImportError handler in the function that loads
deployers currently raises RuntimeError(f"Deployer for platform {platform} is
not implemented") without chaining the original ImportError; modify the except
block to capture the ImportError (e.g., except ImportError as e:) and re-raise
the RuntimeError using exception chaining (raise RuntimeError(f"Deployer for
platform {platform} is not implemented") from e) so the original traceback is
preserved for debugging.

In `@DeeployTest/testUtils/testRunner.py`:
- Around line 95-114: The CLI accepts invalid tiling inputs; validate them at
argument-parse time by replacing the raw int/str types with validation helpers:
for '--l1' and '--l3' use a positive-integer validator (e.g., positive_int) that
raises argparse.ArgumentTypeError for <=0 values, and for '--defaultMemLevel'
restrict values via choices (e.g., ['L1','L3']) or a validator that normalizes
and rejects unknown values; update the self.add_argument calls for l1, l3, and
defaultMemLevel in testRunner.py accordingly so parsing fails fast with clear
messages.

In `@README_XDNA.md`:
- Line 21: Update the README_XDNA.md sentence that embeds a personal developer
path by either removing the specific `/scratch/jungvi/IRON` example or replacing
it with a generic, documented workaround: describe that mounting an IRON repo is
a temporary step only required until Deeploy midend/backend support MLIR, show a
generic volume example like `-v /path/to/IRON:/opt/IRON`, and explain the
corresponding env var `IRON_OPERATORS_DIR=/opt/IRON/iron/operators` and when
users should use it versus the standard flow; ensure the text clearly labels it
as a temporary, developer-only workaround.
- Around line 6-8: Add language specifiers (e.g., bash or shell) to the fenced
code blocks that contain shell commands so they get proper syntax highlighting;
for example, change the block around the docker build command (`docker build -f
Container/Dockerfile.deeploy-xdna -t deeploy-xdna:local .`) to start with
```bash and similarly update the other command/code blocks referenced (lines
with commands at the other snippets) to use ```bash or ```shell so all shell
snippets in README_XDNA.md are consistently labeled.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e910d040-c1e1-4193-9705-285039a5ece9

📥 Commits

Reviewing files that changed from the base of the PR and between 09993fb and a82fd52.

📒 Files selected for processing (36)

.github/workflows/_runner-xdna2.yml
.github/workflows/ci-platform-xdna2.yml
.gitignore
CMakeLists.txt
Container/Dockerfile.deeploy-xdna
Deeploy/MLIRDataTypes.py
Deeploy/Targets/XDNA2/Bindings.py
Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRComputeCorePass.py
Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRObjectFifoPass.py
Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRRuntimeSequencePass.py
Deeploy/Targets/XDNA2/CodeTransformationPasses/__init__.py
Deeploy/Targets/XDNA2/Deployer.py
Deeploy/Targets/XDNA2/Parsers.py
Deeploy/Targets/XDNA2/Platform.py
Deeploy/Targets/XDNA2/Templates/AddTemplate.py
Deeploy/Targets/XDNA2/Templates/__init__.py
Deeploy/Targets/XDNA2/Tiler.py
Deeploy/Targets/XDNA2/TypeCheckers.py
DeeployTest/Platforms/XDNA2/CMakeLists.txt
DeeployTest/Platforms/XDNA2/main.cpp
DeeployTest/Tests/Kernels/BF16/Add/Regular/inputs.npz
DeeployTest/Tests/Kernels/BF16/Add/Regular/network.onnx
DeeployTest/Tests/Kernels/BF16/Add/Regular/outputs.npz
DeeployTest/conftest.py
DeeployTest/deeployRunner_xdna2.py
DeeployTest/generateNetwork_xdna2.py
DeeployTest/testUtils/core/execution.py
DeeployTest/testUtils/deeployRunner.py
DeeployTest/testUtils/platformMapping.py
DeeployTest/testUtils/testRunner.py
DeeployTest/test_platforms.py
DeeployTest/test_xdna2_config.py
README_XDNA.md
TargetLibraries/XDNA2/CMakeLists.txt
TargetLibraries/XDNA2/kernels/add.cc
requirements-xdna.txt

coderabbitai · 2026-04-09T11:34:05Z

Deeploy/MLIRDataTypes.py

+    def generate(self, operatorRepresentation = {}, **kwargs) -> str:
+        """Generate an MLIR string for this node.
+
+        This default implementation is a thin wrapper: it delegates to
+        :meth:`emit`.  Deployers that need to build a single module from
+        multiple nodes should call :meth:`emit` directly with the shared
+        module context and then stringify the complete module themselves.
+
+        Returns
+        -------
+        str
+            MLIR text (printable module or fragment).
+        """
+        self.emit(operatorRepresentation, **kwargs)
+        return ""


⚠️ Potential issue | 🟡 Minor

Fix mutable default argument.

Using {} as a default value is flagged by static analysis (B006). Mutable defaults are shared across calls and can lead to subtle bugs.

🐛 Proposed fix

- def generate(self, operatorRepresentation = {}, **kwargs) -> str: + def generate(self, operatorRepresentation = None, **kwargs) -> str: """Generate an MLIR string for this node. ... """ + if operatorRepresentation is None: + operatorRepresentation = {} self.emit(operatorRepresentation, **kwargs) return ""

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def generate(self, operatorRepresentation = {}, **kwargs) -> str:

"""Generate an MLIR string for this node.

This default implementation is a thin wrapper: it delegates to

:meth:`emit`. Deployers that need to build a single module from

multiple nodes should call :meth:`emit` directly with the shared

module context and then stringify the complete module themselves.

Returns

-------

str

MLIR text (printable module or fragment).

"""

self.emit(operatorRepresentation, **kwargs)

return ""

def generate(self, operatorRepresentation = None, **kwargs) -> str:

"""Generate an MLIR string for this node.

This default implementation is a thin wrapper: it delegates to

:meth:`emit`. Deployers that need to build a single module from

multiple nodes should call :meth:`emit` directly with the shared

module context and then stringify the complete module themselves.

Returns

-------

str

MLIR text (printable module or fragment).

"""

if operatorRepresentation is None:

operatorRepresentation = {}

self.emit(operatorRepresentation, **kwargs)

return ""

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 187-187: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/MLIRDataTypes.py` around lines 187 - 201, The generate method uses a mutable default operatorRepresentation = {} which can lead to shared-state bugs; change the signature of generate (the method named generate in MLIRDataTypes.py) to accept operatorRepresentation = None (or Optional[Dict]) and inside the method do a guarded assignment like "if operatorRepresentation is None: operatorRepresentation = {}" before using it, ensuring you preserve the existing behavior and return value while avoiding the mutable default argument.

coderabbitai · 2026-04-09T11:34:06Z

Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRRuntimeSequencePass.py

+        # Await output tasks, then free input tasks
+        for task in outputTasks:
+            aiex_d.dma_await_task(task)
+        for task in inputTasks:
+            aiex_d.dma_free_task(task)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

mlir-aie dma_await_task dma_free_task semantics

💡 Result:

In mlir-aie’s AIEX dialect:

aiex.dma_await_task(%task): Blocks the enclosing aie.runtime_sequence until the referenced, previously-started DMA task has completed. Waiting is only possible if the task was configured to emit a task completion token (TCT)—i.e., the task’s configuration/start must have issue_token = true. [1]

aiex.dma_free_task(%task): Does not wait or synchronize. It is a compiler-visible “lifetime end” marker that tells the static buffer-descriptor (BD) allocator pass that any BD IDs allocated for BDs inside %task may be reused after this point. (A future dynamic allocator could lower it to an actual runtime free-like instruction.) Practically, you typically place it after an aiex.dma_await_task (or otherwise after you can prove the DMA is done) if you intend to reuse BD IDs. [1]

Source(s):
[1] Xilinx/AMD MLIR-AIE docs – AIEX dialect ops (aiex.dma_await_task, aiex.dma_free_task). (xilinx.github.io)

Citations:

1: https://xilinx.github.io/mlir-aie/AIEXDialect.html

🏁 Script executed:

cat -n Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRRuntimeSequencePass.py | sed -n '60,95p'

Repository: pulp-platform/Deeploy

Length of output: 1467

Output tasks must also be freed after awaiting, and input tasks should be awaited before freeing.

The code awaits output tasks but doesn't free them—a resource leak. Additionally, input tasks are freed without being awaited first, which violates MLIR-AIE semantics: dma_free_task should only be called after task completion is guaranteed. Since only output tasks have issue_token = true (line 71), they emit completion tokens and can be awaited. Both output and input tasks should be freed after their completion is verified.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRRuntimeSequencePass.py` around lines 85 - 89, The current sequence awaits outputTasks but never frees them and frees inputTasks before ensuring completion; change the order so both inputTasks and outputTasks are awaited before calling aiex_d.dma_free_task on them. Specifically, use aiex_d.dma_await_task(...) for inputTasks as well (or otherwise ensure their completion), then call aiex_d.dma_free_task(...) on each input task, and after awaiting each output task call aiex_d.dma_free_task(...) on each output task; update the block handling outputTasks/inputTasks (and consider the origin of issue_token) so freeing only happens after dma_await_task confirms completion.

coderabbitai · 2026-04-09T11:34:06Z

DeeployTest/Platforms/XDNA2/main.cpp

+static std::vector<uint32_t> read_instr_binary(const std::string &path) {
+  std::ifstream file(path, std::ios::binary);
+  if (!file.is_open()) {
+    throw std::runtime_error("Cannot open instruction file: " + path);
+  }
+  file.seekg(0, std::ios::end);
+  size_t byte_size = file.tellg();
+  file.seekg(0, std::ios::beg);
+
+  std::vector<uint32_t> instr(byte_size / sizeof(uint32_t));
+  file.read(reinterpret_cast<char *>(instr.data()), byte_size);
+  return instr;
+}


⚠️ Potential issue | 🟡 Minor

Add validation for instruction file size alignment.

The code assumes byte_size is a multiple of sizeof(uint32_t). If the file has trailing bytes, they'll be silently dropped. Also, file.tellg() returns std::streampos which may not directly convert to size_t on all platforms.

🛡️ Proposed fix to add validation

static std::vector<uint32_t> read_instr_binary(const std::string &path) { std::ifstream file(path, std::ios::binary); if (!file.is_open()) { throw std::runtime_error("Cannot open instruction file: " + path); } file.seekg(0, std::ios::end); - size_t byte_size = file.tellg(); + std::streampos pos = file.tellg(); + if (pos < 0) { + throw std::runtime_error("Cannot determine file size: " + path); + } + size_t byte_size = static_cast<size_t>(pos); + if (byte_size % sizeof(uint32_t) != 0) { + throw std::runtime_error("Instruction file size not aligned to 4 bytes: " + path); + } file.seekg(0, std::ios::beg);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@DeeployTest/Platforms/XDNA2/main.cpp` around lines 65 - 77, In read_instr_binary, validate the file size and read result: use std::streampos/istream::pos_type for tellg, check for error (e.g., tellg()==-1) and throw if so, verify byte_size is a multiple of sizeof(uint32_t) and throw a descriptive std::runtime_error if not aligned, and after file.read confirm gcount/streams state to ensure the full byte_size was read and throw on short reads or IO errors; update uses of byte_size to a safely cast size_t only after validation.

coderabbitai · 2026-04-09T11:34:06Z

DeeployTest/Platforms/XDNA2/main.cpp

+int main(int argc, char **argv) {
+  // Paths to the compiled artefacts: default to the directory containing
+  // this binary so the test works regardless of the working directory or
+  // whether it is run inside a container.
+  std::string bin_dir;
+  {
+    std::string argv0(argv[0]);
+    auto sep = argv0.rfind('/');
+    bin_dir = (sep == std::string::npos) ? "." : argv0.substr(0, sep);
+  }
+  std::string xclbin_path = bin_dir + "/network.xclbin";
+  std::string instr_path = bin_dir + "/npu_insts.bin";
+
+  bool verbose = false;
+  for (int i = 1; i < argc; ++i) {
+    std::string arg = argv[i];
+    if (arg == "-v" || arg == "--verbose" || arg == "-vv") {
+      verbose = true;
+    }
+  }
+  if (argc >= 2 && argv[1][0] != '-')
+    xclbin_path = argv[1];
+  if (argc >= 3 && argv[2][0] != '-')
+    instr_path = argv[2];
+
+  // -----------------------------------------------------------------------
+  // 1. Open XRT device, register xclbin, create hw_context
+  //    (matches mlir-aie test_utils::init_xrt_load_kernel pattern)
+  // -----------------------------------------------------------------------
+  auto device = xrt::device(0);
+  auto xclbin = xrt::xclbin(xclbin_path);
+  device.register_xclbin(xclbin);
+  xrt::hw_context context(device, xclbin.get_uuid());
+  auto kernel = xrt::kernel(context, "MLIR_AIE");


⚠️ Potential issue | 🟡 Minor

Add exception handling for clearer error messages.

read_instr_binary() and XRT constructors can throw. Without a try-catch, exceptions will terminate with an unhelpful message. This was also flagged by static analysis.

🛡️ Proposed fix to wrap main logic in try-catch

int main(int argc, char **argv) { + try { // Paths to the compiled artefacts... // ... (existing code) ... + } catch (const std::exception &e) { + std::cerr << "Error: " << e.what() << "\n"; + return 1; + } }

Wrap the main body (lines 83-209) in a try-catch block.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@DeeployTest/Platforms/XDNA2/main.cpp` around lines 79 - 112, Wrap the main logic of main (the code that calls read_instr_binary(), xrt::device, xrt::xclbin, device.register_xclbin, xrt::hw_context and xrt::kernel) in a try-catch block that catches std::exception (and optionally a catch-all) and prints a clear error message including ex.what() to stderr (or via your logging mechanism) and returns a non-zero exit code; ensure any thrown exceptions from read_instr_binary() or XRT constructors are handled so the program exits gracefully with an informative message.

coderabbitai · 2026-04-09T11:34:06Z

DeeployTest/testUtils/core/execution.py

+        # Propagate verbosity to the host binary (e.g. XDNA2 main.cpp uses -v)
+        if config.verbose >= 1:
+            cmd.append("-v")


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Host-simulator platform entries:" rg -n -C2 '"simulator"\s*:\s*"host"' DeeployTest/test_platforms.py || true echo echo "Potential host binaries and verbosity-flag handling:" fd -i 'main.cpp' DeeployTest/Platforms | while read -r f; do echo "---- $f ----" rg -n -C2 '(-v|--verbose|getopt|argc|argv)' "$f" || true done

Repository: pulp-platform/Deeploy

Length of output: 1416

🏁 Script executed:

# First, check if there's a generic platform binary and what simulator it uses rg -A 10 '"generic"' DeeployTest/test_platforms.py | head -20

Repository: pulp-platform/Deeploy

Length of output: 736

🏁 Script executed:

# Search for any main.cpp or binary in generic directory find DeeployTest/Platforms/Generic -type f -name 'main.*' -o -name 'sim*' 2>/dev/null | head -20

Repository: pulp-platform/Deeploy

Length of output: 101

🏁 Script executed:

# Check the execution.py file to understand how the host binary is called grep -n "simulator" DeeployTest/testUtils/core/execution.py | head -20

Repository: pulp-platform/Deeploy

Length of output: 381

🏁 Script executed:

# Check if Generic/main.c accepts -v flag cat -n DeeployTest/Platforms/Generic/main.c | head -150

Repository: pulp-platform/Deeploy

Length of output: 2268

🏁 Script executed:

# Check all platforms using "host" simulator rg -B 3 '"simulator"\s*:\s*"host"' DeeployTest/test_platforms.py

Repository: pulp-platform/Deeploy

Length of output: 245

Gate -v flag to XDNA2 platform only.

The Generic platform's binary has int main() with no parameters and cannot accept command-line flags. Passing -v unconditionally for all simulator == "host" runs will break Generic platform tests when config.verbose >= 1. Only XDNA2's binary (XDNA2/main.cpp) explicitly handles -v flags.

Suggested fix

- if config.verbose >= 1: + if config.platform == "XDNA2" and config.verbose >= 1: cmd.append("-v")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testUtils/core/execution.py` around lines 171 - 173, The current code appends "-v" whenever config.verbose >= 1 which breaks the Generic/host binary that doesn't accept flags; change the condition to only add "-v" when the run is for the XDNA2 binary (e.g. check the simulator/platform identifier used in this module — e.g. simulator == "XDNA2" or platform == "XDNA2") so replace the existing guard with something like: only append "-v" when config.verbose >= 1 AND the simulator/platform equals the XDNA2 identifier (leave the cmd and config.verbose symbols intact and reference XDNA2/main.cpp as the binary that accepts -v).

coderabbitai · 2026-04-09T11:34:06Z

requirements-xdna.txt

+--extra-index-url https://github.com/Xilinx/llvm-aie/releases/expanded_assets/nightly
+--extra-index-url https://pypi.org/simple
+
+mlir_aie==v1.2.1
+llvm-aie


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail for url in \ "https://github.com/Xilinx/mlir-aie/releases/expanded_assets/v1.2.1" \ "https://github.com/Xilinx/llvm-aie/releases/expanded_assets/nightly" do echo "=== $url ===" curl -fsSL "$url" \ | rg -o '([A-Za-z0-9._-]+)-(v?[0-9][A-Za-z0-9._-]*)-cp[0-9]+[^"]*\.whl' \ | sort -u || true echo done

Repository: pulp-platform/Deeploy

Length of output: 482

🏁 Script executed:

# First, verify the nightly URL is accessible and check its structure curl -fsSL -I "https://github.com/Xilinx/llvm-aie/releases/expanded_assets/nightly" | head -20

Repository: pulp-platform/Deeploy

Length of output: 4889

🏁 Script executed:

# Get the raw HTML to see what's actually available (no regex filtering) curl -fsSL "https://github.com/Xilinx/llvm-aie/releases/expanded_assets/nightly" 2>&1 | head -100

Repository: pulp-platform/Deeploy

Length of output: 18348

🏁 Script executed:

# Also check the requirements file to understand the full context cat requirements-xdna.txt

Repository: pulp-platform/Deeploy

Length of output: 413

Pin llvm-aie to a specific version; nightly sourcing with no version constraint creates reproducibility risk.

The dependency llvm-aie is currently unpinned while pulled from the nightly feed. The nightly repository contains multiple builds of the same base version with different git hashes, meaning each dependency resolution can pull a different build. Combined with the pinned mlir_aie==v1.2.1, this creates a mismatch: the MLIR dialect version is locked, but the LLVM AIE backend is floating, risking incompatibility and non-reproducible builds.

Identify a compatible llvm-aie version (e.g., llvm-aie==18.0.0.2024050100+59d494d3 or equivalent from a released tag if available), and update the requirement to explicitly pin it.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@requirements-xdna.txt` around lines 6 - 10, requirements-xdna.txt currently lists llvm-aie without a version, causing nondeterministic installs from the nightly index; update the dependency by replacing the unpinned "llvm-aie" entry with a specific pinned version compatible with mlir_aie (for example "llvm-aie==18.0.0.2024050100+59d494d3" or a released tag you verified), so the file contains the exact version string for llvm-aie alongside mlir_aie==v1.2.1 to ensure reproducible, compatible builds.

coderabbitai · 2026-04-09T11:34:06Z

TargetLibraries/XDNA2/kernels/add.cc

+template <typename T_in, typename T_out>
+void eltwise_vadd(T_in *a, T_in *b, T_out *c, int size) {
+  constexpr int vec_factor = 16;
+  event0();
+  T_in *__restrict pA1 = a;
+  T_in *__restrict pB1 = b;
+  T_out *__restrict pC1 = c;
+  const int F = size / vec_factor;
+  AIE_PREPARE_FOR_PIPELINING
+  AIE_LOOP_MIN_ITERATION_COUNT(16)
+  for (int i = 0; i < F; i++) {
+    aie::vector<T_in, vec_factor> A0 = aie::load_v<vec_factor>(pA1);
+    pA1 += vec_factor;
+    aie::vector<T_in, vec_factor> B0 = aie::load_v<vec_factor>(pB1);
+    pB1 += vec_factor;
+    aie::vector<T_out, vec_factor> cout = aie::add(A0, B0);
+    aie::store_v(pC1, cout);
+    pC1 += vec_factor;
+  }
+  event1();
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for tile size configuration in XDNA2 code rg -n -A5 'tileSize|tile_size' --type=py Deeploy/Targets/XDNA2/

Repository: pulp-platform/Deeploy

Length of output: 6085

🏁 Script executed:

#!/bin/bash # Check MAX_TILE_SIZE value rg -n "MAX_TILE_SIZE" Deeploy/Targets/XDNA2/

Repository: pulp-platform/Deeploy

Length of output: 433

🏁 Script executed:

#!/bin/bash # Look for vec_factor or vectorization constraints in the codebase rg -n -A3 "vec_factor\|vector.*16\|alignment.*16" --type=py --type=cc Deeploy/Targets/XDNA2/

Repository: pulp-platform/Deeploy

Length of output: 91

🏁 Script executed:

#!/bin/bash # Check the full _deriveTileSize function sed -n '30,65p' Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRObjectFifoPass.py

Repository: pulp-platform/Deeploy

Length of output: 1463

🏁 Script executed:

#!/bin/bash # Search for constraint definitions related to tile sizing rg -n -B2 -A2 "l1Constraint\|tileSize.*constraint" Deeploy/Targets/XDNA2/

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

#!/bin/bash # Check if there's any handling for remainders or tail loops in kernels rg -n "remainder\|tail.*loop\|% vec_factor\|% 16" TargetLibraries/XDNA2/kernels/

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

#!/bin/bash # Search for _legalizeTransfers and normalization functions rg -n "_legalizeTransfers|legalize.*[Tt]ransfer|normalize.*[Tt]ransfer" Deeploy/

Repository: pulp-platform/Deeploy

Length of output: 857

🏁 Script executed:

#!/bin/bash # Check the full add.cc kernel file cat -n TargetLibraries/XDNA2/kernels/add.cc

Repository: pulp-platform/Deeploy

Length of output: 1979

🏁 Script executed:

#!/bin/bash # Search for any constraints that mention BF16 and alignment rg -n -B3 -A3 "BF16.*align|align.*BF16|bf16.*multiple|multiple.*bf16" Deeploy/ --type=py

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

#!/bin/bash # Look for DMA transfer handling code rg -n -B2 -A4 "transfer.*size|tileSize.*transfer" Deeploy/Targets/XDNA2/ --type=py

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

#!/bin/bash # Find and read the _legalizeTransfers function rg -n -A 50 "def _legalizeTransfers" Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py

Repository: pulp-platform/Deeploy

Length of output: 2912

🏁 Script executed:

#!/bin/bash # Also check if there's any explicit vec_factor or 16-element alignment logic rg -n "16\|vec_factor\|alignment" Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py | head -30

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

#!/bin/bash # Check how tileSize is used after _legalizeTransfers rg -n -B5 -A5 "modifiedOpRepr.*size.*tileSize" Deeploy/Targets/XDNA2/CodeTransformationPasses/MLIRComputeCorePass.py

Repository: pulp-platform/Deeploy

Length of output: 662

Add a scalar tail loop to handle remainders when size is not a multiple of vec_factor (16).

The vectorized kernel processes size / vec_factor iterations using integer division, which silently drops trailing elements if size is not a multiple of 16. The tiling infrastructure does not guarantee tile sizes are multiples of 16 (it only ensures tileSize divides numElements). Add a scalar loop after the vectorized loop to process any remainder:

Suggested fix

for (int i = 0; i < F; i++) { aie::vector<T_in, vec_factor> A0 = aie::load_v<vec_factor>(pA1); pA1 += vec_factor; aie::vector<T_in, vec_factor> B0 = aie::load_v<vec_factor>(pB1); pB1 += vec_factor; aie::vector<T_out, vec_factor> cout = aie::add(A0, B0); aie::store_v(pC1, cout); pC1 += vec_factor; } // Scalar tail loop for remainders for (int i = F * vec_factor; i < size; i++) { c[i] = a[i] + b[i]; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@TargetLibraries/XDNA2/kernels/add.cc` around lines 20 - 40, The vector loop in eltwise_vadd uses F = size / vec_factor and drops trailing elements when size isn't a multiple of vec_factor; add a scalar tail loop after the existing vector loop that processes indices from F * vec_factor up to size, performing element-wise addition using the same types (T_in/T_out) and either the pointer cursors (pA1/pB1/pC1) or direct indexing (a[i], b[i], c[i]) so all remaining elements are handled.

Victor-Jung added 14 commits March 12, 2026 15:51

Update gitignore

05b9404

XDNA2 Platform Beta Support

5615ed4

Add XDNA container

d039415

First attempt at generating MLIR code with Deeploy

e66864a

Generate tiled code but too much logic is in the Template

d854846

Move data movement in passes. Template represent for loop and aquire/…

9f7db26

…release

Template is agnostic of tiling and data movement that are handled by …

14f3ced

…code transformation passes

Add CI on self hosted runner

b850b23

Remove unecessary install

d79e36f

Add cleanup step before checkout to fix permission

7c995bc

aie import is optional to not enforce mlir-aie and llvm-aie package i…

fc2b364

…nstallation

Decouple xdna requirements from dev requirements

1865530

Format

de6f961

Format

01d458b

Victor-Jung added this to the Release 0.2.2 milestone Mar 24, 2026

Victor-Jung self-assigned this Mar 24, 2026

Victor-Jung added the Feature Addition of new features label Mar 24, 2026

Victor-Jung added this to Deeploy Mar 24, 2026

Victor-Jung added 2 commits March 26, 2026 11:49

Add general todos for future refactoring

4427f5a

Format

a82fd52

Victor-Jung mentioned this pull request Apr 7, 2026

Tracing Support in XDNA2 #181

Draft

5 tasks

Victor-Jung changed the title ~~DRAFT: Beta Support for XDNA2 Platform~~ Beta Support for XDNA2 Platform Apr 9, 2026

Victor-Jung marked this pull request as ready for review April 9, 2026 11:25

Victor-Jung requested review from Xeratec and runwangdl as code owners April 9, 2026 11:25

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beta Support for XDNA2 Platform#179

Beta Support for XDNA2 Platform#179
Victor-Jung wants to merge 16 commits intopulp-platform:develfrom
Victor-Jung:pr/xdna-support

Victor-Jung commented Mar 24, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 9, 2026

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Victor-Jung commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Changed

Fixed

PR Merge Checklist

Uh oh!

coderabbitai bot commented Apr 9, 2026

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Victor-Jung commented Mar 24, 2026 •

edited

Loading