[NE16] Add GAP9_w_NE16 platform: NE16 accelerator Engine on GAP9 by runwangdl · Pull Request #183 · pulp-platform/Deeploy

runwangdl · 2026-04-13T22:29:53Z

Adds the NE16 neural engine as an accelerator Engine on top of the existing GAP9 platform, registered as a new composite platform GAP9_w_NE16 that mirrors the Siracusa_w_neureka pattern.

Added

Deeploy/Targets/NE16/ — full Target: Platform.py (NE16Platform extends GAP9Platform, engines=[NE16Engine, GAP9ClusterEngine]), Engine.py, Bindings.py, Parsers.py, Tiler.py, Deployer.py (extends GAP9Deployer to reuse ClDma transformers), Templates/{Allocate,Conv}Template.py, TileConstraints/{NE16Pointwise,NE16Depthwise,NE16Dense}Constraint.py, TopologyOptimizationPasses/Passes.py (incl. _weightEncode ported from pulp-nnx/test/Ne16Weight.py).
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py and DeeployTest/test_gap9_ne16_tiled_config.py — new runner + minimal test config covering PW 1x1, DW 3x3, Dense 3x3 RQ Conv.
TargetLibraries/GAP9/CMakeLists.txt — for GAP9_w_NE16 platform, add_subdirectory on pulp-nnx with USE_NE16=ON and link it into deeploygap9.

Changed

DeeployTest/testUtils/platformMapping.py — register GAP9_w_NE16 in the platforms list, mapPlatform, setupMemoryPlatform, and mapDeployer.
DeeployTest/testMVP.py — include GAP9_w_NE16 in the EngineColoringDeployerWrapper branch (without it, NE16AdjustWeightMemoryLayoutPass never fires and parsing backtracks).
DeeployTest/testUtils/core/execution.py — build the GAP9 SDK image target for GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run).
CMakeLists.txt, DeeployTest/CMakeLists.txt — accept GAP9_w_NE16 alongside GAP9 in the platform branches.
Deeploy/Targets/NE16/Templates/ConvTemplate.py — swap Neureka-inherited subtile constants for NE16 spec per ne16_task_defs.h: CIN_SUBTILE 32/28 → 16 (single mode), output 6 → 3, per-(cout,cinMajor) weight bytes PW 16, DW/Dense 144.

Fixed

Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py — work around a pre-existing ImportError: cannot import name 'float32_tPtr' from 'Deeploy.AbstractDataTypes' by defining it locally via PointerClass(float32_t).

Test plan

Ran on gvsoc gap9.evk (inside ghcr.io/pulp-platform/deeploy-gap9:devel), full pipeline gen -> parse -> lower -> codegen -> CMake -> build -> gapy image -> gvsoc flash run:

Test	Errors	Runtime (cycles)
`Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ`	0 / 1152	901 917
`Kernels/Integer/Conv/DW_2D_RQ` (`--enable-3x3`)	0 / 1280	27 339
`Kernels/Integer/Conv/Regular_2D_RQ` (`--enable-3x3`)	0 / 6372	244 595

Follow-up (out of scope for this PR): PW Unsigned_RQ variant currently produces non-zero errors — likely unsigned weight-offset or out_type conf0 handling; tracked separately.

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

coderabbitai · 2026-04-13T22:38:44Z

Warning

Rate limit exceeded

@runwangdl has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 57 minutes and 48 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 57 minutes and 48 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8acf93fa-a1ac-4403-aafc-838bbb97880b

📥 Commits

Reviewing files that changed from the base of the PR and between 748707a and 2373ff9.

📒 Files selected for processing (55)

.github/workflows/_runner-gap9-w-ne16-tiled.yml
.github/workflows/_runner-snitch-tiled-sequential.yml
.github/workflows/ci-platform-gap9-tiled.yml
.github/workflows/ci-platform-gap9-w-ne16-tiled.yml
.github/workflows/ci-platform-gap9.yml
CMakeLists.txt
Deeploy/Targets/GAP9/Bindings.py
Deeploy/Targets/GAP9/DMA/L3Dma.py
Deeploy/Targets/GAP9/Parsers.py
Deeploy/Targets/GAP9/Platform.py
Deeploy/Targets/GAP9/Templates/GAP9SDKDequantQuantTemplate.py
Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py
Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py
Deeploy/Targets/GAP9/Tiler.py
Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py
Deeploy/Targets/GAP9/TopologyOptimizationPasses/__init__.py
Deeploy/Targets/Generic/Bindings.py
Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py
Deeploy/Targets/Generic/TypeCheckers.py
Deeploy/Targets/NE16/Bindings.py
Deeploy/Targets/NE16/Deployer.py
Deeploy/Targets/NE16/Engine.py
Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
Deeploy/Targets/NE16/OptimizationPasses/__init__.py
Deeploy/Targets/NE16/Parsers.py
Deeploy/Targets/NE16/Platform.py
Deeploy/Targets/NE16/Templates/AllocateTemplate.py
Deeploy/Targets/NE16/Templates/ConvTemplate.py
Deeploy/Targets/NE16/Templates/__init__.py
Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
Deeploy/Targets/NE16/TileConstraints/__init__.py
Deeploy/Targets/NE16/Tiler.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py
Deeploy/Targets/NE16/__init__.py
Deeploy/Targets/PULPOpen/Bindings.py
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
DeeployTest/CMakeLists.txt
DeeployTest/Platforms/GAP9/CMakeLists.txt
DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/inputs.npz
DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/network.onnx
DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/outputs.npz
DeeployTest/conftest.py
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
DeeployTest/testMVP.py
DeeployTest/testUtils/core/execution.py
DeeployTest/testUtils/platformMapping.py
DeeployTest/test_gap9_ne16_tiled_config.py
DeeployTest/test_platforms.py
TargetLibraries/GAP9/CMakeLists.txt
TargetLibraries/GAP9/inc/ne16_utils.h
TargetLibraries/GAP9/src/ne16_utils.c

📝 Walkthrough

Walkthrough

Adds comprehensive NE16 accelerator support and a new platform variant GAP9_w_NE16: new NE16 engine/deployer/platform types, parsers, templates, bindings, tiler constraints, optimization passes, packaging, build/CMake integration, tests, and CI/workflow entries for tiled NE16-enabled execution.

Changes

Cohort / File(s)	Summary
Top-level CMake & GAP9 build `CMakeLists.txt`, `DeeployTest/CMakeLists.txt`, `TargetLibraries/GAP9/CMakeLists.txt`	Introduce `platform=GAP9_w_NE16`, extend GAP9 build branches to include it, enable `USE_NE16` and add `pulp-nnx` linking and compile options when building the NE16 variant.
Platform/Engine/Deployer wiring `Deeploy/Targets/NE16/Platform.py`, `Deeploy/Targets/NE16/Engine.py`, `Deeploy/Targets/NE16/Deployer.py`	Add `NE16Engine`, `NE16Platform`/memory wrappers, `NE16Optimizer`, and `NE16Deployer` with engine flags (`enable3x3`, `enableStrides`) and lowering-pass adjustments.
Node bindings & Tiler-ready bindings `Deeploy/Targets/NE16/Bindings.py`, `Deeploy/Targets/NE16/Tiler.py`	Add NodeBinding lists for requantized/non-requantized PW/DW/Dense conv variants and expose TilingReadyNodeBindings pairing bindings with tile constraints.
Parsers `Deeploy/Targets/NE16/Parsers.py`	Add NE16-specific Conv2D parsers (base + PW/DW/Dense and their RQ variants) enforcing layout, dilations, channels_first, and weight_offset requirements and remapping operatorRepresentation for RQ flows.
Templates & codegen `Deeploy/Targets/NE16/Templates/ConvTemplate.py`, `Deeploy/Targets/NE16/Templates/AllocateTemplate.py`, `Deeploy/Targets/NE16/Templates/__init__.py`	Add NE16 convolution NodeTemplates (init + exec), helpers for counters/strides/conf0, and a memory-level-aware global allocation template for weights.
Tile constraints & helpers `Deeploy/Targets/NE16/TileConstraints/...`, `Deeploy/Targets/NE16/TileConstraints/__init__.py`	Add pointwise, depthwise, dense tiling constraints (and requant variants), serialization of tiling solutions into replacement schemes/schedules, and requant helper utilities.
Topology optimization passes `Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`, `Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py`, `Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py`, `Deeploy/Targets/NE16/OptimizationPasses/__init__.py`	Add NE16-specific topology passes: weight layout encoding, pointwise reshape pass, weight memory-level annotation pass, discoloration and composed NE16 optimization sequence.
Templates package & package inits `Deeploy/Targets/NE16/__init__.py`, `.../Templates/__init__.py`, `.../TileConstraints/__init__.py`, `.../TopologyOptimizationPasses/__init__.py`, `.../OptimizationPasses/__init__.py`	Add package init files that wildcard re-export submodule symbols.
Tests, test tooling & CI `DeeployTest/deeployRunner_tiled_gap9_w_ne16.py`, `DeeployTest/test_gap9_ne16_tiled_config.py`, `DeeployTest/testMVP.py`, `DeeployTest/testUtils/platformMapping.py`, `DeeployTest/testUtils/core/execution.py`, `DeeployTest/CMakeLists.txt`, `DeeployTest/test_platforms.py`, `DeeployTest/conftest.py`, `.github/workflows/_runner-gap9-w-ne16-tiled.yml`, `.github/workflows/ci-platform-gap9-w-ne16-tiled.yml`	Add NE16 test configs, runner script, pytest marker, platform mapping entries, test cases, CMake/test adjustments, and CI workflows/runners to exercise GAP9+NE16 tiled kernels.
Misc small change `Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py`	Minor refactor: construct `float32_tPtr` locally via `PointerClass(float32_t)` instead of importing `float32_tPtr`.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Graph as gs.Graph
participant Parser as NE16 Parsers
participant Tiler as NE16 Tiler/TileConstraints
participant Optimizer as NE16 Optimization Passes
participant Templates as Conv/Allocate Templates
participant Engine as NE16Engine/Deployer
participant Build as CMake/pulp-nnx
Graph->>Parser: parse nodes -> operatorRepresentation
Parser->>Tiler: provide parse metadata & constraints
Tiler->>Optimizer: produce tiling schedules & annotations
Optimizer->>Templates: inject replacements, compute conf/strides
Templates->>Engine: emit init/exec task code + bindings
Engine->>Build: integrate code, link pulp-nnx, and prepare image

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

PyTest Migration #144: related CI/test infra additions and NE16-specific pytest/test wiring.
Remove memory-aware node bindings #123: overlaps on requantization helper functions and tiling constraint integration.
Deeploy-GAP9 Platform #143: related GAP9 wiring and build-branch changes leveraged by the NE16 variant.

Suggested labels

Feature

Suggested reviewers

Xeratec
Victor-Jung

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The PR does not address the linked issue `#39` about I/O tensors being dynamically allocated; it focuses entirely on NE16 platform integration, which is unrelated to the issue's objectives.	This PR should be linked to a different issue tracking NE16 platform integration, or the current link to `#39` should be removed as the PR does not address I/O tensor allocation concerns.
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding a GAP9_w_NE16 composite platform integrating NE16 as an accelerator engine on GAP9.
Description check	✅ Passed	The description is well-detailed and directly related to the changeset, explaining the added files, modifications, and test results.
Out of Scope Changes check	✅ Passed	All changes are directly related to NE16 platform integration and testing. The only tangential change is the float32_tPtr fix in FloatGemmTemplate.py, which is a pre-existing bug fix necessary for the NE16 target to work.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (4)

Deeploy/Targets/NE16/TileConstraints/__init__.py (1)

5-5: Replace self-referential wildcard import with explicit exports.

from . import * on Line 5 makes exported symbols unclear. Replace with explicit re-exports plus __all__ to clarify the public API.

Proposed fix

-from . import *
+from .NE16DenseConstraint import NE16DenseConv2DTileConstraint, NE16RQSDenseConv2DTileConstraint
+from .NE16DepthwiseConstraint import NE16DWConv2DTileConstraint, NE16RQSDWConv2DTileConstraint
+from .NE16PointwiseConstraint import NE16PWConv2DTileConstraint, NE16RQSPWConv2DTileConstraint
+
+__all__ = [
+    "NE16DenseConv2DTileConstraint",
+    "NE16RQSDenseConv2DTileConstraint",
+    "NE16DWConv2DTileConstraint",
+    "NE16RQSDWConv2DTileConstraint",
+    "NE16PWConv2DTileConstraint",
+    "NE16RQSPWConv2DTileConstraint",
+]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/__init__.py` at line 5, Replace the
self-referential wildcard import in __init__.py with explicit re-exports: remove
"from . import *" and import the specific public symbols (e.g., classes,
functions, constants) from the package submodules using explicit imports (for
example: from .submodule import ClassX, func_y), then declare __all__ =
["ClassX", "func_y", ...] listing those exported names; ensure you reference the
actual symbol names defined in the package (replace placeholders with the real
identifiers) and avoid introducing circular imports when reorganizing the
imports.

Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py (1)

161-162: Prefix unused unpacked variables with underscore.

The BatchOffset, HOffset, WOffset, and BatchSize variables are unpacked for structural clarity but never used. Prefix them with _ to indicate intentional disuse.

🔧 Proposed fix

         for cube in outputCubes:
-            (BatchOffset, HOffset, WOffset, COffset) = cube.offset
-            (BatchSize, HSize, WSize, CSize) = cube.dims
+            (_BatchOffset, _HOffset, _WOffset, COffset) = cube.offset
+            (_BatchSize, HSize, WSize, CSize) = cube.dims

Similarly at line 173:

-            inBSize, inHSize, inWSize, inCSize = InCube.dims
+            _inBSize, _inHSize, inWSize, inCSize = InCube.dims

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 161
- 162, The tuple unpackings in NE16DenseConstraint.py currently bind unused
names BatchOffset, HOffset, WOffset, COffset and BatchSize (and similarly at the
later unpacking around line 173); update those unpackings to prefix unused
variables with an underscore (e.g., _BatchOffset, _HOffset, _WOffset,
_BatchSize) so the intent of intentional non-use is clear; locate the unpacking
expressions that assign to cube.offset and cube.dims and rename only the unused
identifiers accordingly while keeping used names (COffset, HSize, WSize, CSize)
unchanged.

Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py (1)

125-136: Use integer division to avoid floating-point precision issues.

Line 131 uses float division (/), which can cause precision loss for large integers. Use integer division (//) instead.

🔧 Proposed fix

 def _findAllMultiplicands(x: int) -> List[int]:
     multiplicands = []
     tmpX = x
     for i in range(2, math.ceil(math.sqrt(x))):  # Ceil cause range doesn't include the last number
         while tmpX % i == 0:
             multiplicands.append(i)
-            tmpX = tmpX / i
+            tmpX = tmpX // i

     if x // math.prod(multiplicands) > 1:
         multiplicands.append(x // math.prod(multiplicands))

     return multiplicands

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 125 -
136, In _findAllMultiplicands, replace the floating-point division on tmpX (tmpX
= tmpX / i) with integer division (tmpX = tmpX // i) and ensure tmpX remains an
int so subsequent modulus and math.prod operations are integer-safe; update any
related expressions that rely on tmpX's type (e.g., the while loop condition
tmpX % i == 0 and the final math.prod usage) to work with the integer result.

Deeploy/Targets/NE16/Parsers.py (1)

55-62: Use not channels_first instead of == False.

Per PEP 8 and Ruff E712, avoid direct comparisons to False.

🔧 Proposed fix

         if not all([
-                channels_first == False,
+                not channels_first,
                 len(data_in.shape) == 4,
                 # LMACAN: weight shape should be equal to 3 because we have to do the ne16's
                 #         special weight encoding
                 len(weight.shape) == 3,
         ]):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Parsers.py` around lines 55 - 62, In the conditional
inside Parsers.py (the if checking channels_first, data_in.shape, weight.shape)
replace the direct boolean comparison channels_first == False with the idiomatic
not channels_first (i.e., use not channels_first in the all([...]) list) so the
condition follows PEP8/Ruff E712; adjust the list element in the same if that
currently reads channels_first == False to read not channels_first.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/NE16/__init__.py`:
- Line 5: The __init__.py currently uses explicit imports but omitted the Tiler
module, causing Engine.py's internal imports to fail; update
Deeploy.Targets.NE16.__init__.py to explicitly import or expose the Tiler module
(the Tiler.py symbol) alongside the other exported modules so it is available
when the package is imported, ensuring Engine.py's references to Tiler resolve
correctly (alternatively, use a package import loop such as using
pkgutil.iter_modules to dynamically import all submodules if you prefer
automatic inclusion).

In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 28-30: The constructor currently uses a mutable default
inputOffsets = {} which causes shared state; change the signature for the
relevant classes (SignPropDeployer.__init__ and GAP9Deployer.__init__) to use
inputOffsets=None and inside each __init__ set inputOffsets = {} if inputOffsets
is None before passing it to super().__init__ (or before any mutations), so each
instance gets its own dict and avoids state leakage when SignPropDeployer
populates the keys.

In `@Deeploy/Targets/NE16/Engine.py`:
- Around line 62-81: The conv-classification is too loose: tighten isPWConv to
reject grouped 1x1 convolutions by adding a check that node.attrs['group'] == 1,
and tighten isDWConv to only accept true depthwise convolutions by inspecting
the conv weight tensor (node.inputs[1].values) and requiring the per-group
in_channels == 1 and that the number of groups equals the number of output
channels (e.g., weights.shape[1] == 1 and weights.shape[0] ==
node.attrs['group'] or group == out_channels); update isPWConv, isDWConv (and
thereby canExecute) accordingly so only bona fide pointwise (group==1) and true
depthwise convs are classified for NE16 coloring.

In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py`:
- Around line 27-44: The loop that sums and assigns weight memory uses
weightMemoryOccupation and then re-adds the same buffer when multiple NE16 nodes
reference it and uses a strict "<" so exact-fit buffers are skipped; change the
logic so before calling _ne16WeightBufferSize(buffer) or adding to
weightMemoryOccupation you check whether buffer._memoryLevel is already equal to
self._weightMemoryLevel.name (skip counting/annotating if already placed) and
use a non-strict comparison (<=) against self._weightMemoryLevel.size when
deciding to assign buffer._memoryLevel; update references in this block
(ne16Nodes iteration, ctxt.lookup(node.inputs[1].name), buffer._memoryLevel,
_ne16WeightBufferSize, and self._weightMemoryLevel.size) accordingly so buffers
aren’t double-counted and exact-fit placements are allowed.

In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 118-121: The parsers map only 4 inputs into
self.operatorRepresentation without bounds checking and always set shift to
'NULL', causing IndexError and mismatched scale_shift_addr; update parseNode()
in the affected parsers (the current loop that assigns inputs via inputs =
['data_in','weight','mul','add'] and self.operatorRepresentation[inputs[idx]] =
ctxt.lookup(inputNode.name).name) to perform explicit validation like
len(node.inputs) == 4 (following RQMatMulParser) OR handle a fifth input by
mapping node.inputs[4] into e.g. self.operatorRepresentation['shift'] (or read
shift from node.attrs['shift'] like PULPOpen) and stop hardcoding 'NULL' so the
template variable ${shift} used by ConvTemplate.py:310 gets a proper value;
apply the same change to NE16RQSPWConv2DParser and NE16RQSDenseConv2DParser.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 141-147: The conf0 computation treats bias presence
incorrectly—change the bias check to treat the "NULL" sentinel as absent: when
calling getNormQuantConf0 in the block guarded by operatorRepresentation["mul"]
!= "NULL", replace the argument "add" in operatorRepresentation with an explicit
boolean like ('add' in operatorRepresentation and operatorRepresentation["add"]
!= "NULL") so bit 25 isn't set when add == "NULL"; keep the existing mul !=
"NULL" guard as-is and pass the adjusted boolean into getNormQuantConf0
(symbols: operatorRepresentation, getNormQuantConf0, conf0, "mul", "add").
- Around line 323-327: The compound literal for ne16_cfg_t incorrectly assigns
weights_stride via an equals expression instead of a designated initializer;
change the block that sets task.data.cfg.weights_stride to use a designated
field (e.g., add a .weights_stride = { .d0 = weightStrideD0, .d1 =
weightStrideD1, .d2 = weightStrideD2 } entry) inside the ne16_cfg_t initializer
so it matches the surrounding fields like .input_stride, .output_stride and
.subtile and produces valid C code.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 41-44: The three unused weight-dimension variable declarations in
NE16DenseConstraint.py (weightInChannelMajorVar, weightBitsVar,
weightBandwidthVar) attempt to access non-existent dims and cause a runtime
KeyError; remove the lines that create weightInChannelMajorVar, weightBitsVar
and weightBandwidthVar and keep only weightOutChannelVar =
tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0) which is
the only weight dim used in this function (adjust any references to those
removed symbols if present elsewhere in the function).

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 43-46: The current constraints in NE16PointwiseConstraint.py bind
outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar which breaks
strided 1x1 convs allowed by NE16Engine.canExecute when enableStrides is true;
replace those identity constraints (the tilerModel.addConstraint calls for
outputHeightVar/outputWidthVar) with the proper stride-aware relation using the
stride variables (e.g., strideH/strideW or the stride parameter used by
NE16Engine) so outputDim = (effective_input_dim - 1) // stride + 1 for height
and width respectively, preserving the existing batch equality (outputBatchVar
== inputBatchVar).

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 81-101: The subtraction of weight_offset from values in Passes.py
can overflow if values is int8; convert values to a wider integer dtype (e.g.,
np.int16 or np.int32) before computing weight_offset and performing values =
values - weight_offset, then ensure the resulting values are clipped to the
0–255 range and finally cast to np.uint8 just before calling _weightEncode
(refer to weightTensor.values, weight_offset, values, and _weightEncode) so the
transpose and depthwise logic remain unchanged.

In `@DeeployTest/deeployRunner_tiled_gap9_w_ne16.py`:
- Around line 15-16: The two parsed flags --ne16-wmem and --enable-3x3 are never
consumed; update main() to translate args.ne16_wmem and args.enable_3x3 into
CMake defines (e.g., add "-DENABLE_NE16_WMEM=ON" when args.ne16_wmem is true and
"-DENABLE_3X3_CONV=ON" when args.enable_3x3 is true) and append them to the same
cmake argument list used for building (the variable that assembles cmake args /
options or the call that runs cmake/configure in main()); if the project expects
OFF values when flags are absent, explicitly add =OFF otherwise. Ensure you
reference the existing parser flags (--ne16-wmem / --enable-3x3) and modify the
code path that constructs cmake arguments in main() so these options are passed
to CMake.

---

Nitpick comments:
In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 55-62: In the conditional inside Parsers.py (the if checking
channels_first, data_in.shape, weight.shape) replace the direct boolean
comparison channels_first == False with the idiomatic not channels_first (i.e.,
use not channels_first in the all([...]) list) so the condition follows
PEP8/Ruff E712; adjust the list element in the same if that currently reads
channels_first == False to read not channels_first.

In `@Deeploy/Targets/NE16/TileConstraints/__init__.py`:
- Line 5: Replace the self-referential wildcard import in __init__.py with
explicit re-exports: remove "from . import *" and import the specific public
symbols (e.g., classes, functions, constants) from the package submodules using
explicit imports (for example: from .submodule import ClassX, func_y), then
declare __all__ = ["ClassX", "func_y", ...] listing those exported names; ensure
you reference the actual symbol names defined in the package (replace
placeholders with the real identifiers) and avoid introducing circular imports
when reorganizing the imports.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 161-162: The tuple unpackings in NE16DenseConstraint.py currently
bind unused names BatchOffset, HOffset, WOffset, COffset and BatchSize (and
similarly at the later unpacking around line 173); update those unpackings to
prefix unused variables with an underscore (e.g., _BatchOffset, _HOffset,
_WOffset, _BatchSize) so the intent of intentional non-use is clear; locate the
unpacking expressions that assign to cube.offset and cube.dims and rename only
the unused identifiers accordingly while keeping used names (COffset, HSize,
WSize, CSize) unchanged.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 125-136: In _findAllMultiplicands, replace the floating-point
division on tmpX (tmpX = tmpX / i) with integer division (tmpX = tmpX // i) and
ensure tmpX remains an int so subsequent modulus and math.prod operations are
integer-safe; update any related expressions that rely on tmpX's type (e.g., the
while loop condition tmpX % i == 0 and the final math.prod usage) to work with
the integer result.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3a121db6-6486-4738-9ded-bbe6729e8727

📥 Commits

Reviewing files that changed from the base of the PR and between 3b011bb and e41800e.

📒 Files selected for processing (28)

CMakeLists.txt
Deeploy/Targets/NE16/Bindings.py
Deeploy/Targets/NE16/Deployer.py
Deeploy/Targets/NE16/Engine.py
Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
Deeploy/Targets/NE16/OptimizationPasses/__init__.py
Deeploy/Targets/NE16/Parsers.py
Deeploy/Targets/NE16/Platform.py
Deeploy/Targets/NE16/Templates/AllocateTemplate.py
Deeploy/Targets/NE16/Templates/ConvTemplate.py
Deeploy/Targets/NE16/Templates/__init__.py
Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
Deeploy/Targets/NE16/TileConstraints/__init__.py
Deeploy/Targets/NE16/Tiler.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py
Deeploy/Targets/NE16/__init__.py
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
DeeployTest/CMakeLists.txt
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
DeeployTest/testMVP.py
DeeployTest/testUtils/core/execution.py
DeeployTest/testUtils/platformMapping.py
DeeployTest/test_gap9_ne16_tiled_config.py
TargetLibraries/GAP9/CMakeLists.txt

coderabbitai · 2026-04-13T22:38:47Z

Deeploy/Targets/NE16/__init__.py

+#
+# SPDX-License-Identifier: Apache-2.0
+
+from . import *


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify self-referential wildcard imports in NE16 package initializers. rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16

Repository: pulp-platform/Deeploy

Length of output: 391

🏁 Script executed:

find Deeploy/Targets/NE16 -maxdepth 1 -type d | sort

Repository: pulp-platform/Deeploy

Length of output: 241

🏁 Script executed:

ls -la Deeploy/Targets/NE16/ | head -30

Repository: pulp-platform/Deeploy

Length of output: 878

🏁 Script executed:

cat -n Deeploy/Targets/NE16/__init__.py

Repository: pulp-platform/Deeploy

Length of output: 224

🏁 Script executed:

rg "from.*NE16.*import.*Tiler|from.*NE16.*Tiler" --type py

Repository: pulp-platform/Deeploy

Length of output: 205

🏁 Script executed:

rg "\.Tiler" Deeploy/Targets/NE16 --type py

Repository: pulp-platform/Deeploy

Length of output: 836

🏁 Script executed:

find Deeploy/Targets -maxdepth 2 -name "__init__.py" -type f | head -10

Repository: pulp-platform/Deeploy

Length of output: 374

🏁 Script executed:

for dir in Deeploy/Targets/*/; do echo "=== $(basename $dir) ==="; cat "$dir/__init__.py" 2>/dev/null | head -20; done

Repository: pulp-platform/Deeploy

Length of output: 1218

🏁 Script executed:

rg "__all__" Deeploy/Targets/NE16 --type py -A 5

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

head -20 Deeploy/Targets/NE16/Tiler.py

Repository: pulp-platform/Deeploy

Length of output: 1270

🏁 Script executed:

rg "import.*Tiler" Deeploy --type py | grep -v "TilerExtension\|TilerModel" | head -20

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

rg "from.*NE16" Deeploy --type py | grep -v "^Binary" | head -30

Repository: pulp-platform/Deeploy

Length of output: 2287

The proposed fix is incomplete: Tiler.py exists in the NE16 directory but is missing from the proposed imports

While the original concern is valid—from . import * does not expose submodules/packages as intended—the proposed fix lists only 9 modules and omits Tiler.py, which exists as a module in the directory and is imported internally by Engine.py.

Corrected fix

-from . import * +from . import Bindings, Deployer, Engine, OptimizationPasses, Parsers, Platform, Tiler, Templates, TileConstraints, TopologyOptimizationPasses + +__all__ = [ + "Bindings", + "Deployer", + "Engine", + "OptimizationPasses", + "Parsers", + "Platform", + "Tiler", + "Templates", + "TileConstraints", + "TopologyOptimizationPasses", +]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from . import *

from . import Bindings, Deployer, Engine, OptimizationPasses, Parsers, Platform, Tiler, Templates, TileConstraints, TopologyOptimizationPasses

__all__ = [

"Bindings",

"Deployer",

"Engine",

"OptimizationPasses",

"Parsers",

"Platform",

"Tiler",

"Templates",

"TileConstraints",

"TopologyOptimizationPasses",

]

🧰 Tools

🪛 Ruff (0.15.9)

[error] 5-5: from . import * used; unable to detect undefined names

(F403)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/__init__.py` at line 5, The __init__.py currently uses explicit imports but omitted the Tiler module, causing Engine.py's internal imports to fail; update Deeploy.Targets.NE16.__init__.py to explicitly import or expose the Tiler module (the Tiler.py symbol) alongside the other exported modules so it is available when the package is imported, ensuring Engine.py's references to Tiler resolve correctly (alternatively, use a package import loop such as using pkgutil.iter_modules to dynamically import all submodules if you prefer automatic inclusion).

coderabbitai · 2026-04-13T22:38:47Z

Deeploy/Targets/NE16/Deployer.py

+                 inputOffsets = {}):
+        super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
+                         default_channels_first, deeployStateDir, inputOffsets)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether inputOffsets is mutated downstream (which would make shared default state dangerous). rg -nP --type=py -C3 'def __init__\(|inputOffsets\s*=' Deeploy/Targets/NE16/Deployer.py Deeploy/Targets/GAP9/Deployer.py Deeploy/DeeployTypes.py rg -nP --type=py -C2 'inputOffsets\[[^]]+\]\s*=|inputOffsets\.update\(|setdefault\(' Deeploy/Targets/NE16/Deployer.py Deeploy/Targets/GAP9/Deployer.py Deeploy/DeeployTypes.py

Repository: pulp-platform/Deeploy

Length of output: 12773

🏁 Script executed:

# Find the parent class of GAP9Deployer rg -n "class GAP9Deployer" Deeploy/Targets/GAP9/Deployer.py -A 5

Repository: pulp-platform/Deeploy

Length of output: 312

🏁 Script executed:

# Get the full NE16Deployer.__init__ context and check parent head -n 35 Deeploy/Targets/NE16/Deployer.py

Repository: pulp-platform/Deeploy

Length of output: 1693

🏁 Script executed:

# Get the full GAP9Deployer.__init__ context to see parent class definition head -n 70 Deeploy/Targets/GAP9/Deployer.py

Repository: pulp-platform/Deeploy

Length of output: 2726

🏁 Script executed:

# Search for any mutation of inputOffsets using broader patterns rg -n "inputOffsets" Deeploy/Targets/NE16/Deployer.py Deeploy/Targets/GAP9/Deployer.py Deeploy/DeeployTypes.py -A 2 -B 2

Repository: pulp-platform/Deeploy

Length of output: 1535

🏁 Script executed:

# Find PULPDeployer class and its __init__ fd "Deployer.py" Deeploy/Targets/PULPOpen/

Repository: pulp-platform/Deeploy

Length of output: 101

🏁 Script executed:

# Search for PULPDeployer definition rg -n "class PULPDeployer" Deeploy/ -A 10

Repository: pulp-platform/Deeploy

Length of output: 974

🏁 Script executed:

# Search for how inputOffsets is used in the parent class hierarchy rg -n "self.inputOffsets" Deeploy/ -B 2 -A 2

Repository: pulp-platform/Deeploy

Length of output: 3019

🏁 Script executed:

# Read SignPropDeployer to see the full __init__ and how inputOffsets is mutated head -n 55 Deeploy/CommonExtensions/NetworkDeployers/SignPropDeployer.py

Repository: pulp-platform/Deeploy

Length of output: 2075

🏁 Script executed:

# Check if inputOffsets is mutated elsewhere besides the initial population rg -n "inputOffsets\[" Deeploy/ -B 1 -A 1

Repository: pulp-platform/Deeploy

Length of output: 1113

Use None as default for inputOffsets parameter.

The {} default on line 28 creates a shared mutable object that gets mutated in SignPropDeployer.__init__() (which populates it with keys). Multiple instances created without providing inputOffsets will share the same dict, causing unintended state leakage.

Proposed fix

- inputOffsets = {}): + inputOffsets = None): + if inputOffsets is None: + inputOffsets = {} super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name, default_channels_first, deeployStateDir, inputOffsets)

Note: GAP9Deployer has the same issue on line 54 and should be fixed identically.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

inputOffsets = {}):

super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,

default_channels_first, deeployStateDir, inputOffsets)

inputOffsets = None):

if inputOffsets is None:

inputOffsets = {}

super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,

default_channels_first, deeployStateDir, inputOffsets)

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 28-28: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Deployer.py` around lines 28 - 30, The constructor currently uses a mutable default inputOffsets = {} which causes shared state; change the signature for the relevant classes (SignPropDeployer.__init__ and GAP9Deployer.__init__) to use inputOffsets=None and inside each __init__ set inputOffsets = {} if inputOffsets is None before passing it to super().__init__ (or before any mutations), so each instance gets its own dict and avoids state leakage when SignPropDeployer populates the keys.

coderabbitai · 2026-04-13T22:38:47Z

Deeploy/Targets/NE16/Engine.py

+    def isPWConv(self, node) -> bool:
+        return node.op in ["Conv", "RequantizedConv"] and \
+            isinstance(node.inputs[1], gs.Constant) and \
+            node.attrs['kernel_shape'] == [1, 1] and \
+            node.attrs['dilations'] == [1, 1] and \
+            (node.attrs['strides'] == [1, 1] or self.enableStrides)
+
+    def isDWConv(self, node) -> bool:
+        return node.op in ["Conv", "RequantizedConv"] and \
+            isinstance(node.inputs[1], gs.Constant) and \
+            node.attrs['kernel_shape'] == [3, 3] and \
+            node.attrs['dilations'] == [1, 1] and \
+            node.attrs['group'] != 1 and \
+            (node.attrs['strides'] == [1, 1] or self.enableStrides)
+
+    def canExecute(self, node: gs.Node) -> bool:
+        if self.enable3x3:
+            return self.isPWConv(node) or self.isDWConv(node) or self.isDenseConv(node)
+        else:
+            return self.isPWConv(node)


⚠️ Potential issue | 🟠 Major

Tighten grouped-convolution classification before coloring.

isPWConv() currently accepts grouped 1x1 convs, and isDWConv() accepts any 3x3 conv with group != 1. The downstream DW path assumes true depthwise semantics, so generic grouped convolutions can be colored for NE16 and then fail later in parsing/tiling.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Engine.py` around lines 62 - 81, The conv-classification is too loose: tighten isPWConv to reject grouped 1x1 convolutions by adding a check that node.attrs['group'] == 1, and tighten isDWConv to only accept true depthwise convolutions by inspecting the conv weight tensor (node.inputs[1].values) and requiring the per-group in_channels == 1 and that the number of groups equals the number of output channels (e.g., weights.shape[1] == 1 and weights.shape[0] == node.attrs['group'] or group == out_channels); update isPWConv, isDWConv (and thereby canExecute) accordingly so only bona fide pointwise (group==1) and true depthwise convs are classified for NE16 coloring.

coderabbitai · 2026-04-13T22:38:47Z

Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py

+        weightMemoryOccupation = 0
+
+        # Current weight memory occupation
+        for buffer in {**ctxt.globalObjects, **ctxt.localObjects}.values():
+            if hasattr(buffer, "_memoryLevel") and buffer._memoryLevel == self._weightMemoryLevel.name:
+                weightMemoryOccupation += _ne16WeightBufferSize(buffer)
+
+        ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName]
+        for node in ne16Nodes:
+            if node.op in ["Conv", "RequantizedConv"]:
+
+                if not (ctxt.is_local(node.inputs[1].name) or ctxt.is_global(node.inputs[1].name)):
+                    continue
+
+                buffer = ctxt.lookup(node.inputs[1].name)
+                if weightMemoryOccupation + _ne16WeightBufferSize(buffer) < self._weightMemoryLevel.size:
+                    buffer._memoryLevel = self._weightMemoryLevel.name
+                    weightMemoryOccupation += _ne16WeightBufferSize(buffer)


⚠️ Potential issue | 🟠 Major

Avoid double-counting reused weights and allow exact-fit placement.

If two NE16 nodes share the same weight buffer, this loop charges the same buffer twice once it has already been annotated in this pass. The strict < check also skips buffers that exactly fill the remaining SRAM. Both cases leave feasible placements unassigned.

Suggested fix

ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName] for node in ne16Nodes: if node.op in ["Conv", "RequantizedConv"]: if not (ctxt.is_local(node.inputs[1].name) or ctxt.is_global(node.inputs[1].name)): continue buffer = ctxt.lookup(node.inputs[1].name) - if weightMemoryOccupation + _ne16WeightBufferSize(buffer) < self._weightMemoryLevel.size: - buffer._memoryLevel = self._weightMemoryLevel.name - weightMemoryOccupation += _ne16WeightBufferSize(buffer) + if getattr(buffer, "_memoryLevel", None) == self._weightMemoryLevel.name: + continue + + bufferSize = _ne16WeightBufferSize(buffer) + if weightMemoryOccupation + bufferSize <= self._weightMemoryLevel.size: + buffer._memoryLevel = self._weightMemoryLevel.name + weightMemoryOccupation += bufferSize return ctxt, graph

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py` around lines 27 - 44, The loop that sums and assigns weight memory uses weightMemoryOccupation and then re-adds the same buffer when multiple NE16 nodes reference it and uses a strict "<" so exact-fit buffers are skipped; change the logic so before calling _ne16WeightBufferSize(buffer) or adding to weightMemoryOccupation you check whether buffer._memoryLevel is already equal to self._weightMemoryLevel.name (skip counting/annotating if already placed) and use a non-strict comparison (<=) against self._weightMemoryLevel.size when deciding to assign buffer._memoryLevel; update references in this block (ne16Nodes iteration, ctxt.lookup(node.inputs[1].name), buffer._memoryLevel, _ne16WeightBufferSize, and self._weightMemoryLevel.size) accordingly so buffers aren’t double-counted and exact-fit placements are allowed.

Deeploy/Targets/NE16/Parsers.py

Deeploy/Targets/NE16/Templates/ConvTemplate.py

coderabbitai · 2026-04-13T22:38:47Z

Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py

+        weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0)
+        weightInChannelMajorVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 1)
+        weightBitsVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 2)
+        weightBandwidthVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 3)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify weight tensor shape after encoding and check if dimIdx 3 causes issues echo "=== Check TilerModel.getTensorDimVar implementation ===" ast-grep --pattern 'def getTensorDimVar($$$): $$$ ' echo -e "\n=== Check if weight shape assertions exist in tiler ===" rg -n "getTensorDimVar.*weight" --type py -C 2

Repository: pulp-platform/Deeploy

Length of output: 17734

🏁 Script executed:

# Check if the variables declared at lines 41-44 are used later in the function sed -n '1,100p' Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py | cat -n

Repository: pulp-platform/Deeploy

Length of output: 6259

🏁 Script executed:

# Check if similar variables are used in the comparable Neureka file sed -n '40,80p' Deeploy/Targets/Neureka/TileConstraints/NeurekaDenseConstraint.py | cat -n

Repository: pulp-platform/Deeploy

Length of output: 2692

🏁 Script executed:

# Find where NE16AdjustWeightMemoryLayoutPass is defined and check weight shape find . -name "*NE16AdjustWeightMemoryLayoutPass*" -o -name "*Parsers.py" | head -5

Repository: pulp-platform/Deeploy

Length of output: 246

🏁 Script executed:

# Search for all uses of the suspect variables in NE16DenseConstraint.py rg -n "weightInChannelMajorVar|weightBitsVar|weightBandwidthVar" Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py

Repository: pulp-platform/Deeploy

Length of output: 370

🏁 Script executed:

# Check NE16 Parsers.py for weight shape information cat -n Deeploy/Targets/NE16/Parsers.py | head -100

Repository: pulp-platform/Deeploy

Length of output: 4765

🏁 Script executed:

# Compare with Neureka - check if those variables are used there rg -n "weightInChannelMajorVar|weightBitsVar|weightBandwidthVar" Deeploy/Targets/Neureka/TileConstraints/NeurekaDenseConstraint.py

Repository: pulp-platform/Deeploy

Length of output: 370

🏁 Script executed:

# Check if getTensorDimVar creates variables on-demand or retrieves them cat -n Deeploy/TilingExtension/TilerModel.py | sed -n '125,145p'

Repository: pulp-platform/Deeploy

Length of output: 1096

🏁 Script executed:

# Check addTensorDimToModel to see what dimensions it adds to the model rg -n "def addTensorDimToModel" Deeploy/TilingExtension/TilerModel.py -A 15

Repository: pulp-platform/Deeploy

Length of output: 762

🏁 Script executed:

# Verify the runtime error scenario: check if weight is 3D and dimIdx=3 would fail python3 << 'EOF' # Simulate the getTensorDimVar call weight_shape = (64, 32*64, 128) # Example 3D weight: (cout, cinMajor, bits*spatial) num_dims = len(weight_shape) print(f"Weight shape: {weight_shape}") print(f"Number of dimensions: {num_dims}") print(f"Valid dimIdx range: 0 to {num_dims - 1}") # Try to access dimIdx=3 for dimIdx in [0, 1, 2, 3]: varName = f"weight_dim_{dimIdx}" if dimIdx < num_dims: print(f"✓ dimIdx={dimIdx}: {varName} would exist (size: {weight_shape[dimIdx]})") else: print(f"✗ dimIdx={dimIdx}: {varName} would NOT exist (KeyError)") EOF

Repository: pulp-platform/Deeploy

Length of output: 344

🏁 Script executed:

# Double-check the complete addGeometricalConstraint function to confirm these variables aren't used wc -l Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py sed -n '24,90p' Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py | grep -E "weightInChannelMajorVar|weightBitsVar|weightBandwidthVar"

Repository: pulp-platform/Deeploy

Length of output: 425

Remove unused weight dimension variables that cause runtime KeyError.

Lines 42-44 declare variables that are never used and access a non-existent dimension. Per Parsers.py line 60, NE16 weight tensors are 3D (cout, cinMajor, bits*spatial) after encoding, with valid dimensions 0–2. Attempting to access dimIdx=3 via getTensorDimVar will fail at runtime with a KeyError.

Remove these three lines:

weightInChannelMajorVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 1) weightBitsVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 2) weightBandwidthVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 3)

Keep only line 41: weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0), which is actually used in the function.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 41 - 44, The three unused weight-dimension variable declarations in NE16DenseConstraint.py (weightInChannelMajorVar, weightBitsVar, weightBandwidthVar) attempt to access non-existent dims and cause a runtime KeyError; remove the lines that create weightInChannelMajorVar, weightBitsVar and weightBandwidthVar and keep only weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0) which is the only weight dim used in this function (adjust any references to those removed symbols if present elsewhere in the function).

coderabbitai · 2026-04-13T22:38:48Z

Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py

+        # Map output dims to inputs dims
+        tilerModel.addConstraint(outputBatchVar == inputBatchVar)
+        tilerModel.addConstraint(outputHeightVar == inputHeightVar)
+        tilerModel.addConstraint(outputWidthVar == inputWidthVar)


⚠️ Potential issue | 🟠 Major

Model pointwise output geometry with stride, not identity.

This forces outputHeightVar == inputHeightVar and outputWidthVar == inputWidthVar, but NE16Engine.canExecute() explicitly allows strided 1x1 convs when enableStrides is set. Those tiles need the usual (effective_input - 1) // stride + 1 relation, otherwise the model is wrong as soon as stride > 1.

Suggested fix

- # Map output dims to inputs dims + # Map output dims to input dims tilerModel.addConstraint(outputBatchVar == inputBatchVar) - tilerModel.addConstraint(outputHeightVar == inputHeightVar) - tilerModel.addConstraint(outputWidthVar == inputWidthVar) + strides = parseDict["strides"] + padding = parseDict["pads"] + inputBuffer = ctxt.lookup(inputBufferName) + effectiveHeight = inputHeightVar + ((padding[0] + padding[2]) * (inputHeightVar == inputBuffer.shape[1])) + effectiveWidth = inputWidthVar + ((padding[1] + padding[3]) * (inputWidthVar == inputBuffer.shape[2])) + tilerModel.addConstraint(outputHeightVar == (effectiveHeight - 1) // strides[0] + 1) + tilerModel.addConstraint(outputWidthVar == (effectiveWidth - 1) // strides[1] + 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines 43 - 46, The current constraints in NE16PointwiseConstraint.py bind outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar which breaks strided 1x1 convs allowed by NE16Engine.canExecute when enableStrides is true; replace those identity constraints (the tilerModel.addConstraint calls for outputHeightVar/outputWidthVar) with the proper stride-aware relation using the stride variables (e.g., strideH/strideW or the stride parameter used by NE16Engine) so outputDim = (effective_input_dim - 1) // stride + 1 for height and width respectively, preserving the existing batch equality (outputBatchVar == inputBatchVar).

coderabbitai · 2026-04-13T22:38:48Z

Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py

+    values = weightTensor.values
+
+    # Extract weight offset and translate weights by the offset
+    weight_offset = values.min()
+    values = values - weight_offset
+    node.attrs["weight_offset"] = weight_offset
+
+    if "channels_first" in node.attrs:
+        channels_first = node.attrs["channels_first"]
+    else:
+        channels_first = default_channels_first
+
+    # Weight encode expects channels first
+    if not channels_first:
+        values = values.transpose(0, 3, 1, 2)
+
+    bits = 8  # Support only 8 bit weights for now
+    if node.attrs['group'] == 1:
+        weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = False)
+    else:
+        weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = True)


⚠️ Potential issue | 🟠 Major

Potential integer overflow when subtracting weight offset.

If values has dtype int8 (range -128 to 127), the subtraction at line 85 can overflow before the astype(np.uint8) at line 99. For example, with weight_offset = -128 and a value of 127:

127 - (-128) = 255 # overflows int8, wraps to -1

Cast to a wider type before subtraction to avoid wraparound:

🐛 Proposed fix

# Adjust N-EUREKA's weights values = weightTensor.values # Extract weight offset and translate weights by the offset weight_offset = values.min() - values = values - weight_offset + values = values.astype(np.int32) - int(weight_offset) node.attrs["weight_offset"] = weight_offset if "channels_first" in node.attrs:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

values = weightTensor.values

# Extract weight offset and translate weights by the offset

weight_offset = values.min()

values = values - weight_offset

node.attrs["weight_offset"] = weight_offset

if "channels_first" in node.attrs:

channels_first = node.attrs["channels_first"]

else:

channels_first = default_channels_first

# Weight encode expects channels first

if not channels_first:

values = values.transpose(0, 3, 1, 2)

bits = 8 # Support only 8 bit weights for now

if node.attrs['group'] == 1:

weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = False)

else:

weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = True)

values = weightTensor.values

# Extract weight offset and translate weights by the offset

weight_offset = values.min()

values = values.astype(np.int32) - int(weight_offset)

node.attrs["weight_offset"] = weight_offset

if "channels_first" in node.attrs:

channels_first = node.attrs["channels_first"]

else:

channels_first = default_channels_first

# Weight encode expects channels first

if not channels_first:

values = values.transpose(0, 3, 1, 2)

bits = 8 # Support only 8 bit weights for now

if node.attrs['group'] == 1:

weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = False)

else:

weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 81 - 101, The subtraction of weight_offset from values in Passes.py can overflow if values is int8; convert values to a wider integer dtype (e.g., np.int16 or np.int32) before computing weight_offset and performing values = values - weight_offset, then ensure the resulting values are clipped to the 0–255 range and finally cast to np.uint8 just before calling _weightEncode (refer to weightTensor.values, weight_offset, values, and _weightEncode) so the transpose and depthwise logic remain unchanged.

coderabbitai · 2026-04-13T22:38:48Z

DeeployTest/deeployRunner_tiled_gap9_w_ne16.py

+        parser.add_argument('--ne16-wmem', action = 'store_true', help = 'Enable NE16 weight memory\n')
+        parser.add_argument('--enable-3x3', action = 'store_true', help = 'Enable 3x3 convolutions\n')


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify end-to-end wiring for NE16-specific runner flags. rg -nP --type=py -C3 'ne16[_-]wmem|enable[_-]3x3|hasattr\(args,\s*.\s*ne16|hasattr\(args,\s*.\s*enable' DeeployTest/testUtils/deeployRunner.py DeeployTest/deeployRunner_tiled_gap9_w_ne16.py DeeployTest/testUtils/core/execution.py rg -nP --type=cmake -C3 'NE16|3x3|WMEM|NUM_CORES' CMakeLists.txt DeeployTest/CMakeLists.txt TargetLibraries/GAP9/CMakeLists.txt

Repository: pulp-platform/Deeploy

Length of output: 6815

🏁 Script executed:

# Search for main() function and how it handles arguments rg -n 'def main' DeeployTest/testUtils/deeployRunner.py -A 20 # Look for args usage with underscore conversion (--ne16-wmem → args.ne16_wmem) rg -n 'args\.ne16|args\.enable' DeeployTest/testUtils/deeployRunner.py DeeployTest/deeployRunner_tiled_gap9_w_ne16.py # Check the entire flow in the runner file cat -n DeeployTest/deeployRunner_tiled_gap9_w_ne16.py

Repository: pulp-platform/Deeploy

Length of output: 2136

🏁 Script executed:

# Get more of the main() function to see how it uses parser_setup_callback sed -n '311,450p' DeeployTest/testUtils/deeployRunner.py | cat -n # Also search for any reference to ne16_wmem or enable_3x3 after variable names are converted rg -n 'ne16_wmem|enable_3x3' DeeployTest/testUtils/deeployRunner.py DeeployTest/testUtils/core/execution.py

Repository: pulp-platform/Deeploy

Length of output: 5539

Add handling for --ne16-wmem and --enable-3x3 flags in main().

The flags are defined on Lines 15-16, but unlike --cores (which is properly handled at deeployRunner.py:90-91), there is no downstream consumption of --ne16-wmem or --enable-3x3. These arguments are parsed but never converted to CMake options or used anywhere in execution. Either wire them into CMake configuration (e.g., -DENABLE_NE16_WMEM, -DENABLE_3X3_CONV), or remove them if not yet implemented.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@DeeployTest/deeployRunner_tiled_gap9_w_ne16.py` around lines 15 - 16, The two parsed flags --ne16-wmem and --enable-3x3 are never consumed; update main() to translate args.ne16_wmem and args.enable_3x3 into CMake defines (e.g., add "-DENABLE_NE16_WMEM=ON" when args.ne16_wmem is true and "-DENABLE_3X3_CONV=ON" when args.enable_3x3 is true) and append them to the same cmake argument list used for building (the variable that assembles cmake args / options or the call that runs cmake/configure in main()); if the project expects OFF values when flags are absent, explicitly add =OFF otherwise. Ensure you reference the existing parser flags (--ne16-wmem / --enable-3x3) and modify the code path that constructs cmake arguments in main() so these options are passed to CMake.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

Deeploy/Targets/NE16/Templates/ConvTemplate.py (1)

352-356: ⚠️ Potential issue | 🔴 Critical

Fix invalid designated initializer syntax for weights_stride.

The line task.data.cfg.weights_stride = (ne16_stride_t) {...} is inside the .cfg = (ne16_cfg_t) { ... } designated initializer block. Within designated initializers, you must use .fieldname = value syntax. The current syntax is invalid C and will cause compilation errors.

🐛 Proposed fix

-            task.data.cfg.weights_stride = (ne16_stride_t) {
+            .weights_stride = (ne16_stride_t) {
                 .d0 = ${weightStrideD0},
                 .d1 = ${weightStrideD1},
                 .d2 = ${weightStrideD2}
             },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 352 - 356,
Inside the designated initializer for task.data.cfg (the .cfg = (ne16_cfg_t) {
... } block) the assignment to weights_stride uses an invalid nested C
cast-style initializer; replace it with a proper designated-field initializer by
setting .weights_stride = { .d0 = ${weightStrideD0}, .d1 = ${weightStrideD1},
.d2 = ${weightStrideD2} } (or include an explicit .weights_stride =
(ne16_stride_t){ .d0 = ..., .d1 = ..., .d2 = ... } if you prefer the cast) so
the field name .weights_stride is used with a brace-enclosed initializer
compatible with ne16_stride_t within the ne16_cfg_t designated initializer.

🧹 Nitpick comments (8)

Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (1)

7-11: Minor style nit: Consider grouping all imports before module-level assignments.

The assignment float32_tPtr = PointerClass(float32_t) on line 10 appears between import statements. Per PEP 8 conventions, imports should typically be grouped together at the top of the file before other statements.

♻️ Suggested reordering

 from Deeploy.AbstractDataTypes import PointerClass
 from Deeploy.CommonExtensions.DataTypes import float32_t
-
-float32_tPtr = PointerClass(float32_t)
 from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation
+
+float32_tPtr = PointerClass(float32_t)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py` around lines 7 - 11,
Move the module-level assignment float32_tPtr = PointerClass(float32_t) so all
import statements are grouped first: keep the imports for PointerClass,
float32_t, and DeeployTypes together at the top, then place the float32_tPtr
assignment immediately after those imports; ensure you reference PointerClass
and float32_t exactly as they are imported and leave NetworkContext,
NodeTemplate, OperatorRepresentation imports unchanged.

DeeployTest/test_platforms.py (1)

1004-1006: Unused config_name variable (consistent with existing tests).

The config_name variable is unpacked but never used. This matches the pattern in all other tiled test functions in this file (e.g., lines 317, 346, 808, etc.). For consistency, you could prefix it with an underscore to signal intent:
-    test_name, l1, config_name = test_params
+    test_name, l1, _config_name = test_params
This is a low-priority style nit since the existing codebase follows the same pattern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/test_platforms.py` around lines 1004 - 1006, In
test_gap9_w_ne16_tiled_kernels_l2_singlebuffer, the unpacked variable
config_name is unused; rename it to _config_name (or prefix with an underscore)
in the function signature to indicate it's intentionally unused and match the
existing tiled test patterns (refer to function
test_gap9_w_ne16_tiled_kernels_l2_singlebuffer).

.github/workflows/_runner-gap9-w-ne16-tiled.yml (1)

36-36: Consider logging pip install failures instead of silently ignoring them.

The || true suppresses all pip install failures silently. If the install genuinely fails, subsequent test runs will fail with confusing errors. Consider logging the failure while still allowing the workflow to proceed:
-          pip install -e . || true
+          pip install -e . || echo "WARNING: pip install failed, tests may not work correctly"
Alternatively, if the install is expected to always succeed in CI, remove || true to fail fast on broken dependencies.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/_runner-gap9-w-ne16-tiled.yml at line 36, The workflow
step currently runs "pip install -e . || true", which silently swallows
installation failures; change this to either remove the "|| true" so the job
fails fast on pip install errors, or replace it with logic that captures the pip
exit status and logs the failure before continuing (e.g., run pip install -e .
and on failure emit a clear error log message referencing the pip install step);
locate the exact step containing "pip install -e . || true" in the workflow and
update it accordingly so failures are not silently ignored.

Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py (1)

161-162: Prefix unused unpacked variables with underscore.

Per static analysis hints, several unpacked variables are never used. Prefix them with underscore to indicate intentional non-use.

♻️ Suggested fix

-            (BatchOffset, HOffset, WOffset, COffset) = cube.offset
-            (BatchSize, HSize, WSize, CSize) = cube.dims
+            (_BatchOffset, _HOffset, _WOffset, COffset) = cube.offset
+            (_BatchSize, HSize, WSize, CSize) = cube.dims

-            inBSize, inHSize, inWSize, inCSize = InCube.dims
+            _inBSize, _inHSize, inWSize, inCSize = InCube.dims

Also applies to: 173-173

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 161
- 162, The tuple unpackings in NE16DenseConstraint where you extract cube.offset
into (BatchOffset, HOffset, WOffset, COffset) and cube.dims into (BatchSize,
HSize, WSize, CSize) (and the similar unpacking around line 173) include
variables that are never used; update these unpackings to prefix the unused
names with an underscore (e.g., _BatchOffset, _HOffset, etc.) so static analysis
recognizes intentional non-use while keeping any used variables unchanged and
preserving the tuple structure in the methods within the NE16DenseConstraint
class.

Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py (1)

79-80: Remove unused padding variable.

The padding variable is read from parseDict but never used in addPolicyConstraint.

♻️ Suggested fix

         strides = parseDict["strides"]
-        padding = parseDict["pads"]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines
79 - 80, In addPolicyConstraint (class NE16PointwiseConstraint) the local
variable padding is assigned from parseDict["pads"] but never used; remove the
unused assignment (the line `padding = parseDict["pads"]`) so there is no dead
variable left, or if padding is intended to be used, instead reference it where
needed—locate the parseDict usage in addPolicyConstraint and either delete the
padding assignment or integrate padding into the constraint logic.

Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py (2)

29-31: Remove unused dilation variable.

The dilation variable is extracted but never used in the function. Since NE16 doesn't support dilation (enforced by parser's dilations == [1, 1] check), this variable is unnecessary.

♻️ Suggested fix

         strides = parseDict["strides"]
         padding = parseDict["pads"]
-        dilation = parseDict["dilations"]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines
29 - 31, Remove the unused local variable `dilation` that is assigned from
`parseDict["dilations"]`; since the parser enforces `dilations == [1, 1]` and
NE16 does not support dilation, delete the line `dilation =
parseDict["dilations"]` (and any dead code referencing `dilation`) from the
function so only `strides = parseDict["strides"]` and `padding =
parseDict["pads"]` remain, ensuring no other code expects `dilation`.

160-162: Use already-looked-up outputBuffer instead of redundant lookup.

outputBuffer was already retrieved at line 153. Line 162 performs an unnecessary second lookup.

♻️ Suggested fix

             InCube, padding_tuple = Conv2DTileConstraint.computeInputCube((weightH, weightW), pads, strides, weightC,
                                                                           cube,
-                                                                          ctxt.lookup(varOut).shape)
+                                                                          outputBuffer.shape)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines
160 - 162, Replace the redundant ctxt.lookup(varOut).shape call with the
already-fetched outputBuffer.shape in the NE16DepthwiseConstraint code: when
calling Conv2DTileConstraint.computeInputCube (which returns InCube and
padding_tuple), pass outputBuffer.shape instead of performing a second
ctxt.lookup(varOut) to avoid the duplicate lookup and use the previously
retrieved outputBuffer variable.

Deeploy/Targets/NE16/Parsers.py (1)

56-56: Use idiomatic boolean check instead of equality comparison to False.

Per Ruff E712, avoid explicit equality comparisons to False.

♻️ Suggested fix

         if not all([
-                channels_first == False,
+                not channels_first,
                 len(data_in.shape) == 4,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Parsers.py` at line 56, Replace the explicit equality
check "channels_first == False" with the idiomatic boolean negation "not
channels_first" wherever it appears (e.g., in the call/config constructing the
parser using the channels_first variable) to comply with Ruff E712 and avoid
direct comparison to False.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py`:
- Line 34: The list comprehension that builds ne16Nodes uses
node.attrs["engine"] directly and can raise KeyError for nodes without an
"engine"; change it to safely check for the key (e.g., use
node.attrs.get("engine") == self.ne16EngineName or "engine" in node.attrs and
node.attrs["engine"] == self.ne16EngineName) so only nodes with an engine
attribute are compared against self.ne16EngineName; update the expression that
defines ne16Nodes accordingly in MemoryLevelAnnotationPasses.py.

---

Duplicate comments:
In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 352-356: Inside the designated initializer for task.data.cfg (the
.cfg = (ne16_cfg_t) { ... } block) the assignment to weights_stride uses an
invalid nested C cast-style initializer; replace it with a proper
designated-field initializer by setting .weights_stride = { .d0 =
${weightStrideD0}, .d1 = ${weightStrideD1}, .d2 = ${weightStrideD2} } (or
include an explicit .weights_stride = (ne16_stride_t){ .d0 = ..., .d1 = ..., .d2
= ... } if you prefer the cast) so the field name .weights_stride is used with a
brace-enclosed initializer compatible with ne16_stride_t within the ne16_cfg_t
designated initializer.

---

Nitpick comments:
In @.github/workflows/_runner-gap9-w-ne16-tiled.yml:
- Line 36: The workflow step currently runs "pip install -e . || true", which
silently swallows installation failures; change this to either remove the "||
true" so the job fails fast on pip install errors, or replace it with logic that
captures the pip exit status and logs the failure before continuing (e.g., run
pip install -e . and on failure emit a clear error log message referencing the
pip install step); locate the exact step containing "pip install -e . || true"
in the workflow and update it accordingly so failures are not silently ignored.

In `@Deeploy/Targets/NE16/Parsers.py`:
- Line 56: Replace the explicit equality check "channels_first == False" with
the idiomatic boolean negation "not channels_first" wherever it appears (e.g.,
in the call/config constructing the parser using the channels_first variable) to
comply with Ruff E712 and avoid direct comparison to False.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 161-162: The tuple unpackings in NE16DenseConstraint where you
extract cube.offset into (BatchOffset, HOffset, WOffset, COffset) and cube.dims
into (BatchSize, HSize, WSize, CSize) (and the similar unpacking around line
173) include variables that are never used; update these unpackings to prefix
the unused names with an underscore (e.g., _BatchOffset, _HOffset, etc.) so
static analysis recognizes intentional non-use while keeping any used variables
unchanged and preserving the tuple structure in the methods within the
NE16DenseConstraint class.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py`:
- Around line 29-31: Remove the unused local variable `dilation` that is
assigned from `parseDict["dilations"]`; since the parser enforces `dilations ==
[1, 1]` and NE16 does not support dilation, delete the line `dilation =
parseDict["dilations"]` (and any dead code referencing `dilation`) from the
function so only `strides = parseDict["strides"]` and `padding =
parseDict["pads"]` remain, ensuring no other code expects `dilation`.
- Around line 160-162: Replace the redundant ctxt.lookup(varOut).shape call with
the already-fetched outputBuffer.shape in the NE16DepthwiseConstraint code: when
calling Conv2DTileConstraint.computeInputCube (which returns InCube and
padding_tuple), pass outputBuffer.shape instead of performing a second
ctxt.lookup(varOut) to avoid the duplicate lookup and use the previously
retrieved outputBuffer variable.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 79-80: In addPolicyConstraint (class NE16PointwiseConstraint) the
local variable padding is assigned from parseDict["pads"] but never used; remove
the unused assignment (the line `padding = parseDict["pads"]`) so there is no
dead variable left, or if padding is intended to be used, instead reference it
where needed—locate the parseDict usage in addPolicyConstraint and either delete
the padding assignment or integrate padding into the constraint logic.

In `@Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py`:
- Around line 7-11: Move the module-level assignment float32_tPtr =
PointerClass(float32_t) so all import statements are grouped first: keep the
imports for PointerClass, float32_t, and DeeployTypes together at the top, then
place the float32_tPtr assignment immediately after those imports; ensure you
reference PointerClass and float32_t exactly as they are imported and leave
NetworkContext, NodeTemplate, OperatorRepresentation imports unchanged.

In `@DeeployTest/test_platforms.py`:
- Around line 1004-1006: In test_gap9_w_ne16_tiled_kernels_l2_singlebuffer, the
unpacked variable config_name is unused; rename it to _config_name (or prefix
with an underscore) in the function signature to indicate it's intentionally
unused and match the existing tiled test patterns (refer to function
test_gap9_w_ne16_tiled_kernels_l2_singlebuffer).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0e7e2d4c-444f-4b78-b2b5-22aef5e9a86f

📥 Commits

Reviewing files that changed from the base of the PR and between e41800e and 38f10f2.

📒 Files selected for processing (32)

.github/workflows/_runner-gap9-w-ne16-tiled.yml
.github/workflows/ci-platform-gap9-w-ne16-tiled.yml
CMakeLists.txt
Deeploy/Targets/NE16/Bindings.py
Deeploy/Targets/NE16/Deployer.py
Deeploy/Targets/NE16/Engine.py
Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
Deeploy/Targets/NE16/OptimizationPasses/__init__.py
Deeploy/Targets/NE16/Parsers.py
Deeploy/Targets/NE16/Platform.py
Deeploy/Targets/NE16/Templates/AllocateTemplate.py
Deeploy/Targets/NE16/Templates/ConvTemplate.py
Deeploy/Targets/NE16/Templates/__init__.py
Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
Deeploy/Targets/NE16/TileConstraints/__init__.py
Deeploy/Targets/NE16/Tiler.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py
Deeploy/Targets/NE16/__init__.py
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
DeeployTest/CMakeLists.txt
DeeployTest/conftest.py
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
DeeployTest/testMVP.py
DeeployTest/testUtils/core/execution.py
DeeployTest/testUtils/platformMapping.py
DeeployTest/test_gap9_ne16_tiled_config.py
DeeployTest/test_platforms.py
TargetLibraries/GAP9/CMakeLists.txt

✅ Files skipped from review due to trivial changes (5)

DeeployTest/testMVP.py
DeeployTest/testUtils/core/execution.py
DeeployTest/CMakeLists.txt
Deeploy/Targets/NE16/Tiler.py
DeeployTest/test_gap9_ne16_tiled_config.py

🚧 Files skipped from review as they are similar to previous changes (6)

TargetLibraries/GAP9/CMakeLists.txt
CMakeLists.txt
DeeployTest/testUtils/platformMapping.py
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
Deeploy/Targets/NE16/Platform.py

coderabbitai · 2026-04-13T23:46:01Z

Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py

+            if hasattr(buffer, "_memoryLevel") and buffer._memoryLevel == self._weightMemoryLevel.name:
+                weightMemoryOccupation += _ne16WeightBufferSize(buffer)
+
+        ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName]


⚠️ Potential issue | 🟡 Minor

Potential KeyError if node lacks engine attribute.

The list comprehension accesses node.attrs["engine"] without checking if the key exists. Nodes not assigned to an engine may lack this attribute.

🛡️ Proposed defensive fix

- ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName] + ne16Nodes = [node for node in graph.nodes if node.attrs.get("engine") == self.ne16EngineName]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py` at line 34, The list comprehension that builds ne16Nodes uses node.attrs["engine"] directly and can raise KeyError for nodes without an "engine"; change it to safely check for the key (e.g., use node.attrs.get("engine") == self.ne16EngineName or "engine" in node.attrs and node.attrs["engine"] == self.ne16EngineName) so only nodes with an engine attribute are compared against self.ne16EngineName; update the expression that defines ne16Nodes accordingly in MemoryLevelAnnotationPasses.py.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (5)

Deeploy/Targets/NE16/Deployer.py (1)

19-30: ⚠️ Potential issue | 🟠 Major

Avoid shared inputOffsets state across deployer instances.

inputOffsets = {} still creates one dict shared by every NE16Deployer built without an explicit argument, so per-instance state can leak between deployments.

Suggested fix

-                 inputOffsets = {}):
+                 inputOffsets = None):
+        if inputOffsets is None:
+            inputOffsets = {}
         super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
                          default_channels_first, deeployStateDir, inputOffsets)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Deployer.py` around lines 19 - 30, The constructor
__init__ currently uses a mutable default inputOffsets = {} which shares state
across instances; change the parameter default to inputOffsets: Optional[Dict] =
None and inside __init__ (before calling super().__init__) set inputOffsets = {}
if it's None, then pass that local dict into super().__init__; update the
function signature and ensure any references to inputOffsets in
NE16Deployer.__init__ use the per-instance dict created when None is supplied.

Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py (1)

90-106: ⚠️ Potential issue | 🟠 Major

Widen before subtracting weight_offset.

If weightTensor.values is int8, values - weight_offset happens in the narrow dtype and can wrap before the later cast to uint8, corrupting the encoded weights.

Suggested fix

-    weight_offset = values.min()
-    values = values - weight_offset
+    weight_offset = int(values.min())
+    values = values.astype(np.int32) - weight_offset
     node.attrs["weight_offset"] = weight_offset

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 90 -
106, The subtraction values = values - weight_offset can underflow when
weightTensor.values is int8; before computing the offset and subtracting, widen
values to a safe integer type (e.g., int16/int32) so the subtraction is
performed without wraparound, then clip/normalize and finally cast to np.uint8
before calling _weightEncode; update the operations around weight_offset,
values, and the subsequent transpose/astype steps (referencing weight_offset,
values, weightTensor.values, and _weightEncode) to perform widening,
subtraction, any needed clamping, and only then cast to uint8 for encoding.

Deeploy/Targets/NE16/Engine.py (1)

62-75: ⚠️ Potential issue | 🟠 Major

Grouped convolutions are still being classified as NE16 PW/DW.

isPWConv() still accepts grouped 1x1 convs, and isDWConv() still accepts any grouped 3x3 conv. That will color generic grouped convolutions for NE16 even though the downstream paths only handle regular pointwise (group == 1) or true depthwise weights.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Engine.py` around lines 62 - 75, isPWConv currently
allows grouped 1x1 convs and isDWConv accepts any grouped 3x3 conv; change the
predicates to only mark true pointwise (group == 1) and true depthwise (group ==
number of input channels) respectively: update isPWConv to include
node.attrs['group'] == 1 in its condition, and update isDWConv to require
node.attrs['group'] equal the convolution's input channel count (i.e., only
treat it as depthwise when group == input_channels) while keeping the existing
kernel_shape, dilation and stride checks.

Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py (1)

43-46: ⚠️ Potential issue | 🟠 Major

Model pointwise output geometry from stride, not identity.

These constraints only describe stride-1 pointwise conv. If NE16Engine.enableStrides is ever enabled, H_out/W_out become wrong immediately because the tiler still forces outputHeightVar == inputHeightVar and outputWidthVar == inputWidthVar.

Suggested fix

         # Map output dims to inputs dims
         tilerModel.addConstraint(outputBatchVar == inputBatchVar)
-        tilerModel.addConstraint(outputHeightVar == inputHeightVar)
-        tilerModel.addConstraint(outputWidthVar == inputWidthVar)
+        strides = parseDict["strides"]
+        padding = parseDict["pads"]
+        inputBuffer = ctxt.lookup(inputBufferName)
+        effectiveHeight = inputHeightVar + ((padding[0] + padding[2]) * (inputHeightVar == inputBuffer.shape[1]))
+        effectiveWidth = inputWidthVar + ((padding[1] + padding[3]) * (inputWidthVar == inputBuffer.shape[2]))
+        tilerModel.addConstraint(outputHeightVar == (effectiveHeight - 1) // strides[0] + 1)
+        tilerModel.addConstraint(outputWidthVar == (effectiveWidth - 1) // strides[1] + 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines
43 - 46, The current constraints in NE16PointwiseConstraint map outputHeightVar
and outputWidthVar directly to inputHeightVar/inputWidthVar which only works for
stride=1; update the constraints to compute output dims from the pointwise
stride when NE16Engine.enableStrides is used (use the convolution output formula
for 1x1: H_out = ceil(inputHeight / strideH), W_out = ceil(inputWidth /
strideW)). Replace the equalities on outputHeightVar and outputWidthVar with
expressions that use the pointwise stride variables (e.g., strideH/strideW or
the engine's stride config) and implement the integer ceil division ((input +
stride - 1) / stride) so tilerModel.addConstraint uses outputHeightVar ==
(inputHeightVar + strideHVar - 1) / strideHVar and similarly for width; keep the
batch mapping (outputBatchVar == inputBatchVar) as-is.

Deeploy/Targets/NE16/Templates/ConvTemplate.py (1)

352-356: ⚠️ Potential issue | 🔴 Critical

Use a designated field initializer for weights_stride.

task.data.cfg.weights_stride = ... is not valid inside a C compound literal. This needs to be .weights_stride = ... like the neighboring fields, otherwise the generated kernel code will not compile.

Proposed fix

-            task.data.cfg.weights_stride = (ne16_stride_t) {
+            .weights_stride = (ne16_stride_t) {
                 .d0 = ${weightStrideD0},
                 .d1 = ${weightStrideD1},
                 .d2 = ${weightStrideD2}
             },

Is `task.data.cfg.weights_stride = ...` valid inside a C designated initializer, or must it be written as `.weights_stride = ...`?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 352 - 356,
Replace the invalid in-line assignment inside the C compound literal with a
designated initializer: change the occurrence of "task.data.cfg.weights_stride =
(ne16_stride_t) { ... }" so that within the compound literal for task.data.cfg
you use ".weights_stride = (ne16_stride_t) { .d0 = ..., .d1 = ..., .d2 = ... }"
(i.e., reference the struct member as .weights_stride) so the generated kernel
compiles; update the initializer near the other designated fields in the same
compound literal where task.data.cfg is constructed.

🧹 Nitpick comments (1)

Deeploy/Targets/NE16/TileConstraints/__init__.py (1)
5-5: Consider using explicit submodule imports for clarity.

The from . import * statement in a package __init__.py doesn't automatically import submodules—it only imports names from __all__ (if defined) in the current package's namespace. For a package that simply exposes its submodules, explicit imports are clearer:
from . import NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint, RequantHelpers
This same pattern appears in several other new NE16 __init__.py files. If the current behavior is intentional (relying on consumers to import submodules directly), this can remain as-is.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/__init__.py` at line 5, Replace the
wildcard import in the package __init__ (from . import *) with explicit
submodule imports to make exported symbols clear: import the submodules
NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint and
RequantHelpers (e.g., from . import NE16DenseConstraint,
NE16DepthwiseConstraint, NE16PointwiseConstraint, RequantHelpers) so consumers
get those modules via the package namespace; alternatively, if you intended to
expose names rather than modules, define an explicit __all__ listing the public
symbols instead of using a star import.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 32-40: The constructor is mutating the shared lowering optimizer
in place (self.loweringOptimizer.passes) which causes cross-deployer pollution;
instead clone the optimizer or at least its passes list before modifying it
(e.g., deep-copy the optimizer or assign a new passes list based on
list(self.loweringOptimizer.passes)) and then perform the PULPNCHWtoNHWCPass ->
NCHWtoNHWCPass replacement and append ConvEngineDiscolorationPass() and
NE16OptimizationPass(self.default_channels_first, "NE16") to the cloned
optimizer so the module-scoped NE16Optimizer from Platform.py is not mutated
across instances.

In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 37-40: The NE16 parser's parseNodeCtxt currently defaults
channels_first=True but the implementation only accepts False; change the
default parameter in parseNodeCtxt to channels_first: bool = False and update
the three RQS override method signatures to also default channels_first to False
so callers use NHWC by default and the method signatures match the only
supported layout (refer to parseNodeCtxt and the three RQS override methods to
locate and update their signatures).

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py`:
- Around line 52-58: The current constraint only pins weightShape[0] via
weightOutChannelVar.Max(), but the DW packed weight tensor is emitted/loaded as
a full encoded block (so all dims must be pinned); update the tilerModel
constraints for the packed DW weight (located where weightBuffer =
ctxt.lookup(weightBufferName) and tilerModel.addConstraint(...) is called) to
set every shape dimension of that buffer to its Max() instead of only
weightOutChannelVar—e.g., iterate over the buffer's shape variables (weightShape
or equivalent per-dimension vars obtained from weightBuffer) and call
tilerModel.addConstraint(dimVar == dimVar.Max()) for each dimension so the tiler
cannot shrink any axis while codegen still DMA-loads the full packed block.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 139-150: The _findAllMultiplicands function currently uses float
division and an exclusive upper bound which breaks integer factorization; change
the loop to iterate i from 2 through int(math.sqrt(x)) inclusive (or use
math.isqrt(x)+1) and use integer division (tmpX //= i) when extracting factors
so tmpX stays an int; ensure the final remaining factor appended for x //
math.prod(multiplicands) is computed with integer arithmetic and that
multiplicands contains ints so callers like _bestReshapeOption receive correct
balanced factor pairs.

---

Duplicate comments:
In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 19-30: The constructor __init__ currently uses a mutable default
inputOffsets = {} which shares state across instances; change the parameter
default to inputOffsets: Optional[Dict] = None and inside __init__ (before
calling super().__init__) set inputOffsets = {} if it's None, then pass that
local dict into super().__init__; update the function signature and ensure any
references to inputOffsets in NE16Deployer.__init__ use the per-instance dict
created when None is supplied.

In `@Deeploy/Targets/NE16/Engine.py`:
- Around line 62-75: isPWConv currently allows grouped 1x1 convs and isDWConv
accepts any grouped 3x3 conv; change the predicates to only mark true pointwise
(group == 1) and true depthwise (group == number of input channels)
respectively: update isPWConv to include node.attrs['group'] == 1 in its
condition, and update isDWConv to require node.attrs['group'] equal the
convolution's input channel count (i.e., only treat it as depthwise when group
== input_channels) while keeping the existing kernel_shape, dilation and stride
checks.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 352-356: Replace the invalid in-line assignment inside the C
compound literal with a designated initializer: change the occurrence of
"task.data.cfg.weights_stride = (ne16_stride_t) { ... }" so that within the
compound literal for task.data.cfg you use ".weights_stride = (ne16_stride_t) {
.d0 = ..., .d1 = ..., .d2 = ... }" (i.e., reference the struct member as
.weights_stride) so the generated kernel compiles; update the initializer near
the other designated fields in the same compound literal where task.data.cfg is
constructed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 43-46: The current constraints in NE16PointwiseConstraint map
outputHeightVar and outputWidthVar directly to inputHeightVar/inputWidthVar
which only works for stride=1; update the constraints to compute output dims
from the pointwise stride when NE16Engine.enableStrides is used (use the
convolution output formula for 1x1: H_out = ceil(inputHeight / strideH), W_out =
ceil(inputWidth / strideW)). Replace the equalities on outputHeightVar and
outputWidthVar with expressions that use the pointwise stride variables (e.g.,
strideH/strideW or the engine's stride config) and implement the integer ceil
division ((input + stride - 1) / stride) so tilerModel.addConstraint uses
outputHeightVar == (inputHeightVar + strideHVar - 1) / strideHVar and similarly
for width; keep the batch mapping (outputBatchVar == inputBatchVar) as-is.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 90-106: The subtraction values = values - weight_offset can
underflow when weightTensor.values is int8; before computing the offset and
subtracting, widen values to a safe integer type (e.g., int16/int32) so the
subtraction is performed without wraparound, then clip/normalize and finally
cast to np.uint8 before calling _weightEncode; update the operations around
weight_offset, values, and the subsequent transpose/astype steps (referencing
weight_offset, values, weightTensor.values, and _weightEncode) to perform
widening, subtraction, any needed clamping, and only then cast to uint8 for
encoding.

---

Nitpick comments:
In `@Deeploy/Targets/NE16/TileConstraints/__init__.py`:
- Line 5: Replace the wildcard import in the package __init__ (from . import *)
with explicit submodule imports to make exported symbols clear: import the
submodules NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint
and RequantHelpers (e.g., from . import NE16DenseConstraint,
NE16DepthwiseConstraint, NE16PointwiseConstraint, RequantHelpers) so consumers
get those modules via the package namespace; alternatively, if you intended to
expose names rather than modules, define an explicit __all__ listing the public
symbols instead of using a star import.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eaa00d35-32b5-49a6-8eeb-a015c49962b3

📥 Commits

Reviewing files that changed from the base of the PR and between 38f10f2 and 748707a.

📒 Files selected for processing (35)

.github/workflows/_runner-gap9-w-ne16-tiled.yml
.github/workflows/ci-platform-gap9-w-ne16-tiled.yml
CMakeLists.txt
Deeploy/Targets/NE16/Bindings.py
Deeploy/Targets/NE16/Deployer.py
Deeploy/Targets/NE16/Engine.py
Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
Deeploy/Targets/NE16/OptimizationPasses/__init__.py
Deeploy/Targets/NE16/Parsers.py
Deeploy/Targets/NE16/Platform.py
Deeploy/Targets/NE16/Templates/AllocateTemplate.py
Deeploy/Targets/NE16/Templates/ConvTemplate.py
Deeploy/Targets/NE16/Templates/__init__.py
Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
Deeploy/Targets/NE16/TileConstraints/__init__.py
Deeploy/Targets/NE16/Tiler.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py
Deeploy/Targets/NE16/__init__.py
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
DeeployTest/CMakeLists.txt
DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/inputs.npz
DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/network.onnx
DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/outputs.npz
DeeployTest/conftest.py
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
DeeployTest/testMVP.py
DeeployTest/testUtils/core/execution.py
DeeployTest/testUtils/platformMapping.py
DeeployTest/test_gap9_ne16_tiled_config.py
DeeployTest/test_platforms.py
TargetLibraries/GAP9/CMakeLists.txt

✅ Files skipped from review due to trivial changes (4)

DeeployTest/conftest.py
Deeploy/Targets/NE16/Tiler.py
DeeployTest/testUtils/core/execution.py
Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py

🚧 Files skipped from review as they are similar to previous changes (9)

DeeployTest/testUtils/platformMapping.py
TargetLibraries/GAP9/CMakeLists.txt
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
.github/workflows/_runner-gap9-w-ne16-tiled.yml
.github/workflows/ci-platform-gap9-w-ne16-tiled.yml
Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
Deeploy/Targets/NE16/Bindings.py
DeeployTest/test_gap9_ne16_tiled_config.py

Deeploy/Targets/NE16/Deployer.py

coderabbitai · 2026-04-14T09:03:09Z

Deeploy/Targets/NE16/Parsers.py

+    def parseNodeCtxt(self,
+                      ctxt: NetworkContext,
+                      node: gs.Node,
+                      channels_first: bool = True) -> Tuple[NetworkContext, bool]:


⚠️ Potential issue | 🟠 Major

Make the NE16 parser default to NHWC.

parseNodeCtxt() defaults channels_first to True, but Line 56 rejects anything except False. Any caller that uses the default path will fail to parse NE16 convolutions. Flip this default here and mirror it in the three RQS overrides so the method signature matches the only supported layout.

Proposed fix

- def parseNodeCtxt(self, - ctxt: NetworkContext, - node: gs.Node, - channels_first: bool = True) -> Tuple[NetworkContext, bool]: + def parseNodeCtxt(self, + ctxt: NetworkContext, + node: gs.Node, + channels_first: bool = False) -> Tuple[NetworkContext, bool]: @@ - channels_first == False, + not channels_first,

- def parseNodeCtxt(self, - ctxt: NetworkContext, - node: gs.Node, - channels_first: bool = True) -> Tuple[NetworkContext, bool]: + def parseNodeCtxt(self, + ctxt: NetworkContext, + node: gs.Node, + channels_first: bool = False) -> Tuple[NetworkContext, bool]:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Parsers.py` around lines 37 - 40, The NE16 parser's parseNodeCtxt currently defaults channels_first=True but the implementation only accepts False; change the default parameter in parseNodeCtxt to channels_first: bool = False and update the three RQS override method signatures to also default channels_first to False so callers use NHWC by default and the method signatures match the only supported layout (refer to parseNodeCtxt and the three RQS override methods to locate and update their signatures).

coderabbitai · 2026-04-14T09:03:09Z

Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py

+        weightBuffer = ctxt.lookup(weightBufferName)
+        # NE16 DW weight is packed as a single (1, 1, packed_bytes) block
+        # containing all output channels (up to NE16_SUBTILE_INPUT_CHANNEL=16).
+        # Keep the outermost dim fixed at its full (=1) value regardless of
+        # the output channel tiling.
+        tilerModel.addConstraint(weightOutChannelVar == weightOutChannelVar.Max())
+


⚠️ Potential issue | 🟠 Major

Pin the whole packed DW weight tensor, not just dim 0.

Line 57 fixes only weightShape[0], but Lines 227-228 always emit a load for the full encoded weight block. That lets the tiler under-account weight memory by shrinking the packed-bytes axis, then codegen still DMA-loads the entire tensor. Constrain every dimension of the encoded DW weight buffer to Max() for this path.

Proposed fix

weightBuffer = ctxt.lookup(weightBufferName) # NE16 DW weight is packed as a single (1, 1, packed_bytes) block # containing all output channels (up to NE16_SUBTILE_INPUT_CHANNEL=16). - # Keep the outermost dim fixed at its full (=1) value regardless of - # the output channel tiling. - tilerModel.addConstraint(weightOutChannelVar == weightOutChannelVar.Max()) + # Keep the full encoded block resident regardless of output-channel tiling. + for dimIdx in range(len(weightBuffer.shape)): + weightDimVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = dimIdx) + tilerModel.addConstraint(weightDimVar == weightDimVar.Max())

Also applies to: 225-228

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines 52 - 58, The current constraint only pins weightShape[0] via weightOutChannelVar.Max(), but the DW packed weight tensor is emitted/loaded as a full encoded block (so all dims must be pinned); update the tilerModel constraints for the packed DW weight (located where weightBuffer = ctxt.lookup(weightBufferName) and tilerModel.addConstraint(...) is called) to set every shape dimension of that buffer to its Max() instead of only weightOutChannelVar—e.g., iterate over the buffer's shape variables (weightShape or equivalent per-dimension vars obtained from weightBuffer) and call tilerModel.addConstraint(dimVar == dimVar.Max()) for each dimension so the tiler cannot shrink any axis while codegen still DMA-loads the full packed block.

coderabbitai · 2026-04-14T09:03:09Z

Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py

+def _findAllMultiplicands(x: int) -> List[int]:
+    multiplicands = []
+    tmpX = x
+    for i in range(2, math.ceil(math.sqrt(x))):  # Ceil cause range doesn't include the last number
+        while tmpX % i == 0:
+            multiplicands.append(i)
+            tmpX = tmpX / i
+
+    if x // math.prod(multiplicands) > 1:
+        multiplicands.append(x // math.prod(multiplicands))
+
+    return multiplicands


⚠️ Potential issue | 🟠 Major

Use integer factorization here.

tmpX = tmpX / i turns the accumulator into a float, and the current loop bound skips exact square roots. For dimensions like 4 or 49 this misses balanced factor pairs, so _bestReshapeOption() falls back to (dim, 1) and defeats the pointwise-reshape heuristic.

Suggested fix

def _findAllMultiplicands(x: int) -> List[int]: multiplicands = [] tmpX = x - for i in range(2, math.ceil(math.sqrt(x))): # Ceil cause range doesn't include the last number + for i in range(2, math.isqrt(x) + 1): while tmpX % i == 0: multiplicands.append(i) - tmpX = tmpX / i + tmpX //= i - if x // math.prod(multiplicands) > 1: - multiplicands.append(x // math.prod(multiplicands)) + if tmpX > 1: + multiplicands.append(tmpX)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 139 - 150, The _findAllMultiplicands function currently uses float division and an exclusive upper bound which breaks integer factorization; change the loop to iterate i from 2 through int(math.sqrt(x)) inclusive (or use math.isqrt(x)+1) and use integer division (tmpX //= i) when extracting factors so tmpX stays an int; ensure the final remaining factor appended for x // math.prod(multiplicands) is computed with integer arithmetic and that multiplicands contains ints so callers like _bestReshapeOption receive correct balanced factor pairs.

Mirrors the Siracusa_w_neureka pattern. NE16Platform extends GAP9Platform with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends GAP9Deployer (reuses ClDma transformers via GAP9Bindings). New Target: Deeploy/Targets/NE16/ (Platform, Engine, Bindings, Parsers, Tiler, Deployer, Templates, TileConstraints, TopologyOptimizationPasses). The _weightEncode function is ported from pulp-nnx/test/Ne16Weight.py (single CIN_SUBTILE=16 mode, no 1x1 vs 3x3 split). ConvTemplate subtile constants set per ne16_task_defs.h (output 3x3, weight stride bytes PW=16 DW/Dense=144). New test infrastructure: - DeeployTest/deeployRunner_tiled_gap9_w_ne16.py - DeeployTest/test_gap9_ne16_tiled_config.py (PW/DW/Dense RQ Conv) DeeployTest wiring: - testUtils/platformMapping.py: register GAP9_w_NE16 in the platforms list, mapPlatform, setupMemoryPlatform, mapDeployer. - testMVP.py: include GAP9_w_NE16 in the EngineColoringDeployerWrapper branch (without it NE16AdjustWeightMemoryLayoutPass never fires and parsing backtracks to exhaustion). - testUtils/core/execution.py: build the GAP9 SDK 'image' target for GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run). - CMakeLists.txt, DeeployTest/CMakeLists.txt: accept GAP9_w_NE16 alongside GAP9 in the platform branches. - TargetLibraries/GAP9/CMakeLists.txt: for GAP9_w_NE16 platform, add_subdirectory on pulp-nnx with USE_NE16=ON and link it into deeploygap9. Fix: Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py referenced an undefined symbol float32_tPtr from Deeploy.AbstractDataTypes; define it locally via PointerClass(float32_t) to unblock the import chain reached by NE16Platform. Verified on gvsoc gap9.evk: PW 1x1 RQ (Regular_RQ): 0/1152 errors, 901917 cycles DW 3x3 RQ (DW_2D_RQ): 0/1280 errors, 27339 cycles (--enable-3x3) Dense 3x3 (Regular_2D_RQ): 0/6372 errors, 244595 cycles (--enable-3x3)

- Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings - The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format - The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias - Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates - Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift - Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer Bug fixes: - Add output signedness check in QuantChecker - Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack - Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3 - Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers Co-authored-by: runwangdl <samanthawangdl@gmail.com>

- TargetLibraries/GAP9/CMakeLists.txt: rename CNN_Libraries_NE16 → CNN_Libraries_HWPE (the actual gap9-sdk path); skip SDK CNN_BasicKernels_NE16.c source for GAP9_w_NE16 platform (it uses the pulp-nnx ne16 stack, so the SDK NE16 kernels are not needed). - Deeploy/Targets/NE16/Platform.py: instantiate the GAP9ClusterEngine with a trimmed includeList (no CNN_BasicKernels_NE16.h / ne16_utils.h / CNN_Copy.h) so the generated Network.c does not pull in the SDK NE16 header alongside pulp-nnx ne16_task_defs.h — the NE16_REG_* macros are defined in both and trigger -Werror redefs.

ghcr.io/pulp-platform/deeploy-gap9:* is hosted in pulp-platform's private GitHub Container Registry. Only upstream's self-hosted runners have credentials to pull it; on fork CI runs (ubuntu-latest) the docker pull fails with 'Error response from daemon: denied' and the whole job is reported as failure. Guard the select-env entry of all three gap9 workflows (ci-platform-gap9.yml, -tiled.yml, -w-ne16-tiled.yml) so they SKIP cleanly on forks instead of FAILING. Upstream behaviour is unchanged.

QuantChecker.checkOutputType (added by the NE16-Linear PR) requires opSigned == outputTypeSigned. Existing Generic and PULPOpen bindings only registered the signed-int8 output variant, so any Quant pattern with signed=0 (e.g. 4-bit unsigned quantization in Models/Transformer_DeepQuant) had no candidate and parsing exhausted backtracking. Add uint8 output to BasicQuantBindings and uint8 input to BasicDequantBindings in both Targets/Generic/Bindings.py and Targets/PULPOpen/Bindings.py. Verified: Models/Transformer_DeepQuant network gen now succeeds for both Generic and Siracusa platforms.

The Snitch FP32 GEMM/TransB-5000 build OOMs the GitHub-hosted runner ('std::bad_alloc' from the C compiler driver) when 4 pytest-xdist workers compile in parallel. Two workers leave enough headroom on the standard 7-GB runner. (Pre-existing flake; surfaced as a hard fail in CI runs that happen to land both heavy FP32 GEMM compilations on adjacent workers.)

runwangdl requested review from Victor-Jung and Xeratec as code owners April 13, 2026 22:29

runwangdl assigned runwangdl and pauloohaha and unassigned pauloohaha Apr 13, 2026

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

runwangdl force-pushed the gap9-ne16 branch 9 times, most recently from 1153077 to 4edb011 Compare April 13, 2026 23:43

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

runwangdl force-pushed the gap9-ne16 branch from 4edb011 to 748707a Compare April 14, 2026 08:54

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

runwangdl assigned pauloohaha Apr 14, 2026

runwangdl and others added 2 commits April 14, 2026 10:43

runwangdl force-pushed the gap9-ne16 branch 2 times, most recently from b8087fc to b3f40e5 Compare April 14, 2026 10:50

runwangdl force-pushed the gap9-ne16 branch from b3f40e5 to 6c8ae2b Compare April 14, 2026 10:59

runwangdl added 3 commits April 14, 2026 11:01

-from . import *
+from . import Bindings, Deployer, Engine, OptimizationPasses, Parsers, Platform, Tiler, Templates, TileConstraints, TopologyOptimizationPasses
+__all__ = [
+    "Bindings",
+    "Deployer",
+    "Engine",
+    "OptimizationPasses",
+    "Parsers",
+    "Platform",
+    "Tiler",
+    "Templates",
+    "TileConstraints",
+    "TopologyOptimizationPasses",
+]

		parser.add_argument('--ne16-wmem', action = 'store_true', help = 'Enable NE16 weight memory\n')
		parser.add_argument('--enable-3x3', action = 'store_true', help = 'Enable 3x3 convolutions\n')

Conversation

runwangdl commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Changed

Fixed

Test plan

PR Merge Checklist

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

runwangdl commented Apr 13, 2026 •

edited

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading