Sync with Microsoft ONNX Runtime - 20062026 by ai-fw-intg · Pull Request #1154 · intel/onnxruntime

ai-fw-intg · 2026-06-19T20:33:57Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

…9067) ## Description The symmetric INT4/INT8 MoE decode GEMV could emit `NaN`/garbage when a GEMM reduction dimension was not a whole multiple of the 64-element interleaved-weight K tile (e.g. `intermediate_size` such as 544). The interleaved weight layout's CUTLASS K iterator reads K in whole tiles of 64; a partial final tile makes threads read past the valid activation range. This PR fixes the decode GEMV selection gate to reject such shapes, adds an explicit up-front validation in the QMoE op so the grouped GEMM path fails with a clear error instead of silently producing wrong results, and folds in several QMoE review-feedback cleanups (checked size arithmetic, env-var parsing, and documentation). ## Summary of Changes ### NaN fix and hardening (INT weight-only path) | File | Change | |------|--------| | `onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu` | `is_moe_gemv_supported` now rejects `k % kTileSizeK != 0` (64), so the decode GEMV is not selected for a partial final K tile and the path falls back to the grouped GEMM. | | `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc` | Added an up-front guard for `quant_type == "int"`: `hidden_size` (fc1.K) and `inter_size` (fc2.K) must be multiples of 64 (the interleaved-weight K tile); otherwise return `INVALID_ARGUMENT` with a clear message instead of computing garbage. | ### Review-feedback cleanups | File | Change | |------|--------| | `onnxruntime/contrib_ops/cuda/moe/moe.cc` | Use `SafeInt<size_t>` for scratch byte-count arithmetic (expanded rows × element sizes) feeding a single allocation. | | `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc` | Same `SafeInt<size_t>` scratch-size hardening. | | `onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu` | Parse `ORT_MOE_GEMV_FP16_ACCUM` via `ParseEnvironmentVariableWithDefault<int>`; include `env_var_utils.h` after `dispatcher.h` (SHARED_PROVIDER guard ordering, documented inline). | | `onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_kernels.cu` | Parse `ORT_DISABLE_MOE_GEMV` via the same helper; clarify fast-path comments (symmetric INT4/INT8, per-column or block-wise). | | `onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/threadblock/dq_mma_base.h` | Comment explaining the worst-case sizing of `kFinegrainedScaleRowsPerStage` for the smallest fine-grained group size. | ### Documentation | File | Change | |------|--------| | `docs/contrib_ops/cuda/moe_qmoe.md` | Document the `swiglu_fusion=0` + SwiGLU backward-compatibility remap (gpt-oss-20b interleaved layout) and the one-time warning. | | `docs/contrib_ops/cuda/qmoe_gemv_experiments.md` | Note that recorded numbers are point-in-time baselines tied to the listed GPU/driver/CUDA/ORT build. | ## Testing - `python -m pytest onnxruntime/test/python/transformers/test_qmoe_cuda.py -k "gemv or swiglu or block or PrePack or prepack"` — 84 passed, 6 skipped on H200 (sm90). - New `TestSwigluQMoE::test_swiglu_qmoe_int_partial_ktile_rejected` builds an `inter_size=544` (= 17×32, partial 64 tile) INT8 SwiGLU QMoE and asserts the run raises `"inter_size to be a multiple of 64"`. - New `TestSwigluQMoE::test_swiglu_qmoe_fusion0_remap_parity` exercises the `swiglu_fusion=0` → interleaved remap parity. - `TestQMoEIntPrePackSmoke::test_int4_swiglu_interleaved_small` bumped from `inter_size=32` (a now-rejected partial K tile) to `64`. - `ORT_ENABLE_FP4_GEMV=1 python -m pytest onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py` — no failures (the new guard is scoped to `quant_type == "int"`, so FP4/FP8 are unaffected). - `lintrunner` clean on the changed C++ and Python files. ## Motivation and Context The interleaved column-major weight layout (`ColumnMajorTileInterleave<64, …>`) requires the GEMM reduction dim K to be a whole multiple of `ThreadblockK` (64 for fp16/bf16 activations). The single-matrix `fpA_intB` GEMM already throws on this, but the grouped MoE GEMM and the decode GEMV had no equivalent guard and silently produced `NaN`/garbage. This PR closes that gap at the QMoE boundary (clear error) and in the GEMV dispatch gate (safe fallback). No supported, 64-aligned shape changes behavior. ## Checklist - [x] Tests added/updated - [x] Documentation updated (if applicable) - [x] No breaking changes (rejected shapes were already producing incorrect output) - [ ] CI passes

…icrosoft#28985) ### Description Adds a decode-optimized CUDA path for the `LinearAttention` contrib op (the gated-delta / linear-attention recurrence used by hybrid models such as Qwen3-Next / Qwen3.6). The existing recurrent kernels are tuned for prefill; at decode (`seq_len == 1`) they leave the GPU mostly idle. This PR adds two decode-specific kernels that saturate the GPU and access the recurrent state in a coalesced pattern, **without changing the op's `present_state` layout or numerics**. ### Motivation `LinearAttentionRecurrentKernelFixedShape` launches one block per `(batch, kv_head)` and keeps the full `d_k x d_v` state in shared memory across the token loop, with block-wide `__syncthreads` at every step. That design amortizes state I/O during prefill, but at decode it: - launches only `kv_num_heads` blocks (e.g. 32) — a fraction of the SMs, and - gets no amortization from the shared-memory state cache (one token per launch), so the op is latency-bound. On an H200 profile of Qwen3.6-35B-A3B it was the single most expensive decode kernel after the dense/MoE GEMMs (~0.69 ms/token). ### Key Changes All in `onnxruntime/contrib_ops/cuda/bert/linear_attention_impl.cu`: | Addition | Description | |---|---| | `warp_reduce_sum` | `__shfl_xor_sync` full-warp sum helper. | | `LinearAttentionDecodeKernel<T, DK>` | Warp-per-column decode kernel: grid `(kv_num_heads, batch, ceil(d_v/4))`, each warp owns one output column with the state column sharded into registers; reductions via warp shuffles, no shared memory, no block barriers. Handles any `d_v`. | | `LinearAttentionDecodeColKernel<T, DK>` | Column-per-thread decode kernel (default): one thread owns a full state column in registers. For a fixed row `i`, consecutive threads read consecutive addresses `i*d_v + col`, so state load/store is fully **coalesced with no transpose** — the row-major `[d_k, d_v]` `present_state` layout is unchanged. Used when `d_v % 32 == 0`; otherwise falls back to the warp kernel. | | Dispatch in `LaunchLinearAttentionKernel` | Routes `seq_len <= 16` and `d_k in {64,128,256}` to the decode kernels; all other shapes fall through to the existing recurrent kernels, so the **prefill path is unchanged**. | Both kernels cover the full op semantics: `linear` / `gated` / `delta` / `gated_delta` update rules, scalar and per-key-dim decay, per-head and scalar beta, standard GQA and inverse GQA, and `n_k_heads` K-sharing. ### Performance H200, Qwen3.6-35B-A3B (INT4), single-sequence decode, CUDA graph on. Kernel time measured with Nsight Systems (steady-state, warmup excluded): | Kernel | Time / token | |---|---| | `LinearAttentionRecurrentKernelFixedShape` (existing) | 693 µs | | `LinearAttentionDecodeKernel` (warp-per-column) | 346 µs (2.0x) | | `LinearAttentionDecodeColKernel` (column-per-thread) | **202 µs (3.4x)** | End-to-end decode throughput improved measurably (the kernel is ~half its prior cost), with no change to prefill. ### Testing - All 26 `ContribOpLinearAttentionTest` parity tests pass (the decode kernels are exercised by the single-token, inverse-GQA, KGQA, and Qwen3.5-like cases): ``` ./onnxruntime_provider_test --gtest_filter='*LinearAttention*' ``` - No public API or `present_state` layout change; the decode path is opt-in by shape and falls back to the existing kernels for unsupported `d_k` / `d_v`. ### Motivation and Context Decode-throughput optimization for hybrid linear-attention + MoE models. No breaking changes; numerics and output layout are preserved. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…(onnx#8068, microsoft#28904) (microsoft#28958) ## Summary Aligns the opset-24 ONNX-domain `Attention` kernels (CPU + CUDA) with the ONNX errata onnx/onnx#8068 (tracking RFC onnx/onnx#8054) for the **external static KV-cache** path — keyed by `nonpad_kv_seqlen` (input #6), no `past_key`. **Addresses microsoft#28904.** ## What changed 1. **Bottom-right `is_causal` alignment.** Per batch, `offset[b] = nonpad_kv_seqlen[b] − q_sequence_length`; a query at in-block index `i` attends key `j` iff `j <= i + offset[b]`. Applied on CPU and on the CUDA Flash / Memory-Efficient (MEA) / unfused paths. The MEA causal-alignment selector is now offset-aware (no unconditional top-left when an external cache is present); Flash's native bottom-right + per-batch `seqlens_k` is used where eligible. 2. **Fully-masked-row → 0 guard (Bug-2).** A query row with no allowed key now outputs a **zero** row instead of mean-of-V (the finite-sentinel softmax result). Detected with an exact per-key structural predicate (`isneginf`-equivalent) and zeroed with **select (not multiply)** before `P @ V`, so `0 @ V = 0`. Added on CPU and the CUDA MEA path. The Flash `is_causal` + `seqlens_k` path (`offset >= 0`) cannot produce a fully-masked row and is intentionally left unguarded. Bool-mask conversion was already select-not-multiply on both EPs (Bug-1 satisfied; no change needed). 3. **Reject removal.** Removes the CUDA `NOT_IMPLEMENTED` reject for `is_causal` + `nonpad_kv_seqlen` with `S_q != total_kv` and no `past_key` — the spec now *defines* this result, so the op computes it rather than rejecting. Full-prefill (`offset = 0`) and `past_key` decode paths remain **bit-identical**. Contrib `MultiHeadAttention` / `GroupQueryAttention` consume the shared FMHA kernels and are **unchanged** — only the ONNX-domain `Attention` dispatch is retargeted. ## Test coverage - **C++ `AttentionTest` gtests: 73/73 pass**, including new bottom-right-offset, structural-empty causal row → 0 (CPU + CUDA), and fp16 fully-masked-row goldens. - **Python `test_onnx_attention`: 277/0** — includes the updated `test_tensorscatter_attention.py` (stale negative-reject → positive bottom-right acceptance). - QA final gate: from-scratch Debug build green. ## Preemptive onnx#8068 node-test skips (de-skip TODO) This branch adds the new onnx/onnx#8068 `Attention` backend node tests to both skip lists so they don't fail before the onnx dependency is bumped: - `onnxruntime/test/testdata/onnx_backend_test_series_filters.jsonc` - `onnxruntime/test/onnx/TestCase.cc` (C++ `GetBrokenTests`) These are a **no-op on the current onnx pin (v1.21.0)**. **TODO (de-skip):** remove **both** skip lists once `cmake/external/onnx` is bumped to a release containing onnx#8068. ## Deferred follow-up `q_seq > 1` Python bottom-right **parity** coverage requires upgrading the `test_onnx_attention` suite's numpy/torch reference functions from **total-kv-relative** causal (`offset = kv_seq − q_seq`) to **nonpad-relative** bottom-right (`offset = nonpad_kv_seqlen − q_seq`); a naive `is_causal=1` flip on the current refs is a no-op or a false failure against the correct kernel. The `q_seq > 1` / `nonpad < q_seq` behavior (including structural-empty rows) is already locked by the C++ gtest goldens. Tracked as follow-up. ## References - onnx/onnx#8068 — spec + reference errata (bottom-right `is_causal` on the `nonpad_kv_seqlen`/no-`past_key` path + composed `is_causal` + `attn_mask` NaN robustness). Separately pushed, CI green, awaiting SIG review. - onnx/onnx#8054 — RFC: offset-aware causal masking for KV-cache decode / chunked prefill. --------- Signed-off-by: Ti-Tai Wang <titaiwang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Bumps and [ws](https://github.com/websockets/ws). These dependencies needed to be updated together. Updates `ws` from 7.5.10 to 7.5.11 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/websockets/ws/releases">ws's releases</a>.</em></p> <blockquote> <h2>7.5.11</h2> <h1>Bug fixes</h1> <ul> <li>Backported 2b2abd45 to the 7.x release line (e14c4586).</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/websockets/ws/commit/fd36cd864fcdf62a08273a99e19a7d975401fee8"><code>fd36cd8</code></a> [dist] 7.5.11</li> <li><a href="https://github.com/websockets/ws/commit/e14c45861deca0cef60dec0f9109b694abebdf52"><code>e14c458</code></a> [security] Limit retained message parts</li> <li>See full diff in <a href="https://github.com/websockets/ws/compare/7.5.10...7.5.11">compare view</a></li> </ul> </details> <br /> Updates `ws` from 6.2.3 to 6.2.4 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/websockets/ws/releases">ws's releases</a>.</em></p> <blockquote> <h2>7.5.11</h2> <h1>Bug fixes</h1> <ul> <li>Backported 2b2abd45 to the 7.x release line (e14c4586).</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/websockets/ws/commit/fd36cd864fcdf62a08273a99e19a7d975401fee8"><code>fd36cd8</code></a> [dist] 7.5.11</li> <li><a href="https://github.com/websockets/ws/commit/e14c45861deca0cef60dec0f9109b694abebdf52"><code>e14c458</code></a> [security] Limit retained message parts</li> <li>See full diff in <a href="https://github.com/websockets/ws/compare/7.5.10...7.5.11">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…osoft#29161) ## Description When `GroupQueryAttention` runs the CUDA FlashDecode fast-decode path with a sliding/local attention window (`local_window_size > 0`), the split-K planning was sized using the full `total_sequence_length` even though only the last `local_window_size` KV positions can contribute to the output. This caused local-window decode layers to over-split and run an unnecessary split-K combine pass. This PR clamps the sequence length used for split planning to the local window size, so local-window decode no longer pays for splits/combine work it does not need. This is motivated by models that use local attention windows for GQA (e.g. gpt-oss-style decode with a small sliding window over a large KV cache). ## Summary of Changes ### Kernel dispatch | File | Change | |------|--------| | `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | In the FlashDecode fast-decode path, clamp the sequence length passed to `get_num_splits_and_buffer_sizes` to `local_window_size` when `local_window_size > 0`, so split-K planning reflects only the windowed KV range. | ### Tests | File | Change | |------|--------| | `onnxruntime/test/python/transformers/test_gqa.py` | Add `test_gqa_local_window_large_context_decode` regression test: decode step (q_len=1) with a large past context (4096) and a small local window (128), verifying parity of the narrowed split planning. Skips when Flash Attention is unavailable. | ### Profiling helpers | File | Change | |------|--------| | `onnxruntime/test/python/transformers/profile_gqa.py` | New nsys profiling helper for the GQA decode path, with a `--local-window-size` option and NVTX range markers. | | `onnxruntime/test/python/transformers/profile_gqa.sh` | New shell wrapper that runs `nsys` profiling per precision mode and parses results with the shared `parse_nsys.py`; checks `nsys`/`nvtx` availability instead of mutating the environment. | ## Testing - Unit test: ```bash cd onnxruntime/test/python/transformers PIPELINE_MODE=1 python test_gqa.py -k test_gqa_local_window_large_context_decode -v ``` - Existing FlashDecode parity coverage: ```bash PIPELINE_MODE=1 python test_gqa.py -k test_gqa_past_flash_attention -v ``` - Profiling (optional, requires an NVIDIA GPU + Nsight Systems): ```bash cd onnxruntime/test/python/transformers ./profile_gqa.sh --fp16 --past-sequence-length 4096 --local-window-size 128 ``` Observed on H200 (SM90, fp16, batch=2, num_heads=64, kv_num_heads=8, head_size=64): the split-K combine pass is eliminated for the local-window case and the main decode kernel time drops significantly versus the unclamped (full-context) split planning. - Backward compatibility: behavior is unchanged when `local_window_size <= 0`; the clamp only applies on the FlashDecode fast-decode path with a positive local window. ## Motivation and Context Local-window GQA decode layers only attend to the most recent `local_window_size` KV positions, so splitting and combining across the entire KV cache wastes split-K combine work. Clamping the split planning sequence length to the window size keeps the fast path correct while removing the redundant combine pass for windowed decode layers. ## Checklist - [x] Tests added/updated - [x] No breaking changes (behavior unchanged when `local_window_size <= 0`) - [ ] Documentation updated (if applicable)

### Description This PR removes the TensorRT fused **causal** attention kernels (the `fmha_v2_*_Causal_*` and `fmha_v2_flash_attention_*_Causal_*` cubins) and all of the code paths that selected them from the CUDA `Attention` operator. These causal fused kernels were disabled by default (since [microsoft#14732](microsoft#14732)) and were only reachable via the opt-in `ORT_ENABLE_FUSED_CAUSAL_ATTENTION` environment variable / `TRT_CAUSAL_ATTENTION` backend bit. They used fp16 accumulation, which can cause accuracy drops, and have been superseded by flash attention, memory-efficient attention, and cuDNN SDPA. Removing them deletes ~1.27M lines of generated cubin source and simplifies the attention dispatch logic. ### Key Changes - **Removed cubin sources**: Deleted all `causal/fmha_v2_fp16_Causal_*` and `flash_attention/fmha_v2_flash_attention_fp16_Causal_*` generated cubin files (70+ files). - **Dispatch simplification** ([attention.cc](onnxruntime/contrib_ops/cuda/bert/attention.cc)): Removed the `is_unidirectional_` / causal fused-runner branch in `ComputeInternal`; the fused runner path now only handles the BERT (non-causal) case. - **Kernel options** ([attention_kernel_options.cc](onnxruntime/contrib_ops/cuda/bert/attention_kernel_options.cc), [attention_kernel_options.h](onnxruntime/contrib_ops/cuda/bert/attention_kernel_options.h)): Removed `use_trt_causal_attention_`, `UseTrtCausalAttention()`, the `TRT_CAUSAL_ATTENTION` debug print, and the `causal` argument of `SetTrtFusedKernel`. - **QKV format** ([attention_common.h](onnxruntime/contrib_ops/cpu/bert/attention_common.h), [attention_prepare_qkv.cu](onnxruntime/contrib_ops/cuda/bert/attention_prepare_qkv.cu)): Removed the `Q_K_V_BNSH_QKV_BS3NH` format and the fused-causal gemm-buffer-with-bias preparation path. - **Runner API** ([mha_runner.cu](onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/mha_runner.cu), [mha_runner.h](onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/mha_runner.h), [fused_multihead_attention_v2.h](onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/fused_multihead_attention_v2.h)): Dropped the `causal` parameter from `FusedMHARunnerFP16v2::Create` / `IsSupported` and removed the causal kernel metadata. - **Env var removed**: `ORT_ENABLE_FUSED_CAUSAL_ATTENTION` (`kEnableFusedCausalAttention`) is no longer recognized. - **Callers updated**: [multihead_attention.cc](onnxruntime/contrib_ops/cuda/bert/multihead_attention.cc), [packed_attention.cc](onnxruntime/contrib_ops/cuda/bert/packed_attention.cc), [packed_multihead_attention.cc](onnxruntime/contrib_ops/cuda/bert/packed_multihead_attention.cc), [attention_impl.cu](onnxruntime/contrib_ops/cuda/bert/attention_impl.cu) updated to the new no-causal signatures. - **Python helpers**: Removed stale `ORT_ENABLE_FUSED_CAUSAL_ATTENTION` references from the transformers benchmark helper and stable diffusion benchmark. - **Tests updated**: [attention_op_test.cc](onnxruntime/test/contrib_ops/attention_op_test.cc) and [attention_kernel_options_test.cc](onnxruntime/test/providers/cuda/test_cases/attention_kernel_options_test.cc) no longer set/assert the causal-fused option. ### Motivation and Context The fused causal kernels were off by default, carried potential fp16-accumulation accuracy risk, and added a large amount of generated cubin source to the repo. Causal attention is already well covered by flash attention, memory-efficient attention, and cuDNN SDPA, so these kernels can be safely removed to reduce binary size and simplify maintenance. ### Testing - Build the CUDA EP and run the attention contrib op tests (`ContribOpAttentionTest.*`, including `Causal_EmptyPastState`). - Run `AttentionKernelOptionsTest.*` to verify the kernel-option parsing no longer references the causal backend.

### Description  This PR fixes a convolution performance regression affecting some OCR models with large-kernel convolutions when the KleidiAI SME IGEMM convolution path is selected. The change has 2 parts: 1. updates to the KleidiAI IGEMM LHS packing to pack rows in bounded chunks instead of packing the full LHS buffer up front, which reduces memory usage and improves cache locality for large convolutions, 2. a new route selection function `ArmKleidiAI::SelectConvRoute` that decides between `Igemm`, `GemmFallback` and `None` based on convolution parameters and a workload size-based heuristic. The function `CheckCapabilitiesSme` runs `SelectConvRoute` and only returns true if the selected route is `Igemm`. The patch also adds a standard GEMM fallback to the `ConvRoute` possibilities, and runs `MlasGemm` if said fallback is selected. If the function selects `None`, then the convolution falls back to `MlasSgemmOperation`. ### Motivation and Context  Fixes microsoft#27633. --------- Signed-off-by: Qxiang Xu <Qixiang.Xu@arm.com> Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com> Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com> Co-authored-by: Damien Dooley <damien.dooley@arm.com>

### Description The current tests in the `NhwcTransformerTests` suite `ConvFloat_UsesNhwcOnlyWithKleidi` and `FusedConvFloat_UsesNhwcOnlyWithKleidi` assumed that whenever KleidiAI NHWC float-conv support is available, the test graph must be rewritten to `com.microsoft.NhwcFusedConv`. That assumption is not valid when ONNX Runtime is built with `--enable_arm_neon_nchwc`. In that cofiguration, the Level 3 NCHWc transformer is registered before the NHWC transformer, so the NCHWc rewrite can be selected instead. The optimiser priority is intentional, so these tests shouldn't require NHWC to be chosen over NCHWc. This change keeps the existing optimiser ordering and instead updates the assertions in the 2 tests. If the NHWC path is selected, the tests still validate the expected `NhwcFusedConv` graph shape and verify that the path is only used when KleidiAI NHWC support is available. If another valid layout optimisation is selected first, the tests no longer fail just because the `NhwcFusedConv` op isn't generated. ### Motivation and Context  This change fixes the 2 mentioned unit tests which fail when ONNX Runtime is built and tested with `--enable_arm_neon_nchwc`. --------- Signed-off-by: Martin Klacer <martin.klacer@arm.com>

tianleiwu and others added 9 commits June 18, 2026 15:01

Merge remote-tracking branch 'origin/master' into sync_msft_20062026

7d591bd

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel June 19, 2026 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync with Microsoft ONNX Runtime - 20062026#1154

Sync with Microsoft ONNX Runtime - 20062026#1154
ai-fw-intg wants to merge 9 commits into
ovep-developfrom
sync_msft_20062026

ai-fw-intg commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

ai-fw-intg commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants