Skip to content

Sync with Microsoft ONNX Runtime - 20062026#1154

Open
ai-fw-intg wants to merge 9 commits into
ovep-developfrom
sync_msft_20062026
Open

Sync with Microsoft ONNX Runtime - 20062026#1154
ai-fw-intg wants to merge 9 commits into
ovep-developfrom
sync_msft_20062026

Conversation

@ai-fw-intg

Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

tianleiwu and others added 9 commits June 18, 2026 15:01
…9067)

## Description

The symmetric INT4/INT8 MoE decode GEMV could emit `NaN`/garbage when a
GEMM
reduction dimension was not a whole multiple of the 64-element
interleaved-weight
K tile (e.g. `intermediate_size` such as 544). The interleaved weight
layout's
CUTLASS K iterator reads K in whole tiles of 64; a partial final tile
makes
threads read past the valid activation range. This PR fixes the decode
GEMV
selection gate to reject such shapes, adds an explicit up-front
validation in the
QMoE op so the grouped GEMM path fails with a clear error instead of
silently
producing wrong results, and folds in several QMoE review-feedback
cleanups
(checked size arithmetic, env-var parsing, and documentation).

## Summary of Changes

### NaN fix and hardening (INT weight-only path)

| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu` |
`is_moe_gemv_supported` now rejects `k % kTileSizeK != 0` (64), so the
decode GEMV is not selected for a partial final K tile and the path
falls back to the grouped GEMM. |
| `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc` | Added an
up-front guard for `quant_type == "int"`: `hidden_size` (fc1.K) and
`inter_size` (fc2.K) must be multiples of 64 (the interleaved-weight K
tile); otherwise return `INVALID_ARGUMENT` with a clear message instead
of computing garbage. |

### Review-feedback cleanups

| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/moe/moe.cc` | Use `SafeInt<size_t>` for
scratch byte-count arithmetic (expanded rows × element sizes) feeding a
single allocation. |
| `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc` | Same
`SafeInt<size_t>` scratch-size hardening. |
| `onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu` | Parse
`ORT_MOE_GEMV_FP16_ACCUM` via
`ParseEnvironmentVariableWithDefault<int>`; include `env_var_utils.h`
after `dispatcher.h` (SHARED_PROVIDER guard ordering, documented
inline). |
| `onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_kernels.cu` | Parse
`ORT_DISABLE_MOE_GEMV` via the same helper; clarify fast-path comments
(symmetric INT4/INT8, per-column or block-wise). |
|
`onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/threadblock/dq_mma_base.h`
| Comment explaining the worst-case sizing of
`kFinegrainedScaleRowsPerStage` for the smallest fine-grained group
size. |

### Documentation

| File | Change |
|------|--------|
| `docs/contrib_ops/cuda/moe_qmoe.md` | Document the `swiglu_fusion=0` +
SwiGLU backward-compatibility remap (gpt-oss-20b interleaved layout) and
the one-time warning. |
| `docs/contrib_ops/cuda/qmoe_gemv_experiments.md` | Note that recorded
numbers are point-in-time baselines tied to the listed
GPU/driver/CUDA/ORT build. |

## Testing

- `python -m pytest
onnxruntime/test/python/transformers/test_qmoe_cuda.py -k "gemv or
swiglu or block or PrePack or prepack"` — 84 passed, 6 skipped on H200
(sm90).
- New `TestSwigluQMoE::test_swiglu_qmoe_int_partial_ktile_rejected`
builds an `inter_size=544` (= 17×32, partial 64 tile) INT8 SwiGLU QMoE
and asserts the run raises `"inter_size to be a multiple of 64"`.
- New `TestSwigluQMoE::test_swiglu_qmoe_fusion0_remap_parity` exercises
the `swiglu_fusion=0` → interleaved remap parity.
- `TestQMoEIntPrePackSmoke::test_int4_swiglu_interleaved_small` bumped
from `inter_size=32` (a now-rejected partial K tile) to `64`.
- `ORT_ENABLE_FP4_GEMV=1 python -m pytest
onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py` — no
failures (the new guard is scoped to `quant_type == "int"`, so FP4/FP8
are unaffected).
- `lintrunner` clean on the changed C++ and Python files.

## Motivation and Context

The interleaved column-major weight layout
(`ColumnMajorTileInterleave<64, …>`)
requires the GEMM reduction dim K to be a whole multiple of
`ThreadblockK` (64
for fp16/bf16 activations). The single-matrix `fpA_intB` GEMM already
throws on
this, but the grouped MoE GEMM and the decode GEMV had no equivalent
guard and
silently produced `NaN`/garbage. This PR closes that gap at the QMoE
boundary
(clear error) and in the GEMV dispatch gate (safe fallback). No
supported,
64-aligned shape changes behavior.

## Checklist

- [x] Tests added/updated
- [x] Documentation updated (if applicable)
- [x] No breaking changes (rejected shapes were already producing
incorrect output)
- [ ] CI passes
…icrosoft#28985)

### Description

Adds a decode-optimized CUDA path for the `LinearAttention` contrib op
(the
gated-delta / linear-attention recurrence used by hybrid models such as
Qwen3-Next / Qwen3.6). The existing recurrent kernels are tuned for
prefill; at
decode (`seq_len == 1`) they leave the GPU mostly idle. This PR adds two
decode-specific kernels that saturate the GPU and access the recurrent
state in
a coalesced pattern, **without changing the op's `present_state` layout
or
numerics**.

### Motivation

`LinearAttentionRecurrentKernelFixedShape` launches one block per
`(batch, kv_head)` and keeps the full `d_k x d_v` state in shared memory
across
the token loop, with block-wide `__syncthreads` at every step. That
design
amortizes state I/O during prefill, but at decode it:

- launches only `kv_num_heads` blocks (e.g. 32) — a fraction of the SMs,
and
- gets no amortization from the shared-memory state cache (one token per
launch),

so the op is latency-bound. On an H200 profile of Qwen3.6-35B-A3B it was
the
single most expensive decode kernel after the dense/MoE GEMMs (~0.69
ms/token).

### Key Changes

All in `onnxruntime/contrib_ops/cuda/bert/linear_attention_impl.cu`:

| Addition | Description |
|---|---|
| `warp_reduce_sum` | `__shfl_xor_sync` full-warp sum helper. |
| `LinearAttentionDecodeKernel<T, DK>` | Warp-per-column decode kernel:
grid `(kv_num_heads, batch, ceil(d_v/4))`, each warp owns one output
column with the state column sharded into registers; reductions via warp
shuffles, no shared memory, no block barriers. Handles any `d_v`. |
| `LinearAttentionDecodeColKernel<T, DK>` | Column-per-thread decode
kernel (default): one thread owns a full state column in registers. For
a fixed row `i`, consecutive threads read consecutive addresses `i*d_v +
col`, so state load/store is fully **coalesced with no transpose** — the
row-major `[d_k, d_v]` `present_state` layout is unchanged. Used when
`d_v % 32 == 0`; otherwise falls back to the warp kernel. |
| Dispatch in `LaunchLinearAttentionKernel` | Routes `seq_len <= 16` and
`d_k in {64,128,256}` to the decode kernels; all other shapes fall
through to the existing recurrent kernels, so the **prefill path is
unchanged**. |

Both kernels cover the full op semantics: `linear` / `gated` / `delta` /
`gated_delta` update rules, scalar and per-key-dim decay, per-head and
scalar
beta, standard GQA and inverse GQA, and `n_k_heads` K-sharing.

### Performance

H200, Qwen3.6-35B-A3B (INT4), single-sequence decode, CUDA graph on.
Kernel time
measured with Nsight Systems (steady-state, warmup excluded):

| Kernel | Time / token |
|---|---|
| `LinearAttentionRecurrentKernelFixedShape` (existing) | 693 µs |
| `LinearAttentionDecodeKernel` (warp-per-column) | 346 µs (2.0x) |
| `LinearAttentionDecodeColKernel` (column-per-thread) | **202 µs
(3.4x)** |

End-to-end decode throughput improved measurably (the kernel is ~half
its prior
cost), with no change to prefill.

### Testing

- All 26 `ContribOpLinearAttentionTest` parity tests pass (the decode
kernels are
exercised by the single-token, inverse-GQA, KGQA, and Qwen3.5-like
cases):

  ```
  ./onnxruntime_provider_test --gtest_filter='*LinearAttention*'
  ```

- No public API or `present_state` layout change; the decode path is
opt-in by
shape and falls back to the existing kernels for unsupported `d_k` /
`d_v`.

### Motivation and Context

Decode-throughput optimization for hybrid linear-attention + MoE models.
No
breaking changes; numerics and output layout are preserved.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…(onnx#8068, microsoft#28904) (microsoft#28958)

## Summary

Aligns the opset-24 ONNX-domain `Attention` kernels (CPU + CUDA) with
the ONNX errata onnx/onnx#8068 (tracking RFC onnx/onnx#8054) for the
**external static KV-cache** path — keyed by `nonpad_kv_seqlen` (input
#6), no `past_key`.

**Addresses microsoft#28904.**

## What changed

1. **Bottom-right `is_causal` alignment.** Per batch, `offset[b] =
nonpad_kv_seqlen[b] − q_sequence_length`; a query at in-block index `i`
attends key `j` iff `j <= i + offset[b]`. Applied on CPU and on the CUDA
Flash / Memory-Efficient (MEA) / unfused paths. The MEA causal-alignment
selector is now offset-aware (no unconditional top-left when an external
cache is present); Flash's native bottom-right + per-batch `seqlens_k`
is used where eligible.
2. **Fully-masked-row → 0 guard (Bug-2).** A query row with no allowed
key now outputs a **zero** row instead of mean-of-V (the finite-sentinel
softmax result). Detected with an exact per-key structural predicate
(`isneginf`-equivalent) and zeroed with **select (not multiply)** before
`P @ V`, so `0 @ V = 0`. Added on CPU and the CUDA MEA path. The Flash
`is_causal` + `seqlens_k` path (`offset >= 0`) cannot produce a
fully-masked row and is intentionally left unguarded. Bool-mask
conversion was already select-not-multiply on both EPs (Bug-1 satisfied;
no change needed).
3. **Reject removal.** Removes the CUDA `NOT_IMPLEMENTED` reject for
`is_causal` + `nonpad_kv_seqlen` with `S_q != total_kv` and no
`past_key` — the spec now *defines* this result, so the op computes it
rather than rejecting.

Full-prefill (`offset = 0`) and `past_key` decode paths remain
**bit-identical**. Contrib `MultiHeadAttention` / `GroupQueryAttention`
consume the shared FMHA kernels and are **unchanged** — only the
ONNX-domain `Attention` dispatch is retargeted.

## Test coverage

- **C++ `AttentionTest` gtests: 73/73 pass**, including new
bottom-right-offset, structural-empty causal row → 0 (CPU + CUDA), and
fp16 fully-masked-row goldens.
- **Python `test_onnx_attention`: 277/0** — includes the updated
`test_tensorscatter_attention.py` (stale negative-reject → positive
bottom-right acceptance).
- QA final gate: from-scratch Debug build green.

## Preemptive onnx#8068 node-test skips (de-skip TODO)

This branch adds the new onnx/onnx#8068 `Attention` backend node tests
to both skip lists so they don't fail before the onnx dependency is
bumped:
- `onnxruntime/test/testdata/onnx_backend_test_series_filters.jsonc`
- `onnxruntime/test/onnx/TestCase.cc` (C++ `GetBrokenTests`)

These are a **no-op on the current onnx pin (v1.21.0)**. **TODO
(de-skip):** remove **both** skip lists once `cmake/external/onnx` is
bumped to a release containing onnx#8068.

## Deferred follow-up

`q_seq > 1` Python bottom-right **parity** coverage requires upgrading
the `test_onnx_attention` suite's numpy/torch reference functions from
**total-kv-relative** causal (`offset = kv_seq − q_seq`) to
**nonpad-relative** bottom-right (`offset = nonpad_kv_seqlen − q_seq`);
a naive `is_causal=1` flip on the current refs is a no-op or a false
failure against the correct kernel. The `q_seq > 1` / `nonpad < q_seq`
behavior (including structural-empty rows) is already locked by the C++
gtest goldens. Tracked as follow-up.

## References

- onnx/onnx#8068 — spec + reference errata (bottom-right `is_causal` on
the `nonpad_kv_seqlen`/no-`past_key` path + composed `is_causal` +
`attn_mask` NaN robustness). Separately pushed, CI green, awaiting SIG
review.
- onnx/onnx#8054 — RFC: offset-aware causal masking for KV-cache decode
/ chunked prefill.

---------

Signed-off-by: Ti-Tai Wang <titaiwang@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bumps and [ws](https://github.com/websockets/ws). These dependencies
needed to be updated together.
Updates `ws` from 7.5.10 to 7.5.11
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/websockets/ws/releases">ws's
releases</a>.</em></p>
<blockquote>
<h2>7.5.11</h2>
<h1>Bug fixes</h1>
<ul>
<li>Backported 2b2abd45 to the 7.x release line (e14c4586).</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/websockets/ws/commit/fd36cd864fcdf62a08273a99e19a7d975401fee8"><code>fd36cd8</code></a>
[dist] 7.5.11</li>
<li><a
href="https://github.com/websockets/ws/commit/e14c45861deca0cef60dec0f9109b694abebdf52"><code>e14c458</code></a>
[security] Limit retained message parts</li>
<li>See full diff in <a
href="https://github.com/websockets/ws/compare/7.5.10...7.5.11">compare
view</a></li>
</ul>
</details>
<br />

Updates `ws` from 6.2.3 to 6.2.4
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/websockets/ws/releases">ws's
releases</a>.</em></p>
<blockquote>
<h2>7.5.11</h2>
<h1>Bug fixes</h1>
<ul>
<li>Backported 2b2abd45 to the 7.x release line (e14c4586).</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/websockets/ws/commit/fd36cd864fcdf62a08273a99e19a7d975401fee8"><code>fd36cd8</code></a>
[dist] 7.5.11</li>
<li><a
href="https://github.com/websockets/ws/commit/e14c45861deca0cef60dec0f9109b694abebdf52"><code>e14c458</code></a>
[security] Limit retained message parts</li>
<li>See full diff in <a
href="https://github.com/websockets/ws/compare/7.5.10...7.5.11">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…osoft#29161)

## Description

When `GroupQueryAttention` runs the CUDA FlashDecode fast-decode path
with a sliding/local attention
window (`local_window_size > 0`), the split-K planning was sized using
the full
`total_sequence_length` even though only the last `local_window_size` KV
positions can contribute to
the output. This caused local-window decode layers to over-split and run
an unnecessary split-K
combine pass. This PR clamps the sequence length used for split planning
to the local window size, so
local-window decode no longer pays for splits/combine work it does not
need. This is motivated by
models that use local attention windows for GQA (e.g. gpt-oss-style
decode with a small sliding
window over a large KV cache).

## Summary of Changes

### Kernel dispatch

| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | In the
FlashDecode fast-decode path, clamp the sequence length passed to
`get_num_splits_and_buffer_sizes` to `local_window_size` when
`local_window_size > 0`, so split-K planning reflects only the windowed
KV range. |

### Tests

| File | Change |
|------|--------|
| `onnxruntime/test/python/transformers/test_gqa.py` | Add
`test_gqa_local_window_large_context_decode` regression test: decode
step (q_len=1) with a large past context (4096) and a small local window
(128), verifying parity of the narrowed split planning. Skips when Flash
Attention is unavailable. |

### Profiling helpers

| File | Change |
|------|--------|
| `onnxruntime/test/python/transformers/profile_gqa.py` | New nsys
profiling helper for the GQA decode path, with a `--local-window-size`
option and NVTX range markers. |
| `onnxruntime/test/python/transformers/profile_gqa.sh` | New shell
wrapper that runs `nsys` profiling per precision mode and parses results
with the shared `parse_nsys.py`; checks `nsys`/`nvtx` availability
instead of mutating the environment. |

## Testing

- Unit test:
  ```bash
  cd onnxruntime/test/python/transformers
PIPELINE_MODE=1 python test_gqa.py -k
test_gqa_local_window_large_context_decode -v
  ```
- Existing FlashDecode parity coverage:
  ```bash
  PIPELINE_MODE=1 python test_gqa.py -k test_gqa_past_flash_attention -v
  ```
- Profiling (optional, requires an NVIDIA GPU + Nsight Systems):
  ```bash
  cd onnxruntime/test/python/transformers
./profile_gqa.sh --fp16 --past-sequence-length 4096 --local-window-size
128
  ```
Observed on H200 (SM90, fp16, batch=2, num_heads=64, kv_num_heads=8,
head_size=64): the split-K
combine pass is eliminated for the local-window case and the main decode
kernel time drops
  significantly versus the unclamped (full-context) split planning.
- Backward compatibility: behavior is unchanged when `local_window_size
<= 0`; the clamp only applies
  on the FlashDecode fast-decode path with a positive local window.

## Motivation and Context

Local-window GQA decode layers only attend to the most recent
`local_window_size` KV positions, so
splitting and combining across the entire KV cache wastes split-K
combine work. Clamping the split
planning sequence length to the window size keeps the fast path correct
while removing the redundant
combine pass for windowed decode layers.

## Checklist
- [x] Tests added/updated
- [x] No breaking changes (behavior unchanged when `local_window_size <=
0`)
- [ ] Documentation updated (if applicable)
### Description

This PR removes the TensorRT fused **causal** attention kernels (the
`fmha_v2_*_Causal_*` and `fmha_v2_flash_attention_*_Causal_*` cubins)
and all of the code paths that selected them from the CUDA `Attention`
operator.

These causal fused kernels were disabled by default (since
[microsoft#14732](microsoft#14732)) and were
only reachable via the opt-in `ORT_ENABLE_FUSED_CAUSAL_ATTENTION`
environment variable / `TRT_CAUSAL_ATTENTION` backend bit. They used
fp16 accumulation, which can cause accuracy drops, and have been
superseded by flash attention, memory-efficient attention, and cuDNN
SDPA. Removing them deletes ~1.27M lines of generated cubin source and
simplifies the attention dispatch logic.

### Key Changes

- **Removed cubin sources**: Deleted all `causal/fmha_v2_fp16_Causal_*`
and `flash_attention/fmha_v2_flash_attention_fp16_Causal_*` generated
cubin files (70+ files).
- **Dispatch simplification**
([attention.cc](onnxruntime/contrib_ops/cuda/bert/attention.cc)):
Removed the `is_unidirectional_` / causal fused-runner branch in
`ComputeInternal`; the fused runner path now only handles the BERT
(non-causal) case.
- **Kernel options**
([attention_kernel_options.cc](onnxruntime/contrib_ops/cuda/bert/attention_kernel_options.cc),
[attention_kernel_options.h](onnxruntime/contrib_ops/cuda/bert/attention_kernel_options.h)):
Removed `use_trt_causal_attention_`, `UseTrtCausalAttention()`, the
`TRT_CAUSAL_ATTENTION` debug print, and the `causal` argument of
`SetTrtFusedKernel`.
- **QKV format**
([attention_common.h](onnxruntime/contrib_ops/cpu/bert/attention_common.h),
[attention_prepare_qkv.cu](onnxruntime/contrib_ops/cuda/bert/attention_prepare_qkv.cu)):
Removed the `Q_K_V_BNSH_QKV_BS3NH` format and the fused-causal
gemm-buffer-with-bias preparation path.
- **Runner API**
([mha_runner.cu](onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/mha_runner.cu),
[mha_runner.h](onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/mha_runner.h),
[fused_multihead_attention_v2.h](onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/fused_multihead_attention_v2.h)):
Dropped the `causal` parameter from `FusedMHARunnerFP16v2::Create` /
`IsSupported` and removed the causal kernel metadata.
- **Env var removed**: `ORT_ENABLE_FUSED_CAUSAL_ATTENTION`
(`kEnableFusedCausalAttention`) is no longer recognized.
- **Callers updated**:
[multihead_attention.cc](onnxruntime/contrib_ops/cuda/bert/multihead_attention.cc),
[packed_attention.cc](onnxruntime/contrib_ops/cuda/bert/packed_attention.cc),
[packed_multihead_attention.cc](onnxruntime/contrib_ops/cuda/bert/packed_multihead_attention.cc),
[attention_impl.cu](onnxruntime/contrib_ops/cuda/bert/attention_impl.cu)
updated to the new no-causal signatures.
- **Python helpers**: Removed stale `ORT_ENABLE_FUSED_CAUSAL_ATTENTION`
references from the transformers benchmark helper and stable diffusion
benchmark.
- **Tests updated**:
[attention_op_test.cc](onnxruntime/test/contrib_ops/attention_op_test.cc)
and
[attention_kernel_options_test.cc](onnxruntime/test/providers/cuda/test_cases/attention_kernel_options_test.cc)
no longer set/assert the causal-fused option.

### Motivation and Context

The fused causal kernels were off by default, carried potential
fp16-accumulation accuracy risk, and added a large amount of generated
cubin source to the repo. Causal attention is already well covered by
flash attention, memory-efficient attention, and cuDNN SDPA, so these
kernels can be safely removed to reduce binary size and simplify
maintenance.

### Testing

- Build the CUDA EP and run the attention contrib op tests
(`ContribOpAttentionTest.*`, including `Causal_EmptyPastState`).
- Run `AttentionKernelOptionsTest.*` to verify the kernel-option parsing
no longer references the causal backend.
### Description
<!-- Describe your changes. -->

This PR fixes a convolution performance regression affecting some OCR
models with large-kernel convolutions when the KleidiAI SME IGEMM
convolution path is selected.

The change has 2 parts:
1. updates to the KleidiAI IGEMM LHS packing to pack rows in bounded
chunks instead of packing the full LHS buffer up front, which reduces
memory usage and improves cache locality for large convolutions,
2. a new route selection function `ArmKleidiAI::SelectConvRoute` that
decides between `Igemm`, `GemmFallback` and `None` based on convolution
parameters and a workload size-based heuristic.

The function `CheckCapabilitiesSme` runs `SelectConvRoute` and only
returns true if the selected route is `Igemm`. The patch also adds a
standard GEMM fallback to the `ConvRoute` possibilities, and runs
`MlasGemm` if said fallback is selected. If the function selects `None`,
then the convolution falls back to `MlasSgemmOperation`.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fixes microsoft#27633.

---------

Signed-off-by: Qxiang Xu <Qixiang.Xu@arm.com>
Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
Co-authored-by: Damien Dooley <damien.dooley@arm.com>
### Description

The current tests in the `NhwcTransformerTests` suite
`ConvFloat_UsesNhwcOnlyWithKleidi` and
`FusedConvFloat_UsesNhwcOnlyWithKleidi` assumed that whenever KleidiAI
NHWC float-conv support is available, the test graph must be rewritten
to `com.microsoft.NhwcFusedConv`.

That assumption is not valid when ONNX Runtime is built with
`--enable_arm_neon_nchwc`. In that cofiguration, the Level 3 NCHWc
transformer is registered before the NHWC transformer, so the NCHWc
rewrite can be selected instead. The optimiser priority is intentional,
so these tests shouldn't require NHWC to be chosen over NCHWc.

This change keeps the existing optimiser ordering and instead updates
the assertions in the 2 tests. If the NHWC path is selected, the tests
still validate the expected `NhwcFusedConv` graph shape and verify that
the path is only used when KleidiAI NHWC support is available. If
another valid layout optimisation is selected first, the tests no longer
fail just because the `NhwcFusedConv` op isn't generated.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

This change fixes the 2 mentioned unit tests which fail when ONNX
Runtime is built and tested with `--enable_arm_neon_nchwc`.

---------

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants