Skip to content

fix: prevent int16 overflow in NEON non-dotprod fallback path#459

Open
Mindev27 wants to merge 1 commit intomicrosoft:mainfrom
Mindev27:fix/neon-int16-overflow
Open

fix: prevent int16 overflow in NEON non-dotprod fallback path#459
Mindev27 wants to merge 1 commit intomicrosoft:mainfrom
Mindev27:fix/neon-int16-overflow

Conversation

@Mindev27
Copy link

@Mindev27 Mindev27 commented Mar 14, 2026

The non-dotprod NEON fallback accumulates vmlal_s8 results into int16x8_t across 32 loop iterations (256 multiply-adds total), which overflows int16 range and produces garbage output on ARMv8.0 CPUs like Cortex-A53/A73.

Fix by widening to int32 each iteration instead of every 32. Applied to all three dot product variants (1x1, 1xN, Nx1).

Partially addresses #411 (items 5 and 6).

Tested on M4 Pro (dotprod path) inference runs fine. Don't have ARMv8.0 hardware to test the non-dotprod path directly.

@Mindev27
Copy link
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant