Numba IndexedElemwise: Faster indexed increments when indices do not alias#2250
Draft
ricardoV94 wants to merge 2 commits into
Draft
Numba IndexedElemwise: Faster indexed increments when indices do not alias#2250ricardoV94 wants to merge 2 commits into
ricardoV94 wants to merge 2 commits into
Conversation
Port the scratch-slot store-redirection + scoped alias.scope/noalias metadata from 605789e (reduction_fusion_with_scope_markers) so this branch is self-contained regardless of PR merge order. The scratch slot moves each 0-d output's real store into make_loop_call (SROA collapses it back onto the inner 'o += t'), giving a load/store instruction we control -- the hook the no-dup promise needs to attach llvm.access.group / parallel_accesses metadata. Adapted from 605789e: dropped the unused string_codegen import (this branch's store_core_outputs still uses compile_numba_function_src).
Fusing a value-compute into an `inc`-scatter normally scalarizes the whole loop:
the read-modify-write on out[idx[i]] is a possible cross-iteration dependency the
LoopVectorizer can't rule out. When the indices are statically known to be distinct
there is no such dependency, so we promise it to LLVM with an access group on every
loop memory op plus `llvm.loop.parallel_accesses` on the latch -- the value-compute
then vectorizes (LLVM scalarizes only the indexed stores, which AVX2 has no scatter for).
- FuseIndexedElemwise gates the promise on `_has_unique_indices` (a constant with unique
entries, or a `unique_indices` assumption), only for `inc` (`set` already vectorizes),
recording it as a 4th `indexed_outputs` field that flows into the cache key.
- make_loop_call emits a `distinct !{}` access group (a small MDValue subclass, since
llvmlite can't emit distinct nodes and a uniqued !{} crashes the verifier) and tags
the indexed RMW load/store, the index loads, and the input loads -- every memory op
must join the group or LLVM's isAnnotatedParallel rejects the loop.
Builds on the scratch-slot store redirection (prior commit). Validated: arithmetic
value-compute goes 0 -> 3 packed ops with the promise; results unchanged with/without it.
The transcendental win composes once a vector math library is wired (orthogonal veclib work).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.