Skip to content

Numba IndexedElemwise: Faster indexed increments when indices do not alias#2250

Draft
ricardoV94 wants to merge 2 commits into
pymc-devs:mainfrom
ricardoV94:no_dups_promise
Draft

Numba IndexedElemwise: Faster indexed increments when indices do not alias#2250
ricardoV94 wants to merge 2 commits into
pymc-devs:mainfrom
ricardoV94:no_dups_promise

Conversation

@ricardoV94

@ricardoV94 ricardoV94 commented Jun 20, 2026

Copy link
Copy Markdown
Member
import numpy as np
import pytensor
import pytensor.tensor as pt
from pytensor.assumptions import assume

def make(use_assumption, n=8192):
    x = pt.vector("x")
    v = pt.vector("v")
    idx = pt.vector("idx", dtype="int64")
    idx_used = assume(idx, unique_indices=True) if use_assumption else idx
    val = pt.sqrt(v * v + 1.0)                       # vectorizable value-compute
    out = x[idx_used].inc(val)
    return pytensor.function([x, idx, v], out, trust_input=True)

n = 8192
rng = np.random.default_rng(0)
out0 = rng.random(n)
idxs = rng.permutation(n).astype("int64")            # unique -> the assumption holds
vals = rng.random(n)

fn_before = make(use_assumption=False)               # scatter RMW poisons the loop -> scalar
fn_after  = make(use_assumption=True)                # unique-index promise -> vectorizes

assert np.allclose(
   fn_before(out0.copy(), idxs, vals),
   fn_after(out0.copy(), idxs, vals),
)

%timeit fn_before(out0.copy(), idxs, vals)   # ~27 µs
%timeit fn_after(out0.copy(), idxs, vals)    # ~20 µs  (~1.4x)

Port the scratch-slot store-redirection + scoped alias.scope/noalias metadata
from 605789e (reduction_fusion_with_scope_markers) so this branch is self-contained
regardless of PR merge order. The scratch slot moves each 0-d output's real
store into make_loop_call (SROA collapses it back onto the inner 'o += t'),
giving a load/store instruction we control -- the hook the no-dup promise needs
to attach llvm.access.group / parallel_accesses metadata.

Adapted from 605789e: dropped the unused string_codegen import (this branch's
store_core_outputs still uses compile_numba_function_src).
Fusing a value-compute into an `inc`-scatter normally scalarizes the whole loop:
the read-modify-write on out[idx[i]] is a possible cross-iteration dependency the
LoopVectorizer can't rule out. When the indices are statically known to be distinct
there is no such dependency, so we promise it to LLVM with an access group on every
loop memory op plus `llvm.loop.parallel_accesses` on the latch -- the value-compute
then vectorizes (LLVM scalarizes only the indexed stores, which AVX2 has no scatter for).

- FuseIndexedElemwise gates the promise on `_has_unique_indices` (a constant with unique
  entries, or a `unique_indices` assumption), only for `inc` (`set` already vectorizes),
  recording it as a 4th `indexed_outputs` field that flows into the cache key.
- make_loop_call emits a `distinct !{}` access group (a small MDValue subclass, since
  llvmlite can't emit distinct nodes and a uniqued !{} crashes the verifier) and tags
  the indexed RMW load/store, the index loads, and the input loads -- every memory op
  must join the group or LLVM's isAnnotatedParallel rejects the loop.

Builds on the scratch-slot store redirection (prior commit). Validated: arithmetic
value-compute goes 0 -> 3 packed ops with the promise; results unchanged with/without it.
The transcendental win composes once a vector math library is wired (orthogonal veclib work).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant