Numba IndexedElemwise: Faster indexed increments when indices do not alias by ricardoV94 · Pull Request #2250 · pymc-devs/pytensor

ricardoV94 · 2026-06-20T21:00:41Z

import numpy as np
import pytensor
import pytensor.tensor as pt
from pytensor.assumptions import assume

def make(use_assumption, n=8192):
    x = pt.vector("x")
    v = pt.vector("v")
    idx = pt.vector("idx", dtype="int64")
    idx_used = assume(idx, unique_indices=True) if use_assumption else idx
    val = pt.sqrt(v * v + 1.0)                       # vectorizable value-compute
    out = x[idx_used].inc(val)
    return pytensor.function([x, idx, v], out, trust_input=True)

n = 8192
rng = np.random.default_rng(0)
out0 = rng.random(n)
idxs = rng.permutation(n).astype("int64")            # unique -> the assumption holds
vals = rng.random(n)

fn_before = make(use_assumption=False)               # scatter RMW poisons the loop -> scalar
fn_after  = make(use_assumption=True)                # unique-index promise -> vectorizes

assert np.allclose(
   fn_before(out0.copy(), idxs, vals),
   fn_after(out0.copy(), idxs, vals),
)

%timeit fn_before(out0.copy(), idxs, vals)   # ~27 µs
%timeit fn_after(out0.copy(), idxs, vals)    # ~20 µs  (~1.4x)

Port the scratch-slot store-redirection + scoped alias.scope/noalias metadata from 605789e (reduction_fusion_with_scope_markers) so this branch is self-contained regardless of PR merge order. The scratch slot moves each 0-d output's real store into make_loop_call (SROA collapses it back onto the inner 'o += t'), giving a load/store instruction we control -- the hook the no-dup promise needs to attach llvm.access.group / parallel_accesses metadata. Adapted from 605789e: dropped the unused string_codegen import (this branch's store_core_outputs still uses compile_numba_function_src).

Fusing a value-compute into an `inc`-scatter normally scalarizes the whole loop: the read-modify-write on out[idx[i]] is a possible cross-iteration dependency the LoopVectorizer can't rule out. When the indices are statically known to be distinct there is no such dependency, so we promise it to LLVM with an access group on every loop memory op plus `llvm.loop.parallel_accesses` on the latch -- the value-compute then vectorizes (LLVM scalarizes only the indexed stores, which AVX2 has no scatter for). - FuseIndexedElemwise gates the promise on `_has_unique_indices` (a constant with unique entries, or a `unique_indices` assumption), only for `inc` (`set` already vectorizes), recording it as a 4th `indexed_outputs` field that flows into the cache key. - make_loop_call emits a `distinct !{}` access group (a small MDValue subclass, since llvmlite can't emit distinct nodes and a uniqued !{} crashes the verifier) and tags the indexed RMW load/store, the index loads, and the input loads -- every memory op must join the group or LLVM's isAnnotatedParallel rejects the loop. Builds on the scratch-slot store redirection (prior commit). Validated: arithmetic value-compute goes 0 -> 3 packed ops with the promise; results unchanged with/without it. The transcendental win composes once a vector math library is wired (orthogonal veclib work).

ricardoV94 added 2 commits June 20, 2026 19:27

ricardoV94 added numba performance labels Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Numba IndexedElemwise: Faster indexed increments when indices do not alias#2250

Numba IndexedElemwise: Faster indexed increments when indices do not alias#2250
ricardoV94 wants to merge 2 commits into
pymc-devs:mainfrom
ricardoV94:no_dups_promise

ricardoV94 commented Jun 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ricardoV94 commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ricardoV94 commented Jun 20, 2026 •

edited

Loading