Numba: SIMD friendly log1p, expm1, and softplus by ricardoV94 · Pull Request #2246 · pymc-devs/pytensor

ricardoV94 · 2026-06-19T10:52:29Z

No description provided.

Add the numba__veclib config option naming the vector math library wired into LLVM, fold it into the Numba compile-cache key so a function built against one library is never reused under a different (or no) library — which would segfault on the unresolved symbol — and add scripts/check_numba_veclib.py to verify the wiring. No lowering uses it yet.

The SIMD/cache-friendly transcendental lowerings rewrite log1p as a corrected log(1+x) and expm1 as a polynomial + exp, both of which lower to the `log`/`exp` a vector math library can vectorize. A benchmark investigation showed these forms are a *scalar* data-dependent trade, not a uniform win, so they are now gated to where they actually help. Root cause: glibc's scalar `log` is slow for arguments near 1 while scalar `log1p` has a fast small-argument path (and float32 `log1pf` / `expm1f` are relatively slow). So with no vector library: - float64 log1p via corrected `log` REGRESSES in the small-argument regime log1p exists for: 0.77x at |x|~1e-4, worst 0.57x at |x|~0.1, 0.91x for |x|<0.5. It only wins (~1.9x) once inputs reach past ~0.5, which the earlier (-0.5, 5) microbenchmark happened to cover. - float32 log1p / expm1 win across the whole domain (1.2-2.8x). - float64 expm1's polynomial is slower than scalar `expm1` near 0. A vector library makes the forms branch-free and time-invariant (libmvec vector `log` is flat ~12ms across the domain; the corrected log1p flat ~16ms vs native scalar log1p's data-dependent 27-65ms), so under `numba__veclib` they are uniformly fastest. Accuracy then drops to the library's ~4 ulp (vs ~1-2 ulp scalar) -- an opt-in trade; expm1's polynomial branch stays ~1 ulp (it carries no transcendental). Changes: - Log1p: corrected form only for float32 or when `numba__veclib` is set; scalar `np.log1p` by default on float64. - Expm1: polynomial unconditional for float32 (a win even with no library); float64 stays gated on `numba__veclib`. - Log1mexp: revert the corrected-log reuse to scalar `log1p` (the Maechler form). Its data-dependent branch always feeds log1p small arguments, so the corrected form regressed it ~20-30% on float64 for no accuracy gain, and it never vectorizes there (branch-bound). - Softplus is unchanged: the branchless restructuring keeps it faster than the cascade in every regime, so its corrected log1p is not a regression (only a ~1 ulp accuracy difference). Cache keys carry a `veclib` flag so a scalar-built artifact is never reused for the corrected/poly lowering, or vice versa. Claude-Session: https://claude.ai/code/session_01HzV22gdeXoEAasiZR1MvYq

sigmoid lowered as `1 / (1 + exp(-x))` does not vectorize under a vector math library: the bare division stops LLVM's loop vectorizer from replacing `exp` with a libmvec call, so it stays a scalar `exp` loop. A division-guard select fixes this. `d = 1 + exp(-x); 0 if d == 0 else 1/d` vectorizes where `1/(1+exp(-x))` does not -- the `d == 0` guard is dead (1 + exp(-x) >= 1 always), but its presence lets the vectorizer pull the division through (the same mechanism that makes `_log1p_via_log`'s `um1 == 0` select vectorize). Measured under libmvec: 2.74x (float64) / 5.77x (float32). Gated like Log1p/Expm1, since the dead select costs ~5% on float64 with no library: the select form is used for float32 (a small win even scalar) or when `numba__veclib` is set; float64 with no library keeps the plain scalar form. Literals carry x's dtype via `type(x)(...)` so float32 stays float32 (a bare `1` would unify to float64 and halve the SIMD width). Accuracy is ~2 ulp scalar and ~3 ulp under libmvec (the vector `exp`) -- the same opt-in trade as the other lowerings. The uint-input path uses the same gated form (its float32/float64 upcast determines the gate). Cache key carries the `veclib` flag and is bumped so scalar- and select-built artifacts never cross-reuse. Claude-Session: https://claude.ai/code/session_01HzV22gdeXoEAasiZR1MvYq

Fix MLX variadic Elemwise crash on broadcast-shaped inputs (pymc-devs…

5187c05

…#2235)

ricardoV94 force-pushed the vectorizable_log1p_expm1 branch 2 times, most recently from a50bc7b to 833f56d Compare June 20, 2026 14:32

ricardoV94 added 3 commits June 20, 2026 16:35

ricardoV94 force-pushed the vectorizable_log1p_expm1 branch from 833f56d to b094073 Compare June 20, 2026 21:13

ricardoV94 changed the title ~~Numba: SIMD friendly log1p and expm1~~ Numba: SIMD friendly log1p, expm1, and softplus Jun 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Numba: SIMD friendly log1p, expm1, and softplus#2246

Numba: SIMD friendly log1p, expm1, and softplus#2246
ricardoV94 wants to merge 4 commits into
pymc-devs:mainfrom
ricardoV94:vectorizable_log1p_expm1

ricardoV94 commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ricardoV94 commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants