Skip to content

Numba: SIMD friendly log1p, expm1, and softplus#2246

Draft
ricardoV94 wants to merge 4 commits into
pymc-devs:mainfrom
ricardoV94:vectorizable_log1p_expm1
Draft

Numba: SIMD friendly log1p, expm1, and softplus#2246
ricardoV94 wants to merge 4 commits into
pymc-devs:mainfrom
ricardoV94:vectorizable_log1p_expm1

Conversation

@ricardoV94

Copy link
Copy Markdown
Member

No description provided.

@ricardoV94 ricardoV94 force-pushed the vectorizable_log1p_expm1 branch 2 times, most recently from a50bc7b to 833f56d Compare June 20, 2026 14:32
Add the numba__veclib config option naming the vector math library wired into
LLVM, fold it into the Numba compile-cache key so a function built against one
library is never reused under a different (or no) library — which would segfault
on the unresolved symbol — and add scripts/check_numba_veclib.py to verify the
wiring. No lowering uses it yet.
The SIMD/cache-friendly transcendental lowerings rewrite log1p as a
corrected log(1+x) and expm1 as a polynomial + exp, both of which lower
to the `log`/`exp` a vector math library can vectorize. A benchmark
investigation showed these forms are a *scalar* data-dependent trade,
not a uniform win, so they are now gated to where they actually help.

Root cause: glibc's scalar `log` is slow for arguments near 1 while
scalar `log1p` has a fast small-argument path (and float32 `log1pf` /
`expm1f` are relatively slow). So with no vector library:

  - float64 log1p via corrected `log` REGRESSES in the small-argument
    regime log1p exists for: 0.77x at |x|~1e-4, worst 0.57x at |x|~0.1,
    0.91x for |x|<0.5. It only wins (~1.9x) once inputs reach past ~0.5,
    which the earlier (-0.5, 5) microbenchmark happened to cover.
  - float32 log1p / expm1 win across the whole domain (1.2-2.8x).
  - float64 expm1's polynomial is slower than scalar `expm1` near 0.

A vector library makes the forms branch-free and time-invariant (libmvec
vector `log` is flat ~12ms across the domain; the corrected log1p flat
~16ms vs native scalar log1p's data-dependent 27-65ms), so under
`numba__veclib` they are uniformly fastest. Accuracy then drops to the
library's ~4 ulp (vs ~1-2 ulp scalar) -- an opt-in trade; expm1's
polynomial branch stays ~1 ulp (it carries no transcendental).

Changes:
  - Log1p: corrected form only for float32 or when `numba__veclib` is
    set; scalar `np.log1p` by default on float64.
  - Expm1: polynomial unconditional for float32 (a win even with no
    library); float64 stays gated on `numba__veclib`.
  - Log1mexp: revert the corrected-log reuse to scalar `log1p` (the
    Maechler form). Its data-dependent branch always feeds log1p small
    arguments, so the corrected form regressed it ~20-30% on float64 for
    no accuracy gain, and it never vectorizes there (branch-bound).
  - Softplus is unchanged: the branchless restructuring keeps it faster
    than the cascade in every regime, so its corrected log1p is not a
    regression (only a ~1 ulp accuracy difference).

Cache keys carry a `veclib` flag so a scalar-built artifact is never
reused for the corrected/poly lowering, or vice versa.

Claude-Session: https://claude.ai/code/session_01HzV22gdeXoEAasiZR1MvYq
sigmoid lowered as `1 / (1 + exp(-x))` does not vectorize under a vector
math library: the bare division stops LLVM's loop vectorizer from
replacing `exp` with a libmvec call, so it stays a scalar `exp` loop.

A division-guard select fixes this. `d = 1 + exp(-x); 0 if d == 0 else
1/d` vectorizes where `1/(1+exp(-x))` does not -- the `d == 0` guard is
dead (1 + exp(-x) >= 1 always), but its presence lets the vectorizer pull
the division through (the same mechanism that makes `_log1p_via_log`'s
`um1 == 0` select vectorize). Measured under libmvec: 2.74x (float64) /
5.77x (float32).

Gated like Log1p/Expm1, since the dead select costs ~5% on float64 with
no library: the select form is used for float32 (a small win even scalar)
or when `numba__veclib` is set; float64 with no library keeps the plain
scalar form. Literals carry x's dtype via `type(x)(...)` so float32 stays
float32 (a bare `1` would unify to float64 and halve the SIMD width).
Accuracy is ~2 ulp scalar and ~3 ulp under libmvec (the vector `exp`) --
the same opt-in trade as the other lowerings. The uint-input path uses
the same gated form (its float32/float64 upcast determines the gate).

Cache key carries the `veclib` flag and is bumped so scalar- and
select-built artifacts never cross-reuse.

Claude-Session: https://claude.ai/code/session_01HzV22gdeXoEAasiZR1MvYq
@ricardoV94 ricardoV94 force-pushed the vectorizable_log1p_expm1 branch from 833f56d to b094073 Compare June 20, 2026 21:13
@ricardoV94 ricardoV94 changed the title Numba: SIMD friendly log1p and expm1 Numba: SIMD friendly log1p, expm1, and softplus Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants