Numba: SIMD friendly log1p, expm1, and softplus#2246
Draft
ricardoV94 wants to merge 4 commits into
Draft
Conversation
a50bc7b to
833f56d
Compare
Add the numba__veclib config option naming the vector math library wired into LLVM, fold it into the Numba compile-cache key so a function built against one library is never reused under a different (or no) library — which would segfault on the unresolved symbol — and add scripts/check_numba_veclib.py to verify the wiring. No lowering uses it yet.
The SIMD/cache-friendly transcendental lowerings rewrite log1p as a
corrected log(1+x) and expm1 as a polynomial + exp, both of which lower
to the `log`/`exp` a vector math library can vectorize. A benchmark
investigation showed these forms are a *scalar* data-dependent trade,
not a uniform win, so they are now gated to where they actually help.
Root cause: glibc's scalar `log` is slow for arguments near 1 while
scalar `log1p` has a fast small-argument path (and float32 `log1pf` /
`expm1f` are relatively slow). So with no vector library:
- float64 log1p via corrected `log` REGRESSES in the small-argument
regime log1p exists for: 0.77x at |x|~1e-4, worst 0.57x at |x|~0.1,
0.91x for |x|<0.5. It only wins (~1.9x) once inputs reach past ~0.5,
which the earlier (-0.5, 5) microbenchmark happened to cover.
- float32 log1p / expm1 win across the whole domain (1.2-2.8x).
- float64 expm1's polynomial is slower than scalar `expm1` near 0.
A vector library makes the forms branch-free and time-invariant (libmvec
vector `log` is flat ~12ms across the domain; the corrected log1p flat
~16ms vs native scalar log1p's data-dependent 27-65ms), so under
`numba__veclib` they are uniformly fastest. Accuracy then drops to the
library's ~4 ulp (vs ~1-2 ulp scalar) -- an opt-in trade; expm1's
polynomial branch stays ~1 ulp (it carries no transcendental).
Changes:
- Log1p: corrected form only for float32 or when `numba__veclib` is
set; scalar `np.log1p` by default on float64.
- Expm1: polynomial unconditional for float32 (a win even with no
library); float64 stays gated on `numba__veclib`.
- Log1mexp: revert the corrected-log reuse to scalar `log1p` (the
Maechler form). Its data-dependent branch always feeds log1p small
arguments, so the corrected form regressed it ~20-30% on float64 for
no accuracy gain, and it never vectorizes there (branch-bound).
- Softplus is unchanged: the branchless restructuring keeps it faster
than the cascade in every regime, so its corrected log1p is not a
regression (only a ~1 ulp accuracy difference).
Cache keys carry a `veclib` flag so a scalar-built artifact is never
reused for the corrected/poly lowering, or vice versa.
Claude-Session: https://claude.ai/code/session_01HzV22gdeXoEAasiZR1MvYq
sigmoid lowered as `1 / (1 + exp(-x))` does not vectorize under a vector math library: the bare division stops LLVM's loop vectorizer from replacing `exp` with a libmvec call, so it stays a scalar `exp` loop. A division-guard select fixes this. `d = 1 + exp(-x); 0 if d == 0 else 1/d` vectorizes where `1/(1+exp(-x))` does not -- the `d == 0` guard is dead (1 + exp(-x) >= 1 always), but its presence lets the vectorizer pull the division through (the same mechanism that makes `_log1p_via_log`'s `um1 == 0` select vectorize). Measured under libmvec: 2.74x (float64) / 5.77x (float32). Gated like Log1p/Expm1, since the dead select costs ~5% on float64 with no library: the select form is used for float32 (a small win even scalar) or when `numba__veclib` is set; float64 with no library keeps the plain scalar form. Literals carry x's dtype via `type(x)(...)` so float32 stays float32 (a bare `1` would unify to float64 and halve the SIMD width). Accuracy is ~2 ulp scalar and ~3 ulp under libmvec (the vector `exp`) -- the same opt-in trade as the other lowerings. The uint-input path uses the same gated form (its float32/float64 upcast determines the gate). Cache key carries the `veclib` flag and is bumped so scalar- and select-built artifacts never cross-reuse. Claude-Session: https://claude.ai/code/session_01HzV22gdeXoEAasiZR1MvYq
833f56d to
b094073
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.