Numba: SIMD friendly log1p, expm1, and softplus#2246
Draft
ricardoV94 wants to merge 3 commits into
Draft
Conversation
833f56d to
b094073
Compare
01e7b75 to
74ed33b
Compare
Add the numba__veclib config option naming the vector math library wired into LLVM, fold it into the Numba compile-cache key so a function built against one library is never reused under a different (or no) library — which would segfault on the unresolved symbol — and add scripts/check_numba_veclib.py to verify the wiring. No lowering uses it yet.
…clib
The SIMD/cache-friendly transcendental lowerings rewrite log1p as a
corrected log(1+x), expm1 as a polynomial + exp, and softplus as a
branchless max(x,0)+log1p(exp(-|x|)) -- all of which lower to the
`log`/`exp` a vector math library can vectorize. A benchmark
investigation showed these forms trade ~1 ulp of accuracy for speed, so
they are gated on the flags that opt into that trade rather than applied
unconditionally.
Root cause: glibc's scalar `log` is slow for arguments near 1 while
scalar `log1p` has a fast small-argument path (and float32 `log1pf` /
`expm1f` are relatively slow). So with no vector library:
- float64 log1p via corrected `log` REGRESSES in the small-argument
regime log1p exists for: 0.77x at |x|~1e-4, worst 0.57x at |x|~0.1,
0.91x for |x|<0.5. It only wins (~1.9x) once inputs reach past ~0.5,
which the earlier (-0.5, 5) microbenchmark happened to cover.
- float32 log1p / expm1 win across the whole domain (1.2-2.8x).
- float64 expm1's polynomial is slower than scalar `expm1` near 0.
A vector library makes the forms branch-free and time-invariant (libmvec
vector `log` is flat ~12ms across the domain; the corrected log1p flat
~16ms vs native scalar log1p's data-dependent 27-65ms), so under
`numba__veclib` they are uniformly fastest. Accuracy then drops to the
library's ~4 ulp (vs ~1-2 ulp scalar) -- an opt-in trade; expm1's
polynomial branch stays ~1 ulp (it carries no transcendental).
Gating. The scalar speedup is a precision-for-speed trade, so it follows
`numba__fastmath` (the existing such flag, default on -- so default
behavior is unchanged, but `numba__fastmath=False` now restores the
accurate forms). A vector library always enables the rewrite too, since
it needs the rewritten form to vectorize and already accepts ~4 ulp. The
float64 corrected/poly forms are the exception: they are SLOWER scalar,
so fastmath (which means "go faster") must not select them -- only a
vector library does.
- Log1p: corrected form under a vector library (both dtypes) or on
float32 under `numba__fastmath`; scalar `np.log1p` otherwise (e.g.
float64 with no library).
- Expm1: polynomial under a vector library (both dtypes) or on float32
under `numba__fastmath`; scalar `expm1` otherwise.
- Softplus: branchless form (its log1p via the corrected `log`) under a
vector library or `numba__fastmath` -- both dtypes, faster for the
common mixed-sign case and vectorizable; the accurate Maechler
cascade otherwise.
- Log1mexp: scalar `log1p` (the Maechler form). The corrected-log reuse
was reverted -- its data-dependent branch always feeds log1p small
arguments, so the corrected form regressed it ~20-30% on float64 for
no accuracy gain, and it never vectorizes there (branch-bound).
Cache keys carry a flag recording which source each Op emitted (corrected
/ poly / branchless vs the scalar fallback) so the two are never reused
for each other.
Claude-Session: https://claude.ai/code/session_01HzV22gdeXoEAasiZR1MvYq
sigmoid lowered as `1 / (1 + exp(-x))` does not vectorize under a vector math library: the bare division stops LLVM's loop vectorizer from replacing `exp` with a libmvec call, so it stays a scalar `exp` loop. A division-guard select fixes this. `d = 1 + exp(-x); 0 if d == 0 else 1/d` vectorizes where `1/(1+exp(-x))` does not -- the `d == 0` guard is dead (1 + exp(-x) >= 1 always), but its presence lets the vectorizer pull the division through (the same mechanism that makes `_log1p_via_log`'s `um1 == 0` select vectorize). Measured under libmvec: 2.74x (float64) / 5.77x (float32). Gated solely on `numba__veclib`. Unlike Log1p/Expm1/Softplus the select is not a precision-for-speed trade: the dead guard makes it bit-identical to `1/(1+exp(-x))` (~2 ulp scalar, ~3 ulp under libmvec's vector `exp`) and ~neutral scalar speed, so it buys nothing without a library -- and `numba__fastmath` is therefore not the right gate. With no vector library both dtypes keep the plain scalar form. Literals carry x's dtype via `type(x)(...)` so float32 stays float32 (a bare `1` would unify to float64 and halve the SIMD width). The uint-input path uses the same gated form. Cache key carries the `veclib` flag and is bumped so scalar- and select-built artifacts never cross-reuse. Claude-Session: https://claude.ai/code/session_01HzV22gdeXoEAasiZR1MvYq
74ed33b to
a4c7b13
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.