Numba: SIMD friendly log1p, expm1, and softplus by ricardoV94 · Pull Request #2246 · pymc-devs/pytensor

ricardoV94 · 2026-06-19T10:52:29Z

┌──────────┬───────┬────────────────────────────┬───────────────────────────┬──────────────┬────────────────────┬────────────────────┐
│    Op    │ dtype │       Scalar speedup       │ Scalar ULP (before→after) │ SIMD speedup │      SIMD ULP      │        Gate        │
├──────────┼───────┼────────────────────────────┼───────────────────────────┼──────────────┼────────────────────┼────────────────────┤
│ log1p    │ f64   │ no (regresses small-arg)   │ 0.73 → 2.0                │ 1.57x        │ 2.2                │ veclib             │
├──────────┼───────┼────────────────────────────┼───────────────────────────┼──────────────┼────────────────────┼────────────────────┤
│ log1p    │ f32   │ 2.8–3.1x                   │ 0.73 → 2.0                │ 3.82x        │ 3.7                │ fastmath or veclib │
├──────────┼───────┼────────────────────────────┼───────────────────────────┼──────────────┼────────────────────┼────────────────────┤
│ expm1    │ f64   │ no (slower near-0)         │ 0.81 → 0.81 (gated)       │ 1.2–1.7x     │ 1.1 poly / 4.3 exp │ veclib             │
├──────────┼───────┼────────────────────────────┼───────────────────────────┼──────────────┼────────────────────┼────────────────────┤
│ expm1    │ f32   │ 1.6–2.7x                   │ 0.74 → 1.2                │ 5.7–9.0x     │ 1.1 poly / 4.3 exp │ fastmath or veclib │
├──────────┼───────┼────────────────────────────┼───────────────────────────┼──────────────┼────────────────────┼────────────────────┤
│ softplus │ f64   │ 1.28x mixed / ~tie uniform │ 1.33 → 2.31               │ 1.94x        │ 4.0                │ fastmath or veclib │
├──────────┼───────┼────────────────────────────┼───────────────────────────┼──────────────┼────────────────────┼────────────────────┤
│ softplus │ f32   │ 1.13x mixed / ~tie uniform │ 1.27 → 2.23               │ 5.61x        │ 4.5                │ fastmath or veclib │
├──────────┼───────┼────────────────────────────┼───────────────────────────┼──────────────┼────────────────────┼────────────────────┤
│ sigmoid  │ f64   │ no (~neutral)              │ 1.8 → 1.8                 │ 2.74x        │ ~3                 │ veclib             │
├──────────┼───────┼────────────────────────────┼───────────────────────────┼──────────────┼────────────────────┼────────────────────┤
│ sigmoid  │ f32   │ no (~neutral)              │ 1.8 → 1.8                 │ 5.77x        │ ~3                 │ veclib             │
└──────────┴───────┴────────────────────────────┴───────────────────────────┴──────────────┴────────────────────┴────────────────────┘

Add the numba__veclib config option naming the vector math library wired into LLVM, fold it into the Numba compile-cache key so a function built against one library is never reused under a different (or no) library — which would segfault on the unresolved symbol — and add scripts/check_numba_veclib.py to verify the wiring. No lowering uses it yet.

…clib The SIMD/cache-friendly transcendental lowerings rewrite log1p as a corrected log(1+x), expm1 as a polynomial + exp, and softplus as a branchless max(x,0)+log1p(exp(-|x|)) -- all of which lower to the `log`/`exp` a vector math library can vectorize. A benchmark investigation showed these forms trade ~1 ulp of accuracy for speed, so they are gated on the flags that opt into that trade rather than applied unconditionally. Root cause: glibc's scalar `log` is slow for arguments near 1 while scalar `log1p` has a fast small-argument path (and float32 `log1pf` / `expm1f` are relatively slow). So with no vector library: - float64 log1p via corrected `log` REGRESSES in the small-argument regime log1p exists for: 0.77x at |x|~1e-4, worst 0.57x at |x|~0.1, 0.91x for |x|<0.5. It only wins (~1.9x) once inputs reach past ~0.5, which the earlier (-0.5, 5) microbenchmark happened to cover. - float32 log1p / expm1 win across the whole domain (1.2-2.8x). - float64 expm1's polynomial is slower than scalar `expm1` near 0. A vector library makes the forms branch-free and time-invariant (libmvec vector `log` is flat ~12ms across the domain; the corrected log1p flat ~16ms vs native scalar log1p's data-dependent 27-65ms), so under `numba__veclib` they are uniformly fastest. Accuracy then drops to the library's ~4 ulp (vs ~1-2 ulp scalar) -- an opt-in trade; expm1's polynomial branch stays ~1 ulp (it carries no transcendental). Gating. The scalar speedup is a precision-for-speed trade, so it follows `numba__fastmath` (the existing such flag, default on -- so default behavior is unchanged, but `numba__fastmath=False` now restores the accurate forms). A vector library always enables the rewrite too, since it needs the rewritten form to vectorize and already accepts ~4 ulp. The float64 corrected/poly forms are the exception: they are SLOWER scalar, so fastmath (which means "go faster") must not select them -- only a vector library does. - Log1p: corrected form under a vector library (both dtypes) or on float32 under `numba__fastmath`; scalar `np.log1p` otherwise (e.g. float64 with no library). - Expm1: polynomial under a vector library (both dtypes) or on float32 under `numba__fastmath`; scalar `expm1` otherwise. - Softplus: branchless form (its log1p via the corrected `log`) under a vector library or `numba__fastmath` -- both dtypes, faster for the common mixed-sign case and vectorizable; the accurate Maechler cascade otherwise. - Log1mexp: scalar `log1p` (the Maechler form). The corrected-log reuse was reverted -- its data-dependent branch always feeds log1p small arguments, so the corrected form regressed it ~20-30% on float64 for no accuracy gain, and it never vectorizes there (branch-bound). Cache keys carry a flag recording which source each Op emitted (corrected / poly / branchless vs the scalar fallback) so the two are never reused for each other. Claude-Session: https://claude.ai/code/session_01HzV22gdeXoEAasiZR1MvYq

sigmoid lowered as `1 / (1 + exp(-x))` does not vectorize under a vector math library: the bare division stops LLVM's loop vectorizer from replacing `exp` with a libmvec call, so it stays a scalar `exp` loop. A division-guard select fixes this. `d = 1 + exp(-x); 0 if d == 0 else 1/d` vectorizes where `1/(1+exp(-x))` does not -- the `d == 0` guard is dead (1 + exp(-x) >= 1 always), but its presence lets the vectorizer pull the division through (the same mechanism that makes `_log1p_via_log`'s `um1 == 0` select vectorize). Measured under libmvec: 2.74x (float64) / 5.77x (float32). Gated solely on `numba__veclib`. Unlike Log1p/Expm1/Softplus the select is not a precision-for-speed trade: the dead guard makes it bit-identical to `1/(1+exp(-x))` (~2 ulp scalar, ~3 ulp under libmvec's vector `exp`) and ~neutral scalar speed, so it buys nothing without a library -- and `numba__fastmath` is therefore not the right gate. With no vector library both dtypes keep the plain scalar form. Literals carry x's dtype via `type(x)(...)` so float32 stays float32 (a bare `1` would unify to float64 and halve the SIMD width). The uint-input path uses the same gated form. Cache key carries the `veclib` flag and is bumped so scalar- and select-built artifacts never cross-reuse. Claude-Session: https://claude.ai/code/session_01HzV22gdeXoEAasiZR1MvYq

ricardoV94 force-pushed the vectorizable_log1p_expm1 branch 3 times, most recently from 833f56d to b094073 Compare June 20, 2026 21:13

ricardoV94 changed the title ~~Numba: SIMD friendly log1p and expm1~~ Numba: SIMD friendly log1p, expm1, and softplus Jun 21, 2026

ricardoV94 force-pushed the vectorizable_log1p_expm1 branch 3 times, most recently from 01e7b75 to 74ed33b Compare June 26, 2026 16:41

ricardoV94 added 3 commits June 26, 2026 18:56

ricardoV94 force-pushed the vectorizable_log1p_expm1 branch from 74ed33b to a4c7b13 Compare June 26, 2026 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Numba: SIMD friendly log1p, expm1, and softplus#2246

Numba: SIMD friendly log1p, expm1, and softplus#2246
ricardoV94 wants to merge 3 commits into
pymc-devs:mainfrom
ricardoV94:vectorizable_log1p_expm1

ricardoV94 commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ricardoV94 commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ricardoV94 commented Jun 19, 2026 •

edited

Loading