Taylor series native `exponential` implementation by Jutho · Pull Request #243 · QuantumKitHub/MatrixAlgebraKit.jl

Jutho · 2026-06-03T10:23:44Z

This is something I started a while ago, but I need to continue working on it. Basically, for higher precision number types, the best approach for computing the exponential is just a Taylor expansion, combined with scaling and squaring (i.e. exp(A) = (exp(A/2^s))^(2^s)). But which s in combination with which order of m of the Taylor expansion is a nontrivial question, that has been studied in the literature, and also involves clever ways of computing higher order polynomials, i.e. polynomials of A of degree m, thus requiring all powers A^k, k in 0:m, without actually having to do m matrix multiplications.

github-actions · 2026-06-03T10:33:31Z

Your PR no longer requires formatting changes. Thank you for your contribution!

lkdvos · 2026-07-01T18:00:29Z

@Jutho I hope you don't mind but I took some time to rebase this PR now that @sanderdemeyer's PR is finished.
Doing some benchmarks on my machine, it seems like this is actually faster in most cases, so I even switched out the default to simply use this.
Balancing seems to not cost too much so I've opted to have that on by default to be cautious.

Benchmark: `MatrixFunctionViaTaylor` vs `LinearAlgebra.exp`

Timings via BenchmarkTools (@belapsed, best of 50 samples for BLAS floats, 5 for BigFloat) on random randn matrices. Tayl/LA and nobal/LA are the Taylor time (with / without balancing) relative to LinearAlgebra.exp; relerr is ‖got − exp(A)‖ / ‖exp(A)‖.

BLAS floats

Float64

n	LA (µs)	Taylor (µs)	Taylor no-balance (µs)	Tayl/LA	nobal/LA	relerr
8	3.9	5.9	5.6	1.51	1.43	1.3e-15
16	14.6	13.1	11.8	0.90	0.81	5.4e-15
32	57.8	30.1	26.7	0.52	0.46	1.2e-14
64	263.1	136.1	134.6	0.52	0.51	2.1e-14
128	2108.7	1219.7	1188.3	0.58	0.56	3.5e-14
256	6263.0	4161.5	4050.0	0.66	0.65	8.4e-14

ComplexF64

n	LA (µs)	Taylor (µs)	Taylor no-balance (µs)	Tayl/LA	nobal/LA	relerr
8	8.1	10.0	9.0	1.23	1.11	2.2e-15
16	31.1	36.8	30.0	1.19	0.97	1.9e-15
32	152.1	133.3	119.1	0.88	0.78	5.1e-15
64	943.1	801.2	719.9	0.85	0.76	2.5e-14
128	3456.6	2833.4	2591.5	0.82	0.75	3.0e-14
256	13064.8	11885.0	10696.6	0.91	0.82	7.4e-14

Taylor is faster than LinearAlgebra.exp for n ≥ 16 (Float64) and competitive-to-faster for ComplexF64, only paying a penalty at the smallest sizes. Balancing costs ~10–15% on these well-scaled random matrices; it is kept on by default for robustness on badly-scaled inputs.

Arbitrary precision (256-bit)

LinearAlgebra.exp has no method for Matrix{BigFloat}.
Reported below are Taylor timings and the self-consistency residual ‖exp(A)·exp(−A) − I‖₁.

BigFloat

n	Taylor (ms)	no-balance (ms)	‖E·E(−A) − I‖₁
8	1.46	1.38	1.9e-74
16	9.67	9.62	2.7e-73
32	75.63	76.57	3.0e-71
64	769.48	769.52	1.5e-68

Complex{BigFloat}

n	Taylor (ms)	no-balance (ms)	‖E·E(−A) − I‖₁
8	5.95	6.00	4.5e-74
16	44.53	43.26	2.9e-72
32	410.79	418.93	6.3e-71
64	3449.70	3432.66	2.0e-69

Jutho · 2026-07-02T00:10:15Z

I guess for the BigFloat case, the important timing is how it compares to the current approach in MAK, namely via the generic eigenvalue decomposition. I think that should be very favorable.

codecov · 2026-07-02T01:27:27Z

Codecov Report

❌ Patch coverage is 68.93204% with 32 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/common/balancing.jl	0.00%	30 Missing ⚠️
src/implementations/exponential.jl	98.61%	1 Missing ⚠️
src/interface/exponential.jl	0.00%	1 Missing ⚠️

Files with missing lines	Coverage Δ
src/MatrixAlgebraKit.jl	`100.00% <ø> (ø)`
src/interface/matrixfunctions.jl	`100.00% <ø> (ø)`
src/implementations/exponential.jl	`99.29% <98.61%> (-0.71%)`	⬇️
src/interface/exponential.jl	`0.00% <0.00%> (ø)`
src/common/balancing.jl	`0.00% <0.00%> (ø)`

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Register a pure-Julia scaling-and-squaring Taylor algorithm as `MatrixFunctionViaTaylor` and fit it to the exponential interface: the `(τ, A)` forms reuse the existing generic method, and it becomes the default algorithm for all dense matrices (`DiagonalAlgorithm` still handles `Diagonal`). Being LAPACK-free, it also covers arbitrary precision, where `LinearAlgebra.exp` has no method. The truncation order and number of squarings are chosen to minimize the Paterson-Stockmeyer matrix-multiplication count under the Taylor remainder bound, the polynomial is evaluated with incrementally built coefficients, and squaring reuses a ping-pong buffer. `balance!` is rewritten as a clean, type-generic Parlett-Reinsch scaling and wired in as an optional preprocessing step. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lkdvos · 2026-07-02T15:52:49Z

It seems like for BigFloat the decompositions are actually faster in some cases and slower in other, in particular for hermitian inputs. I'm not sure what should be the conclusion of what algorithm to choose by default, or what is causing this since that is slightly unexpected to me, but I guess the matmuls just aren't as efficient, so the more complex work needed for eigensolvers is less bad.

[Float64]  timing — general matrix   (min time per call)
      n      Taylor          LA         Eig
      8       4.5µs       3.3µs      13.3µs
     16       8.8µs      10.9µs      43.4µs
     32      24.6µs      36.4µs     217.3µs
     64     157.8µs     205.0µs     882.1µs
    128      2.24ms      2.34ms      9.20ms
    256      3.28ms      7.41ms     33.83ms

[Float64]  timing — hermitian matrix   (min time per call)
      n      Taylor          LA         Eig        Eigh
      8       4.0µs       7.9µs      12.6µs       9.5µs
     16       7.8µs      23.1µs      40.5µs      25.2µs
     32      18.1µs      74.0µs     218.4µs      76.5µs
     64     110.8µs     329.7µs     960.2µs     333.1µs
    128     836.6µs      3.27ms      9.32ms      3.22ms
    256      2.42ms     12.51ms     38.67ms     11.35ms

[ComplexF64]  timing — general matrix   (min time per call)
      n      Taylor          LA         Eig
      8       9.3µs       7.3µs      24.5µs
     16      29.5µs      27.6µs      91.9µs
     32     138.1µs     132.3µs     424.4µs
     64     944.5µs     899.1µs      2.11ms
    128      2.77ms      2.88ms     20.68ms
    256     17.37ms     17.80ms     80.29ms

[ComplexF64]  timing — hermitian matrix   (min time per call)
      n      Taylor          LA         Eig        Eigh
      8       7.4µs     114.4µs      21.7µs     115.2µs
     16      21.6µs     257.0µs      80.8µs     256.6µs
     32      95.6µs     587.6µs     382.1µs     594.0µs
     64     957.5µs     998.9µs      2.02ms      1.04ms
    128      2.06ms      7.69ms     21.79ms      7.75ms
    256      7.51ms     24.30ms     91.26ms     21.90ms

[BigFloat]  timing — general matrix   (min time per call)
      n      Taylor         Eig
      8      1.09ms      1.88ms
     16      8.32ms     11.80ms
     32     67.12ms     93.51ms
     64    695.57ms    731.99ms

[BigFloat]  timing — hermitian matrix   (min time per call)
      n      Taylor         Eig        Eigh
      8     844.7µs      1.28ms      1.70ms
     16      6.41ms      8.52ms      9.51ms
     32     51.65ms     70.95ms     50.23ms
     64    567.32ms    561.44ms    302.70ms

[Complex{BigFloat}]  timing — general matrix   (min time per call)
      n      Taylor         Eig
      8      5.15ms      4.60ms
     16     41.52ms     35.76ms
     32    341.52ms    274.70ms
     64      2.951s      2.114s

[Complex{BigFloat}]  timing — hermitian matrix   (min time per call)
      n      Taylor         Eig        Eigh
      8      3.87ms      3.56ms      2.36ms
     16     31.55ms     28.01ms     13.73ms
     32    247.06ms    197.29ms     98.47ms
     64      2.256s      1.650s    687.86ms

Jutho · 2026-07-02T21:16:42Z

That is not cool. Is that really due to the slow matrix multiplication. Would that be overhead due to the allocation of the different bigfloats.

Could you try the same with something like Double64 from https://github.com/JuliaMath/DoubleFloats.jl , which is a bitstype if I understand correctly.

lkdvos · 2026-07-03T01:36:28Z

After looking at it a bit more, one thing I noticed is that the Taylor series is truncated such that the error is eps(real(T)), which just gives quite a large amount of terms for BigFloat. This however does not depend on the size of the matrix, so it might just be that the crossover point is at larger array sizes, because the mul! is just less optimized?

To be completely fair, I would assume there is some equivalent that is true for the eigen decomposition, but I know too little of that implementation to know how that scales.

I reran the benchmarks with DoubleFloat, the story is slightly different but it looks like it is hard to beat the hermitian diagonalization...
Nevertheless, I think since we don't want to check for hermitian, nor want to load in GenericLinearAlgebra by default, I think keeping the Taylor implementation as default seems like a good across the board reasonable default.

[Float64]  timing — general matrix   (min time per call)
      n      Taylor          LA         Eig
      8       6.6µs       4.8µs      19.7µs
     16      13.1µs      15.8µs      67.5µs
     32      34.1µs      48.9µs     274.6µs
     64     152.6µs     230.8µs      1.38ms
    128      1.34ms      1.91ms      7.47ms
    256      9.76ms     10.99ms     22.46ms

[Float64]  timing — hermitian matrix   (min time per call)
      n      Taylor          LA         Eig        Eigh
      8       5.7µs      11.1µs      18.4µs      13.4µs
     16      11.5µs      33.3µs      63.2µs      35.2µs
     32      27.5µs     105.7µs     281.4µs     108.5µs
     64     127.2µs     418.6µs      1.50ms     421.2µs
    128     988.0µs      2.04ms      7.71ms      1.80ms
    256      7.45ms      8.64ms     27.11ms      7.54ms

[ComplexF64]  timing — general matrix   (min time per call)
      n      Taylor          LA         Eig
      8      13.6µs       9.5µs      36.5µs
     16      41.9µs      36.7µs     153.7µs
     32     184.6µs     154.4µs     694.6µs
     64     736.1µs     745.4µs      3.70ms
    128      4.91ms      4.80ms     19.56ms
    256     32.85ms     31.92ms     57.54ms

[ComplexF64]  timing — hermitian matrix   (min time per call)
      n      Taylor          LA         Eig        Eigh
      8      10.6µs      14.5µs      32.5µs      18.0µs
     16      31.3µs      45.1µs     134.2µs      50.7µs
     32     121.5µs     161.0µs     602.2µs     175.6µs
     64     597.7µs     714.5µs      3.51ms     758.9µs
    128      3.96ms      3.87ms     21.44ms      3.62ms
    256     25.04ms     20.30ms     72.62ms     19.00ms

[Double64]  timing — general matrix   (min time per call)
      n      Taylor         Eig
      8     104.5µs     248.0µs
     16     703.7µs      1.47ms
     32      5.29ms     10.20ms
     64     42.75ms     71.66ms
    128    357.09ms    519.60ms

[Double64]  timing — hermitian matrix   (min time per call)
      n      Taylor         Eig        Eigh
      8      75.7µs     208.5µs      58.0µs
     16     522.6µs      1.28ms     363.3µs
     32      3.96ms      9.00ms      2.35ms
     64     31.21ms     69.51ms     17.12ms
    128    263.20ms    524.07ms    128.97ms

[Complex{Double64}]  timing — general matrix   (min time per call)
      n      Taylor         Eig
      8     432.4µs     566.8µs
     16      3.23ms      3.50ms
     32     24.80ms     24.04ms
     64    194.93ms    174.09ms
    128      1.633s      1.300s

[Complex{Double64}]  timing — hermitian matrix   (min time per call)
      n      Taylor         Eig        Eigh
      8     304.4µs     475.8µs     120.7µs
     16      2.33ms      2.86ms     819.0µs
     32     17.96ms     19.31ms      6.15ms
     64    151.54ms    148.09ms     46.71ms
    128      1.206s      1.176s    362.33ms

[BigFloat]  timing — general matrix   (min time per call)
      n      Taylor         Eig
      8      1.71ms      2.59ms
     16     12.04ms     16.39ms
     32     96.30ms    135.57ms
     64      1.023s      1.050s

[BigFloat]  timing — hermitian matrix   (min time per call)
      n      Taylor         Eig        Eigh
      8      1.24ms      1.76ms      2.41ms
     16      9.30ms     11.84ms     13.32ms
     32     76.77ms     99.64ms     67.08ms
     64    824.75ms    897.12ms    446.64ms

[Complex{BigFloat}]  timing — general matrix   (min time per call)
      n      Taylor         Eig
      8      7.22ms      6.15ms
     16     60.19ms     48.48ms
     32    516.49ms    380.98ms
     64      4.051s      2.956s

[Complex{BigFloat}]  timing — hermitian matrix   (min time per call)
      n      Taylor         Eig        Eigh
      8      5.35ms      4.76ms      3.30ms
     16     42.54ms     35.55ms     19.24ms
     32    404.71ms    293.33ms    137.99ms
     64      3.208s      2.293s      1.012s

lkdvos · 2026-07-03T19:11:12Z

I investigated a little further, and the main thing that I can come up with is to try and get a tighter bound on the operator norm estimator. Currently, we are just using opnorm(A, 1), which often seems to overestimate the actual spectral norm by quite a bit, for the benchmark matrices I had:

┌──────────────────┬──────────┬─────────────────┬───────────────────────┬───────────┬───────────────┐
│       case       │ opnorm-1 │ spectral radius │ (order,sq) via 1-norm │   via ρ   │ matmuls saved │
├──────────────────┼──────────┼─────────────────┼───────────────────────┼───────────┼───────────────┤
│ ComplexF64 n=128 │ 88.97    │ 15.35           │ (15,7)=13             │ (11,6)=11 │ ~15%          │
├──────────────────┼──────────┼─────────────────┼───────────────────────┼───────────┼───────────────┤
│ ComplexF64 n=256 │ 178.6    │ 22.22           │ (15,8)=14             │ (15,5)=12 │ ~21%          │
├──────────────────┼──────────┼─────────────────┼───────────────────────┼───────────┼───────────────┤
│ BigFloat n=64    │ 44.33    │ 11.02           │ (34,9)=19             │ (34,7)=17 │ ~10%          │
└──────────────────┴──────────┴─────────────────┴───────────────────────┴───────────┴───────────────┘

There are some ways of improving on this, but things get complicated quite quickly, these methods are not free, and the actual payoff is not that large, so I think I might vote to just keep things the way they are right now. As this already is faster than LinearAlgebra, and we don't immediately have a need for hyper-efficient dense exponentials, I think this should be fine :).

Jutho · 2026-07-03T22:36:52Z

I don't recognize much of the original implementation anymore. I know that, since you anyway need to compute a minimum number of powers of A, it can be useful to first compute some powers of A, say up to A8 = A^8, and then take as estimate for the norm

opnorm(A8, 1)^(1/8).

This is based on Gelfand's theorem which states that $\lVert A^m\rVert^{1/m}$ converges to the spectral radius of $A$ for $m\to\infty$ for any submultiplicative norm. I think I was doing something like this in my original attempt.

lkdvos force-pushed the jh/taylorexponentiate branch from 84e07c2 to 08d3a71 Compare July 1, 2026 18:00

lkdvos changed the title ~~[WIP] Taylor Exponentiate~~ Taylor series native exponential implementation Jul 1, 2026

Jutho and others added 2 commits July 2, 2026 11:51

taylorexponentiate part 1

a3f1696

lkdvos force-pushed the jh/taylorexponentiate branch from a7f2e42 to 0d07fe9 Compare July 2, 2026 15:51

lkdvos added 2 commits July 2, 2026 12:10

add benchmark script

a4f26ad

update docs

b111776

lkdvos force-pushed the jh/taylorexponentiate branch from 357d1e6 to b111776 Compare July 2, 2026 16:10

lkdvos added 2 commits July 2, 2026 20:52

make GPU-compatible

b199218

add DoubleFloat to benchmark

c83fc9c

Jutho commented Jul 3, 2026

View reviewed changes

Comment thread src/implementations/exponential.jl Outdated

lkdvos added 2 commits July 3, 2026 07:58

use one!

5085cfa

micro-optimize

9994038

lkdvos approved these changes Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Taylor series native `exponential` implementation#243

Taylor series native `exponential` implementation#243
Jutho wants to merge 8 commits into
mainfrom
jh/taylorexponentiate

Jutho commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

lkdvos commented Jul 1, 2026 •

edited

Loading

Uh oh!

Jutho commented Jul 2, 2026

Uh oh!

codecov Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

lkdvos commented Jul 2, 2026 •

edited

Loading

Uh oh!

Jutho commented Jul 2, 2026

Uh oh!

lkdvos commented Jul 3, 2026

Uh oh!

Uh oh!

lkdvos commented Jul 3, 2026

Uh oh!

Jutho commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Jutho commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lkdvos commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark: MatrixFunctionViaTaylor vs LinearAlgebra.exp

BLAS floats

Arbitrary precision (256-bit)

Uh oh!

Jutho commented Jul 2, 2026

Uh oh!

codecov Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lkdvos commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jutho commented Jul 2, 2026

Uh oh!

lkdvos commented Jul 3, 2026

Uh oh!

Uh oh!

lkdvos commented Jul 3, 2026

Uh oh!

Jutho commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 3, 2026 •

edited

Loading

lkdvos commented Jul 1, 2026 •

edited

Loading

Benchmark: `MatrixFunctionViaTaylor` vs `LinearAlgebra.exp`

codecov Bot commented Jul 2, 2026 •

edited

Loading

lkdvos commented Jul 2, 2026 •

edited

Loading

Jutho commented Jul 3, 2026 •

edited

Loading