Speed up named-tensor broadcasting#199
Merged
Merged
Conversation
## Summary Reworks elementwise broadcasting on named tensors (`a + b`, `a .+ b`, `c .= a .+ b`) to skip Base's named-axis broadcast machinery and lower directly onto TensorAlgebra's linear-combination path over the backing arrays, aligning operands by dimension name. A named broadcast previously rebuilt `NamedUnitRange` axes through `combine_axes`/`broadcast_shape` on every call and ran a runtime `promote_op` inference call for the result element type, because a named tensor's `eltype` is not inferrable. A 2x2 add allocated 52 times. Now `instantiate` is a no-op for the named style, the destination names come from `dimnames` of the operands rather than `axes(bc)`, and the unnamed work runs behind a function barrier where the result element type is inferrable. The same 2x2 add drops to 10 allocations (3.19 KiB to 464 B) and is several times faster, and larger dense adds improve as well. `materialize!` is intercepted for the named style so the in-place path no longer reconstructs the broadcast over `axes(dest)` and re-enters the axis machinery. In-place is now cheaper than out-of-place, as it should be. Aligning an operand to the destination names also drops the previous `aligneddims` round trip. With the axis machinery off every path, the `combine_axes`/`broadcast_shape`/`promote_shape`/`check_broadcast_shape` overloads and the broadcasted `similar` are removed. `axes` and `similar` on a raw lazy named `Broadcasted` are no longer supported, which nothing relies on: materialization goes through `dimnames`, and the non-linear fused fallback runs on the unnamed broadcast.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #199 +/- ##
==========================================
+ Coverage 73.44% 74.00% +0.55%
==========================================
Files 28 28
Lines 1529 1500 -29
==========================================
- Hits 1123 1110 -13
+ Misses 406 390 -16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Keep `unnamed(a, names)` a general permute-to-names method and handle the already-aligned shortcut in `broadcasted_unnamed`, its only caller. The shortcut is load-bearing: without it every aligned operand (including the always-aligned first one) is wrapped in an identity `PermutedDimsArray`, which breaks the clean lowering and makes a small add several times slower.
Behind a function barrier that recovers the concrete backing array, `ndims` is a compile-time constant, so the permutation can be built as an `ntuple(..., Val(ndims))` (an `NTuple{N,Int}`) rather than `Tuple(getperm(...))` (a `Tuple{Vararg{Int}}` whose length is not inferrable). That lets `permuteddims` build a concretely-typed wrapper, roughly halving the cost of a permuted add (a 2x2 permuted add goes from about 1130 ns to about 900 ns). Aligned adds are unaffected, taking the alignment fast path in `broadcasted_unnamed`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reworks elementwise broadcasting on named tensors (
a + b,a .+ b,c .= a .+ b) to skip Base's named-axis broadcast machinery and lower directly onto TensorAlgebra's linear-combination path over the backing arrays, aligning operands by dimension name.A named broadcast previously rebuilt
NamedUnitRangeaxes throughcombine_axes/broadcast_shapeon every call and ran a runtimepromote_opinference call for the result element type, because a named tensor'seltypeis not inferrable. A 2x2 add allocated 52 times. Nowinstantiateis a no-op for the named style, the destination names come fromdimnamesof the operands rather thanaxes(bc), and the unnamed work runs behind a function barrier where the result element type is inferrable. The same 2x2 add drops to 10 allocations (3.19 KiB to 464 B) and is several times faster, and larger dense adds improve as well.materialize!is intercepted for the named style so the in-place path no longer reconstructs the broadcast overaxes(dest)and re-enters the axis machinery. In-place is now cheaper than out-of-place, as it should be. An operand already aligned to the destination names takes a fast path that returns its backing array untouched, and an operand whose dimension order differs is aligned behind a function barrier that builds the permutation with a static length, dropping the previousaligneddimsround trip and keeping the permuted case fast.With the axis machinery off every path, the
combine_axes/broadcast_shape/promote_shape/check_broadcast_shapeoverloads and the broadcastedsimilarare removed.axesandsimilaron a raw lazy namedBroadcastedare no longer supported, which nothing relies on: materialization goes throughdimnames, and the non-linear fused fallback runs on the unnamed broadcast.