Skip to content

Speed up named-tensor broadcasting#199

Merged
mtfishman merged 3 commits into
mainfrom
mf/named-broadcast-instantiate
Jul 1, 2026
Merged

Speed up named-tensor broadcasting#199
mtfishman merged 3 commits into
mainfrom
mf/named-broadcast-instantiate

Conversation

@mtfishman

@mtfishman mtfishman commented Jul 1, 2026

Copy link
Copy Markdown
Member

Summary

Reworks elementwise broadcasting on named tensors (a + b, a .+ b, c .= a .+ b) to skip Base's named-axis broadcast machinery and lower directly onto TensorAlgebra's linear-combination path over the backing arrays, aligning operands by dimension name.

A named broadcast previously rebuilt NamedUnitRange axes through combine_axes/broadcast_shape on every call and ran a runtime promote_op inference call for the result element type, because a named tensor's eltype is not inferrable. A 2x2 add allocated 52 times. Now instantiate is a no-op for the named style, the destination names come from dimnames of the operands rather than axes(bc), and the unnamed work runs behind a function barrier where the result element type is inferrable. The same 2x2 add drops to 10 allocations (3.19 KiB to 464 B) and is several times faster, and larger dense adds improve as well.

materialize! is intercepted for the named style so the in-place path no longer reconstructs the broadcast over axes(dest) and re-enters the axis machinery. In-place is now cheaper than out-of-place, as it should be. An operand already aligned to the destination names takes a fast path that returns its backing array untouched, and an operand whose dimension order differs is aligned behind a function barrier that builds the permutation with a static length, dropping the previous aligneddims round trip and keeping the permuted case fast.

With the axis machinery off every path, the combine_axes/broadcast_shape/promote_shape/check_broadcast_shape overloads and the broadcasted similar are removed. axes and similar on a raw lazy named Broadcasted are no longer supported, which nothing relies on: materialization goes through dimnames, and the non-linear fused fallback runs on the unnamed broadcast.

## Summary

Reworks elementwise broadcasting on named tensors (`a + b`, `a .+ b`, `c .= a .+ b`) to skip Base's named-axis broadcast machinery and lower directly onto TensorAlgebra's linear-combination path over the backing arrays, aligning operands by dimension name.

A named broadcast previously rebuilt `NamedUnitRange` axes through `combine_axes`/`broadcast_shape` on every call and ran a runtime `promote_op` inference call for the result element type, because a named tensor's `eltype` is not inferrable. A 2x2 add allocated 52 times. Now `instantiate` is a no-op for the named style, the destination names come from `dimnames` of the operands rather than `axes(bc)`, and the unnamed work runs behind a function barrier where the result element type is inferrable. The same 2x2 add drops to 10 allocations (3.19 KiB to 464 B) and is several times faster, and larger dense adds improve as well.

`materialize!` is intercepted for the named style so the in-place path no longer reconstructs the broadcast over `axes(dest)` and re-enters the axis machinery. In-place is now cheaper than out-of-place, as it should be. Aligning an operand to the destination names also drops the previous `aligneddims` round trip.

With the axis machinery off every path, the `combine_axes`/`broadcast_shape`/`promote_shape`/`check_broadcast_shape` overloads and the broadcasted `similar` are removed. `axes` and `similar` on a raw lazy named `Broadcasted` are no longer supported, which nothing relies on: materialization goes through `dimnames`, and the non-linear fused fallback runs on the unnamed broadcast.
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.87879% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.00%. Comparing base (df02533) to head (2061c94).

Files with missing lines Patch % Lines
src/broadcast.jl 86.20% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #199      +/-   ##
==========================================
+ Coverage   73.44%   74.00%   +0.55%     
==========================================
  Files          28       28              
  Lines        1529     1500      -29     
==========================================
- Hits         1123     1110      -13     
+ Misses        406      390      -16     
Flag Coverage Δ
docs 24.55% <63.63%> (-0.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mtfishman added 2 commits July 1, 2026 14:13
Keep `unnamed(a, names)` a general permute-to-names method and handle the already-aligned shortcut in `broadcasted_unnamed`, its only caller. The shortcut is load-bearing: without it every aligned operand (including the always-aligned first one) is wrapped in an identity `PermutedDimsArray`, which breaks the clean lowering and makes a small add several times slower.
Behind a function barrier that recovers the concrete backing array, `ndims` is a compile-time constant, so the permutation can be built as an `ntuple(..., Val(ndims))` (an `NTuple{N,Int}`) rather than `Tuple(getperm(...))` (a `Tuple{Vararg{Int}}` whose length is not inferrable). That lets `permuteddims` build a concretely-typed wrapper, roughly halving the cost of a permuted add (a 2x2 permuted add goes from about 1130 ns to about 900 ns). Aligned adds are unaffected, taking the alignment fast path in `broadcasted_unnamed`.
@mtfishman mtfishman merged commit 94614ee into main Jul 1, 2026
18 checks passed
@mtfishman mtfishman deleted the mf/named-broadcast-instantiate branch July 1, 2026 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant