Skip the zeroed-destination gather in matricized contraction#201
Merged
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #201 +/- ##
==========================================
+ Coverage 83.74% 83.76% +0.01%
==========================================
Files 24 24
Lines 732 739 +7
==========================================
+ Hits 613 619 +6
- Misses 119 120 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
When β is a strong zero, contractopadd! for the Matricize algorithm no longer matricizes a_dest to form the mul! target. Matricizing the destination gathers all of its about-to-be-overwritten blocks into a matrix that mul! then immediately overwrites, which for a graded output is a full block-by-block copy. Instead the matmul allocates its result directly as a1_mat * a2_mat, scattered into a_dest. A new matricizepermaliases predicate keeps the existing zero-copy write-through for a dense output already in matmul-aligned order.
0b9ae6d to
cf17b2e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
βis a strong zero,contractopadd!for theMatricizealgorithm no longer matricizesa_destto form themul!target. Matricizing the destination gathers all of its (about-to-be-overwritten) blocks into a matrix thatmul!then immediately overwrites, which for a graded output is a full block-by-block copy. Instead the matmul allocates its result directly asa1_mat * a2_mat, and that is scattered intoa_dest. Every coupled-sector block is materialized (the matmul zeros the ones it does not reach), so the scatter overwritesa_destin full.A new
matricizepermaliasespredicate keeps the existing zero-copy path for a dense output already in matmul-aligned order, where matricizinga_destreturns areshape/transposeview thatmul!writes straight through. That case, and any nonzeroβ, still take the in-placemul!path. The predicate is pure (fusion style plusmatricizekind) and defaults tofalse, so a graded output always takes the new path and a dense aligned output never does.This trades one redundant write pass over the output for none, with allocations unchanged. On order-4 permuting graded contractions it measures around a 10% speedup, more when the matmul is a small fraction of the work and less as the matmul dominates.