[BWARE] Tune compressed matmul fast paths and Spark execution decisions#2483
Open
Baunsgaard wants to merge 2 commits into
Open
[BWARE] Tune compressed matmul fast paths and Spark execution decisions#2483Baunsgaard wants to merge 2 commits into
Baunsgaard wants to merge 2 commits into
Conversation
Mixes two related performance changes: refined compressed multiply heuristics, and a Spark-vs-CP decision refresh on the Hop layer. CLALib matmul changes: - CLALibMMChain: for XtXv with few col groups and a wide-enough matrix, compute X' * X via leftMultByTransposeSelf and finish with a regular matrix multiply against v. Cheaper than chaining when the X' * X path can stay compressed - CLALibTSMM: refactor leftMultByTransposeSelf into a package-private helper so MMChain can call it; widen the ColGroupUncompressed handling - CLALibRightMultBy: stop forcing decompression for ASDC / ASDCZero inputs; they have working preAggregate paths that beat the dense fallback - CLALibCompAgg: fix blklen rounding so the last partition is not short by k rows on parallel aggregates Spark/CP exec-decision refresh (Hop, UnaryOp, BinaryOp): - Hop: new helpers hasSparkOutput() and isScalarOrVectorBellowBlockSize() shared between unary and binary decision points - UnaryOp.optFindExecType: replace the inline chain of negations with isDisallowedSparkOps(), allow Frame outputs, and pull unary ops into Spark whenever the input already has a Spark output - BinaryOp.optFindExecType: same kind of restructuring; allow matrix-or-frame outputs to be pulled into Spark when exactly one operand is a scalar or small vector Instruction-side adjustments: - VariableCPInstruction (CAST_AS_MATRIX from frame): use the parallel MatrixBlockFromFrame.convertToMatrixBlock(fin, k) path instead of the single-threaded DataConverter helper - ParameterizedBuiltinCPInstruction (transformdecode): call the parallel decoder.decode(data, out, k) overload using InfrastructureAnalyzer.getLocalParallelism()
The multi-threaded DecoderComposite.decode submitted one task per decoder per row block, running all decoders concurrently. This broke the ordering dependency between decoders: recode-on-output reads the category indexes written by the dummycode decoder, so when the recode task raced ahead it read unwritten cells and produced null or the raw index instead of the original value. Parallelize over row blocks instead, running all decoders in order within each block via the sequential block decode. Also short-circuit to the single-threaded path when k <= 1. Fixes order-dependent failures in TransformFrameEncodeDecodeTest and TransformFrameEncodeColmapTest (dummycode single-node/hybrid) that surfaced once transformdecode started using the parallel decode path.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2483 +/- ##
=========================================
Coverage 71.37% 71.38%
- Complexity 48749 48775 +26
=========================================
Files 1571 1571
Lines 188912 188935 +23
Branches 37067 37074 +7
=========================================
+ Hits 134845 134870 +25
- Misses 43601 43604 +3
+ Partials 10466 10461 -5 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mixes two related performance changes: refined compressed multiply heuristics, and a Spark-vs-CP decision refresh on the Hop layer.
CLALib matmul changes:
Spark/CP exec-decision refresh (Hop, UnaryOp, BinaryOp):
Instruction-side adjustments: