[BWARE] Tune compressed matmul fast paths and Spark execution decisions by Baunsgaard · Pull Request #2483 · apache/systemds

Baunsgaard · 2026-06-08T13:03:10Z

Mixes two related performance changes: refined compressed multiply heuristics, and a Spark-vs-CP decision refresh on the Hop layer.

CLALib matmul changes:

CLALibMMChain: for XtXv with few col groups and a wide-enough matrix, compute X' * X via leftMultByTransposeSelf and finish with a regular matrix multiply against v. Cheaper than chaining when the X' * X path can stay compressed
CLALibTSMM: refactor leftMultByTransposeSelf into a package-private helper so MMChain can call it; widen the ColGroupUncompressed handling
CLALibRightMultBy: stop forcing decompression for ASDC / ASDCZero inputs; they have working preAggregate paths that beat the dense fallback
CLALibCompAgg: fix blklen rounding so the last partition is not short by k rows on parallel aggregates

Spark/CP exec-decision refresh (Hop, UnaryOp, BinaryOp):

Hop: new helpers hasSparkOutput() and isScalarOrVectorBellowBlockSize() shared between unary and binary decision points
UnaryOp.optFindExecType: replace the inline chain of negations with isDisallowedSparkOps(), allow Frame outputs, and pull unary ops into Spark whenever the input already has a Spark output
BinaryOp.optFindExecType: same kind of restructuring; allow matrix-or-frame outputs to be pulled into Spark when exactly one operand is a scalar or small vector

Instruction-side adjustments:

VariableCPInstruction (CAST_AS_MATRIX from frame): use the parallel MatrixBlockFromFrame.convertToMatrixBlock(fin, k) path instead of the single-threaded DataConverter helper
ParameterizedBuiltinCPInstruction (transformdecode): call the parallel decoder.decode(data, out, k) overload using InfrastructureAnalyzer.getLocalParallelism()

Mixes two related performance changes: refined compressed multiply heuristics, and a Spark-vs-CP decision refresh on the Hop layer. CLALib matmul changes: - CLALibMMChain: for XtXv with few col groups and a wide-enough matrix, compute X' * X via leftMultByTransposeSelf and finish with a regular matrix multiply against v. Cheaper than chaining when the X' * X path can stay compressed - CLALibTSMM: refactor leftMultByTransposeSelf into a package-private helper so MMChain can call it; widen the ColGroupUncompressed handling - CLALibRightMultBy: stop forcing decompression for ASDC / ASDCZero inputs; they have working preAggregate paths that beat the dense fallback - CLALibCompAgg: fix blklen rounding so the last partition is not short by k rows on parallel aggregates Spark/CP exec-decision refresh (Hop, UnaryOp, BinaryOp): - Hop: new helpers hasSparkOutput() and isScalarOrVectorBellowBlockSize() shared between unary and binary decision points - UnaryOp.optFindExecType: replace the inline chain of negations with isDisallowedSparkOps(), allow Frame outputs, and pull unary ops into Spark whenever the input already has a Spark output - BinaryOp.optFindExecType: same kind of restructuring; allow matrix-or-frame outputs to be pulled into Spark when exactly one operand is a scalar or small vector Instruction-side adjustments: - VariableCPInstruction (CAST_AS_MATRIX from frame): use the parallel MatrixBlockFromFrame.convertToMatrixBlock(fin, k) path instead of the single-threaded DataConverter helper - ParameterizedBuiltinCPInstruction (transformdecode): call the parallel decoder.decode(data, out, k) overload using InfrastructureAnalyzer.getLocalParallelism()

The multi-threaded DecoderComposite.decode submitted one task per decoder per row block, running all decoders concurrently. This broke the ordering dependency between decoders: recode-on-output reads the category indexes written by the dummycode decoder, so when the recode task raced ahead it read unwritten cells and produced null or the raw index instead of the original value. Parallelize over row blocks instead, running all decoders in order within each block via the sequential block decode. Also short-circuit to the single-threaded path when k <= 1. Fixes order-dependent failures in TransformFrameEncodeDecodeTest and TransformFrameEncodeColmapTest (dummycode single-node/hybrid) that surfaced once transformdecode started using the parallel decode path.

codecov · 2026-06-08T16:29:40Z

Codecov Report

❌ Patch coverage is 80.64516% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.38%. Comparing base (88c26e2) to head (df4a5c6).
⚠️ Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
.../apache/sysds/runtime/compress/lib/CLALibTSMM.java	80.00%	2 Missing and 2 partials ⚠️
src/main/java/org/apache/sysds/hops/UnaryOp.java	66.66%	0 Missing and 3 partials ⚠️
.../sysds/runtime/compress/lib/CLALibRightMultBy.java	0.00%	2 Missing ⚠️
...sds/runtime/transform/decode/DecoderComposite.java	66.66%	1 Missing and 1 partial ⚠️
src/main/java/org/apache/sysds/hops/BinaryOp.java	92.85%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##               main    #2483   +/-   ##
=========================================
  Coverage     71.37%   71.38%           
- Complexity    48749    48775   +26     
=========================================
  Files          1571     1571           
  Lines        188912   188935   +23     
  Branches      37067    37074    +7     
=========================================
+ Hits         134845   134870   +25     
- Misses        43601    43604    +3     
+ Partials      10466    10461    -5

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-project-automation Bot added this to SystemDS PR Queue Jun 8, 2026

github-project-automation Bot moved this to In Progress in SystemDS PR Queue Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BWARE] Tune compressed matmul fast paths and Spark execution decisions#2483

[BWARE] Tune compressed matmul fast paths and Spark execution decisions#2483
Baunsgaard wants to merge 2 commits into
apache:mainfrom
Baunsgaard:split/mmChainTsmm

Baunsgaard commented Jun 8, 2026

Uh oh!

codecov Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Baunsgaard commented Jun 8, 2026

Uh oh!

codecov Bot commented Jun 8, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant