Skip to content

[BWARE] Add getCategoricalMask DML builtin#2482

Open
Baunsgaard wants to merge 2 commits into
apache:mainfrom
Baunsgaard:split/getCategoricalMask
Open

[BWARE] Add getCategoricalMask DML builtin#2482
Baunsgaard wants to merge 2 commits into
apache:mainfrom
Baunsgaard:split/getCategoricalMask

Conversation

@Baunsgaard

Copy link
Copy Markdown
Contributor

Adds a new builtin that, given a transform-encode metadata frame and the encoding JSON spec, returns a 1xN matrix mask marking which output columns are categorical (1) versus continuous (0). Useful when callers need to know the category boundary in transformed output without re-deriving it from the spec.

  • Register GET_CATEGORICAL_MASK in Builtins, Opcodes, Types (OpOp2), Builtin (functionobject)
  • Validate it as a frame+scalar binary in BuiltinFunctionExpression (new checkFrameParam helper) and lower it to a BinaryOp in DMLTranslator
  • Force CP execution for the new op in BinaryOp.optFindExecType
  • Implement runtime in BinaryFrameScalarCPInstruction and route FRAME+SCALAR binary instructions to it in BinaryCPInstruction
  • Add writeTestScalar(String, String) overload to TestUtils
  • Cover recode, dummycode, hash, and hybrid specs in GetCategoricalMaskTest (note: hash variants depend on the decoder/encoder hash-column changes in a separate branch)

Adds a new builtin that, given a transform-encode metadata frame and
the encoding JSON spec, returns a 1xN matrix mask marking which output
columns are categorical (1) versus continuous (0). Useful when callers
need to know the category boundary in transformed output without
re-deriving it from the spec.

- Register GET_CATEGORICAL_MASK in Builtins, Opcodes, Types (OpOp2),
  Builtin (functionobject)
- Validate it as a frame+scalar binary in BuiltinFunctionExpression
  (new checkFrameParam helper) and lower it to a BinaryOp in
  DMLTranslator
- Force CP execution for the new op in BinaryOp.optFindExecType
- Implement runtime in BinaryFrameScalarCPInstruction and route
  FRAME+SCALAR binary instructions to it in BinaryCPInstruction
- Add writeTestScalar(String, String) overload to TestUtils
- Cover recode, dummycode, hash, and hybrid specs in
  GetCategoricalMaskTest (note: hash variants depend on the
  decoder/encoder hash-column changes in a separate branch)
getCategoricalMask only accounted for recode and dummycode columns, so
specs using feature hashing produced a mask with the wrong number of
columns and the DML check failed. Parse the hash column list and bucket
count K from the spec: a hashed column is categorical, and a hashed
column that is also dummycoded expands to K columns rather than the
recode distinct count.

Also correct the ID-based spec guard (it never triggered) to actually
require an ID-based spec, and remove the now-unused java.io.FileWriter
import in TestUtils that broke Checkstyle.
@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.05882% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.39%. Comparing base (88c26e2) to head (c528fa2).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...nstructions/cp/BinaryFrameScalarCPInstruction.java 85.71% 5 Missing and 4 partials ⚠️
...apache/sysds/parser/BuiltinFunctionExpression.java 81.81% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2482      +/-   ##
============================================
+ Coverage     71.37%   71.39%   +0.01%     
- Complexity    48749    48771      +22     
============================================
  Files          1571     1572       +1     
  Lines        188912   188996      +84     
  Branches      37067    37084      +17     
============================================
+ Hits         134845   134925      +80     
- Misses        43601    43604       +3     
- Partials      10466    10467       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant