[BWARE] Add getCategoricalMask DML builtin#2482
Open
Baunsgaard wants to merge 2 commits into
Open
Conversation
Adds a new builtin that, given a transform-encode metadata frame and the encoding JSON spec, returns a 1xN matrix mask marking which output columns are categorical (1) versus continuous (0). Useful when callers need to know the category boundary in transformed output without re-deriving it from the spec. - Register GET_CATEGORICAL_MASK in Builtins, Opcodes, Types (OpOp2), Builtin (functionobject) - Validate it as a frame+scalar binary in BuiltinFunctionExpression (new checkFrameParam helper) and lower it to a BinaryOp in DMLTranslator - Force CP execution for the new op in BinaryOp.optFindExecType - Implement runtime in BinaryFrameScalarCPInstruction and route FRAME+SCALAR binary instructions to it in BinaryCPInstruction - Add writeTestScalar(String, String) overload to TestUtils - Cover recode, dummycode, hash, and hybrid specs in GetCategoricalMaskTest (note: hash variants depend on the decoder/encoder hash-column changes in a separate branch)
getCategoricalMask only accounted for recode and dummycode columns, so specs using feature hashing produced a mask with the wrong number of columns and the DML check failed. Parse the hash column list and bucket count K from the spec: a hashed column is categorical, and a hashed column that is also dummycoded expands to K columns rather than the recode distinct count. Also correct the ID-based spec guard (it never triggered) to actually require an ID-based spec, and remove the now-unused java.io.FileWriter import in TestUtils that broke Checkstyle.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2482 +/- ##
============================================
+ Coverage 71.37% 71.39% +0.01%
- Complexity 48749 48771 +22
============================================
Files 1571 1572 +1
Lines 188912 188996 +84
Branches 37067 37084 +17
============================================
+ Hits 134845 134925 +80
- Misses 43601 43604 +3
- Partials 10466 10467 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new builtin that, given a transform-encode metadata frame and the encoding JSON spec, returns a 1xN matrix mask marking which output columns are categorical (1) versus continuous (0). Useful when callers need to know the category boundary in transformed output without re-deriving it from the spec.