Skip to content

array expression audit follow-ups (from #4483) #4503

@andygrove

Description

@andygrove

Tracking issue for follow-up work surfaced by the array expression audit in #4483. Each item below is either a support-level / serde correctness fix that the audit deliberately deferred, or a coverage gap the audit documented but did not implement. Already-filed gaps (#4481 NaN/signed-zero canonicalization, #4482 array_max/array_min NaN ordering, #3178 array_join null handling, #2190 collation, #3338 empty-array coerce_types) are not duplicated here.

High priority

1. Lift convert-time withInfo fallbacks in arrays.scala into getSupportLevel

Several array serdes bail out from convert with withInfo(...) + return None for cases that are knowable from the expression alone:

  • CometArrayRemove (spark/src/main/scala/org/apache/comet/serde/arrays.scala:40-55) rejects unsupported child types in convert.
  • CometArrayExcept (arrays.scala:357-374) rejects BinaryType / StructType element types in convert rather than via getSupportLevel.
  • CometArrayPosition (arrays.scala:701-727) rejects all-foldable args and unsupported element types in convert.
  • CometElementAt (arrays.scala:565-598) rejects non-array input in convert; the MapType rejection in particular belongs in getSupportLevel.
  • CometFlatten (arrays.scala:603-617) rejects unsupported child element types in convert.
  • CometArrayReverse (arrays.scala:545-556) rejects unsupported child types in convert even though it already has a getSupportLevel.
  • CometSortArray (arrays.scala:172-182) rejects non-Literal ascendingOrder in convert. Spark 4.0+ widens ascendingOrder to any foldable boolean, so this convert-time rejection is the new fallback path; it should be declared as Unsupported(Some(...)) in getSupportLevel so EXPLAIN surfaces the reason.

The audit-comet-expression skill (rule 10) requires expression-shape restrictions to be declared as Unsupported(Some(reason)) / Incompatible(Some(reason)) branches in getSupportLevel so EXPLAIN surfaces the reason at planning time, the auto-generated compatibility doc picks them up, and the dispatcher can route around them. The audit doc for array_position, array_remove, element_at, flatten, and sort_array explicitly notes these need to be lifted.

2. Documented NaN / canonicalization divergences should flip support level to Incompatible

The audit doc flags Known limitation / Known divergence sub-bullets on several entries, but the corresponding Comet serdes still report Compatible (skill rule 12 requires this to flip the support level):

These should either gain a CometExpressionSerde wrapper with Incompatible(Some(reason)) keyed on float/double element types, or the CometScalarFunction constructor should accept an optional support-level callback.

3. Dead serde registrations after Spark 4.0 rewrites

Skill rule 13 says unreachable serde mappings should be relabeled or removed so the dispatcher map reflects reality:

  • CometArrayAppend (arrays.scala:58-107, registered at QueryPlanSerde.scala:52): unreachable on Spark 4.0+ because ArrayAppend is now RuntimeReplaceable and rewritten to ArrayInsert(arr, Literal(-1), elem). The audit doc confirms dispatch flows through CometArrayInsert on 4.0+. Either drop the 4.0+ registration through the shim or mark it as 3.x-only.
  • CometArrayCompact (arrays.scala:311-333, registered at QueryPlanSerde.scala:53): ArrayCompact is always RuntimeReplaceable (rewritten to ArrayFilter(arr, IsNotNull(...)) in all supported Spark versions), so the direct classOf[ArrayCompact] -> CometArrayCompact mapping is never hit at convert time. The actual dispatch path is CometArrayFilter.convert -> CometArrayCompact.convert (arrays.scala:625-637). The direct registration should be removed.

Medium priority

4. CometSortArray Spark 4.0+ foldable-boolean coverage

Spark 4.0 widens SortArray.ascendingOrder from a Literal(_: Boolean, BooleanType) to any foldable boolean. CometSortArray (arrays.scala:172-182) still pattern-matches only Literal, so sort_array(arr, NOT some_const) and similar shapes silently fall back on 4.0+. Constant-folding catches most cases but not all. Either evaluate the foldable expression at convert time (preferred) or surface the limitation in getSupportLevel.

5. array_union result ordering vs DataFusion is unverified

The audit doc for array_union notes "Result ordering versus DataFusion is also unverified; compare the array_intersect ordering caveat." CometArrayUnion (arrays.scala:454-466) is currently Compatible. Verify whether DataFusion's array_union preserves Spark's left-first-then-new-right-elements ordering. If it does not, raise the support level to Incompatible with a documented reason, matching the array_intersect treatment.

6. CometArrayExcept data-type guard should be in getSupportLevel

CometArrayExcept.isTypeSupported (arrays.scala:342-355) explicitly rejects BinaryType and StructType element types, but the rejection happens in convert (arrays.scala:361-367). Move the check into a getSupportLevel branch so EXPLAIN can report it and the compatibility doc reflects the gap. (Same pattern as item 1.)

7. CometArrayJoin Spark 4.0+ collation coverage

Audit doc notes Spark 4.0 widens ArrayJoin.inputTypes to AbstractArrayType(StringTypeWithCollation(supportsTrimCollation = true)) and that non-binary collations are not propagated. Already tracked under #2190 (umbrella) but worth wiring an explicit hasNonDefaultStringCollation guard in CometArrayJoin.getSupportLevel so the fallback is visible in EXPLAIN, matching the pattern CometArrayIntersect already follows (arrays.scala:207-213).


Surfaced by the audit-comet-expression skill run in #4483.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions