Skip to content

feat: fix array_compact for Spark 4.0 and correct return type metadata#3796

Open
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:fix-array-compact
Open

feat: fix array_compact for Spark 4.0 and correct return type metadata#3796
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:fix-array-compact

Conversation

@andygrove
Copy link
Copy Markdown
Member

Summary

Fixes several issues with the array_compact expression found during an audit:

  • Spark 4.0 support: On Spark 4.0, array_compact rewrites to KnownNotContainsNull(ArrayFilter(IsNotNull)). Because KnownNotContainsNull was not handled by Comet, the expression always fell back to Spark on 4.0. This PR adds a case in the Spark 4.0 shim that unwraps KnownNotContainsNull, serializes the inner ArrayFilter using DataFusion's array_remove_all, and propagates the containsNull=false return type correctly.

  • Return type bug: CometArrayCompact.convert was hardcoding ArrayType(elementType = elementType) as the return type passed to DataFusion, which always set containsNull=true. Changed to use expr.dataType so the correct nullability metadata is emitted — matching Spark's behavior where containsNull=false after compaction on Spark 4.0.

  • Incorrect Incompatible classification: CometArrayCompact.getSupportLevel was returning Incompatible(None) with no explanation. The implementation is semantically correct (DataFusion's array_remove_all(arr, null) behaves the same as Spark's ArrayFilter(IsNotNull)), so this is changed to Compatible().

  • SQL test improvements: Added string, double, and array<array<int>> (nested) type coverage; changed spark_answer_only to query to verify native execution.

  • Remove Scala test skip: The assume(\!isSpark40Plus) guard with a // TODO fix for Spark 4.0.0 comment is removed now that Spark 4.0 is supported.

  • Docs: Updated expressions.md to mark ArrayCompact as supported.

- Handle KnownNotContainsNull wrapper in Spark 4.0 shim so that
  array_compact runs natively on Spark 4.0 (previously always fell back)
- Fix return type passed to DataFusion: use expr.dataType instead of
  hardcoded ArrayType(elementType) so that containsNull=false is
  correctly propagated on Spark 4.0
- Mark CometArrayCompact as Compatible() instead of Incompatible(None)
- Expand SQL test coverage: add string, double, and nested array types;
  change spark_answer_only to query to verify native execution
- Remove assume(\!isSpark40Plus) skip from Scala test
- Update expressions.md: ArrayCompact is now supported
…DataFusion

DataFusion's array_remove_all function always returns a list with nullable
elements (nullable=true), but array_compact's Spark dataType has
containsNull=false. Passing the Spark type as the promised return type caused
a runtime type-mismatch assertion in DataFusion's ScalarFunctionExpr. Fix by
passing ArrayType(elementType, containsNull=true) in both the Spark 3.x
CometArrayCompact serde and the Spark 4.0 KnownNotContainsNull shim.
@andygrove andygrove marked this pull request as ready for review March 26, 2026 03:01
object CometArrayCompact extends CometExpressionSerde[Expression] {

override def getSupportLevel(expr: Expression): SupportLevel = Incompatible(None)
override def getSupportLevel(expr: Expression): SupportLevel = Compatible()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove getSupportLevel entirely

andygrove and others added 2 commits April 8, 2026 12:26
- arrays.scala: use spark_array_compact UDF (main's approach) instead of
  array_remove_all with containsNull=true workaround, since DF 53 changed
  array_remove_all to return NULL when element arg is NULL
- expressions.md: both ArrayAppend and ArrayCompact are Spark-compatible (Yes)
- array_compact.sql: keep multi-column table queries from HEAD, merge in
  additional literal tests (double, nested) from main

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants