diff --git a/docs/source/contributor-guide/spark_expressions_support.md b/docs/source/contributor-guide/spark_expressions_support.md index 8f82847bb6..78f0545701 100644 --- a/docs/source/contributor-guide/spark_expressions_support.md +++ b/docs/source/contributor-guide/spark_expressions_support.md @@ -364,78 +364,141 @@ ### math_funcs - [x] `%` + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Remainder(left, right, evalMode)` signature identical across versions. Native path uses Rust `spark_modulo` UDF; non-ANSI returns NULL on divide-by-zero, ANSI raises `DIVIDE_BY_ZERO` / `REMAINDER_BY_ZERO`. `CometRemainder` rejects `EvalMode.TRY`, so `try_mod` (Spark 4.0+) falls back to Spark (https://github.com/apache/datafusion-comet/issues/4484). - [x] `*` + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Multiply(left, right, evalMode)` signature identical. Decimal results exceeding `DECIMAL128_MAX_PRECISION` go through `WideDecimalBinaryExpr` (Decimal256 intermediate); smaller decimals and primitives use DataFusion `BinaryExpr`. ANSI integer overflow uses Rust `checked_mul`. Interval multiplication falls back. - [x] `+` + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Add(left, right, evalMode)` with the same Decimal / ANSI plumbing as `*`. `Date + Int8/16/32` dispatches to the Rust `date_add` UDF to work around DataFusion's Date32 + Interval-only kernel. - [x] `-` + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Subtract(left, right, evalMode)` mirrors `+`. `Date - Int8/16/32` uses the Rust `date_sub` UDF. - [x] `/` + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Divide(left, right, evalMode)`. Non-ANSI mode wraps the divisor in `If(EqualTo(right, 0), null, right)` so DataFusion never throws. Decimal output is wrapped in `CheckOverflow(failOnError = ANSI)`; ANSI surfaces `NUMERIC_VALUE_OUT_OF_RANGE`, non-ANSI returns NULL. - [x] abs + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Abs(child, failOnError)` over `NumericType` plus the two interval types. `failOnError` (ANSI) is propagated to the native `abs` UDF, which throws `ARITHMETIC_OVERFLOW` on `Int.MinValue` / `Long.MinValue` / Decimal MIN. `DayTimeIntervalType` and `YearMonthIntervalType` fall back to Spark. Spark 4.0 / 4.1 do the `NullIntolerant` -> `nullIntolerant: Boolean` refactor; behaviour unchanged. - [x] acos + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.acos, "ACOS")` unchanged across versions; wired as `CometScalarFunction("acos")` to DataFusion's `acos` UDF. NaN for `|x| > 1`. - [x] acosh + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): custom `StrictMath.log(x + sqrt(x*x - 1))` unchanged across versions. NaN for `x < 1`. Routes to DataFusion's `acosh`. - [x] asin + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.asin, "ASIN")` unchanged. NaN for `|x| > 1`. - [x] asinh + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): special-cases `Double.NegativeInfinity` to avoid `log(NaN)`, otherwise `StrictMath.log(x + sqrt(x*x + 1))`. Identical across versions. - [x] atan + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.atan, "ATAN")` unchanged. - [x] atan2 + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `BinaryMathExpression(math.atan2, "ATAN2")` with both inputs adjusted by `+0.0` to flip `-0.0` to `+0.0`. `CometAtan2` reproduces this by wrapping each child in `Add(child, Literal.default(child.dataType))` before dispatching to DataFusion's `atan2`. - [x] atanh + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): custom `0.5 * (log1p(x) - log1p(-x))` (SPARK-28519). NaN for `|x| > 1`, +/-Infinity for `x = +/-1`. - [x] bin + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Bin(child)` over `LongType -> StringType`. Spark 4.x gains `DefaultStringProducingExpression` and the `nullIntolerant: Boolean` refactor; no behaviour change. Routes to datafusion-spark `SparkBin`. - [ ] bround - [x] cbrt + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): passthrough to DataFusion `cbrt`. - [x] ceil + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): one-arg `ceil(expr)` supported (`LongType` / `DoubleType` / `DecimalType` with scale >= 0). Decimal with negative scale falls back at convert time. The two-arg `ceil(expr, scale)` form (`RoundCeil`) is not wired and falls back to Spark. - [x] ceiling + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): registry alias for `Ceil`. Same support as `ceil`. - [ ] conv - [x] cos + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.cos, "COS")` unchanged across versions. - [x] cosh + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.cosh, "COSH")` unchanged. - [x] cot + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): custom `1 / math.tan(x)`. DataFusion's `cot` is also `1.0 / tan(x)`, so the result matches. - [x] csc + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): custom `1 / math.sin(x)`. Routed to datafusion-spark's `SparkCsc` (registered in `jni_api.rs`). - [x] degrees + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.toDegrees, "DEGREES")` unchanged across versions. - [x] div + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `IntegralDivide(left, right, evalMode)`. Non-decimal operands are cast to `DecimalType(19, 0)`; result is recomputed per `IntegralDivide.resultDecimalType`, wrapped in `CheckOverflow`, then cast to `Long`. ANSI overflow for `Long.MinValue div -1` and decimal-overflow ANSI cases are covered by existing tests. - [ ] e - [x] exp + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(StrictMath.exp, "EXP")` unchanged. ULP-level differences vs DataFusion `exp` are possible but unflagged. - [x] expm1 + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(StrictMath.expm1, "EXPM1")` unchanged. - [x] factorial - 3.4.3 (audited 2026-05-15): identical to v3.5.8. - 3.5.8 (audited 2026-05-15): canonical reference; `extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant`. Returns NULL for NULL input or values outside `[0, 20]`. - 4.0.1 (audited 2026-05-15): `NullIntolerant` trait replaced by `nullIntolerant: Boolean` method override; behavior unchanged. + - 4.1.1 (audited 2026-05-27): identical to 4.0.1. - [x] floor + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): mirror of `ceil`. Two-arg `floor(expr, scale)` form (`RoundFloor`) falls back to Spark. - [x] greatest + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): NULL-skipping variadic. Wired as `CometScalarFunction("greatest")` to DataFusion's `GreatestFunc`. Comet does not gate input types, so interval inputs and other Spark-only orderings rely on the native UDF accepting them; no explicit fallback path. - [x] hex + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): accepts `LongType` / `BinaryType` / `StringType`. Spark 4.x widens `StringType` to `StringTypeWithCollation` and preserves collation in `dataType`; `CometHex` passes `expr.dataType` to native `SparkHex`, which always returns `Utf8` -- collation propagation may diverge on Spark 4.x. - [ ] hypot - [x] least + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): mirror of `greatest`; same caveats. Spark 4.1.1 adds `contextIndependentFoldable` (no Comet impact). - [x] ln + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): registry alias for `Log`. Comet wires through `CometLog` to DataFusion `ln` with a `nullIfNegative` rewrite to match Spark's NULL behaviour for `x <= 0`. - [x] log + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): one-arg `log(x)` -> `CometLog` (DataFusion `ln`); two-arg `log(base, x)` -> `CometLogarithm` (custom `spark_log` UDF, returns NULL when `base <= 0` or `x <= 0` to match `Logarithm.nullSafeEval`). - [x] log10 + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryLogExpression(StrictMath.log10, "LOG10")`; returns NULL for `x <= 0`. Possible ULP differences from `StrictMath`. - [ ] log1p - [x] log2 + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryLogExpression(StrictMath.log(x) / StrictMath.log(2), "LOG2")`; returns NULL for `x <= 0`. - [x] mod + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): registry alias for `Remainder`. Same support as `%`. - [x] negative + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMinus(child, failOnError)` -> Rust `NegativeExpr`. ANSI overflow is detected for `Int8/Int16/Int32/Int64` and `IntervalYearMonth/IntervalDayTime`. Float / Double / Decimal cannot overflow on negate. Spark 4.0 `NullIntolerant` -> `nullIntolerant: Boolean` refactor; no impact. - [x] pi + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `LeafMathExpression(math.Pi, "PI")`; foldable, so Spark `ConstantFolding` rewrites it to a `Literal` before Comet sees the plan. The `CometScalarFunction("pi")` registration is exercised only when `ConstantFolding` is excluded. - [ ] pmod - [x] positive + - Spark 3.4.3, 3.5.8 (audited 2026-05-27): `UnaryPositive(child)` is a regular expression. There is no Comet serde for `UnaryPositive`, so projections containing `+col` silently disable Comet for the projection on 3.4/3.5. + - Spark 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryPositive` is `RuntimeReplaceable` with `replacement = child`; the optimizer removes it before Comet sees the plan, so the gap is transparent on 4.x. - [x] pow + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Pow(left, right) extends BinaryMathExpression(StrictMath.pow, "POWER")`; routes to DataFusion `pow`. ULP-level differences possible. - [x] power + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): registry alias for `Pow`. Same support as `pow`. - [x] radians + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.toRadians, "RADIANS")` unchanged across versions. - [x] rand + - See `misc_funcs / rand`. - [x] randn + - See `misc_funcs / randn`. - [ ] random - [ ] randstr - [x] rint + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.rint, "ROUND")` with `funcName = "rint"`. Passthrough to DataFusion `rint` (round-half-to-even). - [x] round + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): HALF_UP rounding for integer / decimal. Float / Double child types always fall back to Spark because `BigDecimal`-via-`toString` rounding cannot be precisely matched (documented inline in `CometRound`). ANSI `failOnError` is propagated for integer overflow. `BRound` (HALF_EVEN) is not wired. - [x] sec + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): custom `1 / math.cos(x)`. Routed to datafusion-spark's `SparkSec`. - [x] shiftleft + - See `bitwise_funcs / <<` (audited in PR #4479). Same support as the operator alias added in 4.0. - [x] sign + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): registry alias for `Signum`. Same support as `signum`. - [x] signum + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Signum(child)` over `DoubleType`. Spark also restricts to the two interval types via `inputTypes`; Comet handles only the `Double` case via DataFusion `signum`. - [x] sin + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.sin, "SIN")` unchanged. - [x] sinh + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.sinh, "SINH")` unchanged. - [x] sqrt + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.sqrt, "SQRT")` unchanged. - [x] tan + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.tan, "TAN")` unchanged. - [x] tanh + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.tanh, "TANH")` unchanged. - [x] try_add + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `TryAdd` is `RuntimeReplaceable` and rewrites to `Add(.., EvalMode.TRY)` for numeric inputs (datetime / interval go through `TryEval(Add(.., ANSI))` and fall back). Numeric path uses the Rust `checked_add` UDF, returning NULL on overflow. Decimal goes through `WideDecimalBinaryExpr` with `EvalMode.Try`. - [x] try_divide + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `TryDivide` rewrites to `Divide(.., EvalMode.TRY)`. The `nullIfWhenPrimitive` wrapper swaps zero divisors to NULL; integer / float divide uses `checked_div`; decimal uses `decimal_div` + `CheckOverflow(failOnError = false)` returning NULL. - [ ] try_mod - [x] try_multiply + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): rewrites to `Multiply(.., EvalMode.TRY)`. Integer path uses `checked_mul`; decimal uses `WideDecimalBinaryExpr` with `EvalMode.Try`, returning NULL on overflow. - [x] try_subtract + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): rewrites to `Subtract(.., EvalMode.TRY)`. Integer path uses `checked_sub`; decimal uses `WideDecimalBinaryExpr` as needed. - [x] unhex + - Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Unhex(child, failOnError)`. Spark 4.x widens input to `StringTypeWithCollation` and wraps the inner call in try/catch; Comet `CometUnhex` forwards `failOnError` to native `spark_unhex` but does not gate on collation. - [ ] uniform - [x] width_bucket + - Spark 3.5.8 (audited 2026-05-27): introduced; not available in 3.4.3. + - Spark 4.0.1, 4.1.1 (audited 2026-05-27): same semantics; `NullIntolerant` -> `nullIntolerant: Boolean` refactor. + - Known limitation: wired via per-version `CometExprShim` rather than a `CometExpressionSerde`, so it bypasses the support-level framework and the auto-generated compatibility doc (https://github.com/apache/datafusion-comet/issues/4485). Native path uses datafusion-spark `SparkWidthBucket`; interval input types are not exercised by Comet tests. ### misc_funcs