diff --git a/.claude/skills/audit-comet-expression/SKILL.md b/.claude/skills/audit-comet-expression/SKILL.md index e2704bb47d..a1b48f00cf 100644 --- a/.claude/skills/audit-comet-expression/SKILL.md +++ b/.claude/skills/audit-comet-expression/SKILL.md @@ -119,19 +119,125 @@ grep -r "$ARGUMENTS" spark/src/main/scala/org/apache/comet/ --include="*.scala" Read the serde implementation and check: - Which Spark versions the serde handles -- Whether `getSupportLevel` is implemented and accurate -- Whether all input types are handled -- Whether any types are explicitly marked `Unsupported` -- Whether `getIncompatibleReasons()` and `getUnsupportedReasons()` are overridden. - `getSupportLevel` controls runtime fallback, but `GenerateDocs` reads these two - methods to build the Compatibility Guide. If `getSupportLevel` returns - `Incompatible(Some(...))` or `Unsupported(Some(...))` but the corresponding - `get*Reasons()` method is not overridden, the reason will not appear in the - published docs. Expect both to return a `Seq[String]` containing the same - reason text used in `getSupportLevel`. Follow the pattern in - `spark/src/main/scala/org/apache/comet/serde/structs.scala::CometStructsToJson` - or `spark/src/main/scala/org/apache/comet/serde/datetime.scala::CometHour`: - extract the reason as a `private val` and reference it from both places. +- Whether all input types Spark accepts are handled +- Whether `convert` validates expression shape (e.g. literal-only arguments) before serializing + +Then audit the support-level and reason methods as described below. + +#### Audit `getSupportLevel`, `getIncompatibleReasons`, `getUnsupportedReasons`, and `convert` + +These four methods must stay aligned. Each has a distinct purpose, and the +most common bugs in Comet serdes are misalignments between them. + +**Pick the right support level.** + +- `Unsupported(Some(reason))`: Comet cannot run this case at all. The + dispatcher in `QueryPlanSerde.exprToProtoInternal` falls back to Spark + unconditionally. Use this when an input type, option, or expression shape + is not implemented, or when running the native path would crash or error. +- `Incompatible(Some(reason))`: Comet can run this, but results may differ + from Spark. The dispatcher only allows it when + `spark.comet.expr.allowIncompatible=true` (or the per-expression + equivalent). Use this for known result differences such as locale + sensitivity, timezone handling, ordering ambiguity, or floating-point + precision. +- `Compatible(None)`: full Spark compatibility for this combination of + inputs and options. +- `Compatible(Some(note))`: fully compatible but with a docs-only caveat. + Note that any `Some(...)` on `Compatible` triggers a runtime + `logWarning`, so reserve it for genuinely useful caveats. + +Decision rule: if the user would be surprised by a _wrong answer_, it is +`Incompatible`. If the user would be surprised by a _runtime error or +unsupported-type message_, it is `Unsupported`. + +**Runtime vs docs split.** + +- The `notes` field on `Incompatible(Some(...))` and `Unsupported(Some(...))` + flows into EXPLAIN output via the dispatcher (see + `QueryPlanSerde.exprToProtoInternal`, around the `case Incompatible` / + `case Unsupported` branches). This is what users see when they ask why + Comet fell back. +- `getIncompatibleReasons()` and `getUnsupportedReasons()` are read only by + `GenerateDocs` when building + `docs/source/user-guide/compatibility.md`. They are static (no `expr` + argument) and should enumerate every distinct reason the expression + could ever return at runtime. +- `convert` may also call `withInfo(expr, "reason")` and return `None` for + cases that cannot be detected from the expression alone (for example, + non-literal arguments, or child conversion failures). Those reasons + belong in `getUnsupportedReasons()` too. + +**Consistency rules.** + +1. If `getSupportLevel` can return `Incompatible(...)` for any input, + override `getIncompatibleReasons()` and include a reason for every + incompatible branch. +2. If `getSupportLevel` can return `Unsupported(...)` for any input, + override `getUnsupportedReasons()` and include a reason for every + unsupported branch. +3. If `convert` has its own `withInfo(...); None` fallbacks (e.g. literal + checks), enumerate those reasons in `getUnsupportedReasons()` too. +4. Extract each reason into a `private val` and reference it from both + `getSupportLevel` and `get*Reasons()`. Do not duplicate the string + inline. Canonical example: + `spark/src/main/scala/org/apache/comet/serde/arrays.scala::CometArrayIntersect`. + It declares `private val incompatReason` and + `private val unsupportedCollationReason`, and each is referenced from + `getSupportLevel` and the matching reasons method. +5. Prefer `Incompatible(Some(reason))` over `Incompatible(None)`. The + `None` form drops the reason from EXPLAIN output, leaving users with a + generic "not fully compatible" message and forcing them to read the + docs to find out why. +6. Gate compatibility decisions in `getSupportLevel`, not inside `convert`. + Putting the check inside `convert` (e.g. reading a config flag and + calling `withInfo`) bypasses the dispatcher's `allowIncompatible` + handling and the EXPLAIN message becomes inconsistent with what the + doc generator produces. + +**Wording guidelines for reason strings.** + +Reasons appear verbatim in the Compatibility Guide (rendered as Markdown) +and in EXPLAIN output, so they are user-facing. + +- Lead with the user-observable effect, then the cause if helpful. + ✅ "Result array element order may differ from Spark when the right array + is longer than the left." + ❌ "DataFusion probes the longer side." +- Use sentence case and end with a period. +- Use backticks around config keys, type names, and SQL identifiers. +- Link to a tracking GitHub issue for known incompatibilities so users can + follow progress: `(https://github.com/apache/datafusion-comet/issues/NNNN)`. +- Keep it concise. Single sentence is best. +- Do not write "Incompatible reason: ..." or "Unsupported because ...". + The doc generator adds the framing. +- Phrase Incompatible reasons as _what differs from Spark_, not _what is + missing_. Phrase Unsupported reasons as _what does not run_. If you find + yourself writing an "Incompatible" reason that says "Comet only supports + X" or "Y is not supported", the support level is probably wrong: it + should be `Unsupported`. + +**Common antipatterns to flag during the audit.** + +- A `private val reason` constant declared near the top of the object, but + `getSupportLevel` hardcodes a different string inline. The doc and the + EXPLAIN message will drift. + Real example: `CometHour` declares `incompatReason` then hardcodes a + near-duplicate string in `getSupportLevel`. +- Reason text duplicated in two places without a shared constant. + Real examples: `CometMinute`, `CometSecond`. +- `Incompatible(None)` paired with a populated `getIncompatibleReasons()`. + The reason reaches the docs but not EXPLAIN output. + Real example: `CometInitCap`. +- `getIncompatibleReasons()` overridden but `getSupportLevel` never returns + `Incompatible(...)`. Either the reasons method is dead code, or + `getSupportLevel` is missing a branch. +- A reason string phrased as "X is not supported" attached to an + `Incompatible` branch (or vice versa). Re-read it and decide which + support level it really belongs to. +- `convert` bails out via `withInfo` for a case that is fully knowable + from the expression (e.g. an unsupported child data type). Move the + check into `getSupportLevel` so the dispatcher handles it uniformly. ### Shims @@ -237,7 +343,52 @@ Also review the Comet implementation (Step 3) against the Spark behavior (Step 1 - Are there behavioral differences that are NOT marked `Incompatible` but should be? - Are there behavioral differences between Spark versions that the Comet implementation does not account for (missing shim)? - Does the Rust implementation match the Spark behavior for all edge cases? -- If `getSupportLevel` returns `Incompatible` or `Unsupported` with a reason, are `getIncompatibleReasons()` / `getUnsupportedReasons()` also overridden so the reason is picked up by `GenerateDocs` and appears in the Compatibility Guide? + +### Support-level consistency audit + +Walk through this checklist against the serde. Each failed item is a +finding for Step 6. + +1. **Support level matches behavior.** For each branch of + `getSupportLevel`, decide whether the user-observable effect is a wrong + answer (`Incompatible`) or a fallback / error (`Unsupported`). Flag any + branch where the label does not match the behavior. +2. **Reasons cover every branch.** Every distinct reason that + `getSupportLevel` can return as `Incompatible(Some(r))` must appear in + `getIncompatibleReasons()`. Same for `Unsupported(Some(r))` and + `getUnsupportedReasons()`. Missing reasons silently drop from the + Compatibility Guide. +3. **Reasons are not dead code.** If `getIncompatibleReasons()` is + overridden but `getSupportLevel` never returns `Incompatible(...)`, + either the reason is stale or `getSupportLevel` is missing a branch. + Same for `getUnsupportedReasons()`. +4. **Reason strings are shared via `private val`.** If the same reason + appears as a string literal in two places, flag it: changes to one + will not propagate to the other. +5. **Inline reason matches the constant.** If a `private val reason` is + declared but `getSupportLevel` uses a different string literal, flag + it as a drift bug. +6. **`Incompatible(None)` has no docs-only reason.** If + `getSupportLevel` returns `Incompatible(None)` but + `getIncompatibleReasons()` is non-empty, flag it: the EXPLAIN message + will be generic while the docs show a specific reason. Switch to + `Incompatible(Some(reason))`. +7. **`convert` fallbacks are documented.** If `convert` calls + `withInfo(expr, "...")` and returns `None` for cases not covered by + `getSupportLevel` (e.g. non-literal arguments, unsupported expression + shapes), confirm the reason is also listed in + `getUnsupportedReasons()`. +8. **Compatibility decisions live in `getSupportLevel`.** If `convert` + reads a config flag and bails out, prefer moving the check into + `getSupportLevel` so the dispatcher handles `allowIncompatible` + uniformly. `CometCaseConversionBase` is an example of the in-`convert` + pattern that this skill recommends against. +9. **Reason wording.** Each reason should describe the user-observable + effect, use sentence case with a period, backtick config keys and + types, and link a tracking issue when one exists. Flag reasons that + read like internal implementation notes ("DataFusion probes the longer + side") or that mismatch their support level (an "Incompatible" reason + that says "X is not supported"). --- @@ -245,33 +396,98 @@ Also review the Comet implementation (Step 3) against the Spark behavior (Step 1 Summarize findings as a prioritized list. -### High priority +### High priority: correctness divergences -Issues where Comet may silently produce wrong results compared to Spark. +Cases where Comet produces a different observable result from Spark +(wrong value, missing exception, accepted-instead-of-rejected input, +etc.). Each one becomes a captured test in Step 7. -### Medium priority +### Medium priority: missing test coverage -Missing test coverage for edge cases that could expose bugs. +Edge cases Spark exercises but Comet does not test, where the behaviour +appears to match but is not verified. -### Low priority +### Low priority: cosmetic and consistency issues -Minor gaps, cosmetic improvements, or nice-to-have tests. +Reason-string drift, missing `get*Reasons()` overrides, dead branches, +etc. These come from the Step 5 consistency audit. --- -## Step 7: Offer to Implement Missing Tests +## Step 7: Capture Correctness Findings as Tests; Offer Consistency Fixes + +Correctness divergences (Step 6 high-priority items) must not be left as +prose. Every divergence the audit identified becomes a regression test +in the same PR as the audit, so future readers can run it. + +For each correctness finding, do the following in this order: + +1. **Search for an existing tracking issue.** + + ```bash + gh issue list --search " in:title,body" --state all --limit 5 + ``` + + Match on both the expression name and a distinguishing keyword + (ANSI, timezone, NTZ, overflow, etc.). + +2. **If no issue exists, file one.** Use the `correctness` label plus + the relevant area label (e.g. `temporal expressions`). Keep the + title in the form "[Bug] ". Include + the Spark version, a minimal repro, and the divergent result. + +3. **If the fix is trivial and you are confident, fix it inline.** + Then add the test in the default `query` mode so it locks in the + fix. "Trivial" means a few lines in one file with no native code + changes, no support-level reshuffling, and no semantics decisions + that the user should weigh in on. -After presenting the gap analysis, ask the user: +4. **Otherwise, add the test in `query ignore()` mode** with + a one-line SQL comment above the query explaining the symptom. The + test file lives next to the existing tests for the expression. Do + not skip writing the test because the bug is unfixed: the captured + reproduction is the whole point of this step. -> I found the following missing test cases. Would you like me to implement them? + ```sql + -- Comet returns NULL where Spark throws under spark.sql.ansi.enabled=true + query ignore(https://github.com/apache/datafusion-comet/issues/NNNN) + SELECT next_day(date('2024-01-01'), 'NOT_A_DAY') + ``` + + When the underlying bug is fixed, the `ignore(...)` is removed and + the test starts running. This is also the contract documented in + `docs/source/contributor-guide/sql-file-tests.md`. + +After capturing correctness findings, ask the user about the remaining +categories. Keep these asks distinct so the user can opt in to each. + +**Missing test coverage (non-bug):** + +> Spark exercises the following cases that have no Comet test. Would +> you like me to add them? > > - [list each missing test case] > > I can add them as Comet SQL Tests in `spark/src/test/resources/sql-tests/expressions//$ARGUMENTS.sql` > (or as Comet Scala Tests in `CometExpressionSuite` for cases that require programmatic setup). -If the user says yes, implement the missing tests following the Comet SQL Tests format described in -`docs/source/contributor-guide/sql-file-tests.md`. Prefer Comet SQL Tests over Comet Scala Tests. +**Support-level consistency:** + +> I also found the following alignment issues between `getSupportLevel`, +> `getIncompatibleReasons`, `getUnsupportedReasons`, and `convert`. Would +> you like me to fix them? +> +> - [list each finding from the Step 5 consistency audit, with the file and serde object name] +> +> Each fix is a small, mechanical edit (extract a `private val` for the +> reason, switch `Incompatible(None)` to `Incompatible(Some(reason))`, +> add a missing `get*Reasons()` override, or move a check from `convert` +> into `getSupportLevel`). + +If the user says yes to test coverage, implement them following the Comet +SQL Tests format described in +`docs/source/contributor-guide/sql-file-tests.md`. Prefer Comet SQL Tests +over Comet Scala Tests. ### Comet SQL Tests template diff --git a/docs/source/contributor-guide/spark_expressions_support.md b/docs/source/contributor-guide/spark_expressions_support.md index ccb816c668..c6337c0e41 100644 --- a/docs/source/contributor-guide/spark_expressions_support.md +++ b/docs/source/contributor-guide/spark_expressions_support.md @@ -215,86 +215,214 @@ ### datetime_funcs - [x] add_months + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `inputTypes = Seq(DateType, IntegerType)`; returns `DateType`; codegen delegates to `DateTimeUtils.dateAddMonths`. + - Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` trait replaced by `override def nullIntolerant: Boolean = true` on `AddMonthsBase`; behaviour and codegen unchanged. - [x] convert_timezone + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. Ternary `(sourceTz, targetTz, sourceTs)`; `inputTypes = Seq(StringType, StringType, TimestampNTZType)`; delegates to `DateTimeUtils.convertTimestampNtzToAnotherTz`. + - Spark 4.0.1 (audited 2026-05-27): timezone `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`; behaviour unchanged for ASCII timezone strings. + - Known divergence: Comet composes `to_utc_timestamp` then `from_utc_timestamp` and its native timezone parser only accepts IANA zone IDs and `+HH:MM` offsets, so legacy forms like `GMT+1`, `UTC+1`, or three-letter abbreviations throw a native parse error at execution (https://github.com/apache/datafusion-comet/issues/2013). - [ ] curdate - [ ] current_date - [ ] current_time - [ ] current_timestamp - [x] current_timezone - [x] date_add + - Spark 3.4.3 (audited 2026-05-27): baseline. `(DateType, IntegerType|ShortType|ByteType) -> DateType`; `nullSafeEval` returns `startDays + d.intValue()` with Java int wrap-around; no ANSI branch. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` trait replaced by `override def nullIntolerant: Boolean = true`. - [x] date_diff + - Spark 3.4.3 (audited 2026-05-27): baseline. `(DateType, DateType) -> IntegerType`; `nullSafeEval` is `endDays - startDays` with Java int wrap-around. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` trait replaced by `override def nullIntolerant: Boolean = true`. + - Known divergence: the native impl uses non-wrapping `i32 -`, which would panic in debug builds on extreme inputs (Spark wraps); practically unreachable for date inputs. - [x] date_format + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `(TimestampType, StringType) -> StringType`; format string is parsed via `TimestampFormatter` (`DateTimeFormatter` under `CORRECTED` policy, `SimpleDateFormat` under `LEGACY` policy). + - Spark 4.0.1 (audited 2026-05-27): trait set updated to use `DefaultStringProducingExpression`; `nullIntolerant` becomes a field; format `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`. Behaviour unchanged for ASCII format strings. + - Known divergence: only a curated allow-list of `SimpleDateFormat` patterns runs natively (via DataFusion `to_char`). Non-UTC session timezones with a whitelisted format require `spark.comet.expr.dateFormat.allowIncompatible=true`. Non-literal formats, non-whitelisted formats, and the default disabled-codegen path route through Spark's `DateFormatClass.doGenCode` only when `spark.comet.exec.scalaUDF.codegen.enabled=true`; otherwise the operator falls back to Spark. `spark.sql.legacy.timeParserPolicy=LEGACY` is honoured only on the codegen-dispatch / Spark-fallback paths; the native allow-list assumes corrected semantics. - [x] date_from_unix_date + - Spark 3.4.3 (audited 2026-05-27): baseline. `IntegerType -> DateType`; `nullSafeEval` is the identity on days-since-epoch. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` trait replaced by `override def nullIntolerant: Boolean = true`. - [x] date_part - [x] date_sub + - Spark 3.4.3 (audited 2026-05-27): baseline. Mirror of `DateAdd` (`startDays - d.intValue()`); same input types and wrap-around behaviour; no ANSI branch. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` trait replaced by `override def nullIntolerant: Boolean = true`. - [x] date_trunc + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `inputTypes = Seq(StringType, TimestampType)`; format parsed by `parseTruncLevel` (case-insensitive) and supports `YEAR`/`YYYY`/`YY`, `QUARTER`, `MONTH`/`MM`/`MON`, `WEEK`, `DAY`/`DD`, `HOUR`, `MINUTE`, `SECOND`, `MILLISECOND`, `MICROSECOND`. Unknown levels return NULL. Truncation is `TimeZoneAware` and uses `zoneId` for day-and-coarser units. + - Spark 4.0.1 (audited 2026-05-27): format `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`; truncation semantics unchanged for ASCII format strings. + - Known divergence: Comet returns incorrect results in non-UTC session timezones for day-and-coarser units (https://github.com/apache/datafusion-comet/issues/2649); marked `Incompatible` when the resolved zone is not `UTC` / `Etc/UTC`. Non-literal and unsupported format strings raise a native execution error instead of returning NULL. - [x] dateadd + - Spark 3.4.3 (audited 2026-05-27): SQL alias for `date_add`; see that entry. + - Spark 3.5.8 (audited 2026-05-27): SQL alias for `date_add`; see that entry. + - Spark 4.0.1 (audited 2026-05-27): SQL alias for `date_add`; see that entry. - [x] datediff + - Spark 3.4.3 (audited 2026-05-27): SQL alias for `date_diff`; see that entry. + - Spark 3.5.8 (audited 2026-05-27): SQL alias for `date_diff`; see that entry. + - Spark 4.0.1 (audited 2026-05-27): SQL alias for `date_diff`; see that entry. - [x] datepart - [x] day + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. SQL alias for `DayOfMonth`; delegates to `DateTimeUtils.getDayOfMonth` via `LocalDate.getDayOfMonth` (1..31). + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [ ] dayname - [x] dayofmonth + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `DayOfMonth extends GetDateField`; delegates to `DateTimeUtils.getDayOfMonth` (1..31). + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [x] dayofweek + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `DayOfWeek extends GetDateField`; returns 1..7 with Sunday=1 via `LocalDate.getDayOfWeek.plus(1).getValue`. + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [x] dayofyear + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `DayOfYear extends GetDateField`; returns 1..366 via `LocalDate.getDayOfYear`. + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [x] extract - [x] from_unixtime + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `BinaryExpression` with `inputTypes = Seq(LongType, StringType)`; returns `StringType`; uses session `zoneId` to format the resulting timestamp. + - Spark 4.0.1 (audited 2026-05-27): now `DefaultStringProducingExpression`; format `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`; `nullIntolerant` set via override instead of the `NullIntolerant` trait. + - Known divergence: Comet only honours the default format pattern `yyyy-MM-dd HH:mm:ss`; any other format falls back to Spark. Implemented by composing DataFusion's `from_unixtime` and `to_char`, so DataFusion's valid timestamp range differs from Spark (https://github.com/apache/datafusion/issues/16594) and Spark datetime patterns are not honoured even when supplied (https://github.com/apache/datafusion/issues/16577). - [x] from_utc_timestamp - Spark 3.4.3 (audited 2026-05-12): identical to 3.5.8. - Spark 3.5.8 (audited 2026-05-12): baseline. - Spark 4.0.1 (audited 2026-05-12): `inputTypes` widened to `StringTypeWithCollation`; behaviour unchanged for ASCII timezone strings. - Known divergence: Comet's native timezone parser does not accept Spark's legacy zone forms (`GMT+1`, `UTC+1`, three-letter abbreviations like `PST`). Such timezones throw a native parse error at execution. - [x] hour + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `case class Hour` extends `GetTimeField`. + - Spark 4.0.1 (audited 2026-05-27): `case class Hour` is unchanged; parent `GetTimeField` trait refactored to override `nullIntolerant: Boolean = true` instead of mixing in `NullIntolerant` (no behavioural change). + - Known divergence: for `TimestampNTZType` inputs Comet's native path applies session-timezone conversion (Spark treats `TIMESTAMP_NTZ` as wall-clock and ignores session timezone), so the returned hour can differ. Marked `Incompatible` and gated by `spark.comet.expr.allowIncompatible` (https://github.com/apache/datafusion-comet/issues/3180). - [x] last_day + - Spark 3.4.3 (audited 2026-05-27): baseline. `DateType -> DateType`; computes `DateTimeUtils.getLastDayOfMonth`; no ANSI branch. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` trait replaced by `override def nullIntolerant: Boolean = true`. - [x] localtimestamp - [x] make_date + - Spark 3.4.3 (audited 2026-05-27): baseline. `(IntegerType, IntegerType, IntegerType) -> DateType`; under `spark.sql.ansi.enabled=true` invalid `(year, month, day)` throws `ansiDateTimeError`, else returns NULL. Documented valid year range is 1 to 9999. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): error helper renamed to `ansiDateTimeArgumentOutOfRange`; behaviour otherwise unchanged. + - Known divergence: `SparkMakeDate` in `native/spark-expr/src/datetime_funcs/make_date.rs` always returns NULL on invalid input and never raises, so Comet diverges from Spark when `spark.sql.ansi.enabled=true`. It also accepts year 0 and negative years (chrono's proleptic calendar) which Spark rejects. - [ ] make_dt_interval - [ ] make_interval - [ ] make_time - [x] make_timestamp + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. Septenary `(year, month, day, hour, min, sec[, timezone])` with `sec: DecimalType(16,6)`; honours `spark.sql.ansi.enabled` (throws on invalid input, else NULL); timezone input is `StringType`; result type follows `spark.sql.timestampType`. + - Spark 4.0.1 (audited 2026-05-27): timezone input widened to `StringTypeWithCollation(supportsTrimCollation = true)`; ANSI error helpers renamed (`ansiDateTimeArgumentOutOfRange`, `invalidFractionOfSecondError(value)`); `NullIntolerant` trait replaced by `nullIntolerant: Boolean = true`; new sibling `TryMakeTimestamp` added but routed through a separate expression. - [ ] make_timestamp_ltz - [ ] make_timestamp_ntz - [ ] make_ym_interval - [x] minute + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `case class Minute` extends `GetTimeField`. + - Spark 4.0.1 (audited 2026-05-27): `case class Minute` is unchanged; parent `GetTimeField` trait refactored to override `nullIntolerant: Boolean = true` instead of mixing in `NullIntolerant` (no behavioural change). + - Known divergence: for `TimestampNTZType` inputs Comet's native path applies session-timezone conversion (Spark treats `TIMESTAMP_NTZ` as wall-clock and ignores session timezone), so the returned minute can differ. Marked `Incompatible` and gated by `spark.comet.expr.allowIncompatible` (https://github.com/apache/datafusion-comet/issues/3180). - [x] month + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `Month extends GetDateField`; delegates to `DateTimeUtils.getMonth` via `LocalDate.getMonthValue` (1..12). + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [ ] monthname - [x] months_between + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. Ternary `(date1: Timestamp, date2: Timestamp, roundOff: Boolean)` returning `DoubleType`; `TimeZoneAwareExpression`; codegen delegates to `DateTimeUtils.monthsBetween`. + - Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` trait dropped in favour of `nullIntolerant: Boolean = true` override; signature and runtime behaviour unchanged. - [x] next_day + - Spark 3.4.3 (audited 2026-05-27): baseline. `(DateType, StringType) -> DateType`; under `spark.sql.ansi.enabled=true` an unrecognised `dayOfWeek` throws `ansiIllegalArgumentError`, else returns NULL. Allowed tokens come from `DateTimeUtils.getDayOfWeekFromString` (`SU/SUN/SUNDAY`, `MO/MON/MONDAY`, ...), case-insensitive via `Locale.ROOT`, no trimming. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): error type changed to `SparkIllegalArgumentException`; `inputTypes` now uses `StringTypeWithCollation(supportsTrimCollation = true)`. + - Known divergence: `datafusion-spark::SparkNextDay` returns NULL for malformed `dayOfWeek` regardless of `spark.sql.ansi.enabled`, so ANSI mode does not throw. It also `trim()`s the day-of-week argument before matching, so `' MO '` succeeds natively while Spark would treat it as invalid. - [ ] now - [x] quarter + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `Quarter extends GetDateField`; returns 1..4 via `IsoFields.QUARTER_OF_YEAR`. + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [x] second + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `case class Second` extends `GetTimeField`. + - Spark 4.0.1 (audited 2026-05-27): `case class Second` is unchanged; parent `GetTimeField` trait refactored to override `nullIntolerant: Boolean = true` instead of mixing in `NullIntolerant` (no behavioural change). + - Known divergence: for `TimestampNTZType` inputs Comet's native path applies session-timezone conversion (Spark treats `TIMESTAMP_NTZ` as wall-clock and ignores session timezone), so the returned second can differ. Marked `Incompatible` and gated by `spark.comet.expr.allowIncompatible` (https://github.com/apache/datafusion-comet/issues/3180). - [ ] session_window - [ ] time_diff - [ ] time_trunc - [x] timestamp_micros + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `MicrosToTimestamp` extends `IntegralToTimestampBase` with `upScaleFactor = 1`; accepts `IntegralType`, returns `TimestampType`; codegen is identity. + - Spark 4.0.1 (audited 2026-05-27): `IntegralToTimestampBase` drops the `NullIntolerant` trait in favour of `nullIntolerant: Boolean = true`; behaviour unchanged. - [x] timestamp_millis + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `MillisToTimestamp` extends `IntegralToTimestampBase` with `upScaleFactor = MICROS_PER_MILLIS (1000)`; multiply overflow throws via `Math.multiplyExact`. + - Spark 4.0.1 (audited 2026-05-27): same as 3.5.8 modulo the `NullIntolerant` trait/method refactor in `IntegralToTimestampBase`. - [x] timestamp_seconds + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `inputTypes = Seq(NumericType)` accepting integral, decimal, float, and double; integral values use `Math.multiplyExact` (overflow throws); float and double return NULL on NaN or Infinity. + - Spark 4.0.1 (audited 2026-05-27): `nullIntolerant` set via override instead of the `NullIntolerant` trait; otherwise identical to 3.5.8. + - Known divergence: Comet's Rust impl supports only Int32, Int64, Float32, and Float64. `DecimalType`, `ByteType`, and `ShortType` fall back to Spark. Int64 overflow returns a `ComputeError` matching Spark's `ArithmeticException`. NaN and Infinity map to NULL on the float and double paths. - [ ] to_date - [ ] to_time - [ ] to_timestamp - [ ] to_timestamp_ltz - [ ] to_timestamp_ntz - [x] to_unix_timestamp + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `inputTypes = Seq(TypeCollection(StringType, DateType, TimestampType, TimestampNTZType), StringType)`; returns `LongType` seconds; honours `spark.sql.ansi.enabled` (throws on parse error, else NULL); `TimeZoneAwareExpression`. + - Spark 4.0.1 (audited 2026-05-27): both the value and the format argument become `StringTypeWithCollation(supportsTrimCollation = true)`; a new `suggestedFuncOnFail = "try_to_timestamp"` field is added on `ToTimestamp` (advisory). + - Known divergence: routed through the JVM codegen dispatcher rather than a native kernel, so behaviour is bit-identical to Spark only when `spark.comet.exec.scalaUDF.codegen.enabled=true`; when the flag is off the operator falls back to Spark. - [x] to_utc_timestamp - Spark 3.4.3 (audited 2026-05-12): identical to 3.5.8. - Spark 3.5.8 (audited 2026-05-12): baseline. - Spark 4.0.1 (audited 2026-05-12): `inputTypes` widened to `StringTypeWithCollation`; behaviour unchanged for ASCII timezone strings. - Known divergence: Comet's native timezone parser does not accept Spark's legacy zone forms (`GMT+1`, `UTC+1`, three-letter abbreviations like `PST`). Such timezones throw a native parse error at execution. - [x] trunc + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `inputTypes = Seq(DateType, StringType)`; `parseTruncLevel` is case-insensitive and accepts `YEAR`/`YYYY`/`YY`, `QUARTER`, `MONTH`/`MM`/`MON`, `WEEK`. Unknown or sub-week levels (`DAY`, `HOUR`, ...) return NULL because `MIN_LEVEL_OF_DATE_TRUNC` is `TRUNC_TO_WEEK`. + - Spark 4.0.1 (audited 2026-05-27): format `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`; truncation semantics unchanged for ASCII format strings. + - Known divergence: Comet's native kernel raises an execution error for unknown format strings instead of returning NULL, so non-literal formats are flagged `Incompatible`. Sub-week formats such as `DAY`/`DD` are rejected with `Unsupported` (Spark would return NULL) and fall back to Spark. - [ ] try_make_interval - [ ] try_make_timestamp - [ ] try_to_date - [ ] try_to_time - [ ] try_to_timestamp - [x] unix_date + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `UnixDate(child)`: `inputTypes = Seq(DateType)`, `dataType = IntegerType`; `nullSafeEval` returns the underlying days-since-epoch int unchanged. + - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` trait is replaced by `nullIntolerant: Boolean = true`. - [x] unix_micros + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `UnixMicros` extends `TimestampToLongBase` with `scaleFactor = 1`; codegen reduces to identity on the underlying micros. + - Spark 4.0.1 (audited 2026-05-27): same as 3.5.8 modulo the `NullIntolerant` trait/method refactor. - [x] unix_millis + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `UnixMillis` extends `TimestampToLongBase` with `scaleFactor = MICROS_PER_MILLIS`; floor-divides timestamp micros by 1000. + - Spark 4.0.1 (audited 2026-05-27): same as 3.5.8 modulo the `NullIntolerant` trait/method refactor. - [x] unix_seconds + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `UnixSeconds` extends `TimestampToLongBase` with `scaleFactor = MICROS_PER_SECOND`; accepts `TimestampType` only; returns `LongType`; floor-divides micros by the scale factor. + - Spark 4.0.1 (audited 2026-05-27): `TimestampToLongBase` swaps the `NullIntolerant` trait for `nullIntolerant: Boolean = true`; numerics unchanged. - [x] unix_timestamp + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. Inherits from `UnixTime` / `ToTimestamp`; `inputTypes = Seq(TypeCollection(StringType, DateType, TimestampType, TimestampNTZType), StringType)`; result is `LongType`; honours `failOnError` for ANSI parse errors on the string path. + - Spark 4.0.1 (audited 2026-05-27): `ToTimestamp.inputTypes` widens the string slot to `StringTypeWithCollation(supportsTrimCollation = true)`; behaviour unchanged for non-collated strings. + - Known divergence: Comet's native path only accepts `TimestampType`, `DateType`, and `TimestampNTZType` (string inputs fall back to Spark). For `TimestampType` and `DateType` the session timezone is applied via `array_with_timezone`; for `TimestampNTZType` the microsecond value is divided directly without timezone adjustment. - [x] weekday + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `WeekDay extends GetDateField`; returns 0..6 with Monday=0 via `LocalDate.getDayOfWeek.ordinal()`. + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [x] weekofyear + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `WeekOfYear extends GetDateField`; returns the ISO-8601 week-of-week-based-year via `IsoFields.WEEK_OF_WEEK_BASED_YEAR` (Monday start, week 1 has more than 3 days). Comet maps this to DataFusion's `datepart('week', ...)` which uses Arrow's `iso_week().week()`, matching Spark. + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [ ] window - [ ] window_time - [x] year + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `Year extends GetDateField`; delegates to `DateTimeUtils.getYear` via `LocalDate.getYear`. + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. ### generator_funcs diff --git a/spark/src/test/resources/sql-tests/expressions/datetime/make_date_ansi.sql b/spark/src/test/resources/sql-tests/expressions/datetime/make_date_ansi.sql new file mode 100644 index 0000000000..04a28e3e1c --- /dev/null +++ b/spark/src/test/resources/sql-tests/expressions/datetime/make_date_ansi.sql @@ -0,0 +1,36 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. + +-- ANSI mode: Spark's MakeDate wraps the java.time.DateTimeException from LocalDate.of in +-- ansiDateTimeArgumentOutOfRange (4.0) / ansiDateTimeError (3.4/3.5) when +-- spark.sql.ansi.enabled=true. Comet's native SparkMakeDate always returns NULL on +-- invalid input and never raises, so it does not throw under ANSI. The ignored queries +-- below capture the divergence; remove ignore(...) when +-- https://github.com/apache/datafusion-comet/issues/4451 is fixed. +-- Config: spark.sql.ansi.enabled=true + +-- February 30 is not a valid date. +query ignore(https://github.com/apache/datafusion-comet/issues/4451) +SELECT make_date(2024, 2, 30) + +-- Month 13 is out of range. +query ignore(https://github.com/apache/datafusion-comet/issues/4451) +SELECT make_date(2024, 13, 1) + +-- Day 0 is out of range. +query ignore(https://github.com/apache/datafusion-comet/issues/4451) +SELECT make_date(2024, 6, 0) diff --git a/spark/src/test/resources/sql-tests/expressions/datetime/next_day.sql b/spark/src/test/resources/sql-tests/expressions/datetime/next_day.sql index 057c9daea6..65cccb9356 100644 --- a/spark/src/test/resources/sql-tests/expressions/datetime/next_day.sql +++ b/spark/src/test/resources/sql-tests/expressions/datetime/next_day.sql @@ -72,3 +72,8 @@ SELECT next_day(date('2023-01-01'), 'Monday'), next_day(date('2023-01-01'), 'Sun -- null handling query SELECT next_day(NULL, 'Monday'), next_day(date('2023-01-01'), NULL) + +-- Comet's native impl trims whitespace before matching the day name; Spark does not, so +-- ' MO ' is invalid in Spark (NULL) but matches Monday in Comet. +query ignore(https://github.com/apache/datafusion-comet/issues/4450) +SELECT next_day(date('2024-01-01'), ' MO '), next_day(date('2024-01-01'), 'MO ') diff --git a/spark/src/test/resources/sql-tests/expressions/datetime/next_day_ansi.sql b/spark/src/test/resources/sql-tests/expressions/datetime/next_day_ansi.sql new file mode 100644 index 0000000000..9f8f8e435f --- /dev/null +++ b/spark/src/test/resources/sql-tests/expressions/datetime/next_day_ansi.sql @@ -0,0 +1,30 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. + +-- ANSI mode: Spark's NextDay throws SparkIllegalArgumentException on a malformed +-- dayOfWeek. Comet's native impl (datafusion-spark::SparkNextDay) always returns NULL, +-- so it does not throw under ANSI. The expect_error pattern below will be the assertion +-- once https://github.com/apache/datafusion-comet/issues/4449 is fixed; until then the +-- queries are ignored so the suite stays green. +-- Config: spark.sql.ansi.enabled=true + +-- Comet returns NULL where Spark throws. +query ignore(https://github.com/apache/datafusion-comet/issues/4449) +SELECT next_day(date('2024-01-01'), 'NOT_A_DAY') + +query ignore(https://github.com/apache/datafusion-comet/issues/4449) +SELECT next_day(date('2024-01-01'), '')