From e2b9356080cf080f743a51a972a82fa90f3f0a5c Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Wed, 27 May 2026 07:44:20 -0600 Subject: [PATCH 1/4] docs: clarify support-level and reason consistency in audit-comet-expression skill Expand the audit skill so it produces consistent, accurate findings about `getSupportLevel`, `getIncompatibleReasons`, `getUnsupportedReasons`, and `convert`. Step 3 gains a dedicated subsection covering: - A decision rule for `Unsupported` vs `Incompatible` (wrong answer vs fallback/error). - The runtime-vs-docs split: support-level `notes` flow into EXPLAIN via the dispatcher, while `get*Reasons()` is read only by `GenerateDocs`. - Six consistency rules covering every branch, dead-reason detection, shared `private val` for reason strings, `Incompatible(Some(reason))` preference, and gating decisions in `getSupportLevel` rather than `convert`. - Wording guidelines for reason strings (user-observable effect first, sentence case, backtick configs/types, tracking issue links). - Real antipattern callouts naming specific serdes (`CometHour`, `CometMinute`, `CometSecond`, `CometInitCap`). Step 5 gains a structured 9-item support-level consistency audit checklist that produces findings against the four methods. Step 7 splits into two asks: one for missing tests, one for consistency fixes, so the user can opt in to either independently. --- .../skills/audit-comet-expression/SKILL.md | 206 ++++++++++++++++-- 1 file changed, 188 insertions(+), 18 deletions(-) diff --git a/.claude/skills/audit-comet-expression/SKILL.md b/.claude/skills/audit-comet-expression/SKILL.md index e2704bb47d..bfabfeaa4e 100644 --- a/.claude/skills/audit-comet-expression/SKILL.md +++ b/.claude/skills/audit-comet-expression/SKILL.md @@ -119,19 +119,125 @@ grep -r "$ARGUMENTS" spark/src/main/scala/org/apache/comet/ --include="*.scala" Read the serde implementation and check: - Which Spark versions the serde handles -- Whether `getSupportLevel` is implemented and accurate -- Whether all input types are handled -- Whether any types are explicitly marked `Unsupported` -- Whether `getIncompatibleReasons()` and `getUnsupportedReasons()` are overridden. - `getSupportLevel` controls runtime fallback, but `GenerateDocs` reads these two - methods to build the Compatibility Guide. If `getSupportLevel` returns - `Incompatible(Some(...))` or `Unsupported(Some(...))` but the corresponding - `get*Reasons()` method is not overridden, the reason will not appear in the - published docs. Expect both to return a `Seq[String]` containing the same - reason text used in `getSupportLevel`. Follow the pattern in - `spark/src/main/scala/org/apache/comet/serde/structs.scala::CometStructsToJson` - or `spark/src/main/scala/org/apache/comet/serde/datetime.scala::CometHour`: - extract the reason as a `private val` and reference it from both places. +- Whether all input types Spark accepts are handled +- Whether `convert` validates expression shape (e.g. literal-only arguments) before serializing + +Then audit the support-level and reason methods as described below. + +#### Audit `getSupportLevel`, `getIncompatibleReasons`, `getUnsupportedReasons`, and `convert` + +These four methods must stay aligned. Each has a distinct purpose, and the +most common bugs in Comet serdes are misalignments between them. + +**Pick the right support level.** + +- `Unsupported(Some(reason))`: Comet cannot run this case at all. The + dispatcher in `QueryPlanSerde.exprToProtoInternal` falls back to Spark + unconditionally. Use this when an input type, option, or expression shape + is not implemented, or when running the native path would crash or error. +- `Incompatible(Some(reason))`: Comet can run this, but results may differ + from Spark. The dispatcher only allows it when + `spark.comet.expr.allowIncompatible=true` (or the per-expression + equivalent). Use this for known result differences such as locale + sensitivity, timezone handling, ordering ambiguity, or floating-point + precision. +- `Compatible(None)`: full Spark compatibility for this combination of + inputs and options. +- `Compatible(Some(note))`: fully compatible but with a docs-only caveat. + Note that any `Some(...)` on `Compatible` triggers a runtime + `logWarning`, so reserve it for genuinely useful caveats. + +Decision rule: if the user would be surprised by a *wrong answer*, it is +`Incompatible`. If the user would be surprised by a *runtime error or +unsupported-type message*, it is `Unsupported`. + +**Runtime vs docs split.** + +- The `notes` field on `Incompatible(Some(...))` and `Unsupported(Some(...))` + flows into EXPLAIN output via the dispatcher (see + `QueryPlanSerde.exprToProtoInternal`, around the `case Incompatible` / + `case Unsupported` branches). This is what users see when they ask why + Comet fell back. +- `getIncompatibleReasons()` and `getUnsupportedReasons()` are read only by + `GenerateDocs` when building + `docs/source/user-guide/compatibility.md`. They are static (no `expr` + argument) and should enumerate every distinct reason the expression + could ever return at runtime. +- `convert` may also call `withInfo(expr, "reason")` and return `None` for + cases that cannot be detected from the expression alone (for example, + non-literal arguments, or child conversion failures). Those reasons + belong in `getUnsupportedReasons()` too. + +**Consistency rules.** + +1. If `getSupportLevel` can return `Incompatible(...)` for any input, + override `getIncompatibleReasons()` and include a reason for every + incompatible branch. +2. If `getSupportLevel` can return `Unsupported(...)` for any input, + override `getUnsupportedReasons()` and include a reason for every + unsupported branch. +3. If `convert` has its own `withInfo(...); None` fallbacks (e.g. literal + checks), enumerate those reasons in `getUnsupportedReasons()` too. +4. Extract each reason into a `private val` and reference it from both + `getSupportLevel` and `get*Reasons()`. Do not duplicate the string + inline. Canonical example: + `spark/src/main/scala/org/apache/comet/serde/arrays.scala::CometArrayIntersect`. + It declares `private val incompatReason` and + `private val unsupportedCollationReason`, and each is referenced from + `getSupportLevel` and the matching reasons method. +5. Prefer `Incompatible(Some(reason))` over `Incompatible(None)`. The + `None` form drops the reason from EXPLAIN output, leaving users with a + generic "not fully compatible" message and forcing them to read the + docs to find out why. +6. Gate compatibility decisions in `getSupportLevel`, not inside `convert`. + Putting the check inside `convert` (e.g. reading a config flag and + calling `withInfo`) bypasses the dispatcher's `allowIncompatible` + handling and the EXPLAIN message becomes inconsistent with what the + doc generator produces. + +**Wording guidelines for reason strings.** + +Reasons appear verbatim in the Compatibility Guide (rendered as Markdown) +and in EXPLAIN output, so they are user-facing. + +- Lead with the user-observable effect, then the cause if helpful. + ✅ "Result array element order may differ from Spark when the right array + is longer than the left." + ❌ "DataFusion probes the longer side." +- Use sentence case and end with a period. +- Use backticks around config keys, type names, and SQL identifiers. +- Link to a tracking GitHub issue for known incompatibilities so users can + follow progress: `(https://github.com/apache/datafusion-comet/issues/NNNN)`. +- Keep it concise. Single sentence is best. +- Do not write "Incompatible reason: ..." or "Unsupported because ...". + The doc generator adds the framing. +- Phrase Incompatible reasons as *what differs from Spark*, not *what is + missing*. Phrase Unsupported reasons as *what does not run*. If you find + yourself writing an "Incompatible" reason that says "Comet only supports + X" or "Y is not supported", the support level is probably wrong: it + should be `Unsupported`. + +**Common antipatterns to flag during the audit.** + +- A `private val reason` constant declared near the top of the object, but + `getSupportLevel` hardcodes a different string inline. The doc and the + EXPLAIN message will drift. + Real example: `CometHour` declares `incompatReason` then hardcodes a + near-duplicate string in `getSupportLevel`. +- Reason text duplicated in two places without a shared constant. + Real examples: `CometMinute`, `CometSecond`. +- `Incompatible(None)` paired with a populated `getIncompatibleReasons()`. + The reason reaches the docs but not EXPLAIN output. + Real example: `CometInitCap`. +- `getIncompatibleReasons()` overridden but `getSupportLevel` never returns + `Incompatible(...)`. Either the reasons method is dead code, or + `getSupportLevel` is missing a branch. +- A reason string phrased as "X is not supported" attached to an + `Incompatible` branch (or vice versa). Re-read it and decide which + support level it really belongs to. +- `convert` bails out via `withInfo` for a case that is fully knowable + from the expression (e.g. an unsupported child data type). Move the + check into `getSupportLevel` so the dispatcher handles it uniformly. ### Shims @@ -237,7 +343,52 @@ Also review the Comet implementation (Step 3) against the Spark behavior (Step 1 - Are there behavioral differences that are NOT marked `Incompatible` but should be? - Are there behavioral differences between Spark versions that the Comet implementation does not account for (missing shim)? - Does the Rust implementation match the Spark behavior for all edge cases? -- If `getSupportLevel` returns `Incompatible` or `Unsupported` with a reason, are `getIncompatibleReasons()` / `getUnsupportedReasons()` also overridden so the reason is picked up by `GenerateDocs` and appears in the Compatibility Guide? + +### Support-level consistency audit + +Walk through this checklist against the serde. Each failed item is a +finding for Step 6. + +1. **Support level matches behavior.** For each branch of + `getSupportLevel`, decide whether the user-observable effect is a wrong + answer (`Incompatible`) or a fallback / error (`Unsupported`). Flag any + branch where the label does not match the behavior. +2. **Reasons cover every branch.** Every distinct reason that + `getSupportLevel` can return as `Incompatible(Some(r))` must appear in + `getIncompatibleReasons()`. Same for `Unsupported(Some(r))` and + `getUnsupportedReasons()`. Missing reasons silently drop from the + Compatibility Guide. +3. **Reasons are not dead code.** If `getIncompatibleReasons()` is + overridden but `getSupportLevel` never returns `Incompatible(...)`, + either the reason is stale or `getSupportLevel` is missing a branch. + Same for `getUnsupportedReasons()`. +4. **Reason strings are shared via `private val`.** If the same reason + appears as a string literal in two places, flag it: changes to one + will not propagate to the other. +5. **Inline reason matches the constant.** If a `private val reason` is + declared but `getSupportLevel` uses a different string literal, flag + it as a drift bug. +6. **`Incompatible(None)` has no docs-only reason.** If + `getSupportLevel` returns `Incompatible(None)` but + `getIncompatibleReasons()` is non-empty, flag it: the EXPLAIN message + will be generic while the docs show a specific reason. Switch to + `Incompatible(Some(reason))`. +7. **`convert` fallbacks are documented.** If `convert` calls + `withInfo(expr, "...")` and returns `None` for cases not covered by + `getSupportLevel` (e.g. non-literal arguments, unsupported expression + shapes), confirm the reason is also listed in + `getUnsupportedReasons()`. +8. **Compatibility decisions live in `getSupportLevel`.** If `convert` + reads a config flag and bails out, prefer moving the check into + `getSupportLevel` so the dispatcher handles `allowIncompatible` + uniformly. `CometCaseConversionBase` is an example of the in-`convert` + pattern that this skill recommends against. +9. **Reason wording.** Each reason should describe the user-observable + effect, use sentence case with a period, backtick config keys and + types, and link a tracking issue when one exists. Flag reasons that + read like internal implementation notes ("DataFusion probes the longer + side") or that mismatch their support level (an "Incompatible" reason + that says "X is not supported"). --- @@ -259,9 +410,13 @@ Minor gaps, cosmetic improvements, or nice-to-have tests. --- -## Step 7: Offer to Implement Missing Tests +## Step 7: Offer to Implement Missing Tests and Fix Consistency Findings + +After presenting the gap analysis, ask the user separately about each +category of finding. Keep the two asks distinct so the user can opt in to +one without the other. -After presenting the gap analysis, ask the user: +**Test coverage:** > I found the following missing test cases. Would you like me to implement them? > @@ -270,8 +425,23 @@ After presenting the gap analysis, ask the user: > I can add them as Comet SQL Tests in `spark/src/test/resources/sql-tests/expressions//$ARGUMENTS.sql` > (or as Comet Scala Tests in `CometExpressionSuite` for cases that require programmatic setup). -If the user says yes, implement the missing tests following the Comet SQL Tests format described in -`docs/source/contributor-guide/sql-file-tests.md`. Prefer Comet SQL Tests over Comet Scala Tests. +**Support-level consistency:** + +> I also found the following alignment issues between `getSupportLevel`, +> `getIncompatibleReasons`, `getUnsupportedReasons`, and `convert`. Would +> you like me to fix them? +> +> - [list each finding from the Step 5 consistency audit, with the file +> and serde object name] +> +> Each fix is a small, mechanical edit (extract a `private val` for the +> reason, switch `Incompatible(None)` to `Incompatible(Some(reason))`, +> add a missing `get*Reasons()` override, or move a check from `convert` +> into `getSupportLevel`). + +If the user says yes to tests, implement them following the Comet SQL Tests +format described in `docs/source/contributor-guide/sql-file-tests.md`. +Prefer Comet SQL Tests over Comet Scala Tests. ### Comet SQL Tests template From 402d314c642a1cffa4b96ef08904bc8a2e6cf4c4 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Wed, 27 May 2026 07:47:14 -0600 Subject: [PATCH 2/4] style: apply prettier --- .claude/skills/audit-comet-expression/SKILL.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/.claude/skills/audit-comet-expression/SKILL.md b/.claude/skills/audit-comet-expression/SKILL.md index bfabfeaa4e..7cac6c252d 100644 --- a/.claude/skills/audit-comet-expression/SKILL.md +++ b/.claude/skills/audit-comet-expression/SKILL.md @@ -147,9 +147,9 @@ most common bugs in Comet serdes are misalignments between them. Note that any `Some(...)` on `Compatible` triggers a runtime `logWarning`, so reserve it for genuinely useful caveats. -Decision rule: if the user would be surprised by a *wrong answer*, it is -`Incompatible`. If the user would be surprised by a *runtime error or -unsupported-type message*, it is `Unsupported`. +Decision rule: if the user would be surprised by a _wrong answer_, it is +`Incompatible`. If the user would be surprised by a _runtime error or +unsupported-type message_, it is `Unsupported`. **Runtime vs docs split.** @@ -211,8 +211,8 @@ and in EXPLAIN output, so they are user-facing. - Keep it concise. Single sentence is best. - Do not write "Incompatible reason: ..." or "Unsupported because ...". The doc generator adds the framing. -- Phrase Incompatible reasons as *what differs from Spark*, not *what is - missing*. Phrase Unsupported reasons as *what does not run*. If you find +- Phrase Incompatible reasons as _what differs from Spark_, not _what is + missing_. Phrase Unsupported reasons as _what does not run_. If you find yourself writing an "Incompatible" reason that says "Comet only supports X" or "Y is not supported", the support level is probably wrong: it should be `Unsupported`. @@ -431,8 +431,7 @@ one without the other. > `getIncompatibleReasons`, `getUnsupportedReasons`, and `convert`. Would > you like me to fix them? > -> - [list each finding from the Step 5 consistency audit, with the file -> and serde object name] +> - [list each finding from the Step 5 consistency audit, with the file and serde object name] > > Each fix is a small, mechanical edit (extract a `private val` for the > reason, switch `Incompatible(None)` to `Incompatible(Some(reason))`, From d911a436a644d422053c4f976e0d87af08e6db63 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Wed, 27 May 2026 08:03:09 -0600 Subject: [PATCH 3/4] docs: audit date/time expressions across Spark 3.4.3, 3.5.8, 4.0.1 Add per-version audit sub-bullets to every implemented date/time expression in spark_expressions_support.md using the audit-comet-expression skill. Covers 38 SQL function names across the 33 backing Comet serde objects (some serdes back multiple SQL names, e.g. `day`/`dayofmonth`, `date_add`/`dateadd`, `date_diff`/`datediff`). For each function, the sub-bullets record: - Whether the Spark class is identical across 3.4.3, 3.5.8, 4.0.1 - Spark 4.0 changes (universally the `NullIntolerant` trait / `nullIntolerant: Boolean` refactor, plus `StringTypeWithCollation` widening on string inputs and some error-helper renames) - Known divergences between Comet and Spark, with tracking-issue links The audit was driven by 8 parallel agents, each handling a related group of expressions (codegen-dispatched, date field extractors, Hour/Minute/Second, scalar function wrappers, timezone/unix, truncation, format, Iceberg transforms). Out of scope: `current_timezone`, `date_part`, `datepart`, `extract`, `localtimestamp` route through Spark optimizer rewrites or evaluate to constants and do not have dedicated Comet serdes; `days` and `hours` are V2 partition transforms with no SQL function name and so do not appear in this section. No code changes in this PR; consistency findings flagged during the audit will be addressed in follow-up PRs. --- .../spark_expressions_support.md | 128 ++++++++++++++++++ 1 file changed, 128 insertions(+) diff --git a/docs/source/contributor-guide/spark_expressions_support.md b/docs/source/contributor-guide/spark_expressions_support.md index ccb816c668..c6337c0e41 100644 --- a/docs/source/contributor-guide/spark_expressions_support.md +++ b/docs/source/contributor-guide/spark_expressions_support.md @@ -215,86 +215,214 @@ ### datetime_funcs - [x] add_months + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `inputTypes = Seq(DateType, IntegerType)`; returns `DateType`; codegen delegates to `DateTimeUtils.dateAddMonths`. + - Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` trait replaced by `override def nullIntolerant: Boolean = true` on `AddMonthsBase`; behaviour and codegen unchanged. - [x] convert_timezone + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. Ternary `(sourceTz, targetTz, sourceTs)`; `inputTypes = Seq(StringType, StringType, TimestampNTZType)`; delegates to `DateTimeUtils.convertTimestampNtzToAnotherTz`. + - Spark 4.0.1 (audited 2026-05-27): timezone `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`; behaviour unchanged for ASCII timezone strings. + - Known divergence: Comet composes `to_utc_timestamp` then `from_utc_timestamp` and its native timezone parser only accepts IANA zone IDs and `+HH:MM` offsets, so legacy forms like `GMT+1`, `UTC+1`, or three-letter abbreviations throw a native parse error at execution (https://github.com/apache/datafusion-comet/issues/2013). - [ ] curdate - [ ] current_date - [ ] current_time - [ ] current_timestamp - [x] current_timezone - [x] date_add + - Spark 3.4.3 (audited 2026-05-27): baseline. `(DateType, IntegerType|ShortType|ByteType) -> DateType`; `nullSafeEval` returns `startDays + d.intValue()` with Java int wrap-around; no ANSI branch. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` trait replaced by `override def nullIntolerant: Boolean = true`. - [x] date_diff + - Spark 3.4.3 (audited 2026-05-27): baseline. `(DateType, DateType) -> IntegerType`; `nullSafeEval` is `endDays - startDays` with Java int wrap-around. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` trait replaced by `override def nullIntolerant: Boolean = true`. + - Known divergence: the native impl uses non-wrapping `i32 -`, which would panic in debug builds on extreme inputs (Spark wraps); practically unreachable for date inputs. - [x] date_format + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `(TimestampType, StringType) -> StringType`; format string is parsed via `TimestampFormatter` (`DateTimeFormatter` under `CORRECTED` policy, `SimpleDateFormat` under `LEGACY` policy). + - Spark 4.0.1 (audited 2026-05-27): trait set updated to use `DefaultStringProducingExpression`; `nullIntolerant` becomes a field; format `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`. Behaviour unchanged for ASCII format strings. + - Known divergence: only a curated allow-list of `SimpleDateFormat` patterns runs natively (via DataFusion `to_char`). Non-UTC session timezones with a whitelisted format require `spark.comet.expr.dateFormat.allowIncompatible=true`. Non-literal formats, non-whitelisted formats, and the default disabled-codegen path route through Spark's `DateFormatClass.doGenCode` only when `spark.comet.exec.scalaUDF.codegen.enabled=true`; otherwise the operator falls back to Spark. `spark.sql.legacy.timeParserPolicy=LEGACY` is honoured only on the codegen-dispatch / Spark-fallback paths; the native allow-list assumes corrected semantics. - [x] date_from_unix_date + - Spark 3.4.3 (audited 2026-05-27): baseline. `IntegerType -> DateType`; `nullSafeEval` is the identity on days-since-epoch. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` trait replaced by `override def nullIntolerant: Boolean = true`. - [x] date_part - [x] date_sub + - Spark 3.4.3 (audited 2026-05-27): baseline. Mirror of `DateAdd` (`startDays - d.intValue()`); same input types and wrap-around behaviour; no ANSI branch. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` trait replaced by `override def nullIntolerant: Boolean = true`. - [x] date_trunc + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `inputTypes = Seq(StringType, TimestampType)`; format parsed by `parseTruncLevel` (case-insensitive) and supports `YEAR`/`YYYY`/`YY`, `QUARTER`, `MONTH`/`MM`/`MON`, `WEEK`, `DAY`/`DD`, `HOUR`, `MINUTE`, `SECOND`, `MILLISECOND`, `MICROSECOND`. Unknown levels return NULL. Truncation is `TimeZoneAware` and uses `zoneId` for day-and-coarser units. + - Spark 4.0.1 (audited 2026-05-27): format `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`; truncation semantics unchanged for ASCII format strings. + - Known divergence: Comet returns incorrect results in non-UTC session timezones for day-and-coarser units (https://github.com/apache/datafusion-comet/issues/2649); marked `Incompatible` when the resolved zone is not `UTC` / `Etc/UTC`. Non-literal and unsupported format strings raise a native execution error instead of returning NULL. - [x] dateadd + - Spark 3.4.3 (audited 2026-05-27): SQL alias for `date_add`; see that entry. + - Spark 3.5.8 (audited 2026-05-27): SQL alias for `date_add`; see that entry. + - Spark 4.0.1 (audited 2026-05-27): SQL alias for `date_add`; see that entry. - [x] datediff + - Spark 3.4.3 (audited 2026-05-27): SQL alias for `date_diff`; see that entry. + - Spark 3.5.8 (audited 2026-05-27): SQL alias for `date_diff`; see that entry. + - Spark 4.0.1 (audited 2026-05-27): SQL alias for `date_diff`; see that entry. - [x] datepart - [x] day + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. SQL alias for `DayOfMonth`; delegates to `DateTimeUtils.getDayOfMonth` via `LocalDate.getDayOfMonth` (1..31). + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [ ] dayname - [x] dayofmonth + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `DayOfMonth extends GetDateField`; delegates to `DateTimeUtils.getDayOfMonth` (1..31). + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [x] dayofweek + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `DayOfWeek extends GetDateField`; returns 1..7 with Sunday=1 via `LocalDate.getDayOfWeek.plus(1).getValue`. + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [x] dayofyear + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `DayOfYear extends GetDateField`; returns 1..366 via `LocalDate.getDayOfYear`. + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [x] extract - [x] from_unixtime + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `BinaryExpression` with `inputTypes = Seq(LongType, StringType)`; returns `StringType`; uses session `zoneId` to format the resulting timestamp. + - Spark 4.0.1 (audited 2026-05-27): now `DefaultStringProducingExpression`; format `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`; `nullIntolerant` set via override instead of the `NullIntolerant` trait. + - Known divergence: Comet only honours the default format pattern `yyyy-MM-dd HH:mm:ss`; any other format falls back to Spark. Implemented by composing DataFusion's `from_unixtime` and `to_char`, so DataFusion's valid timestamp range differs from Spark (https://github.com/apache/datafusion/issues/16594) and Spark datetime patterns are not honoured even when supplied (https://github.com/apache/datafusion/issues/16577). - [x] from_utc_timestamp - Spark 3.4.3 (audited 2026-05-12): identical to 3.5.8. - Spark 3.5.8 (audited 2026-05-12): baseline. - Spark 4.0.1 (audited 2026-05-12): `inputTypes` widened to `StringTypeWithCollation`; behaviour unchanged for ASCII timezone strings. - Known divergence: Comet's native timezone parser does not accept Spark's legacy zone forms (`GMT+1`, `UTC+1`, three-letter abbreviations like `PST`). Such timezones throw a native parse error at execution. - [x] hour + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `case class Hour` extends `GetTimeField`. + - Spark 4.0.1 (audited 2026-05-27): `case class Hour` is unchanged; parent `GetTimeField` trait refactored to override `nullIntolerant: Boolean = true` instead of mixing in `NullIntolerant` (no behavioural change). + - Known divergence: for `TimestampNTZType` inputs Comet's native path applies session-timezone conversion (Spark treats `TIMESTAMP_NTZ` as wall-clock and ignores session timezone), so the returned hour can differ. Marked `Incompatible` and gated by `spark.comet.expr.allowIncompatible` (https://github.com/apache/datafusion-comet/issues/3180). - [x] last_day + - Spark 3.4.3 (audited 2026-05-27): baseline. `DateType -> DateType`; computes `DateTimeUtils.getLastDayOfMonth`; no ANSI branch. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` trait replaced by `override def nullIntolerant: Boolean = true`. - [x] localtimestamp - [x] make_date + - Spark 3.4.3 (audited 2026-05-27): baseline. `(IntegerType, IntegerType, IntegerType) -> DateType`; under `spark.sql.ansi.enabled=true` invalid `(year, month, day)` throws `ansiDateTimeError`, else returns NULL. Documented valid year range is 1 to 9999. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): error helper renamed to `ansiDateTimeArgumentOutOfRange`; behaviour otherwise unchanged. + - Known divergence: `SparkMakeDate` in `native/spark-expr/src/datetime_funcs/make_date.rs` always returns NULL on invalid input and never raises, so Comet diverges from Spark when `spark.sql.ansi.enabled=true`. It also accepts year 0 and negative years (chrono's proleptic calendar) which Spark rejects. - [ ] make_dt_interval - [ ] make_interval - [ ] make_time - [x] make_timestamp + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. Septenary `(year, month, day, hour, min, sec[, timezone])` with `sec: DecimalType(16,6)`; honours `spark.sql.ansi.enabled` (throws on invalid input, else NULL); timezone input is `StringType`; result type follows `spark.sql.timestampType`. + - Spark 4.0.1 (audited 2026-05-27): timezone input widened to `StringTypeWithCollation(supportsTrimCollation = true)`; ANSI error helpers renamed (`ansiDateTimeArgumentOutOfRange`, `invalidFractionOfSecondError(value)`); `NullIntolerant` trait replaced by `nullIntolerant: Boolean = true`; new sibling `TryMakeTimestamp` added but routed through a separate expression. - [ ] make_timestamp_ltz - [ ] make_timestamp_ntz - [ ] make_ym_interval - [x] minute + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `case class Minute` extends `GetTimeField`. + - Spark 4.0.1 (audited 2026-05-27): `case class Minute` is unchanged; parent `GetTimeField` trait refactored to override `nullIntolerant: Boolean = true` instead of mixing in `NullIntolerant` (no behavioural change). + - Known divergence: for `TimestampNTZType` inputs Comet's native path applies session-timezone conversion (Spark treats `TIMESTAMP_NTZ` as wall-clock and ignores session timezone), so the returned minute can differ. Marked `Incompatible` and gated by `spark.comet.expr.allowIncompatible` (https://github.com/apache/datafusion-comet/issues/3180). - [x] month + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `Month extends GetDateField`; delegates to `DateTimeUtils.getMonth` via `LocalDate.getMonthValue` (1..12). + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [ ] monthname - [x] months_between + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. Ternary `(date1: Timestamp, date2: Timestamp, roundOff: Boolean)` returning `DoubleType`; `TimeZoneAwareExpression`; codegen delegates to `DateTimeUtils.monthsBetween`. + - Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` trait dropped in favour of `nullIntolerant: Boolean = true` override; signature and runtime behaviour unchanged. - [x] next_day + - Spark 3.4.3 (audited 2026-05-27): baseline. `(DateType, StringType) -> DateType`; under `spark.sql.ansi.enabled=true` an unrecognised `dayOfWeek` throws `ansiIllegalArgumentError`, else returns NULL. Allowed tokens come from `DateTimeUtils.getDayOfWeekFromString` (`SU/SUN/SUNDAY`, `MO/MON/MONDAY`, ...), case-insensitive via `Locale.ROOT`, no trimming. + - Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3. + - Spark 4.0.1 (audited 2026-05-27): error type changed to `SparkIllegalArgumentException`; `inputTypes` now uses `StringTypeWithCollation(supportsTrimCollation = true)`. + - Known divergence: `datafusion-spark::SparkNextDay` returns NULL for malformed `dayOfWeek` regardless of `spark.sql.ansi.enabled`, so ANSI mode does not throw. It also `trim()`s the day-of-week argument before matching, so `' MO '` succeeds natively while Spark would treat it as invalid. - [ ] now - [x] quarter + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `Quarter extends GetDateField`; returns 1..4 via `IsoFields.QUARTER_OF_YEAR`. + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [x] second + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `case class Second` extends `GetTimeField`. + - Spark 4.0.1 (audited 2026-05-27): `case class Second` is unchanged; parent `GetTimeField` trait refactored to override `nullIntolerant: Boolean = true` instead of mixing in `NullIntolerant` (no behavioural change). + - Known divergence: for `TimestampNTZType` inputs Comet's native path applies session-timezone conversion (Spark treats `TIMESTAMP_NTZ` as wall-clock and ignores session timezone), so the returned second can differ. Marked `Incompatible` and gated by `spark.comet.expr.allowIncompatible` (https://github.com/apache/datafusion-comet/issues/3180). - [ ] session_window - [ ] time_diff - [ ] time_trunc - [x] timestamp_micros + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `MicrosToTimestamp` extends `IntegralToTimestampBase` with `upScaleFactor = 1`; accepts `IntegralType`, returns `TimestampType`; codegen is identity. + - Spark 4.0.1 (audited 2026-05-27): `IntegralToTimestampBase` drops the `NullIntolerant` trait in favour of `nullIntolerant: Boolean = true`; behaviour unchanged. - [x] timestamp_millis + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `MillisToTimestamp` extends `IntegralToTimestampBase` with `upScaleFactor = MICROS_PER_MILLIS (1000)`; multiply overflow throws via `Math.multiplyExact`. + - Spark 4.0.1 (audited 2026-05-27): same as 3.5.8 modulo the `NullIntolerant` trait/method refactor in `IntegralToTimestampBase`. - [x] timestamp_seconds + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `inputTypes = Seq(NumericType)` accepting integral, decimal, float, and double; integral values use `Math.multiplyExact` (overflow throws); float and double return NULL on NaN or Infinity. + - Spark 4.0.1 (audited 2026-05-27): `nullIntolerant` set via override instead of the `NullIntolerant` trait; otherwise identical to 3.5.8. + - Known divergence: Comet's Rust impl supports only Int32, Int64, Float32, and Float64. `DecimalType`, `ByteType`, and `ShortType` fall back to Spark. Int64 overflow returns a `ComputeError` matching Spark's `ArithmeticException`. NaN and Infinity map to NULL on the float and double paths. - [ ] to_date - [ ] to_time - [ ] to_timestamp - [ ] to_timestamp_ltz - [ ] to_timestamp_ntz - [x] to_unix_timestamp + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `inputTypes = Seq(TypeCollection(StringType, DateType, TimestampType, TimestampNTZType), StringType)`; returns `LongType` seconds; honours `spark.sql.ansi.enabled` (throws on parse error, else NULL); `TimeZoneAwareExpression`. + - Spark 4.0.1 (audited 2026-05-27): both the value and the format argument become `StringTypeWithCollation(supportsTrimCollation = true)`; a new `suggestedFuncOnFail = "try_to_timestamp"` field is added on `ToTimestamp` (advisory). + - Known divergence: routed through the JVM codegen dispatcher rather than a native kernel, so behaviour is bit-identical to Spark only when `spark.comet.exec.scalaUDF.codegen.enabled=true`; when the flag is off the operator falls back to Spark. - [x] to_utc_timestamp - Spark 3.4.3 (audited 2026-05-12): identical to 3.5.8. - Spark 3.5.8 (audited 2026-05-12): baseline. - Spark 4.0.1 (audited 2026-05-12): `inputTypes` widened to `StringTypeWithCollation`; behaviour unchanged for ASCII timezone strings. - Known divergence: Comet's native timezone parser does not accept Spark's legacy zone forms (`GMT+1`, `UTC+1`, three-letter abbreviations like `PST`). Such timezones throw a native parse error at execution. - [x] trunc + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `inputTypes = Seq(DateType, StringType)`; `parseTruncLevel` is case-insensitive and accepts `YEAR`/`YYYY`/`YY`, `QUARTER`, `MONTH`/`MM`/`MON`, `WEEK`. Unknown or sub-week levels (`DAY`, `HOUR`, ...) return NULL because `MIN_LEVEL_OF_DATE_TRUNC` is `TRUNC_TO_WEEK`. + - Spark 4.0.1 (audited 2026-05-27): format `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`; truncation semantics unchanged for ASCII format strings. + - Known divergence: Comet's native kernel raises an execution error for unknown format strings instead of returning NULL, so non-literal formats are flagged `Incompatible`. Sub-week formats such as `DAY`/`DD` are rejected with `Unsupported` (Spark would return NULL) and fall back to Spark. - [ ] try_make_interval - [ ] try_make_timestamp - [ ] try_to_date - [ ] try_to_time - [ ] try_to_timestamp - [x] unix_date + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `UnixDate(child)`: `inputTypes = Seq(DateType)`, `dataType = IntegerType`; `nullSafeEval` returns the underlying days-since-epoch int unchanged. + - Spark 4.0.1 (audited 2026-05-27): semantics unchanged; `NullIntolerant` trait is replaced by `nullIntolerant: Boolean = true`. - [x] unix_micros + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `UnixMicros` extends `TimestampToLongBase` with `scaleFactor = 1`; codegen reduces to identity on the underlying micros. + - Spark 4.0.1 (audited 2026-05-27): same as 3.5.8 modulo the `NullIntolerant` trait/method refactor. - [x] unix_millis + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `UnixMillis` extends `TimestampToLongBase` with `scaleFactor = MICROS_PER_MILLIS`; floor-divides timestamp micros by 1000. + - Spark 4.0.1 (audited 2026-05-27): same as 3.5.8 modulo the `NullIntolerant` trait/method refactor. - [x] unix_seconds + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `UnixSeconds` extends `TimestampToLongBase` with `scaleFactor = MICROS_PER_SECOND`; accepts `TimestampType` only; returns `LongType`; floor-divides micros by the scale factor. + - Spark 4.0.1 (audited 2026-05-27): `TimestampToLongBase` swaps the `NullIntolerant` trait for `nullIntolerant: Boolean = true`; numerics unchanged. - [x] unix_timestamp + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. Inherits from `UnixTime` / `ToTimestamp`; `inputTypes = Seq(TypeCollection(StringType, DateType, TimestampType, TimestampNTZType), StringType)`; result is `LongType`; honours `failOnError` for ANSI parse errors on the string path. + - Spark 4.0.1 (audited 2026-05-27): `ToTimestamp.inputTypes` widens the string slot to `StringTypeWithCollation(supportsTrimCollation = true)`; behaviour unchanged for non-collated strings. + - Known divergence: Comet's native path only accepts `TimestampType`, `DateType`, and `TimestampNTZType` (string inputs fall back to Spark). For `TimestampType` and `DateType` the session timezone is applied via `array_with_timezone`; for `TimestampNTZType` the microsecond value is divided directly without timezone adjustment. - [x] weekday + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `WeekDay extends GetDateField`; returns 0..6 with Monday=0 via `LocalDate.getDayOfWeek.ordinal()`. + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [x] weekofyear + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `WeekOfYear extends GetDateField`; returns the ISO-8601 week-of-week-based-year via `IsoFields.WEEK_OF_WEEK_BASED_YEAR` (Monday start, week 1 has more than 3 days). Comet maps this to DataFusion's `datepart('week', ...)` which uses Arrow's `iso_week().week()`, matching Spark. + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. - [ ] window - [ ] window_time - [x] year + - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. + - Spark 3.5.8 (audited 2026-05-27): baseline. `Year extends GetDateField`; delegates to `DateTimeUtils.getYear` via `LocalDate.getYear`. + - Spark 4.0.1 (audited 2026-05-27): identical semantics; `GetDateField` drops the `NullIntolerant` mixin in favour of `nullIntolerant: Boolean = true`. ### generator_funcs From 481e97e084899ba8704eebcef2131359ac9a3823 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Wed, 27 May 2026 08:14:28 -0600 Subject: [PATCH 4/4] test: capture datetime audit findings as ignored SQL tests and update audit skill Skill update (audit-comet-expression): - Step 6 reorganises priorities: correctness divergences, missing coverage, and consistency issues are now three distinct buckets. - Step 7 now requires correctness findings to land as captured tests in the same PR as the audit. Walks through the search-or-file issue workflow and the trivial-fix-vs-ignore decision, with a `query ignore()` example. Captured tests (linked to follow-up issues filed against this PR): - next_day.sql gains a divergence test for whitespace trimming (Comet trims `' MO '`; Spark does not). ignore(#4450). - next_day_ansi.sql is new and asserts that `next_day` throws under `spark.sql.ansi.enabled=true` for malformed dayOfWeek. Comet currently returns NULL. ignore(#4449). - make_date_ansi.sql is new and asserts that `make_date` throws under `spark.sql.ansi.enabled=true` for invalid `(year, month, day)`. Comet currently returns NULL. ignore(#4451). The fourth audit finding (`make_date` year 0 / negative years) was verified against Spark's own implementation and turned out to be a non-divergence; the issue was closed and no test added. None of the three remaining bugs are trivial to fix here: both `SparkNextDay` and `SparkMakeDate` live upstream in the `datafusion-spark` crate, so the fixes need to flow through that project. The captured tests will switch from `ignore(...)` to their intended assertion mode when the upstream changes land. --- .../skills/audit-comet-expression/SKILL.md | 77 +++++++++++++++---- .../expressions/datetime/make_date_ansi.sql | 36 +++++++++ .../expressions/datetime/next_day.sql | 5 ++ .../expressions/datetime/next_day_ansi.sql | 30 ++++++++ 4 files changed, 133 insertions(+), 15 deletions(-) create mode 100644 spark/src/test/resources/sql-tests/expressions/datetime/make_date_ansi.sql create mode 100644 spark/src/test/resources/sql-tests/expressions/datetime/next_day_ansi.sql diff --git a/.claude/skills/audit-comet-expression/SKILL.md b/.claude/skills/audit-comet-expression/SKILL.md index 7cac6c252d..a1b48f00cf 100644 --- a/.claude/skills/audit-comet-expression/SKILL.md +++ b/.claude/skills/audit-comet-expression/SKILL.md @@ -396,29 +396,75 @@ finding for Step 6. Summarize findings as a prioritized list. -### High priority +### High priority: correctness divergences -Issues where Comet may silently produce wrong results compared to Spark. +Cases where Comet produces a different observable result from Spark +(wrong value, missing exception, accepted-instead-of-rejected input, +etc.). Each one becomes a captured test in Step 7. -### Medium priority +### Medium priority: missing test coverage -Missing test coverage for edge cases that could expose bugs. +Edge cases Spark exercises but Comet does not test, where the behaviour +appears to match but is not verified. -### Low priority +### Low priority: cosmetic and consistency issues -Minor gaps, cosmetic improvements, or nice-to-have tests. +Reason-string drift, missing `get*Reasons()` overrides, dead branches, +etc. These come from the Step 5 consistency audit. --- -## Step 7: Offer to Implement Missing Tests and Fix Consistency Findings +## Step 7: Capture Correctness Findings as Tests; Offer Consistency Fixes -After presenting the gap analysis, ask the user separately about each -category of finding. Keep the two asks distinct so the user can opt in to -one without the other. +Correctness divergences (Step 6 high-priority items) must not be left as +prose. Every divergence the audit identified becomes a regression test +in the same PR as the audit, so future readers can run it. -**Test coverage:** +For each correctness finding, do the following in this order: -> I found the following missing test cases. Would you like me to implement them? +1. **Search for an existing tracking issue.** + + ```bash + gh issue list --search " in:title,body" --state all --limit 5 + ``` + + Match on both the expression name and a distinguishing keyword + (ANSI, timezone, NTZ, overflow, etc.). + +2. **If no issue exists, file one.** Use the `correctness` label plus + the relevant area label (e.g. `temporal expressions`). Keep the + title in the form "[Bug] ". Include + the Spark version, a minimal repro, and the divergent result. + +3. **If the fix is trivial and you are confident, fix it inline.** + Then add the test in the default `query` mode so it locks in the + fix. "Trivial" means a few lines in one file with no native code + changes, no support-level reshuffling, and no semantics decisions + that the user should weigh in on. + +4. **Otherwise, add the test in `query ignore()` mode** with + a one-line SQL comment above the query explaining the symptom. The + test file lives next to the existing tests for the expression. Do + not skip writing the test because the bug is unfixed: the captured + reproduction is the whole point of this step. + + ```sql + -- Comet returns NULL where Spark throws under spark.sql.ansi.enabled=true + query ignore(https://github.com/apache/datafusion-comet/issues/NNNN) + SELECT next_day(date('2024-01-01'), 'NOT_A_DAY') + ``` + + When the underlying bug is fixed, the `ignore(...)` is removed and + the test starts running. This is also the contract documented in + `docs/source/contributor-guide/sql-file-tests.md`. + +After capturing correctness findings, ask the user about the remaining +categories. Keep these asks distinct so the user can opt in to each. + +**Missing test coverage (non-bug):** + +> Spark exercises the following cases that have no Comet test. Would +> you like me to add them? > > - [list each missing test case] > @@ -438,9 +484,10 @@ one without the other. > add a missing `get*Reasons()` override, or move a check from `convert` > into `getSupportLevel`). -If the user says yes to tests, implement them following the Comet SQL Tests -format described in `docs/source/contributor-guide/sql-file-tests.md`. -Prefer Comet SQL Tests over Comet Scala Tests. +If the user says yes to test coverage, implement them following the Comet +SQL Tests format described in +`docs/source/contributor-guide/sql-file-tests.md`. Prefer Comet SQL Tests +over Comet Scala Tests. ### Comet SQL Tests template diff --git a/spark/src/test/resources/sql-tests/expressions/datetime/make_date_ansi.sql b/spark/src/test/resources/sql-tests/expressions/datetime/make_date_ansi.sql new file mode 100644 index 0000000000..04a28e3e1c --- /dev/null +++ b/spark/src/test/resources/sql-tests/expressions/datetime/make_date_ansi.sql @@ -0,0 +1,36 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. + +-- ANSI mode: Spark's MakeDate wraps the java.time.DateTimeException from LocalDate.of in +-- ansiDateTimeArgumentOutOfRange (4.0) / ansiDateTimeError (3.4/3.5) when +-- spark.sql.ansi.enabled=true. Comet's native SparkMakeDate always returns NULL on +-- invalid input and never raises, so it does not throw under ANSI. The ignored queries +-- below capture the divergence; remove ignore(...) when +-- https://github.com/apache/datafusion-comet/issues/4451 is fixed. +-- Config: spark.sql.ansi.enabled=true + +-- February 30 is not a valid date. +query ignore(https://github.com/apache/datafusion-comet/issues/4451) +SELECT make_date(2024, 2, 30) + +-- Month 13 is out of range. +query ignore(https://github.com/apache/datafusion-comet/issues/4451) +SELECT make_date(2024, 13, 1) + +-- Day 0 is out of range. +query ignore(https://github.com/apache/datafusion-comet/issues/4451) +SELECT make_date(2024, 6, 0) diff --git a/spark/src/test/resources/sql-tests/expressions/datetime/next_day.sql b/spark/src/test/resources/sql-tests/expressions/datetime/next_day.sql index 057c9daea6..65cccb9356 100644 --- a/spark/src/test/resources/sql-tests/expressions/datetime/next_day.sql +++ b/spark/src/test/resources/sql-tests/expressions/datetime/next_day.sql @@ -72,3 +72,8 @@ SELECT next_day(date('2023-01-01'), 'Monday'), next_day(date('2023-01-01'), 'Sun -- null handling query SELECT next_day(NULL, 'Monday'), next_day(date('2023-01-01'), NULL) + +-- Comet's native impl trims whitespace before matching the day name; Spark does not, so +-- ' MO ' is invalid in Spark (NULL) but matches Monday in Comet. +query ignore(https://github.com/apache/datafusion-comet/issues/4450) +SELECT next_day(date('2024-01-01'), ' MO '), next_day(date('2024-01-01'), 'MO ') diff --git a/spark/src/test/resources/sql-tests/expressions/datetime/next_day_ansi.sql b/spark/src/test/resources/sql-tests/expressions/datetime/next_day_ansi.sql new file mode 100644 index 0000000000..9f8f8e435f --- /dev/null +++ b/spark/src/test/resources/sql-tests/expressions/datetime/next_day_ansi.sql @@ -0,0 +1,30 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. + +-- ANSI mode: Spark's NextDay throws SparkIllegalArgumentException on a malformed +-- dayOfWeek. Comet's native impl (datafusion-spark::SparkNextDay) always returns NULL, +-- so it does not throw under ANSI. The expect_error pattern below will be the assertion +-- once https://github.com/apache/datafusion-comet/issues/4449 is fixed; until then the +-- queries are ignored so the suite stays green. +-- Config: spark.sql.ansi.enabled=true + +-- Comet returns NULL where Spark throws. +query ignore(https://github.com/apache/datafusion-comet/issues/4449) +SELECT next_day(date('2024-01-01'), 'NOT_A_DAY') + +query ignore(https://github.com/apache/datafusion-comet/issues/4449) +SELECT next_day(date('2024-01-01'), '')