chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1 by andygrove · Pull Request #4461 · apache/datafusion-comet

andygrove · 2026-05-27T22:08:35Z

Which issue does this PR close?

Closes #.

Rationale for this change

Following the same pattern as #4436 (any), #4437 (bit_and), and the earlier date/time audit, this PR audits every supported string expression in Comet against Spark 3.4.3, 3.5.8, and 4.0.1, records the findings inline in the support doc, and applies the support-level consistency fixes the audit surfaced.

What changes are included in this PR?

Support-doc audit notes

Add per-version audit sub-bullets to all 37 supported string SQL function names in docs/source/contributor-guide/spark_expressions_support.md. The notes record:

whether the Spark class is identical across 3.4.3 / 3.5.8 / 4.0.1
Spark 4.0 changes (collation widening via StringTypeWithCollation; CollationSupport routing for Upper/Lower/InitCap and the startswith/endswith/contains/instr/replace/translate/trim family; NullIntolerant -> nullIntolerant: Boolean refactor; BinaryPad rewrite for lpad/rpad)
known divergences between Comet and Spark, each linked to the tracking issue listed below

Support-level consistency fixes (in `strings.scala`)

CometInitCap: Incompatible(None) -> Incompatible(Some(reason)) via a shared private val; drop the dead convert override.
CometLength: extract the BinaryType reason into a shared private val so the getSupportLevel branch and getUnsupportedReasons stay in sync.
CometRegExpReplace: pass the existing reasons into Incompatible(Some(...)) / Unsupported(Some(...)) via shared private vals so EXPLAIN matches the compatibility-guide text.
CometStringLPad / CometStringRPad: share the two reason strings via a small PadReasons companion so getSupportLevel and getUnsupportedReasons cannot drift; align wording with backticks.
CometSubstring / CometLeft / CometRight: lift the literal-argument fallback out of convert and into getSupportLevel so the dispatcher handles it uniformly; replace the convert-time withInfo with a declared Unsupported reason via a shared private val.

Tracking issues filed for follow-up

Higher-risk findings that need a semantics decision or larger work are filed as separate issues and cross-referenced from the support doc:

[Bug] repeat throws on negative count where Spark returns empty string #4462 repeat throws on negative count where Spark returns empty string
[Bug] translate uses graphemes vs Spark code points and ignores U+0000 deletion #4463 translate uses graphemes vs Spark code points; ignores U+0000 deletion
[Bug] bit_length and octet_length error natively for BinaryType input instead of falling back #4464 bit_length and octet_length error natively for BinaryType input instead of falling back
[Bug] decode ignores Spark 4.0 legacyCharsets and legacyErrorAction flags #4465 decode ignores Spark 4.0 legacyCharsets and legacyErrorAction flags
[Doc] decode does not appear in auto-generated compatibility docs #4466 decode does not appear in auto-generated compatibility docs
[Bug] CometCaseConversionBase gates compat inside convert() instead of getSupportLevel #4467 CometCaseConversionBase gates compat inside convert() instead of getSupportLevel

The umbrella Spark 4.0 collation issue is the existing #2190, which the lower/upper sub-bullets reference.

The audit skill update (require issues for prose-only findings, add Spark 4.1.1 to the version list) is split out into #4468.

Audit process

The audit was driven by 7 parallel agents using the audit-comet-expression skill, each handling a related group (case conversion + initcap; length family; trim; left/right/substring/substring_index; lpad/rpad; simple scalar funcs; complex serdes including concat_ws/regexp_replace/repeat/split/decode).

How are these changes tested?

./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/string/" -Dtest=none (42 tests pass)
./mvnw test -Dsuites="org.apache.comet.CometStringExpressionSuite" -Dtest=none (33 tests pass)
Updated the affected fallback-reason assertions in left.sql, string_lpad.sql, string_rpad.sql, and CometStringExpressionSuite.scala to match the new shared reason strings.

Add per-version audit sub-bullets to every supported string expression in spark_expressions_support.md using the audit-comet-expression skill. Covers 37 SQL function names backed by the serdes in spark/src/main/scala/org/apache/comet/serde/strings.scala plus the CometScalarFunction wirings in QueryPlanSerde.scala. For each function, the sub-bullets record: - Whether the Spark class is identical across 3.4.3, 3.5.8, 4.0.1 - Spark 4.0 changes (collation widening via StringTypeWithCollation, CollationSupport routing for Upper/Lower/InitCap and the startswith/endswith/contains/instr/replace/translate/trim family, NullIntolerant -> nullIntolerant field refactor, BinaryPad rewrite for lpad/rpad) - Known divergences between Comet and Spark (initcap hyphen handling, replace with empty search, repeat with negative count, decode charset/legacy-flag gaps, BinaryType wiring gaps in bit_length / octet_length, regexp_replace pos > 1 restriction) The audit was driven by 7 parallel agents, each handling a related group (case conversion + initcap; length family; trim; left/right/ substring/substring_index; lpad/rpad; simple scalar funcs; complex serdes including concat_ws/regexp_replace/repeat/split/decode). Apply support-level consistency fixes surfaced by the audit: - CometInitCap: Incompatible(None) -> Incompatible(Some(reason)) via a shared private val; drop the dead convert override. - CometLength: extract the BinaryType reason into a shared private val so the getSupportLevel branch and getUnsupportedReasons stay in sync. - CometRegExpReplace: pass the existing reasons into Incompatible(Some(...)) and Unsupported(Some(...)) via shared private vals so EXPLAIN matches the compatibility-guide text. - CometStringLPad / CometStringRPad: share the two reason strings via a small PadReasons companion so getSupportLevel and getUnsupportedReasons cannot drift; align wording with backticks. - CometSubstring / CometLeft / CometRight: lift the literal-argument fallback out of convert and into getSupportLevel so the dispatcher handles it uniformly, and replace the convert-time withInfo with a declared Unsupported reason via a shared private val. Update the affected tests for the new reason strings: - spark/src/test/resources/sql-tests/expressions/string/left.sql - spark/src/test/resources/sql-tests/expressions/string/string_lpad.sql - spark/src/test/resources/sql-tests/expressions/string/string_rpad.sql - spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala

… doc Link the high-priority findings from the string-expressions audit to the tracking issues filed alongside the PR: - repeat negative count divergence -> apache#4462 - translate grapheme vs codepoint and U+0000 deletion -> apache#4463 - bit_length/octet_length BinaryType native error -> apache#4464 - decode Spark 4.0 legacy flags ignored -> apache#4465 - decode missing from auto-generated compatibility docs -> apache#4466 - case-conversion gating antipattern -> apache#4467 - generic Spark 4.0 collation gap on lower/upper -> apache#2190 (existing) Also remove a stray NUL byte in the chr entry that came from a heredoc roundtrip and replace it with the literal escape \u0000.

The audit-comet-expression skill previously permitted leaving semantics-decision findings as prose recommendations in the PR description, on the assumption the reviewer would pick them up. In practice that note dies with the PR. Tighten Step 6 and Step 7 so that every high-priority finding either becomes an inline fix + test, or a filed GitHub issue + ignored regression test, before the audit PR is opened. Add a dedicated 'Findings that need follow-up' subsection spelling out the workflow (search, file, cross-reference from the support-doc sub-bullet and the PR description). Surfaced while re-running the string-expressions audit in apache#4461: several higher-risk findings (CometCaseConversionBase compat gating, StringRepeat negative-count divergence, translate grapheme semantics, bit_length/octet_length BinaryType, decode legacy flags) were left as prose only and had to be filed retroactively as apache#4462-apache#4467.

andygrove · 2026-05-28T16:52:49Z

 - [x] ascii
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `StringType -> IntegerType`; `nullSafeEval` returns `codePointAt(0)` of the first char, or `0` for the empty string. Wired via `CometScalarFunction("ascii")` and resolved to DataFusion `ascii` (`chars().next() as i32`); first-code-point semantics match for ASCII, BMP, and supplementary code points.
+  - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`; behaviour unchanged for `UTF8_BINARY`. Comet does not propagate collation, so non-default collations may diverge silently.


Does this documenting, or need an issue?

andygrove · 2026-05-28T16:53:32Z

+  - Known limitation: wired as a raw `CometScalarFunction("bit_length")` with no `BinaryType` guard. DataFusion's `BitLengthFunc` signature only accepts string types, so `bit_length(<binary>)` execute-fails on the native side instead of falling back cleanly (https://github.com/apache/datafusion-comet/issues/4464).
 - [x] btrim
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `StringTrimBoth` is `RuntimeReplaceable` and rewritten to `StringTrim(srcStr, trimStr)` before serde runs, so the explicit `CometScalarFunction("btrim")` mapping is unreachable.


Should we remove the unreachable mapping?

andygrove · 2026-05-28T16:54:03Z

 - [x] btrim
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `StringTrimBoth` is `RuntimeReplaceable` and rewritten to `StringTrim(srcStr, trimStr)` before serde runs, so the explicit `CometScalarFunction("btrim")` mapping is unreachable.
+  - Spark 4.0.1 (audited 2026-05-27): `StringTrim` (the rewrite target) routes through `CollationSupport.StringTrim.exec` and uses `StringTypeNonCSAICollation(supportsTrimCollation = true)`; semantics unchanged for `UTF8_BINARY`. Non-default collations may diverge in Comet.


Do we need to document this collation difference?

I filed #4496 for doing a follow on audit focussing on collation issues once this PR is merged

andygrove · 2026-05-28T16:59:05Z

+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `RuntimeReplaceable` with `replacement = Substring(str, Literal(1), len)`; accepts `StringType` or `BinaryType` plus `IntegerType`. Comet serde rewrites to a `Substring` proto with `start=1, len=lenValue`, requires `len` to be a `Literal`, and falls back otherwise.
+  - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened with `StringTypeWithCollation`; behaviour unchanged for `UTF8_BINARY`.
+  - Known limitation: the literal-only `len` restriction is enforced inside `convert` via `withInfo` rather than declared in `getSupportLevel`, so EXPLAIN surfaces it only at conversion time.


We should fix this

andygrove · 2026-05-28T17:00:24Z

 - [ ] locate
 - [x] lower
+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. JVM default-locale `toLowerCase` on `UTF8String`. Comet routes to DataFusion `lower` (Rust Unicode default case mapping, no locale awareness) and is gated off by default via `spark.comet.caseConversion.enabled=false`. The gating lives in `convert(...)` rather than `getSupportLevel`, which bypasses the standard `allowIncompatible` machinery (https://github.com/apache/datafusion-comet/issues/4467).


Maybe we should change this to use the standard incompatible approach rather than use a separate config?

andygrove · 2026-05-28T17:06:36Z

+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `StringRepeat(str, times)` with `nullSafeEval(s, n) = s.repeat(n)`; `UTF8String.repeat` returns the empty string for `n <= 0`. Comet casts `times` to `LongType` and delegates to DataFusion `repeat`.
+  - Spark 4.0.1 (audited 2026-05-27): adds `nullIntolerant: Boolean` field; `dataType` becomes `str.dataType` (collation-tracking). Semantics unchanged for `UTF8_BINARY`.
+  - Known divergence: DataFusion `repeat` throws on negative counts instead of returning the empty string Spark produces. Currently surfaced via `getCompatibleNotes` only (https://github.com/apache/datafusion-comet/issues/4462).


This was a hallucination - this issue did exist once, but is already fixed. I closed the issue.

andygrove · 2026-05-28T17:07:03Z

+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `StringReplace(src, search, replace)`; when `search` is empty, Spark returns `src` unchanged (short-circuit on `search.numBytes == 0`).
+  - Spark 4.0.1 (audited 2026-05-27): routes through `CollationSupport.StringReplace.exec`; semantics unchanged for `UTF8_BINARY`.
+  - Known divergence: DataFusion `replace` inserts `to` between every character (and at both ends) when `search` is empty (https://github.com/apache/datafusion-comet/issues/3344). Currently the support level is `Compatible`.


We should mark this as incompatible

andygrove · 2026-05-28T17:07:24Z

+  - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
+  - Spark 3.5.8 (audited 2026-05-27): baseline. `RuntimeReplaceable` with `replacement = If(IsNull(str), null, If(len <= 0, "", Substring(str, -len, len)))`; accepts `StringType` plus `IntegerType`. Comet serde rewrites positive `len` to a `Substring` proto with `start=-len, len=len`; for `len <= 0` it builds an `If(IsNull(str), null, "")` proto chain to preserve NULL propagation.
+  - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened with collation; uses `UnaryMinus(len, failOnError = false)` to avoid integer-overflow exceptions on `len = Int.MinValue`. Semantics unchanged for `UTF8_BINARY`.
+  - Known limitation: the literal-only `len` restriction is enforced inside `convert` via `withInfo` rather than declared in `getSupportLevel`.


we should fix this

The string-expressions audit (PR apache#4461) revealed six recurring failure modes where the skill documented findings rather than acting on them. Strengthen the consistency checklist and auto-fix list to close these loopholes: - Add checklist item 10: expression-shape restrictions (literal-only argument, child data type, etc.) must be declared in `getSupportLevel`, not gated inside `convert` with `withInfo`. Cite `CometLeft` / `CometRight` / `CometSubstring` as the canonical example. - Add checklist item 11: Spark 4.0+ collation routing through `CollationSupport.X.exec` and `StringTypeWithCollation` means the expression is `Incompatible` for non-default collations. Link apache#4496 as the umbrella issue and reject "behaviour unchanged for `UTF8_BINARY`" as a justification for `Compatible`. - Add checklist item 12: a sub-bullet that says "Known divergence" or "Known limitation" on a `Compatible` branch is a smell. The skill must promote the support level rather than documenting the divergence in prose only. Cite the `replace` empty-search-string case. - Add checklist item 13: unreachable serde registrations (e.g. the `btrim` mapping for `StringTrimBoth`, which is rewritten by `RuntimeReplaceable` before serde runs) must be deleted, not catalogued. - Add an issue-verification step to the reason-wording guidance and the follow-up-issue workflow. Every cited issue must be opened with `gh issue view` to confirm it exists, is open, and matches the divergence before the URL ships in a reason string or support-doc sub-bullet. Add the matching auto-fix patterns to Step 7's "apply fixes automatically" list so future audits resolve these inline rather than filing them as prose follow-ups.

Apply the recurring patterns surfaced in PR review comments rather than documenting them as Known Limitation prose. - Remove the unreachable `CometScalarFunction("btrim")` serde mapping. `StringTrimBoth` is `RuntimeReplaceable` and rewritten to `StringTrim` before serde runs, so the entry was dead code. Updated the support-doc sub-bullet for `btrim` to point readers at the `trim` entry. - Rewrite `CometCaseConversionBase` to use the standard `Incompatible` opt-in instead of the bespoke `spark.comet.caseConversion.enabled` conf. Override `getSupportLevel` to return `Incompatible(Some(reason))` and drop the in-`convert` config check. Remove `COMET_CASE_CONVERSION_ENABLED` and update the two callers and the string-expression benchmark to use the per-expression `spark.comet.expression.{Upper,Lower,InitCap}.allowIncompatible` keys. Closes the gating asymmetry called out in apache#4467. - Mark `replace` as `Incompatible` when `search` is a literal empty string. DataFusion's `replace` inserts the replacement between every character whereas Spark short-circuits to return `src` unchanged. Refile under apache#4497 (apache#3344 was closed previously with the bug body describing the divergence in the wrong direction). - Drop the spurious `CometStringRepeat.getCompatibleNotes()` claim that DataFusion `repeat` throws on negative counts. Verified in `expressions/string/string_repeat.sql` that negative counts produce the empty string Spark expects. The audit's prose claim was wrong, so apache#4462 was closed as already-fixed. - Update the `replace` SQL fixture to assert the new fallback message rather than the prior `ignore(...)` shape. Update `lower.sql` / `upper.sql` `expect_fallback(...)` strings to match the new reason text. Switch `lower_enabled.sql` / `upper_enabled.sql` from `spark.comet.caseConversion.enabled` to the per-expression `allowIncompatible` keys. - Cross-reference the collation umbrella issue (apache#4496) from every support-doc sub-bullet that previously documented "non-default collations may diverge silently" without an issue link. Add the cross-reference to the entries that route through Spark 4.0 `CollationSupport.X.exec` but had no collation note at all (`bit_length`, `concat_ws`, `endswith`, `initcap`, `instr`, `lpad`, `octet_length`, `repeat`, `replace`, `right`, `rpad`, `rtrim`, `split`, `startswith`, `substring`, `substring_index`, `translate`). - Replace the stale "Known limitation: enforced inside `convert`" prose on `left`, `right`, `substring`, and `length`. The literal-only argument restriction has already been moved into `getSupportLevel`; the prose just hadn't been updated. Tests: - `./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/string/"` (42 pass) - `./mvnw test -Dsuites="org.apache.comet.CometStringExpressionSuite"` (33 pass)

`CometCaseConversionBase` no longer gates compat with the bespoke `spark.comet.caseConversion.enabled` conf inside `convert()`. The fallback now flows through the standard dispatcher path, so the `BroadcastHashJoin` COMET tag on `upper(ca_country)` switches from "Comet is not compatible with Spark for case conversion in locale-specific cases. Set spark.comet.caseConversion.enabled=true to enable it anyway." to the dispatcher's per-expression Incompatible message, including the `spark.comet.expression.Upper.allowIncompatible=true` opt-in and the compat-guide link. The plan shape is unchanged; only the two annotation lines in each affected golden differ.

andygrove mentioned this pull request May 27, 2026

docs: require audit skill to file issues and add Spark 4.1.1 to version list #4468

Open

andygrove force-pushed the worktree-audit-string-funcs branch from aa20e8f to d994671 Compare May 27, 2026 22:24

andygrove commented May 28, 2026

View reviewed changes

andygrove mentioned this pull request May 28, 2026

Do another audit sweep for string collation differences #4496

Open

andygrove commented May 28, 2026

View reviewed changes

andygrove mentioned this pull request May 28, 2026

[Bug] replace returns wrong result for empty-string search #4497

Open

andygrove added this to the 0.17.0 milestone May 28, 2026

andygrove added 2 commits May 28, 2026 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1#4461

chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1#4461
andygrove wants to merge 4 commits into
apache:mainfrom
andygrove:worktree-audit-string-funcs

andygrove commented May 27, 2026 •

edited

Loading

Uh oh!

andygrove May 28, 2026

Uh oh!

andygrove May 28, 2026

Uh oh!

andygrove May 28, 2026

Uh oh!

andygrove May 28, 2026

Uh oh!

andygrove May 28, 2026

Uh oh!

andygrove May 28, 2026

Uh oh!

andygrove May 28, 2026

Uh oh!

andygrove May 28, 2026

Uh oh!

andygrove May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Support-doc audit notes

Support-level consistency fixes (in strings.scala)

Tracking issues filed for follow-up

Audit process

How are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented May 27, 2026 •

edited

Loading

Support-level consistency fixes (in `strings.scala`)