chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1#4461
Open
andygrove wants to merge 4 commits into
Open
chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1#4461andygrove wants to merge 4 commits into
andygrove wants to merge 4 commits into
Conversation
Add per-version audit sub-bullets to every supported string expression in spark_expressions_support.md using the audit-comet-expression skill. Covers 37 SQL function names backed by the serdes in spark/src/main/scala/org/apache/comet/serde/strings.scala plus the CometScalarFunction wirings in QueryPlanSerde.scala. For each function, the sub-bullets record: - Whether the Spark class is identical across 3.4.3, 3.5.8, 4.0.1 - Spark 4.0 changes (collation widening via StringTypeWithCollation, CollationSupport routing for Upper/Lower/InitCap and the startswith/endswith/contains/instr/replace/translate/trim family, NullIntolerant -> nullIntolerant field refactor, BinaryPad rewrite for lpad/rpad) - Known divergences between Comet and Spark (initcap hyphen handling, replace with empty search, repeat with negative count, decode charset/legacy-flag gaps, BinaryType wiring gaps in bit_length / octet_length, regexp_replace pos > 1 restriction) The audit was driven by 7 parallel agents, each handling a related group (case conversion + initcap; length family; trim; left/right/ substring/substring_index; lpad/rpad; simple scalar funcs; complex serdes including concat_ws/regexp_replace/repeat/split/decode). Apply support-level consistency fixes surfaced by the audit: - CometInitCap: Incompatible(None) -> Incompatible(Some(reason)) via a shared private val; drop the dead convert override. - CometLength: extract the BinaryType reason into a shared private val so the getSupportLevel branch and getUnsupportedReasons stay in sync. - CometRegExpReplace: pass the existing reasons into Incompatible(Some(...)) and Unsupported(Some(...)) via shared private vals so EXPLAIN matches the compatibility-guide text. - CometStringLPad / CometStringRPad: share the two reason strings via a small PadReasons companion so getSupportLevel and getUnsupportedReasons cannot drift; align wording with backticks. - CometSubstring / CometLeft / CometRight: lift the literal-argument fallback out of convert and into getSupportLevel so the dispatcher handles it uniformly, and replace the convert-time withInfo with a declared Unsupported reason via a shared private val. Update the affected tests for the new reason strings: - spark/src/test/resources/sql-tests/expressions/string/left.sql - spark/src/test/resources/sql-tests/expressions/string/string_lpad.sql - spark/src/test/resources/sql-tests/expressions/string/string_rpad.sql - spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala
This was referenced May 27, 2026
Open
… doc Link the high-priority findings from the string-expressions audit to the tracking issues filed alongside the PR: - repeat negative count divergence -> apache#4462 - translate grapheme vs codepoint and U+0000 deletion -> apache#4463 - bit_length/octet_length BinaryType native error -> apache#4464 - decode Spark 4.0 legacy flags ignored -> apache#4465 - decode missing from auto-generated compatibility docs -> apache#4466 - case-conversion gating antipattern -> apache#4467 - generic Spark 4.0 collation gap on lower/upper -> apache#2190 (existing) Also remove a stray NUL byte in the chr entry that came from a heredoc roundtrip and replace it with the literal escape \u0000.
andygrove
added a commit
to andygrove/datafusion-comet
that referenced
this pull request
May 27, 2026
The audit-comet-expression skill previously permitted leaving semantics-decision findings as prose recommendations in the PR description, on the assumption the reviewer would pick them up. In practice that note dies with the PR. Tighten Step 6 and Step 7 so that every high-priority finding either becomes an inline fix + test, or a filed GitHub issue + ignored regression test, before the audit PR is opened. Add a dedicated 'Findings that need follow-up' subsection spelling out the workflow (search, file, cross-reference from the support-doc sub-bullet and the PR description). Surfaced while re-running the string-expressions audit in apache#4461: several higher-risk findings (CometCaseConversionBase compat gating, StringRepeat negative-count divergence, translate grapheme semantics, bit_length/octet_length BinaryType, decode legacy flags) were left as prose only and had to be filed retroactively as apache#4462-apache#4467.
aa20e8f to
d994671
Compare
This was referenced May 27, 2026
andygrove
commented
May 28, 2026
| - [x] ascii | ||
| - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. | ||
| - Spark 3.5.8 (audited 2026-05-27): baseline. `StringType -> IntegerType`; `nullSafeEval` returns `codePointAt(0)` of the first char, or `0` for the empty string. Wired via `CometScalarFunction("ascii")` and resolved to DataFusion `ascii` (`chars().next() as i32`); first-code-point semantics match for ASCII, BMP, and supplementary code points. | ||
| - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`; behaviour unchanged for `UTF8_BINARY`. Comet does not propagate collation, so non-default collations may diverge silently. |
Member
Author
There was a problem hiding this comment.
Does this documenting, or need an issue?
andygrove
commented
May 28, 2026
| - Known limitation: wired as a raw `CometScalarFunction("bit_length")` with no `BinaryType` guard. DataFusion's `BitLengthFunc` signature only accepts string types, so `bit_length(<binary>)` execute-fails on the native side instead of falling back cleanly (https://github.com/apache/datafusion-comet/issues/4464). | ||
| - [x] btrim | ||
| - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. | ||
| - Spark 3.5.8 (audited 2026-05-27): baseline. `StringTrimBoth` is `RuntimeReplaceable` and rewritten to `StringTrim(srcStr, trimStr)` before serde runs, so the explicit `CometScalarFunction("btrim")` mapping is unreachable. |
Member
Author
There was a problem hiding this comment.
Should we remove the unreachable mapping?
andygrove
commented
May 28, 2026
| - [x] btrim | ||
| - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. | ||
| - Spark 3.5.8 (audited 2026-05-27): baseline. `StringTrimBoth` is `RuntimeReplaceable` and rewritten to `StringTrim(srcStr, trimStr)` before serde runs, so the explicit `CometScalarFunction("btrim")` mapping is unreachable. | ||
| - Spark 4.0.1 (audited 2026-05-27): `StringTrim` (the rewrite target) routes through `CollationSupport.StringTrim.exec` and uses `StringTypeNonCSAICollation(supportsTrimCollation = true)`; semantics unchanged for `UTF8_BINARY`. Non-default collations may diverge in Comet. |
Member
Author
There was a problem hiding this comment.
Do we need to document this collation difference?
Member
Author
There was a problem hiding this comment.
I filed #4496 for doing a follow on audit focussing on collation issues once this PR is merged
andygrove
commented
May 28, 2026
| - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. | ||
| - Spark 3.5.8 (audited 2026-05-27): baseline. `RuntimeReplaceable` with `replacement = Substring(str, Literal(1), len)`; accepts `StringType` or `BinaryType` plus `IntegerType`. Comet serde rewrites to a `Substring` proto with `start=1, len=lenValue`, requires `len` to be a `Literal`, and falls back otherwise. | ||
| - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened with `StringTypeWithCollation`; behaviour unchanged for `UTF8_BINARY`. | ||
| - Known limitation: the literal-only `len` restriction is enforced inside `convert` via `withInfo` rather than declared in `getSupportLevel`, so EXPLAIN surfaces it only at conversion time. |
andygrove
commented
May 28, 2026
| - [ ] locate | ||
| - [x] lower | ||
| - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. | ||
| - Spark 3.5.8 (audited 2026-05-27): baseline. JVM default-locale `toLowerCase` on `UTF8String`. Comet routes to DataFusion `lower` (Rust Unicode default case mapping, no locale awareness) and is gated off by default via `spark.comet.caseConversion.enabled=false`. The gating lives in `convert(...)` rather than `getSupportLevel`, which bypasses the standard `allowIncompatible` machinery (https://github.com/apache/datafusion-comet/issues/4467). |
Member
Author
There was a problem hiding this comment.
Maybe we should change this to use the standard incompatible approach rather than use a separate config?
andygrove
commented
May 28, 2026
| - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. | ||
| - Spark 3.5.8 (audited 2026-05-27): baseline. `StringRepeat(str, times)` with `nullSafeEval(s, n) = s.repeat(n)`; `UTF8String.repeat` returns the empty string for `n <= 0`. Comet casts `times` to `LongType` and delegates to DataFusion `repeat`. | ||
| - Spark 4.0.1 (audited 2026-05-27): adds `nullIntolerant: Boolean` field; `dataType` becomes `str.dataType` (collation-tracking). Semantics unchanged for `UTF8_BINARY`. | ||
| - Known divergence: DataFusion `repeat` throws on negative counts instead of returning the empty string Spark produces. Currently surfaced via `getCompatibleNotes` only (https://github.com/apache/datafusion-comet/issues/4462). |
Member
Author
There was a problem hiding this comment.
This was a hallucination - this issue did exist once, but is already fixed. I closed the issue.
andygrove
commented
May 28, 2026
| - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. | ||
| - Spark 3.5.8 (audited 2026-05-27): baseline. `StringReplace(src, search, replace)`; when `search` is empty, Spark returns `src` unchanged (short-circuit on `search.numBytes == 0`). | ||
| - Spark 4.0.1 (audited 2026-05-27): routes through `CollationSupport.StringReplace.exec`; semantics unchanged for `UTF8_BINARY`. | ||
| - Known divergence: DataFusion `replace` inserts `to` between every character (and at both ends) when `search` is empty (https://github.com/apache/datafusion-comet/issues/3344). Currently the support level is `Compatible`. |
Member
Author
There was a problem hiding this comment.
We should mark this as incompatible
andygrove
commented
May 28, 2026
| - Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8. | ||
| - Spark 3.5.8 (audited 2026-05-27): baseline. `RuntimeReplaceable` with `replacement = If(IsNull(str), null, If(len <= 0, "", Substring(str, -len, len)))`; accepts `StringType` plus `IntegerType`. Comet serde rewrites positive `len` to a `Substring` proto with `start=-len, len=len`; for `len <= 0` it builds an `If(IsNull(str), null, "")` proto chain to preserve NULL propagation. | ||
| - Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened with collation; uses `UnaryMinus(len, failOnError = false)` to avoid integer-overflow exceptions on `len = Int.MinValue`. Semantics unchanged for `UTF8_BINARY`. | ||
| - Known limitation: the literal-only `len` restriction is enforced inside `convert` via `withInfo` rather than declared in `getSupportLevel`. |
andygrove
added a commit
to andygrove/datafusion-comet
that referenced
this pull request
May 28, 2026
The string-expressions audit (PR apache#4461) revealed six recurring failure modes where the skill documented findings rather than acting on them. Strengthen the consistency checklist and auto-fix list to close these loopholes: - Add checklist item 10: expression-shape restrictions (literal-only argument, child data type, etc.) must be declared in `getSupportLevel`, not gated inside `convert` with `withInfo`. Cite `CometLeft` / `CometRight` / `CometSubstring` as the canonical example. - Add checklist item 11: Spark 4.0+ collation routing through `CollationSupport.X.exec` and `StringTypeWithCollation` means the expression is `Incompatible` for non-default collations. Link apache#4496 as the umbrella issue and reject "behaviour unchanged for `UTF8_BINARY`" as a justification for `Compatible`. - Add checklist item 12: a sub-bullet that says "Known divergence" or "Known limitation" on a `Compatible` branch is a smell. The skill must promote the support level rather than documenting the divergence in prose only. Cite the `replace` empty-search-string case. - Add checklist item 13: unreachable serde registrations (e.g. the `btrim` mapping for `StringTrimBoth`, which is rewritten by `RuntimeReplaceable` before serde runs) must be deleted, not catalogued. - Add an issue-verification step to the reason-wording guidance and the follow-up-issue workflow. Every cited issue must be opened with `gh issue view` to confirm it exists, is open, and matches the divergence before the URL ships in a reason string or support-doc sub-bullet. Add the matching auto-fix patterns to Step 7's "apply fixes automatically" list so future audits resolve these inline rather than filing them as prose follow-ups.
Apply the recurring patterns surfaced in PR review comments rather than
documenting them as Known Limitation prose.
- Remove the unreachable `CometScalarFunction("btrim")` serde mapping.
`StringTrimBoth` is `RuntimeReplaceable` and rewritten to `StringTrim`
before serde runs, so the entry was dead code. Updated the support-doc
sub-bullet for `btrim` to point readers at the `trim` entry.
- Rewrite `CometCaseConversionBase` to use the standard `Incompatible`
opt-in instead of the bespoke `spark.comet.caseConversion.enabled`
conf. Override `getSupportLevel` to return
`Incompatible(Some(reason))` and drop the in-`convert` config check.
Remove `COMET_CASE_CONVERSION_ENABLED` and update the two callers and
the string-expression benchmark to use the per-expression
`spark.comet.expression.{Upper,Lower,InitCap}.allowIncompatible`
keys. Closes the gating asymmetry called out in apache#4467.
- Mark `replace` as `Incompatible` when `search` is a literal empty
string. DataFusion's `replace` inserts the replacement between every
character whereas Spark short-circuits to return `src` unchanged.
Refile under apache#4497 (apache#3344 was closed previously with the bug body
describing the divergence in the wrong direction).
- Drop the spurious `CometStringRepeat.getCompatibleNotes()` claim that
DataFusion `repeat` throws on negative counts. Verified in
`expressions/string/string_repeat.sql` that negative counts produce
the empty string Spark expects. The audit's prose claim was wrong, so
apache#4462 was closed as already-fixed.
- Update the `replace` SQL fixture to assert the new fallback message
rather than the prior `ignore(...)` shape. Update `lower.sql` /
`upper.sql` `expect_fallback(...)` strings to match the new reason
text. Switch `lower_enabled.sql` / `upper_enabled.sql` from
`spark.comet.caseConversion.enabled` to the per-expression
`allowIncompatible` keys.
- Cross-reference the collation umbrella issue
(apache#4496) from every support-doc sub-bullet that previously documented
"non-default collations may diverge silently" without an issue link.
Add the cross-reference to the entries that route through Spark 4.0
`CollationSupport.X.exec` but had no collation note at all
(`bit_length`, `concat_ws`, `endswith`, `initcap`, `instr`, `lpad`,
`octet_length`, `repeat`, `replace`, `right`, `rpad`, `rtrim`,
`split`, `startswith`, `substring`, `substring_index`, `translate`).
- Replace the stale "Known limitation: enforced inside `convert`" prose
on `left`, `right`, `substring`, and `length`. The literal-only
argument restriction has already been moved into `getSupportLevel`;
the prose just hadn't been updated.
Tests:
- `./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/string/"` (42 pass)
- `./mvnw test -Dsuites="org.apache.comet.CometStringExpressionSuite"` (33 pass)
`CometCaseConversionBase` no longer gates compat with the bespoke `spark.comet.caseConversion.enabled` conf inside `convert()`. The fallback now flows through the standard dispatcher path, so the `BroadcastHashJoin` COMET tag on `upper(ca_country)` switches from "Comet is not compatible with Spark for case conversion in locale-specific cases. Set spark.comet.caseConversion.enabled=true to enable it anyway." to the dispatcher's per-expression Incompatible message, including the `spark.comet.expression.Upper.allowIncompatible=true` opt-in and the compat-guide link. The plan shape is unchanged; only the two annotation lines in each affected golden differ.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
Following the same pattern as #4436 (
any), #4437 (bit_and), and the earlier date/time audit, this PR audits every supported string expression in Comet against Spark 3.4.3, 3.5.8, and 4.0.1, records the findings inline in the support doc, and applies the support-level consistency fixes the audit surfaced.What changes are included in this PR?
Support-doc audit notes
Add per-version audit sub-bullets to all 37 supported string SQL function names in
docs/source/contributor-guide/spark_expressions_support.md. The notes record:StringTypeWithCollation;CollationSupportrouting forUpper/Lower/InitCapand thestartswith/endswith/contains/instr/replace/translate/trimfamily;NullIntolerant->nullIntolerant: Booleanrefactor;BinaryPadrewrite forlpad/rpad)Support-level consistency fixes (in
strings.scala)CometInitCap:Incompatible(None)->Incompatible(Some(reason))via a sharedprivate val; drop the deadconvertoverride.CometLength: extract theBinaryTypereason into a sharedprivate valso thegetSupportLevelbranch andgetUnsupportedReasonsstay in sync.CometRegExpReplace: pass the existing reasons intoIncompatible(Some(...))/Unsupported(Some(...))via sharedprivate vals so EXPLAIN matches the compatibility-guide text.CometStringLPad/CometStringRPad: share the two reason strings via a smallPadReasonscompanion sogetSupportLevelandgetUnsupportedReasonscannot drift; align wording with backticks.CometSubstring/CometLeft/CometRight: lift the literal-argument fallback out ofconvertand intogetSupportLevelso the dispatcher handles it uniformly; replace the convert-timewithInfowith a declaredUnsupportedreason via a sharedprivate val.Tracking issues filed for follow-up
Higher-risk findings that need a semantics decision or larger work are filed as separate issues and cross-referenced from the support doc:
repeatthrows on negative count where Spark returns empty stringtranslateuses graphemes vs Spark code points; ignores U+0000 deletionbit_lengthandoctet_lengtherror natively forBinaryTypeinput instead of falling backdecodeignores Spark 4.0legacyCharsetsandlegacyErrorActionflagsdecodedoes not appear in auto-generated compatibility docsCometCaseConversionBasegates compat insideconvert()instead ofgetSupportLevelThe umbrella Spark 4.0 collation issue is the existing #2190, which the
lower/uppersub-bullets reference.The audit skill update (require issues for prose-only findings, add Spark 4.1.1 to the version list) is split out into #4468.
Audit process
The audit was driven by 7 parallel agents using the
audit-comet-expressionskill, each handling a related group (case conversion + initcap; length family; trim; left/right/substring/substring_index; lpad/rpad; simple scalar funcs; complex serdes including concat_ws/regexp_replace/repeat/split/decode).How are these changes tested?
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/string/" -Dtest=none(42 tests pass)./mvnw test -Dsuites="org.apache.comet.CometStringExpressionSuite" -Dtest=none(33 tests pass)left.sql,string_lpad.sql,string_rpad.sql, andCometStringExpressionSuite.scalato match the new shared reason strings.