Skip to content

chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1#4461

Open
andygrove wants to merge 4 commits into
apache:mainfrom
andygrove:worktree-audit-string-funcs
Open

chore(audit): audit string expressions across Spark 3.4.3, 3.5.8, 4.0.1#4461
andygrove wants to merge 4 commits into
apache:mainfrom
andygrove:worktree-audit-string-funcs

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 27, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

Following the same pattern as #4436 (any), #4437 (bit_and), and the earlier date/time audit, this PR audits every supported string expression in Comet against Spark 3.4.3, 3.5.8, and 4.0.1, records the findings inline in the support doc, and applies the support-level consistency fixes the audit surfaced.

What changes are included in this PR?

Support-doc audit notes

Add per-version audit sub-bullets to all 37 supported string SQL function names in docs/source/contributor-guide/spark_expressions_support.md. The notes record:

  • whether the Spark class is identical across 3.4.3 / 3.5.8 / 4.0.1
  • Spark 4.0 changes (collation widening via StringTypeWithCollation; CollationSupport routing for Upper/Lower/InitCap and the startswith/endswith/contains/instr/replace/translate/trim family; NullIntolerant -> nullIntolerant: Boolean refactor; BinaryPad rewrite for lpad/rpad)
  • known divergences between Comet and Spark, each linked to the tracking issue listed below

Support-level consistency fixes (in strings.scala)

  • CometInitCap: Incompatible(None) -> Incompatible(Some(reason)) via a shared private val; drop the dead convert override.
  • CometLength: extract the BinaryType reason into a shared private val so the getSupportLevel branch and getUnsupportedReasons stay in sync.
  • CometRegExpReplace: pass the existing reasons into Incompatible(Some(...)) / Unsupported(Some(...)) via shared private vals so EXPLAIN matches the compatibility-guide text.
  • CometStringLPad / CometStringRPad: share the two reason strings via a small PadReasons companion so getSupportLevel and getUnsupportedReasons cannot drift; align wording with backticks.
  • CometSubstring / CometLeft / CometRight: lift the literal-argument fallback out of convert and into getSupportLevel so the dispatcher handles it uniformly; replace the convert-time withInfo with a declared Unsupported reason via a shared private val.

Tracking issues filed for follow-up

Higher-risk findings that need a semantics decision or larger work are filed as separate issues and cross-referenced from the support doc:

The umbrella Spark 4.0 collation issue is the existing #2190, which the lower/upper sub-bullets reference.

The audit skill update (require issues for prose-only findings, add Spark 4.1.1 to the version list) is split out into #4468.

Audit process

The audit was driven by 7 parallel agents using the audit-comet-expression skill, each handling a related group (case conversion + initcap; length family; trim; left/right/substring/substring_index; lpad/rpad; simple scalar funcs; complex serdes including concat_ws/regexp_replace/repeat/split/decode).

How are these changes tested?

  • ./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/string/" -Dtest=none (42 tests pass)
  • ./mvnw test -Dsuites="org.apache.comet.CometStringExpressionSuite" -Dtest=none (33 tests pass)
  • Updated the affected fallback-reason assertions in left.sql, string_lpad.sql, string_rpad.sql, and CometStringExpressionSuite.scala to match the new shared reason strings.

Add per-version audit sub-bullets to every supported string expression in
spark_expressions_support.md using the audit-comet-expression skill.
Covers 37 SQL function names backed by the serdes in
spark/src/main/scala/org/apache/comet/serde/strings.scala plus the
CometScalarFunction wirings in QueryPlanSerde.scala.

For each function, the sub-bullets record:
- Whether the Spark class is identical across 3.4.3, 3.5.8, 4.0.1
- Spark 4.0 changes (collation widening via StringTypeWithCollation,
  CollationSupport routing for Upper/Lower/InitCap and the
  startswith/endswith/contains/instr/replace/translate/trim family,
  NullIntolerant -> nullIntolerant field refactor, BinaryPad rewrite
  for lpad/rpad)
- Known divergences between Comet and Spark (initcap hyphen handling,
  replace with empty search, repeat with negative count, decode
  charset/legacy-flag gaps, BinaryType wiring gaps in bit_length /
  octet_length, regexp_replace pos > 1 restriction)

The audit was driven by 7 parallel agents, each handling a related
group (case conversion + initcap; length family; trim; left/right/
substring/substring_index; lpad/rpad; simple scalar funcs; complex
serdes including concat_ws/regexp_replace/repeat/split/decode).

Apply support-level consistency fixes surfaced by the audit:
- CometInitCap: Incompatible(None) -> Incompatible(Some(reason))
  via a shared private val; drop the dead convert override.
- CometLength: extract the BinaryType reason into a shared private val
  so the getSupportLevel branch and getUnsupportedReasons stay in sync.
- CometRegExpReplace: pass the existing reasons into
  Incompatible(Some(...)) and Unsupported(Some(...)) via shared
  private vals so EXPLAIN matches the compatibility-guide text.
- CometStringLPad / CometStringRPad: share the two reason strings via
  a small PadReasons companion so getSupportLevel and
  getUnsupportedReasons cannot drift; align wording with backticks.
- CometSubstring / CometLeft / CometRight: lift the literal-argument
  fallback out of convert and into getSupportLevel so the dispatcher
  handles it uniformly, and replace the convert-time withInfo with a
  declared Unsupported reason via a shared private val.

Update the affected tests for the new reason strings:
- spark/src/test/resources/sql-tests/expressions/string/left.sql
- spark/src/test/resources/sql-tests/expressions/string/string_lpad.sql
- spark/src/test/resources/sql-tests/expressions/string/string_rpad.sql
- spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala
… doc

Link the high-priority findings from the string-expressions audit to
the tracking issues filed alongside the PR:

- repeat negative count divergence -> apache#4462
- translate grapheme vs codepoint and U+0000 deletion -> apache#4463
- bit_length/octet_length BinaryType native error -> apache#4464
- decode Spark 4.0 legacy flags ignored -> apache#4465
- decode missing from auto-generated compatibility docs -> apache#4466
- case-conversion gating antipattern -> apache#4467
- generic Spark 4.0 collation gap on lower/upper -> apache#2190 (existing)

Also remove a stray NUL byte in the chr entry that came from a heredoc
roundtrip and replace it with the literal escape \u0000.
andygrove added a commit to andygrove/datafusion-comet that referenced this pull request May 27, 2026
The audit-comet-expression skill previously permitted leaving
semantics-decision findings as prose recommendations in the PR
description, on the assumption the reviewer would pick them up.
In practice that note dies with the PR.

Tighten Step 6 and Step 7 so that every high-priority finding either
becomes an inline fix + test, or a filed GitHub issue + ignored
regression test, before the audit PR is opened. Add a dedicated
'Findings that need follow-up' subsection spelling out the workflow
(search, file, cross-reference from the support-doc sub-bullet and
the PR description).

Surfaced while re-running the string-expressions audit in apache#4461:
several higher-risk findings (CometCaseConversionBase compat gating,
StringRepeat negative-count divergence, translate grapheme semantics,
bit_length/octet_length BinaryType, decode legacy flags) were left as
prose only and had to be filed retroactively as apache#4462-apache#4467.
@andygrove andygrove force-pushed the worktree-audit-string-funcs branch from aa20e8f to d994671 Compare May 27, 2026 22:24
- [x] ascii
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `StringType -> IntegerType`; `nullSafeEval` returns `codePointAt(0)` of the first char, or `0` for the empty string. Wired via `CometScalarFunction("ascii")` and resolved to DataFusion `ascii` (`chars().next() as i32`); first-code-point semantics match for ASCII, BMP, and supplementary code points.
- Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to `StringTypeWithCollation(supportsTrimCollation = true)`; behaviour unchanged for `UTF8_BINARY`. Comet does not propagate collation, so non-default collations may diverge silently.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this documenting, or need an issue?

- Known limitation: wired as a raw `CometScalarFunction("bit_length")` with no `BinaryType` guard. DataFusion's `BitLengthFunc` signature only accepts string types, so `bit_length(<binary>)` execute-fails on the native side instead of falling back cleanly (https://github.com/apache/datafusion-comet/issues/4464).
- [x] btrim
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `StringTrimBoth` is `RuntimeReplaceable` and rewritten to `StringTrim(srcStr, trimStr)` before serde runs, so the explicit `CometScalarFunction("btrim")` mapping is unreachable.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the unreachable mapping?

- [x] btrim
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `StringTrimBoth` is `RuntimeReplaceable` and rewritten to `StringTrim(srcStr, trimStr)` before serde runs, so the explicit `CometScalarFunction("btrim")` mapping is unreachable.
- Spark 4.0.1 (audited 2026-05-27): `StringTrim` (the rewrite target) routes through `CollationSupport.StringTrim.exec` and uses `StringTypeNonCSAICollation(supportsTrimCollation = true)`; semantics unchanged for `UTF8_BINARY`. Non-default collations may diverge in Comet.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to document this collation difference?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #4496 for doing a follow on audit focussing on collation issues once this PR is merged

- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `RuntimeReplaceable` with `replacement = Substring(str, Literal(1), len)`; accepts `StringType` or `BinaryType` plus `IntegerType`. Comet serde rewrites to a `Substring` proto with `start=1, len=lenValue`, requires `len` to be a `Literal`, and falls back otherwise.
- Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened with `StringTypeWithCollation`; behaviour unchanged for `UTF8_BINARY`.
- Known limitation: the literal-only `len` restriction is enforced inside `convert` via `withInfo` rather than declared in `getSupportLevel`, so EXPLAIN surfaces it only at conversion time.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix this

- [ ] locate
- [x] lower
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. JVM default-locale `toLowerCase` on `UTF8String`. Comet routes to DataFusion `lower` (Rust Unicode default case mapping, no locale awareness) and is gated off by default via `spark.comet.caseConversion.enabled=false`. The gating lives in `convert(...)` rather than `getSupportLevel`, which bypasses the standard `allowIncompatible` machinery (https://github.com/apache/datafusion-comet/issues/4467).
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should change this to use the standard incompatible approach rather than use a separate config?

- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `StringRepeat(str, times)` with `nullSafeEval(s, n) = s.repeat(n)`; `UTF8String.repeat` returns the empty string for `n <= 0`. Comet casts `times` to `LongType` and delegates to DataFusion `repeat`.
- Spark 4.0.1 (audited 2026-05-27): adds `nullIntolerant: Boolean` field; `dataType` becomes `str.dataType` (collation-tracking). Semantics unchanged for `UTF8_BINARY`.
- Known divergence: DataFusion `repeat` throws on negative counts instead of returning the empty string Spark produces. Currently surfaced via `getCompatibleNotes` only (https://github.com/apache/datafusion-comet/issues/4462).
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a hallucination - this issue did exist once, but is already fixed. I closed the issue.

- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `StringReplace(src, search, replace)`; when `search` is empty, Spark returns `src` unchanged (short-circuit on `search.numBytes == 0`).
- Spark 4.0.1 (audited 2026-05-27): routes through `CollationSupport.StringReplace.exec`; semantics unchanged for `UTF8_BINARY`.
- Known divergence: DataFusion `replace` inserts `to` between every character (and at both ends) when `search` is empty (https://github.com/apache/datafusion-comet/issues/3344). Currently the support level is `Compatible`.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mark this as incompatible

- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `RuntimeReplaceable` with `replacement = If(IsNull(str), null, If(len <= 0, "", Substring(str, -len, len)))`; accepts `StringType` plus `IntegerType`. Comet serde rewrites positive `len` to a `Substring` proto with `start=-len, len=len`; for `len <= 0` it builds an `If(IsNull(str), null, "")` proto chain to preserve NULL propagation.
- Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened with collation; uses `UnaryMinus(len, failOnError = false)` to avoid integer-overflow exceptions on `len = Int.MinValue`. Semantics unchanged for `UTF8_BINARY`.
- Known limitation: the literal-only `len` restriction is enforced inside `convert` via `withInfo` rather than declared in `getSupportLevel`.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should fix this

andygrove added a commit to andygrove/datafusion-comet that referenced this pull request May 28, 2026
The string-expressions audit (PR apache#4461) revealed six recurring
failure modes where the skill documented findings rather than acting
on them. Strengthen the consistency checklist and auto-fix list to
close these loopholes:

- Add checklist item 10: expression-shape restrictions (literal-only
  argument, child data type, etc.) must be declared in
  `getSupportLevel`, not gated inside `convert` with `withInfo`. Cite
  `CometLeft` / `CometRight` / `CometSubstring` as the canonical
  example.
- Add checklist item 11: Spark 4.0+ collation routing through
  `CollationSupport.X.exec` and `StringTypeWithCollation` means the
  expression is `Incompatible` for non-default collations. Link
  apache#4496 as the umbrella issue and reject "behaviour unchanged for
  `UTF8_BINARY`" as a justification for `Compatible`.
- Add checklist item 12: a sub-bullet that says "Known divergence" or
  "Known limitation" on a `Compatible` branch is a smell. The skill
  must promote the support level rather than documenting the
  divergence in prose only. Cite the `replace` empty-search-string
  case.
- Add checklist item 13: unreachable serde registrations (e.g. the
  `btrim` mapping for `StringTrimBoth`, which is rewritten by
  `RuntimeReplaceable` before serde runs) must be deleted, not
  catalogued.
- Add an issue-verification step to the reason-wording guidance and
  the follow-up-issue workflow. Every cited issue must be opened
  with `gh issue view` to confirm it exists, is open, and matches
  the divergence before the URL ships in a reason string or
  support-doc sub-bullet.

Add the matching auto-fix patterns to Step 7's "apply fixes
automatically" list so future audits resolve these inline rather than
filing them as prose follow-ups.
@andygrove andygrove added this to the 0.17.0 milestone May 28, 2026
andygrove added 2 commits May 28, 2026 12:10
Apply the recurring patterns surfaced in PR review comments rather than
documenting them as Known Limitation prose.

- Remove the unreachable `CometScalarFunction("btrim")` serde mapping.
  `StringTrimBoth` is `RuntimeReplaceable` and rewritten to `StringTrim`
  before serde runs, so the entry was dead code. Updated the support-doc
  sub-bullet for `btrim` to point readers at the `trim` entry.
- Rewrite `CometCaseConversionBase` to use the standard `Incompatible`
  opt-in instead of the bespoke `spark.comet.caseConversion.enabled`
  conf. Override `getSupportLevel` to return
  `Incompatible(Some(reason))` and drop the in-`convert` config check.
  Remove `COMET_CASE_CONVERSION_ENABLED` and update the two callers and
  the string-expression benchmark to use the per-expression
  `spark.comet.expression.{Upper,Lower,InitCap}.allowIncompatible`
  keys. Closes the gating asymmetry called out in apache#4467.
- Mark `replace` as `Incompatible` when `search` is a literal empty
  string. DataFusion's `replace` inserts the replacement between every
  character whereas Spark short-circuits to return `src` unchanged.
  Refile under apache#4497 (apache#3344 was closed previously with the bug body
  describing the divergence in the wrong direction).
- Drop the spurious `CometStringRepeat.getCompatibleNotes()` claim that
  DataFusion `repeat` throws on negative counts. Verified in
  `expressions/string/string_repeat.sql` that negative counts produce
  the empty string Spark expects. The audit's prose claim was wrong, so
  apache#4462 was closed as already-fixed.
- Update the `replace` SQL fixture to assert the new fallback message
  rather than the prior `ignore(...)` shape. Update `lower.sql` /
  `upper.sql` `expect_fallback(...)` strings to match the new reason
  text. Switch `lower_enabled.sql` / `upper_enabled.sql` from
  `spark.comet.caseConversion.enabled` to the per-expression
  `allowIncompatible` keys.
- Cross-reference the collation umbrella issue
  (apache#4496) from every support-doc sub-bullet that previously documented
  "non-default collations may diverge silently" without an issue link.
  Add the cross-reference to the entries that route through Spark 4.0
  `CollationSupport.X.exec` but had no collation note at all
  (`bit_length`, `concat_ws`, `endswith`, `initcap`, `instr`, `lpad`,
  `octet_length`, `repeat`, `replace`, `right`, `rpad`, `rtrim`,
  `split`, `startswith`, `substring`, `substring_index`, `translate`).
- Replace the stale "Known limitation: enforced inside `convert`" prose
  on `left`, `right`, `substring`, and `length`. The literal-only
  argument restriction has already been moved into `getSupportLevel`;
  the prose just hadn't been updated.

Tests:
- `./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/string/"` (42 pass)
- `./mvnw test -Dsuites="org.apache.comet.CometStringExpressionSuite"` (33 pass)
`CometCaseConversionBase` no longer gates compat with the bespoke
`spark.comet.caseConversion.enabled` conf inside `convert()`. The
fallback now flows through the standard dispatcher path, so the
`BroadcastHashJoin` COMET tag on `upper(ca_country)` switches from

  "Comet is not compatible with Spark for case conversion in
  locale-specific cases. Set spark.comet.caseConversion.enabled=true
  to enable it anyway."

to the dispatcher's per-expression Incompatible message, including
the `spark.comet.expression.Upper.allowIncompatible=true` opt-in and
the compat-guide link. The plan shape is unchanged; only the two
annotation lines in each affected golden differ.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant