Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 135 additions & 13 deletions .claude/skills/audit-comet-expression/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: audit-comet-expression
description: Audit an existing Comet expression for correctness and test coverage. Studies the Spark implementation across versions 3.4.3, 3.5.8, and 4.0.1, reviews the Comet and DataFusion implementations, identifies missing test coverage, and offers to implement additional tests.
description: Audit an existing Comet expression for correctness and test coverage. Studies the Spark implementation across versions 3.4.3, 3.5.8, 4.0.1, and 4.1.1, reviews the Comet and DataFusion implementations, identifies missing test coverage, and offers to implement additional tests.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is 4.0.2 now?

argument-hint: <expression-name>
---

Expand All @@ -10,7 +10,7 @@ Audit the Comet implementation of the `$ARGUMENTS` expression for correctness an

This audit covers:

1. Spark implementation across versions 3.4.3, 3.5.8, and 4.0.1
1. Spark implementation across versions 3.4.3, 3.5.8, 4.0.1, and 4.1.1
2. Comet Scala serde implementation
3. Comet Rust / DataFusion implementation
4. Existing test coverage (Comet SQL Tests and Comet Scala Tests)
Expand All @@ -24,7 +24,7 @@ Clone specific Spark version tags (use shallow clones to avoid polluting the wor

```bash
set -eu -o pipefail
for tag in v3.4.3 v3.5.8 v4.0.1; do
for tag in v3.4.3 v3.5.8 v4.0.1 v4.1.1; do
dir="/tmp/spark-${tag}"
if [ ! -d "$dir" ]; then
git clone --depth 1 --branch "$tag" https://github.com/apache/spark.git "$dir"
Expand All @@ -37,7 +37,7 @@ done
Search the Catalyst SQL expressions source:

```bash
for tag in v3.4.3 v3.5.8 v4.0.1; do
for tag in v3.4.3 v3.5.8 v4.0.1 v4.1.1; do
dir="/tmp/spark-${tag}"
echo "=== $tag ==="
find "$dir/sql/catalyst/src/main/scala" -name "*.scala" | \
Expand All @@ -48,7 +48,7 @@ done
If the expression is not found in catalyst, also check core:

```bash
for tag in v3.4.3 v3.5.8 v4.0.1; do
for tag in v3.4.3 v3.5.8 v4.0.1 v4.1.1; do
dir="/tmp/spark-${tag}"
echo "=== $tag ==="
find "$dir/sql" -name "*.scala" | \
Expand All @@ -73,6 +73,7 @@ Produce a concise diff summary of what changed between:

- 3.4.3 → 3.5.8
- 3.5.8 → 4.0.1
- 4.0.1 → 4.1.1

Pay attention to:

Expand All @@ -87,7 +88,7 @@ Pay attention to:
## Step 2: Locate the Spark Tests

```bash
for tag in v3.4.3 v3.5.8 v4.0.1; do
for tag in v3.4.3 v3.5.8 v4.0.1 v4.1.1; do
dir="/tmp/spark-${tag}"
echo "=== $tag ==="
find "$dir/sql" -name "*.scala" -path "*/test/*" | \
Expand Down Expand Up @@ -208,6 +209,11 @@ and in EXPLAIN output, so they are user-facing.
- Use backticks around config keys, type names, and SQL identifiers.
- Link to a tracking GitHub issue for known incompatibilities so users can
follow progress: `(https://github.com/apache/datafusion-comet/issues/NNNN)`.
**Verify the issue exists and is open** before citing it
(`gh issue view <N> --repo apache/datafusion-comet`). Issue numbers
invented from context or recalled from memory are a recurring failure
mode: a stale link is worse than no link because the reader follows it
and finds nothing.
- Keep it concise. Single sentence is best.
- Do not write "Incompatible reason: ..." or "Unsupported because ...".
The doc generator adds the framing.
Expand Down Expand Up @@ -389,6 +395,49 @@ finding for Step 6.
read like internal implementation notes ("DataFusion probes the longer
side") or that mismatch their support level (an "Incompatible" reason
that says "X is not supported").
10. **Expression-shape restrictions live in `getSupportLevel`.** Any
restriction that is knowable from the expression alone (literal-only
arguments, unsupported child data type, foldable-only options, a
specific operator shape) must be declared as an
`Unsupported(Some(reason))` branch in `getSupportLevel`, not gated
inside `convert` with a `withInfo` + `return None`. Putting the
check in `convert` means EXPLAIN surfaces the reason only at
conversion time, the doc generator never sees it, and the
dispatcher cannot route around it. The literal-only `len`
restrictions on `CometLeft`, `CometRight`, and `CometSubstring`
are the canonical example of the in-`convert` pattern that this
skill forbids: lift them into `getSupportLevel`.
11. **Spark 4.0 collation divergences are flagged, not glossed over.**
If the Spark 4.0/4.1 implementation routes through
`CollationSupport.X.exec(..., collationId)` (or uses
`StringTypeWithCollation` / `StringTypeNonCSAICollation` for input
types) and the Comet path does not propagate collation, the
expression is `Incompatible` for non-default collations. Mark the
branch `Incompatible(Some(reason))` linking to the collation
umbrella issue
(https://github.com/apache/datafusion-comet/issues/4496) so the
follow-up sweep can find every site. "Behaviour unchanged for
`UTF8_BINARY`" alone is not a justification for leaving the
support level at `Compatible`: users running with non-default
collations get silently wrong answers.
12. **Known divergences flip the support level.** If you find yourself
writing the words "Known divergence" or "Known limitation" in the
support-doc sub-bullet while leaving `getSupportLevel` returning
`Compatible`, the audit has skipped its job. A documented
divergence is by definition not `Compatible`. Promote the branch
to `Incompatible(Some(reason))` (or `Unsupported` if the native
path errors rather than producing a wrong answer) and link the
tracking issue. The `replace` empty-search-string divergence with
DataFusion is the canonical example of this anti-pattern.
13. **Unreachable serde mappings are removed.** Expressions registered
as `RuntimeReplaceable` (or otherwise rewritten by an analyzer
rule before serde) never reach `QueryPlanSerde.exprToProtoInternal`
with their original class. If the audit finds that a registered
`CometScalarFunction("name")` or `CometExpressionSerde` entry can
never be hit (e.g. the `btrim` mapping for `StringTrimBoth`, which
is rewritten to `StringTrim` before serde runs), delete the
registration in the same audit PR. Documenting the dead code in
the support doc is not enough.

---

Expand All @@ -411,6 +460,11 @@ For an untested case, run it manually first to determine the current
behaviour, then commit either a regression test (passes) or a
`query ignore(<issue-url>)` test (fails).

Every item in this bucket either becomes an inline fix + test, or a
filed GitHub issue + ignored regression test, in the audit PR. Step 7
spells out the workflow: never leave a high-priority finding as PR-body
prose only.

### Medium priority: missing test coverage

Low-risk coverage gaps: additional input permutations on already-tested
Expand All @@ -429,9 +483,13 @@ etc. These come from the Step 5 consistency audit.

High-priority findings (correctness divergences and high-risk coverage
gaps) and consistency issues from Step 5 / Step 6 must not be left as
prose. Apply them in the same PR as the audit. Only low-risk missing
coverage requires the user's go-ahead, because adding tests for cases
that already work on well-exercised paths is incremental polish.
prose. Apply them in the same PR as the audit. Anything you cannot fix
inline (because it needs a semantics decision, native code change, or
larger design work) must still be captured as a GitHub issue per the
"Findings that need follow-up" section below: prose recommendations in
the PR body alone are insufficient. Only low-risk missing coverage
requires the user's go-ahead, because adding tests for cases that
already work on well-exercised paths is incremental polish.

### High-priority findings: capture as tests

Expand Down Expand Up @@ -499,16 +557,80 @@ need user approval. The classes of fix are:
- Hoist a reason shared by multiple serdes (e.g. a recurring
TimestampNTZ caveat) into a small `private object` companion in the
same file, mirroring `UTCTimestampSerde`.
- Lift expression-shape restrictions (literal-only argument, foldable
child, unsupported child data type) out of `convert`'s `withInfo` +
`return None` and into an `Unsupported(Some(reason))` branch in
`getSupportLevel`. The `convert` body should then assume the
precondition holds and stop calling `withInfo` for that case.
- Promote a documented "Known divergence" or "Known limitation" sub-
bullet from a `Compatible` branch to `Incompatible(Some(reason))`
(or `Unsupported` if the native path errors) and link the tracking
issue. The sub-bullet stays as user-facing documentation. The
support level catches up to match.
- Mark expressions whose Spark 4.0+ path routes through
`CollationSupport.X.exec` (or accepts `StringTypeWithCollation` /
`StringTypeNonCSAICollation`) as `Incompatible(Some(reason))` for
non-default collations, linking
https://github.com/apache/datafusion-comet/issues/4496.
- Delete unreachable serde registrations (`RuntimeReplaceable` rewrites
the expression before serde runs, an analyzer rule strips the case,
etc.) rather than documenting them as a curiosity.

Each fix is one of these patterns. If a finding requires a semantics
decision (e.g. is a specific branch really `Unsupported`, or is it
`Incompatible`?), do not guess: leave it as a prose recommendation in
the PR description and call it out for the reviewer.
`Incompatible`?), do not guess: **file a GitHub issue per the
"Findings that need follow-up" section below** and link it from the PR
description. Do not leave the recommendation as prose only: prose in a
PR description gets buried as soon as the PR merges.

After every fix, build the affected module to make sure the edit
compiles. Do not run the full suite; targeted tests suffice if the
fix could plausibly affect behaviour.

### Findings that need follow-up: always file a tracking issue

Any high-priority finding (correctness divergence, robustness gap,
behavioural difference from Spark, missing-collation guard, etc.) that
this PR does not fix inline **must** be filed as a GitHub issue before
the PR is opened. This includes:

- Semantics decisions the audit surfaces but should not unilaterally
resolve (e.g. promote `Compatible` to `Incompatible`, change a
default).
- Architectural concerns that span multiple expressions (e.g. Spark 4.0
collation propagation across an entire family).
- Bugs that are fixable in principle but need more design or native
changes than fit in the audit PR.
- Documentation gaps surfaced by the audit (e.g. an expression that
doesn't appear in the auto-generated compatibility doc).

For each follow-up:

1. Search for an existing issue first
(`gh issue list --search "<expression> <symptom> in:title,body" --state all --limit 5`).
If a candidate match comes back, **open it
(`gh issue view <N> --repo apache/datafusion-comet`) and confirm the
title and body actually describe the divergence you found, and that
the issue is still `OPEN`**. A closed-but-fixed issue cited as
"known divergence" is worse than no citation, because the reader
follows the link and finds a fix that was already shipped. If it
matches, link it from the PR description and the support-doc
sub-bullet, and stop.
2. If no issue exists, file one with `gh issue create` using the
`correctness` label (or `documentation` for doc-only gaps) plus any
relevant area labels (e.g. `spark 4.0`). Title format:
`[Bug] <expression> <one-line symptom>`. Body includes: Spark version
range affected, a minimal repro, the divergent result, the relevant
Comet file/line, and a one-line note that the issue was surfaced by
this audit PR.
3. Reference the issue number from both the support-doc sub-bullet and
the PR description "What changes are included" section so reviewers
can see what work the audit intentionally deferred.

Prose-only "future work" notes in the PR description are not enough.
The whole point of the audit is to leave behind durable artefacts; an
unfiled finding evaporates after the PR merges.

### Low-risk missing test coverage: ask the user

This is the only Step 7 category that pauses for user input. It only
Expand Down Expand Up @@ -586,7 +708,7 @@ entry in `docs/source/contributor-guide/spark_expressions_support.md`.

Add one sub-bullet per Spark version checked, each including:

- Spark version (e.g. 3.4.3, 3.5.8, 4.0.1)
- Spark version (e.g. 3.4.3, 3.5.8, 4.0.1, 4.1.1)
- Today's date
- A brief note for any version-specific finding (behavioral difference, known incompatibility); omit if nothing notable

Expand All @@ -597,7 +719,7 @@ Add one sub-bullet per Spark version checked, each including:
Present the audit as:

1. **Expression Summary** - Brief description of what `$ARGUMENTS` does, its input/output types, and null behavior
2. **Spark Version Differences** - Summary of any behavioral or API differences across Spark 3.4.3, 3.5.8, and 4.0.1
2. **Spark Version Differences** - Summary of any behavioral or API differences across Spark 3.4.3, 3.5.8, 4.0.1, and 4.1.1
3. **Comet Implementation Notes** - Summary of how Comet implements this expression and any concerns
4. **Coverage Gap Analysis** - The gap table from Step 5, plus implementation gaps
5. **Recommendations** - Prioritized list from Step 6
Expand Down