From 19c35e1ad27d8f0c2797ff095ef5663ab19ba792 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Tue, 2 Jun 2026 07:17:41 -0600 Subject: [PATCH 1/3] docs: expand operators page into a complete Spark operator support reference [skip ci] --- docs/source/user-guide/latest/operators.md | 143 ++++++++++++++++----- 1 file changed, 114 insertions(+), 29 deletions(-) diff --git a/docs/source/user-guide/latest/operators.md b/docs/source/user-guide/latest/operators.md index 495141f9bb..fa2600880a 100644 --- a/docs/source/user-guide/latest/operators.md +++ b/docs/source/user-guide/latest/operators.md @@ -17,32 +17,117 @@ under the License. --> -# Supported Spark Operators - -The following Spark operators are currently replaced with native versions. Query stages that contain any operators -not supported by Comet will fall back to regular Spark execution. - -| Operator | Spark-Compatible? | Compatibility Notes | -| --------------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------ | -| BatchScanExec | Yes | Supports Parquet files and Apache Iceberg Parquet scans. See the [Comet Compatibility Guide] for more information. | -| BroadcastExchangeExec | Yes | | -| BroadcastHashJoinExec | Yes | | -| ExpandExec | Yes | | -| FileSourceScanExec | Yes | Supports Parquet files. See the [Comet Compatibility Guide] for more information. | -| FilterExec | Yes | | -| GenerateExec | Yes | Supports `explode` and `posexplode` generators (arrays only, `_outer` variants are incompatible). | -| GlobalLimitExec | Yes | | -| HashAggregateExec | Yes | | -| InsertIntoHadoopFsRelationCommand | No | Experimental support for native Parquet writes. Disabled by default. | -| LocalLimitExec | Yes | | -| LocalTableScanExec | No | Experimental and disabled by default. | -| ObjectHashAggregateExec | Yes | Supports a limited number of aggregates, such as `bloom_filter_agg`. | -| ProjectExec | Yes | | -| ShuffleExchangeExec | Yes | | -| ShuffledHashJoinExec | Yes | | -| SortExec | Yes | | -| SortMergeJoinExec | Yes | | -| UnionExec | Yes | | -| WindowExec | No | Disabled by default due to known correctness issues. | - -[Comet Compatibility Guide]: compatibility/operators.md +# Spark Operator Support + +This page is the complete reference for how Apache Comet handles each Spark physical operator. +Comet replaces supported operators with native equivalents. Comet runs whole subtrees of native +operators together, so if a query stage contains an operator Comet does not support, that stage +falls back to regular Spark execution. Results are unaffected. + +Operators marked ✅ Supported are enabled by default. Each can be turned off individually with +`spark.comet.exec.OPERATOR.enabled=false` (for example `spark.comet.exec.sort.enabled=false`), and +all native execution can be turned off with `spark.comet.exec.enabled=false`. See the +[Comet Configuration Guide](configs.md) for the full list. + +## Status legend + +| Status | Meaning | +| ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------- | +| ✅ Supported | Native implementation; enabled by default. | +| ⚠️ Supported (caveats) | Works, but with limits: restricted to certain inputs, experimental, or disabled by default. See the [Compatibility Guide](compatibility/index.md). | +| 🔜 Planned | Intended; tracked by an open issue or pull request. | +| 💤 Not currently planned | Not on the current roadmap; falls back to Spark and may be reconsidered later. | + +## Not currently planned + +The following operator families fall back to Spark and are not on the current roadmap: + +- **Structured Streaming operators** (`StateStoreSaveExec`, `StateStoreRestoreExec`, `StreamingSymmetricHashJoinExec`, and similar): Comet targets batch execution. +- **Cartesian / cross joins** (`CartesianProductExec`): rare and expensive, with little acceleration benefit. +- **Sampling and range generation** (`SampleExec`, `RangeExec`): niche leaf operators. + +## Scans + +| Operator | Status | Notes | +| ----------------------- | ------ | ------------------------------------------------------------------------------------------------------------- | +| `FileSourceScanExec` | ✅ | Parquet only. Some types and configurations fall back. See the [Compatibility Guide](compatibility/index.md). | +| `BatchScanExec` | ✅ | Parquet, Apache Iceberg Parquet, and CSV (native) scans. | +| `LocalTableScanExec` | ⚠️ | Experimental, disabled by default (#4393). | +| `InMemoryTableScanExec` | 🔜 | Cached / in-memory table scans fall back today. | +| `RangeExec` | 💤 | See [Not currently planned](#not-currently-planned). | + +## Projection and filtering + +| Operator | Status | Notes | +| ------------- | ------ | ----- | +| `ProjectExec` | ✅ | | +| `FilterExec` | ✅ | | + +## Sorting and limiting + +| Operator | Status | Notes | +| --------------------------- | ------ | ----- | +| `SortExec` | ✅ | | +| `GlobalLimitExec` | ✅ | | +| `LocalLimitExec` | ✅ | | +| `CollectLimitExec` | ✅ | | +| `TakeOrderedAndProjectExec` | ✅ | | + +## Aggregation + +| Operator | Status | Notes | +| ------------------------- | ------ | ----------------------------------------------------------------- | +| `HashAggregateExec` | ✅ | | +| `ObjectHashAggregateExec` | ✅ | Supports a limited set of aggregates, such as `bloom_filter_agg`. | +| `SortAggregateExec` | 🔜 | Falls back today; Comet currently accelerates hash aggregates. | + +## Joins + +| Operator | Status | Notes | +| ----------------------------- | ------ | ---------------------------------------------------- | +| `BroadcastHashJoinExec` | ✅ | | +| `ShuffledHashJoinExec` | ✅ | | +| `SortMergeJoinExec` | ✅ | | +| `BroadcastNestedLoopJoinExec` | 🔜 | In progress (#4429). | +| `CartesianProductExec` | 💤 | See [Not currently planned](#not-currently-planned). | + +## Exchanges + +| Operator | Status | Notes | +| ----------------------- | ------ | ----- | +| `ShuffleExchangeExec` | ✅ | | +| `BroadcastExchangeExec` | ✅ | | + +## Window + +| Operator | Status | Notes | +| ---------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------ | +| `WindowExec` | ⚠️ | Runs natively, but only a subset of window functions is accelerated. The rest fall back. See the [expression reference](expressions.md) (#2721). | +| `WindowGroupLimitExec` | 🔜 | Window-based limit pushdown falls back today. | + +## Generators and set operations + +| Operator | Status | Notes | +| -------------- | ------ | -------------------------------------------------------------------------------------------------------------------------- | +| `GenerateExec` | ⚠️ | Supports `explode` and `posexplode` over arrays. The `_outer` variants are incompatible, and `inline` / `stack` fall back. | +| `ExpandExec` | ✅ | | +| `UnionExec` | ✅ | | +| `CoalesceExec` | ✅ | | + +## Writes + +| Operator | Status | Notes | +| ------------------------ | ------ | ----------------------------------------------------------------- | +| `DataWritingCommandExec` | ⚠️ | Experimental native Parquet writes, disabled by default (opt-in). | + +## Python and UDF + +| Operator | Status | Notes | +| --------------------------------------------------------------------------------------- | ------ | -------------------------------------------------------------------- | +| `ArrowEvalPythonExec`, `MapInArrowExec`, `MapInPandasExec`, `FlatMapGroupsInPandasExec` | 🔜 | Experimental accelerated PyArrow UDF support is in progress (#4234). | +| `BatchEvalPythonExec` | 💤 | Pickled (non-Arrow) Python UDFs. | + +## See also + +- [Comet Compatibility Guide](compatibility/index.md) - known incompatibilities and edge cases. +- [Supported Spark Expressions](expressions.md) - the equivalent reference for expressions. From 3723209185d6b97b7f14e1c4ed95006628c75e2b Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Tue, 2 Jun 2026 13:13:35 -0600 Subject: [PATCH 2/3] docs: rewrite data types page and link operator scans to scan compat doc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Rewrite datatypes.md to match the status-aware reference style introduced for operators and expressions, listing every Spark data type by status (✅ / ⚠️ / 🔜 / 💤) with caveats and tracking issues. - In operators.md, link each scan row to the relevant compatibility doc (Parquet Scan Compatibility, Iceberg Guide). - Note that LocalTableScanExec is disabled by default because there is no acceleration advantage and it is typically only used in test code. [skip ci] --- docs/source/user-guide/latest/datatypes.md | 121 ++++++++++++++++----- docs/source/user-guide/latest/operators.md | 14 +-- 2 files changed, 100 insertions(+), 35 deletions(-) diff --git a/docs/source/user-guide/latest/datatypes.md b/docs/source/user-guide/latest/datatypes.md index 7bc4d66168..5726800de6 100644 --- a/docs/source/user-guide/latest/datatypes.md +++ b/docs/source/user-guide/latest/datatypes.md @@ -17,31 +17,96 @@ under the License. --> -# Supported Spark Data Types - -Comet supports the following Spark data types. Refer to the [Comet Compatibility Guide] for information about data -type support in scans and other operators. - -[Comet Compatibility Guide]: compatibility/index.md - - - -| Data Type | -| ------------ | -| Null | -| Boolean | -| Byte | -| Short | -| Integer | -| Long | -| Float | -| Double | -| Decimal | -| String | -| Binary | -| Date | -| Timestamp | -| TimestampNTZ | -| Struct | -| Array | -| Map | +# Spark Data Type Support + +This page is the complete reference for how Apache Comet handles each Spark data type. Comet's +native execution path is built on Apache Arrow, so the set of types Comet can express natively +is constrained by Arrow's type system. When a query references a type Comet does not support, +the relevant operator falls back to Spark; results are unaffected. + +For per-scan and per-operator type caveats (for example, Parquet read-time conversions or +hash-aggregate group-key restrictions), see the [Compatibility Guide](compatibility/index.md). + +## Status legend + +| Status | Meaning | +| ------------------------ | ----------------------------------------------------------------------------------------------------- | +| ✅ Supported | Native support; enabled by default. | +| ⚠️ Supported (caveats) | Works, but with limits: certain values, contexts, or configurations fall back to Spark. | +| 🔜 Planned | Intended; tracked by an open issue or pull request. | +| 💤 Not currently planned | Not on the current roadmap; queries referencing this type fall back to Spark and may be reconsidered. | + +## Numeric + +| Type | Status | Notes | +| ------------- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `ByteType` | ✅ | | +| `ShortType` | ⚠️ | Parquet scans fall back by default to disambiguate signed `INT16` from unsigned `UINT_8`. See [Parquet Scan Compatibility](compatibility/scans.md). | +| `IntegerType` | ✅ | | +| `LongType` | ✅ | | +| `FloatType` | ⚠️ | NaN and signed-zero handling can diverge from Spark in comparisons and aggregations. See [Floating-point Compatibility](compatibility/floating-point.md). | +| `DoubleType` | ⚠️ | NaN and signed-zero handling can diverge from Spark in comparisons and aggregations. See [Floating-point Compatibility](compatibility/floating-point.md). | +| `DecimalType` | ⚠️ | Decimals encoded in Parquet binary format fall back at scan time. See [Parquet Scan Compatibility](compatibility/scans.md). | + +## String and binary + +| Type | Status | Notes | +| ------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `StringType` | ⚠️ | Default UTF-8 binary collation is supported. Non-default collations (Spark 4.0+) fall back ([#2190](https://github.com/apache/datafusion-comet/issues/2190)). Invalid UTF-8 bytes in Parquet `STRING` columns raise an error rather than falling back. | +| `BinaryType` | ✅ | | +| `CharType` | ⚠️ | Spark normalizes `CHAR(n)` to `StringType` for evaluation; same caveats apply. | +| `VarcharType` | ⚠️ | Spark normalizes `VARCHAR(n)` to `StringType` for evaluation; same caveats apply. | + +## Boolean + +| Type | Status | Notes | +| ------------- | ------ | ----- | +| `BooleanType` | ✅ | | + +## Datetime + +| Type | Status | Notes | +| ------------------ | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `DateType` | ⚠️ | Datetime rebasing for dates written before October 15, 1582 is not applied at scan time. See [Parquet Scan Compatibility](compatibility/scans.md). | +| `TimestampType` | ⚠️ | Datetime rebasing for timestamps written before Spark 3.0 is not applied at scan time. See [Parquet Scan Compatibility](compatibility/scans.md). | +| `TimestampNTZType` | ⚠️ | On Spark 3.x, reading Parquet `TimestampLTZ` as `TimestampNTZ` returns the raw UTC instant instead of raising an error. Spark 4.0+ matches Spark behavior. See [Parquet Scan Compatibility](compatibility/scans.md). | +| `TimeType` | ⚠️ | Spark 4.1+. Native serialization is in place; some operators (sort, shuffle, min/max) are still being wired up ([#4288](https://github.com/apache/datafusion-comet/issues/4288)). | + +## Interval + +Interval types fall back to Spark today. Native acceleration is tracked by +[#4540](https://github.com/apache/datafusion-comet/issues/4540). + +| Type | Status | Notes | +| ----------------------- | ------ | ----------------- | +| `YearMonthIntervalType` | 🔜 | Tracked by #4540. | +| `DayTimeIntervalType` | 🔜 | Tracked by #4540. | +| `CalendarIntervalType` | 🔜 | Tracked by #4540. | + +## Complex + +| Type | Status | Notes | +| ------------ | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `StructType` | ⚠️ | Empty structs (no fields) fall back. Default values that are nested types fall back at scan time. | +| `ArrayType` | ✅ | | +| `MapType` | ⚠️ | Hash aggregate group keys cannot contain a `MapType` (transitively): Arrow's row format used by DataFusion's grouped hash aggregate does not support `Map`, so such groupings fall back. | + +## Variant + +| Type | Status | Notes | +| ------------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `VariantType` | 🔜 | Spark 4.0+. Native scan support is tracked by [#4295](https://github.com/apache/datafusion-comet/issues/4295); shredded Parquet read/write by [#3983](https://github.com/apache/datafusion-comet/issues/3983). | + +## Other + +| Type | Status | Notes | +| ----------------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------- | +| `NullType` | ✅ | | +| `UserDefinedType` | 💤 | User-defined types are application-specific and outside the scope of native acceleration; queries referencing UDTs fall back to Spark. | + +## See also + +- [Comet Compatibility Guide](compatibility/index.md) - known incompatibilities and edge cases. +- [Parquet Scan Compatibility](compatibility/scans.md) - per-type behavior at scan time. +- [Supported Spark Operators](operators.md) - the equivalent reference for operators. +- [Supported Spark Expressions](expressions.md) - the equivalent reference for expressions. diff --git a/docs/source/user-guide/latest/operators.md b/docs/source/user-guide/latest/operators.md index fa2600880a..66f2bdfc87 100644 --- a/docs/source/user-guide/latest/operators.md +++ b/docs/source/user-guide/latest/operators.md @@ -48,13 +48,13 @@ The following operator families fall back to Spark and are not on the current ro ## Scans -| Operator | Status | Notes | -| ----------------------- | ------ | ------------------------------------------------------------------------------------------------------------- | -| `FileSourceScanExec` | ✅ | Parquet only. Some types and configurations fall back. See the [Compatibility Guide](compatibility/index.md). | -| `BatchScanExec` | ✅ | Parquet, Apache Iceberg Parquet, and CSV (native) scans. | -| `LocalTableScanExec` | ⚠️ | Experimental, disabled by default (#4393). | -| `InMemoryTableScanExec` | 🔜 | Cached / in-memory table scans fall back today. | -| `RangeExec` | 💤 | See [Not currently planned](#not-currently-planned). | +| Operator | Status | Notes | +| ----------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `FileSourceScanExec` | ✅ | Parquet only. Some types and configurations fall back. See [Parquet Scan Compatibility](compatibility/scans.md). | +| `BatchScanExec` | ✅ | Parquet, Apache Iceberg Parquet, and CSV (native) scans. See [Parquet Scan Compatibility](compatibility/scans.md) and the [Iceberg Guide](iceberg.md). | +| `LocalTableScanExec` | ⚠️ | Disabled by default; there is no acceleration advantage and this operator is typically only used in test code. Can be opted into via config (#4393). | +| `InMemoryTableScanExec` | 🔜 | Cached / in-memory table scans fall back today. | +| `RangeExec` | 💤 | See [Not currently planned](#not-currently-planned). | ## Projection and filtering From 024b4de36f2414d51e9ce0d0122d53da30be149e Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Tue, 2 Jun 2026 17:26:19 -0600 Subject: [PATCH 3/3] docs: mark supported data types with green checkmark and drop Parquet-scan caveats Parquet-scan-specific caveats (ShortType INT16/UINT_8 disambiguation, Decimal binary encoding, datetime rebasing, TimestampNTZ read conversion) are documented in the Parquet scan compatibility guide, so they no longer downgrade the type status here. Supported types now show a green checkmark. [skip ci] --- docs/source/user-guide/latest/datatypes.md | 36 +++++++++++----------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/docs/source/user-guide/latest/datatypes.md b/docs/source/user-guide/latest/datatypes.md index 5726800de6..c0d1ef7087 100644 --- a/docs/source/user-guide/latest/datatypes.md +++ b/docs/source/user-guide/latest/datatypes.md @@ -41,21 +41,21 @@ hash-aggregate group-key restrictions), see the [Compatibility Guide](compatibil | Type | Status | Notes | | ------------- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------- | | `ByteType` | ✅ | | -| `ShortType` | ⚠️ | Parquet scans fall back by default to disambiguate signed `INT16` from unsigned `UINT_8`. See [Parquet Scan Compatibility](compatibility/scans.md). | +| `ShortType` | ✅ | | | `IntegerType` | ✅ | | | `LongType` | ✅ | | -| `FloatType` | ⚠️ | NaN and signed-zero handling can diverge from Spark in comparisons and aggregations. See [Floating-point Compatibility](compatibility/floating-point.md). | -| `DoubleType` | ⚠️ | NaN and signed-zero handling can diverge from Spark in comparisons and aggregations. See [Floating-point Compatibility](compatibility/floating-point.md). | -| `DecimalType` | ⚠️ | Decimals encoded in Parquet binary format fall back at scan time. See [Parquet Scan Compatibility](compatibility/scans.md). | +| `FloatType` | ✅ | NaN and signed-zero handling can diverge from Spark in comparisons and aggregations. See [Floating-point Compatibility](compatibility/floating-point.md). | +| `DoubleType` | ✅ | NaN and signed-zero handling can diverge from Spark in comparisons and aggregations. See [Floating-point Compatibility](compatibility/floating-point.md). | +| `DecimalType` | ✅ | | ## String and binary -| Type | Status | Notes | -| ------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `StringType` | ⚠️ | Default UTF-8 binary collation is supported. Non-default collations (Spark 4.0+) fall back ([#2190](https://github.com/apache/datafusion-comet/issues/2190)). Invalid UTF-8 bytes in Parquet `STRING` columns raise an error rather than falling back. | -| `BinaryType` | ✅ | | -| `CharType` | ⚠️ | Spark normalizes `CHAR(n)` to `StringType` for evaluation; same caveats apply. | -| `VarcharType` | ⚠️ | Spark normalizes `VARCHAR(n)` to `StringType` for evaluation; same caveats apply. | +| Type | Status | Notes | +| ------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `StringType` | ✅ | Default UTF-8 binary collation is supported. Non-default collations (Spark 4.0+) fall back ([#2190](https://github.com/apache/datafusion-comet/issues/2190)). | +| `BinaryType` | ✅ | | +| `CharType` | ✅ | Spark normalizes `CHAR(n)` to `StringType` for evaluation; same caveats apply. | +| `VarcharType` | ✅ | Spark normalizes `VARCHAR(n)` to `StringType` for evaluation; same caveats apply. | ## Boolean @@ -65,12 +65,12 @@ hash-aggregate group-key restrictions), see the [Compatibility Guide](compatibil ## Datetime -| Type | Status | Notes | -| ------------------ | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `DateType` | ⚠️ | Datetime rebasing for dates written before October 15, 1582 is not applied at scan time. See [Parquet Scan Compatibility](compatibility/scans.md). | -| `TimestampType` | ⚠️ | Datetime rebasing for timestamps written before Spark 3.0 is not applied at scan time. See [Parquet Scan Compatibility](compatibility/scans.md). | -| `TimestampNTZType` | ⚠️ | On Spark 3.x, reading Parquet `TimestampLTZ` as `TimestampNTZ` returns the raw UTC instant instead of raising an error. Spark 4.0+ matches Spark behavior. See [Parquet Scan Compatibility](compatibility/scans.md). | -| `TimeType` | ⚠️ | Spark 4.1+. Native serialization is in place; some operators (sort, shuffle, min/max) are still being wired up ([#4288](https://github.com/apache/datafusion-comet/issues/4288)). | +| Type | Status | Notes | +| ------------------ | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `DateType` | ✅ | | +| `TimestampType` | ✅ | | +| `TimestampNTZType` | ✅ | | +| `TimeType` | ⚠️ | Spark 4.1+. Native serialization is in place; some operators (sort, shuffle, min/max) are still being wired up ([#4288](https://github.com/apache/datafusion-comet/issues/4288)). | ## Interval @@ -87,9 +87,9 @@ Interval types fall back to Spark today. Native acceleration is tracked by | Type | Status | Notes | | ------------ | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `StructType` | ⚠️ | Empty structs (no fields) fall back. Default values that are nested types fall back at scan time. | +| `StructType` | ✅ | Empty structs (no fields) fall back. | | `ArrayType` | ✅ | | -| `MapType` | ⚠️ | Hash aggregate group keys cannot contain a `MapType` (transitively): Arrow's row format used by DataFusion's grouped hash aggregate does not support `Map`, so such groupings fall back. | +| `MapType` | ✅ | Hash aggregate group keys cannot contain a `MapType` (transitively): Arrow's row format used by DataFusion's grouped hash aggregate does not support `Map`, so such groupings fall back. | ## Variant