From 91420dbb0519399564729f590f314c4111ae9930 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 28 May 2026 08:12:31 -0600 Subject: [PATCH 1/3] docs: lead README with the Arrow-native framing Rewrite the top two paragraphs of README.md so the value prop leads with the Arrow-native pipeline (operators, expressions, shuffle, and broadcast all in Apache Arrow columnar format) rather than 'native Rust implementations'. The accelerator list grows by one entry to mention the experimental Scala/Java UDF support; shuffle and 'What Comet Accelerates' wording is tightened to match. No other docs are touched in this PR. Contributor-guide and user-guide prose updates for the same vocabulary clean-up (#4419) will follow separately. --- README.md | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index cba865d96a..4827879671 100644 --- a/README.md +++ b/README.md @@ -35,10 +35,12 @@ under the License. logo -Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful -[Apache DataFusion] query engine. Comet is designed to significantly enhance the -performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the -Spark ecosystem without requiring any code changes. +Apache DataFusion Comet is a high-performance accelerator for Apache Spark. Comet keeps Spark queries +**Arrow-native end-to-end**: operators, expressions, shuffle, and broadcast all stay in Apache Arrow +columnar format, avoiding the per-row overhead of Spark's row-based engine. Within the Arrow-native +pipeline, operators and expressions execute as Rust code (via the [Apache DataFusion] query engine) +or as JVM code that operates directly on Arrow batches. Comet integrates with the Spark ecosystem +without requiring any code changes. **Comet provides a ~2x speedup for TPC-DS @ SF 1000 (1TB), resulting in ~50% cost savings.** @@ -58,17 +60,22 @@ See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contribut ## What Comet Accelerates -Comet replaces Spark operators and expressions with native Rust implementations that run on Apache DataFusion. -It uses Apache Arrow for zero-copy data transfer between the JVM and native code. +Comet replaces Spark operators and expressions with implementations that consume and produce Apache Arrow +batches. Most run as native Rust code on top of Apache DataFusion; some run as JVM code over Arrow batches. +Either way the work stays in the Comet pipeline without falling back to Spark's row-based engine. - **Parquet scans**: native Parquet reader integrated with Spark's query planner - **Apache Iceberg**: accelerated Parquet scans when reading Iceberg tables from Spark (see the [Iceberg guide](https://datafusion.apache.org/comet/user-guide/iceberg.html)) -- **Shuffle**: native columnar shuffle with support for hash and range partitioning +- **Shuffle**: Arrow-IPC columnar shuffle with support for hash and range partitioning, in a native Rust + implementation paired with a JVM fallback for unsupported partition key types - **Expressions**: hundreds of supported Spark expressions across math, string, datetime, array, map, JSON, hash, and predicate categories - **Aggregations**: hash aggregate with support for `FILTER (WHERE ...)` clauses - **Joins**: hash join, sort-merge join, and broadcast join +- **Scala/Java UDFs**: experimental support for keeping Scala/Java scalar UDFs in the Comet pipeline + via Spark's whole-stage codegen (see the + [Scala UDF guide](https://datafusion.apache.org/comet/user-guide/scala_java_udfs.html)) For the authoritative lists, see the [supported expressions](https://datafusion.apache.org/comet/user-guide/expressions.html) and [supported operators](https://datafusion.apache.org/comet/user-guide/operators.html) pages. From 235f4a53d002c802c12fdb88c4520c2c12841148 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Mon, 1 Jun 2026 08:05:47 -0600 Subject: [PATCH 2/3] Update README.md Co-authored-by: Matt Butrovich --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4827879671..a61e1332b7 100644 --- a/README.md +++ b/README.md @@ -62,7 +62,7 @@ See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contribut Comet replaces Spark operators and expressions with implementations that consume and produce Apache Arrow batches. Most run as native Rust code on top of Apache DataFusion; some run as JVM code over Arrow batches. -Either way the work stays in the Comet pipeline without falling back to Spark's row-based engine. +Either way, query execution stays in the Comet pipeline without falling back to Spark's row-based engine. - **Parquet scans**: native Parquet reader integrated with Spark's query planner - **Apache Iceberg**: accelerated Parquet scans when reading Iceberg tables from Spark From 7f03711c9d6c3223e4f3ba50272b943a83542d56 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Mon, 1 Jun 2026 10:33:45 -0600 Subject: [PATCH 3/3] docs: drop 'experimental' from Scala/Java UDF bullet #4514 (JVM Scala UDF codegen dispatch enabled by default) has merged, so the support is no longer experimental. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a61e1332b7..c6b5bef01b 100644 --- a/README.md +++ b/README.md @@ -73,7 +73,7 @@ Either way, query execution stays in the Comet pipeline without falling back to map, JSON, hash, and predicate categories - **Aggregations**: hash aggregate with support for `FILTER (WHERE ...)` clauses - **Joins**: hash join, sort-merge join, and broadcast join -- **Scala/Java UDFs**: experimental support for keeping Scala/Java scalar UDFs in the Comet pipeline +- **Scala/Java UDFs**: support for keeping Scala/Java scalar UDFs in the Comet pipeline via Spark's whole-stage codegen (see the [Scala UDF guide](https://datafusion.apache.org/comet/user-guide/scala_java_udfs.html))