diff --git a/README.md b/README.md index cba865d96a..4827879671 100644 --- a/README.md +++ b/README.md @@ -35,10 +35,12 @@ under the License. logo -Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful -[Apache DataFusion] query engine. Comet is designed to significantly enhance the -performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the -Spark ecosystem without requiring any code changes. +Apache DataFusion Comet is a high-performance accelerator for Apache Spark. Comet keeps Spark queries +**Arrow-native end-to-end**: operators, expressions, shuffle, and broadcast all stay in Apache Arrow +columnar format, avoiding the per-row overhead of Spark's row-based engine. Within the Arrow-native +pipeline, operators and expressions execute as Rust code (via the [Apache DataFusion] query engine) +or as JVM code that operates directly on Arrow batches. Comet integrates with the Spark ecosystem +without requiring any code changes. **Comet provides a ~2x speedup for TPC-DS @ SF 1000 (1TB), resulting in ~50% cost savings.** @@ -58,17 +60,22 @@ See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contribut ## What Comet Accelerates -Comet replaces Spark operators and expressions with native Rust implementations that run on Apache DataFusion. -It uses Apache Arrow for zero-copy data transfer between the JVM and native code. +Comet replaces Spark operators and expressions with implementations that consume and produce Apache Arrow +batches. Most run as native Rust code on top of Apache DataFusion; some run as JVM code over Arrow batches. +Either way the work stays in the Comet pipeline without falling back to Spark's row-based engine. - **Parquet scans**: native Parquet reader integrated with Spark's query planner - **Apache Iceberg**: accelerated Parquet scans when reading Iceberg tables from Spark (see the [Iceberg guide](https://datafusion.apache.org/comet/user-guide/iceberg.html)) -- **Shuffle**: native columnar shuffle with support for hash and range partitioning +- **Shuffle**: Arrow-IPC columnar shuffle with support for hash and range partitioning, in a native Rust + implementation paired with a JVM fallback for unsupported partition key types - **Expressions**: hundreds of supported Spark expressions across math, string, datetime, array, map, JSON, hash, and predicate categories - **Aggregations**: hash aggregate with support for `FILTER (WHERE ...)` clauses - **Joins**: hash join, sort-merge join, and broadcast join +- **Scala/Java UDFs**: experimental support for keeping Scala/Java scalar UDFs in the Comet pipeline + via Spark's whole-stage codegen (see the + [Scala UDF guide](https://datafusion.apache.org/comet/user-guide/scala_java_udfs.html)) For the authoritative lists, see the [supported expressions](https://datafusion.apache.org/comet/user-guide/expressions.html) and [supported operators](https://datafusion.apache.org/comet/user-guide/operators.html) pages.