-
Notifications
You must be signed in to change notification settings - Fork 324
docs: lead README with the Arrow-native framing #4428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -35,10 +35,12 @@ under the License. | |
|
|
||
| <img src="docs/source/_static/images/DataFusionComet-Logo-Light.png" width="512" alt="logo"/> | ||
|
|
||
| Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful | ||
| [Apache DataFusion] query engine. Comet is designed to significantly enhance the | ||
| performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the | ||
| Spark ecosystem without requiring any code changes. | ||
| Apache DataFusion Comet is a high-performance accelerator for Apache Spark. Comet keeps Spark queries | ||
| **Arrow-native end-to-end**: operators, expressions, shuffle, and broadcast all stay in Apache Arrow | ||
| columnar format, avoiding the per-row overhead of Spark's row-based engine. Within the Arrow-native | ||
| pipeline, operators and expressions execute as Rust code (via the [Apache DataFusion] query engine) | ||
| or as JVM code that operates directly on Arrow batches. Comet integrates with the Spark ecosystem | ||
| without requiring any code changes. | ||
|
|
||
| **Comet provides a ~2x speedup for TPC-DS @ SF 1000 (1TB), resulting in ~50% cost savings.** | ||
|
|
||
|
|
@@ -58,17 +60,22 @@ See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contribut | |
|
|
||
| ## What Comet Accelerates | ||
|
|
||
| Comet replaces Spark operators and expressions with native Rust implementations that run on Apache DataFusion. | ||
| It uses Apache Arrow for zero-copy data transfer between the JVM and native code. | ||
| Comet replaces Spark operators and expressions with implementations that consume and produce Apache Arrow | ||
| batches. Most run as native Rust code on top of Apache DataFusion; some run as JVM code over Arrow batches. | ||
| Either way the work stays in the Comet pipeline without falling back to Spark's row-based engine. | ||
|
|
||
| - **Parquet scans**: native Parquet reader integrated with Spark's query planner | ||
| - **Apache Iceberg**: accelerated Parquet scans when reading Iceberg tables from Spark | ||
| (see the [Iceberg guide](https://datafusion.apache.org/comet/user-guide/iceberg.html)) | ||
| - **Shuffle**: native columnar shuffle with support for hash and range partitioning | ||
| - **Shuffle**: Arrow-IPC columnar shuffle with support for hash and range partitioning, in a native Rust | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Didn't we add a (not 100% Spark-compatible) round-robin partitioning solution? Should we skip that if it's opt-in? |
||
| implementation paired with a JVM fallback for unsupported partition key types | ||
| - **Expressions**: hundreds of supported Spark expressions across math, string, datetime, array, | ||
| map, JSON, hash, and predicate categories | ||
| - **Aggregations**: hash aggregate with support for `FILTER (WHERE ...)` clauses | ||
| - **Joins**: hash join, sort-merge join, and broadcast join | ||
| - **Scala/Java UDFs**: experimental support for keeping Scala/Java scalar UDFs in the Comet pipeline | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can drop "experimental" if #4514 merges first. |
||
| via Spark's whole-stage codegen (see the | ||
| [Scala UDF guide](https://datafusion.apache.org/comet/user-guide/scala_java_udfs.html)) | ||
|
|
||
| For the authoritative lists, see the [supported expressions](https://datafusion.apache.org/comet/user-guide/expressions.html) | ||
| and [supported operators](https://datafusion.apache.org/comet/user-guide/operators.html) pages. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.