docs: lead README with the Arrow-native framing by andygrove · Pull Request #4428 · apache/datafusion-comet

andygrove · 2026-05-25T17:11:33Z

Which issue does this PR close?

Part of #4419 (the documentation-only phase, first slice).

Rationale for this change

Comet's documentation conflates several distinct ideas under the word "native": implementation language (Rust vs JVM), pipeline membership (handled by Comet vs falls back to Spark), and data format (Arrow columnar vs Spark rows). Issue #4419 spells out a clearer vocabulary so the docs stop overloading "native" and can scale to a roadmap where some JVM code paths (today Scala UDF codegen; soon Arrow UDFs and hybrid impls) also live inside the Comet pipeline.

The original version of this PR rolled the new vocabulary into ~20 files at once. This revision narrows the scope to a single file — the top-level README.md — so reviewers can sign off on the framing first. The user-guide and contributor-guide sweeps will follow as separate PRs.

What changes are included in this PR?

README.md only. The two top paragraphs and the "What Comet Accelerates" list are rewritten:

The opening paragraph leads with the Arrow-native pipeline (operators, expressions, shuffle, and broadcast all stay in Apache Arrow columnar format) rather than "native Rust implementations on Apache DataFusion."
The intro to "What Comet Accelerates" makes the Rust-vs-JVM split explicit: most operators and expressions run as Rust on DataFusion; some run as JVM code over Arrow batches. Either way the work stays in the Comet pipeline.
The shuffle bullet upgrades from "native columnar shuffle with support for hash and range partitioning" to "Arrow-IPC columnar shuffle ... in a native Rust implementation paired with a JVM fallback for unsupported partition key types."
A new bullet introduces the experimental Scala/Java UDF support and links to the existing user-guide page.

No other docs files are touched. Operator renames (CometExchange → CometNativeShuffleExchange, etc.) and the user-guide / contributor-guide vocabulary sweep are explicitly out of scope here and will land in follow-on PRs.

How are these changes tested?

Documentation only. Verified that the README renders correctly on GitHub and that the new wording matches the vocabulary rules in #4419.

Rewrite the top two paragraphs of README.md so the value prop leads with the Arrow-native pipeline (operators, expressions, shuffle, and broadcast all in Apache Arrow columnar format) rather than 'native Rust implementations'. The accelerator list grows by one entry to mention the experimental Scala/Java UDF support; shuffle and 'What Comet Accelerates' wording is tightened to match. No other docs are touched in this PR. Contributor-guide and user-guide prose updates for the same vocabulary clean-up (apache#4419) will follow separately.

mbutrovich · 2026-05-29T19:46:25Z

-It uses Apache Arrow for zero-copy data transfer between the JVM and native code.
+Comet replaces Spark operators and expressions with implementations that consume and produce Apache Arrow
+batches. Most run as native Rust code on top of Apache DataFusion; some run as JVM code over Arrow batches.
+Either way the work stays in the Comet pipeline without falling back to Spark's row-based engine.


Suggested change

Either way the work stays in the Comet pipeline without falling back to Spark's row-based engine.

Either way, query execution stays in the Comet pipeline without falling back to Spark's row-based engine.

mbutrovich · 2026-05-29T19:47:29Z

 - **Apache Iceberg**: accelerated Parquet scans when reading Iceberg tables from Spark
  (see the [Iceberg guide](https://datafusion.apache.org/comet/user-guide/iceberg.html))
- **Shuffle**: native columnar shuffle with support for hash and range partitioning
+- **Shuffle**: Arrow-IPC columnar shuffle with support for hash and range partitioning, in a native Rust


Didn't we add a (not 100% Spark-compatible) round-robin partitioning solution? Should we skip that if it's opt-in?

mbutrovich · 2026-05-29T19:48:06Z

  map, JSON, hash, and predicate categories
 - **Aggregations**: hash aggregate with support for `FILTER (WHERE ...)` clauses
 - **Joins**: hash join, sort-merge join, and broadcast join
+- **Scala/Java UDFs**: experimental support for keeping Scala/Java scalar UDFs in the Comet pipeline


We can drop "experimental" if #4514 merges first.

andygrove marked this pull request as draft May 25, 2026 17:29

andygrove force-pushed the docs-nomenclature-4419 branch from f84408c to 9fbe7ce Compare May 28, 2026 14:10

andygrove force-pushed the docs-nomenclature-4419 branch from 9fbe7ce to 91420db Compare May 28, 2026 14:12

andygrove changed the title ~~docs: adopt Arrow-native nomenclature across user and contributor guides~~ docs: lead README with the Arrow-native framing May 28, 2026

andygrove marked this pull request as ready for review May 28, 2026 14:13

mbutrovich self-requested a review May 29, 2026 19:43

mbutrovich reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: lead README with the Arrow-native framing#4428

docs: lead README with the Arrow-native framing#4428
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:docs-nomenclature-4419

andygrove commented May 25, 2026 •

edited

Loading

Uh oh!

mbutrovich May 29, 2026

Uh oh!

mbutrovich May 29, 2026

Uh oh!

mbutrovich May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	Either way the work stays in the Comet pipeline without falling back to Spark's row-based engine.
	Either way, query execution stays in the Comet pipeline without falling back to Spark's row-based engine.

Conversation

andygrove commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich May 29, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich May 29, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented May 25, 2026 •

edited

Loading