Skip to content

docs: lead README with the Arrow-native framing#4428

Open
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:docs-nomenclature-4419
Open

docs: lead README with the Arrow-native framing#4428
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:docs-nomenclature-4419

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 25, 2026

Which issue does this PR close?

Part of #4419 (the documentation-only phase, first slice).

Rationale for this change

Comet's documentation conflates several distinct ideas under the word "native": implementation language (Rust vs JVM), pipeline membership (handled by Comet vs falls back to Spark), and data format (Arrow columnar vs Spark rows). Issue #4419 spells out a clearer vocabulary so the docs stop overloading "native" and can scale to a roadmap where some JVM code paths (today Scala UDF codegen; soon Arrow UDFs and hybrid impls) also live inside the Comet pipeline.

The original version of this PR rolled the new vocabulary into ~20 files at once. This revision narrows the scope to a single file — the top-level README.md — so reviewers can sign off on the framing first. The user-guide and contributor-guide sweeps will follow as separate PRs.

What changes are included in this PR?

README.md only. The two top paragraphs and the "What Comet Accelerates" list are rewritten:

  • The opening paragraph leads with the Arrow-native pipeline (operators, expressions, shuffle, and broadcast all stay in Apache Arrow columnar format) rather than "native Rust implementations on Apache DataFusion."
  • The intro to "What Comet Accelerates" makes the Rust-vs-JVM split explicit: most operators and expressions run as Rust on DataFusion; some run as JVM code over Arrow batches. Either way the work stays in the Comet pipeline.
  • The shuffle bullet upgrades from "native columnar shuffle with support for hash and range partitioning" to "Arrow-IPC columnar shuffle ... in a native Rust implementation paired with a JVM fallback for unsupported partition key types."
  • A new bullet introduces the experimental Scala/Java UDF support and links to the existing user-guide page.

No other docs files are touched. Operator renames (CometExchangeCometNativeShuffleExchange, etc.) and the user-guide / contributor-guide vocabulary sweep are explicitly out of scope here and will land in follow-on PRs.

How are these changes tested?

Documentation only. Verified that the README renders correctly on GitHub and that the new wording matches the vocabulary rules in #4419.

@andygrove andygrove marked this pull request as draft May 25, 2026 17:29
@andygrove andygrove force-pushed the docs-nomenclature-4419 branch from f84408c to 9fbe7ce Compare May 28, 2026 14:10
Rewrite the top two paragraphs of README.md so the value prop leads
with the Arrow-native pipeline (operators, expressions, shuffle, and
broadcast all in Apache Arrow columnar format) rather than 'native
Rust implementations'. The accelerator list grows by one entry to
mention the experimental Scala/Java UDF support; shuffle and 'What
Comet Accelerates' wording is tightened to match.

No other docs are touched in this PR. Contributor-guide and user-guide
prose updates for the same vocabulary clean-up (apache#4419) will follow
separately.
@andygrove andygrove force-pushed the docs-nomenclature-4419 branch from 9fbe7ce to 91420db Compare May 28, 2026 14:12
@andygrove andygrove changed the title docs: adopt Arrow-native nomenclature across user and contributor guides docs: lead README with the Arrow-native framing May 28, 2026
@andygrove andygrove marked this pull request as ready for review May 28, 2026 14:13
@mbutrovich mbutrovich self-requested a review May 29, 2026 19:43
Comment thread README.md
It uses Apache Arrow for zero-copy data transfer between the JVM and native code.
Comet replaces Spark operators and expressions with implementations that consume and produce Apache Arrow
batches. Most run as native Rust code on top of Apache DataFusion; some run as JVM code over Arrow batches.
Either way the work stays in the Comet pipeline without falling back to Spark's row-based engine.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Either way the work stays in the Comet pipeline without falling back to Spark's row-based engine.
Either way, query execution stays in the Comet pipeline without falling back to Spark's row-based engine.

Comment thread README.md
- **Apache Iceberg**: accelerated Parquet scans when reading Iceberg tables from Spark
(see the [Iceberg guide](https://datafusion.apache.org/comet/user-guide/iceberg.html))
- **Shuffle**: native columnar shuffle with support for hash and range partitioning
- **Shuffle**: Arrow-IPC columnar shuffle with support for hash and range partitioning, in a native Rust
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we add a (not 100% Spark-compatible) round-robin partitioning solution? Should we skip that if it's opt-in?

Comment thread README.md
map, JSON, hash, and predicate categories
- **Aggregations**: hash aggregate with support for `FILTER (WHERE ...)` clauses
- **Joins**: hash join, sort-merge join, and broadcast join
- **Scala/Java UDFs**: experimental support for keeping Scala/Java scalar UDFs in the Comet pipeline
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop "experimental" if #4514 merges first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants