Skip to content

feat(iceberg): honor write.parquet.* via ParquetWriterBuilder::from_table_properties#2561

Draft
kszucs wants to merge 1 commit into
apache:mainfrom
kszucs:feat-parquet-writer-cdc
Draft

feat(iceberg): honor write.parquet.* via ParquetWriterBuilder::from_table_properties#2561
kszucs wants to merge 1 commit into
apache:mainfrom
kszucs:feat-parquet-writer-cdc

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented Jun 1, 2026

Which issue does this PR close?

  • Closes #.

What changes are included in this PR?

write.parquet.* table properties were only honored on the DataFusion INSERT INTO path, via an inline content-defined-chunking (CDC) translation. Any code writing through the writer stack directly (DataFileWriterParquetWriterBuilder) silently used parquet-rs defaults.

  • Add ParquetWriterBuilder::from_table_properties(&TableProperties, schema), which translates write.parquet.* settings into WriterProperties. It currently translates the content-defined-chunking keys (write.parquet.content-defined-chunking.*); other keys fall back to parquet-rs defaults and can be added to this single translation point later.
  • Add a chainable with_match_mode setter so the field match mode can be overridden (DataFusion needs name-based matching, since its Arrow batches carry no field-id metadata).
  • Refactor the DataFusion insert_into writer to build via from_table_properties, reusing the TableProperties it has already parsed instead of translating CDC options inline.

Additive only: new and new_with_match_mode are unchanged; no breaking changes.

Are these changes tested?

  • Unit tests in parquet_writer.rs: CDC off by default, CDC options translated from properties, and an end-to-end test that writes through the writer to local FS and asserts the payload column is split into multiple variable-sized data pages with CDC, and a single page without.
  • Existing test_insert_into* DataFusion integration tests cover the refactored path (behaviorally unchanged).
  • A new HF-gated integration test (hf_cdc_write_test) writes a CDC parquet file to a HuggingFace bucket and verifies content-chunking on read-back, wired into the existing ci_hf_cdc.yml workflow. Runs only when HF_TOKEN/HF_BUCKET are set.

…able_properties

Add ParquetWriterBuilder::from_table_properties(&TableProperties, schema),
which translates write.parquet.* settings into parquet WriterProperties
instead of using parquet-rs defaults. Currently translates the
content-defined-chunking keys; other keys fall back to parquet-rs
defaults and can be added to this one translation point later. Add a
chainable with_match_mode setter to override field matching.

The DataFusion insert_into path now builds its writer from the table
properties it already parsed, instead of translating CDC options inline,
so the direct (non-DataFusion) writer path picks up the same behavior.

Add an HF-gated Rust integration test that writes a CDC parquet file to
HuggingFace Hub and verifies the payload column is content-chunked, and
wire it into the existing HF CI workflow.
@kszucs kszucs force-pushed the feat-parquet-writer-cdc branch from 8cb9071 to 2c8caf5 Compare June 1, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant