[branch-54] refactor: give parquet CDC options an explicit enabled flag (backport #22632)#22648
Open
kszucs wants to merge 1 commit into
Open
[branch-54] refactor: give parquet CDC options an explicit enabled flag (backport #22632)#22648kszucs wants to merge 1 commit into
enabled flag (backport #22632)#22648kszucs wants to merge 1 commit into
Conversation
Content-defined chunking (CDC) write options were added in apache#21110 and have not been released yet (current workspace is 53.x; CDC is slated for 54.0.0), so the config and proto surfaces can still be changed freely. This reworks it before it ships. What changes: * Rename the `ParquetOptions` field `use_content_defined_chunking` -> `content_defined_chunking`. * `CdcOptions` becomes a plain `config_namespace!` with an explicit `enabled: bool` field alongside the chunking parameters, and the field is a bare `CdcOptions` (no longer `Option<CdcOptions>`). CDC is on iff `content_defined_chunking.enabled` is true. Add `CdcOptions::enabled()` / `CdcOptions::disabled()` shorthand constructors. * Drop the bespoke `impl ConfigField for CdcOptions` / `impl ConfigField for Option<CdcOptions>` and the `#[expect(clippy::should_implement_trait)]` workaround that backed the old bare-boolean form. Everything is now generated by the macro. * Add an `enabled` field to the proto `CdcOptions` message so the proto <-> config mapping is a direct field copy, dropping the previous presence-encoding and the zero-sentinel fallback for the chunk sizes. Why this is better: * Naming matches parquet-rs. parquet's `WriterProperties` exposes `content_defined_chunking()` / `set_content_defined_chunking(...)` with no `use_` prefix; the field name now lines up across the boundary. * Explicit, not magic. CDC is toggled with a real `content_defined_chunking.enabled = true|false` key instead of a special bare-boolean parse, and setting a chunking parameter no longer silently turns CDC on. * No order-dependence on the SQL side. Format options in `COPY ... OPTIONS` and `CREATE EXTERNAL TABLE ... OPTIONS` are applied from a `HashMap`, i.e. in non-deterministic order. With a separate `enabled` flag, the flag and the parameters are set independently, so the resolved config never depends on the order in which the keys happen to be applied. * Simpler. No hand-written `ConfigField` impls, no clippy hack, and the proto serialization is a plain field copy in both directions. Tests, generated config docs, and the information_schema snapshot are updated accordingly; a new `parquet_cdc_config.slt` documents the resolution behavior (enable toggle, parameter-does-not-enable, order independence).
enabled flag (backport #22632)enabled flag (backport #22632)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
enabledflag #22632 tobranch-54.Rationale for this change
Content-defined chunking (CDC) write options were added in #21110 and are slated for the 54.0.0 release. This backports the refactor in #22632 so the config/proto surface ships in its final form, before the release goes out.
The CDC options previously worked as
use_content_defined_chunking: Option<CdcOptions>with aConfigFieldimpl that accepted a bareuse_content_defined_chunking = true|falseand otherwise enabled CDC implicitly when any sub-field was set. This has a few problems:WriterPropertiesexposescontent_defined_chunking()/set_content_defined_chunking(Option<CdcOptions>)with nouse_prefix.COPY ... OPTIONS/CREATE EXTERNAL TABLE ... OPTIONSare applied from aHashMap(non-deterministic order). With the old bare-boolean form, mixing... = falsewith a sub-field could resolve to enabled or disabled depending on iteration order.ConfigFieldimpls and a#[expect(clippy::should_implement_trait)]workaround, plus a zero-sentinel fallback in the proto mapping.Since CDC is unreleased, the config/proto surface can still be changed freely.
What changes are included in this PR?
ParquetOptionsfielduse_content_defined_chunking->content_defined_chunking(matches parquet-rs).CdcOptionsa plainconfig_namespace!with an explicitenabled: boolfield alongside the chunking parameters; the field is a bareCdcOptions(no longerOption<CdcOptions>). CDC is on iffcontent_defined_chunking.enabledis true. Setting a parameter no longer implicitly enables CDC, and the result is independent of key order.CdcOptions::enabled()/CdcOptions::disabled()shorthand constructors.ConfigFieldimpls and theshould_implement_traitworkaround — all generated by the macro now.enabledfield to the protoCdcOptionsmessage so the proto <-> config mapping is a plain field copy in both directions.information_schemasnapshot, and addparquet_cdc_config.sltdocumenting the resolution behavior.Are these changes tested?
Yes —
datafusion-commonconfig + writer unit tests,datafusion-proto-commonproto round-trip tests,datafusion/coreparquet integration tests, and sqllogictest (parquet_cdc.slt+ newparquet_cdc_config.slt). Cherry-pick applied cleanly ontobranch-54; affected crates build and the CDC unit tests pass.Are there any user-facing changes?
Yes, but only to the unreleased CDC options:
datafusion.execution.parquet.use_content_defined_chunking->datafusion.execution.parquet.content_defined_chunking.enabled(plus.min_chunk_size/.max_chunk_size/.norm_level).content_defined_chunking.enabled = true|false.No released API is affected.
🤖 Generated with Claude Code