Skip to content

[branch-54] refactor: give parquet CDC options an explicit enabled flag (backport #22632)#22648

Open
kszucs wants to merge 1 commit into
apache:branch-54from
kszucs:backport-22632-branch-54
Open

[branch-54] refactor: give parquet CDC options an explicit enabled flag (backport #22632)#22648
kszucs wants to merge 1 commit into
apache:branch-54from
kszucs:backport-22632-branch-54

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented May 30, 2026

Which issue does this PR close?

Rationale for this change

Content-defined chunking (CDC) write options were added in #21110 and are slated for the 54.0.0 release. This backports the refactor in #22632 so the config/proto surface ships in its final form, before the release goes out.

The CDC options previously worked as use_content_defined_chunking: Option<CdcOptions> with a ConfigField impl that accepted a bare use_content_defined_chunking = true|false and otherwise enabled CDC implicitly when any sub-field was set. This has a few problems:

  • Naming diverges from parquet-rs. WriterProperties exposes content_defined_chunking() / set_content_defined_chunking(Option<CdcOptions>) with no use_ prefix.
  • Implicit / order-dependent on the SQL side. Format options in COPY ... OPTIONS / CREATE EXTERNAL TABLE ... OPTIONS are applied from a HashMap (non-deterministic order). With the old bare-boolean form, mixing ... = false with a sub-field could resolve to enabled or disabled depending on iteration order.
  • Extra machinery. Supporting the bare boolean required hand-written ConfigField impls and a #[expect(clippy::should_implement_trait)] workaround, plus a zero-sentinel fallback in the proto mapping.

Since CDC is unreleased, the config/proto surface can still be changed freely.

What changes are included in this PR?

  • Rename the ParquetOptions field use_content_defined_chunking -> content_defined_chunking (matches parquet-rs).
  • Make CdcOptions a plain config_namespace! with an explicit enabled: bool field alongside the chunking parameters; the field is a bare CdcOptions (no longer Option<CdcOptions>). CDC is on iff content_defined_chunking.enabled is true. Setting a parameter no longer implicitly enables CDC, and the result is independent of key order.
  • Add CdcOptions::enabled() / CdcOptions::disabled() shorthand constructors.
  • Drop the ConfigField impls and the should_implement_trait workaround — all generated by the macro now.
  • Add an enabled field to the proto CdcOptions message so the proto <-> config mapping is a plain field copy in both directions.
  • Update unit tests, regenerate config docs + the information_schema snapshot, and add parquet_cdc_config.slt documenting the resolution behavior.

Are these changes tested?

Yes — datafusion-common config + writer unit tests, datafusion-proto-common proto round-trip tests, datafusion/core parquet integration tests, and sqllogictest (parquet_cdc.slt + new parquet_cdc_config.slt). Cherry-pick applied cleanly onto branch-54; affected crates build and the CDC unit tests pass.

Are there any user-facing changes?

Yes, but only to the unreleased CDC options:

  • Config key datafusion.execution.parquet.use_content_defined_chunking -> datafusion.execution.parquet.content_defined_chunking.enabled (plus .min_chunk_size / .max_chunk_size / .norm_level).
  • The bare-boolean form is removed; enable/disable via content_defined_chunking.enabled = true|false.

No released API is affected.

🤖 Generated with Claude Code

Content-defined chunking (CDC) write options were added in apache#21110 and have
not been released yet (current workspace is 53.x; CDC is slated for 54.0.0),
so the config and proto surfaces can still be changed freely. This reworks it
before it ships.

What changes:

* Rename the `ParquetOptions` field `use_content_defined_chunking` ->
  `content_defined_chunking`.
* `CdcOptions` becomes a plain `config_namespace!` with an explicit
  `enabled: bool` field alongside the chunking parameters, and the field is a
  bare `CdcOptions` (no longer `Option<CdcOptions>`). CDC is on iff
  `content_defined_chunking.enabled` is true. Add `CdcOptions::enabled()` /
  `CdcOptions::disabled()` shorthand constructors.
* Drop the bespoke `impl ConfigField for CdcOptions` /
  `impl ConfigField for Option<CdcOptions>` and the
  `#[expect(clippy::should_implement_trait)]` workaround that backed the old
  bare-boolean form. Everything is now generated by the macro.
* Add an `enabled` field to the proto `CdcOptions` message so the proto <->
  config mapping is a direct field copy, dropping the previous
  presence-encoding and the zero-sentinel fallback for the chunk sizes.

Why this is better:

* Naming matches parquet-rs. parquet's `WriterProperties` exposes
  `content_defined_chunking()` / `set_content_defined_chunking(...)` with no
  `use_` prefix; the field name now lines up across the boundary.

* Explicit, not magic. CDC is toggled with a real
  `content_defined_chunking.enabled = true|false` key instead of a special
  bare-boolean parse, and setting a chunking parameter no longer silently turns
  CDC on.

* No order-dependence on the SQL side. Format options in `COPY ... OPTIONS`
  and `CREATE EXTERNAL TABLE ... OPTIONS` are applied from a `HashMap`, i.e. in
  non-deterministic order. With a separate `enabled` flag, the flag and the
  parameters are set independently, so the resolved config never depends on the
  order in which the keys happen to be applied.

* Simpler. No hand-written `ConfigField` impls, no clippy hack, and the proto
  serialization is a plain field copy in both directions.

Tests, generated config docs, and the information_schema snapshot are updated
accordingly; a new `parquet_cdc_config.slt` documents the resolution behavior
(enable toggle, parameter-does-not-enable, order independence).
@kszucs kszucs changed the title refactor: give parquet CDC options an explicit enabled flag (backport #22632) [branch-54] refactor: give parquet CDC options an explicit enabled flag (backport #22632) May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant