Skip to content

Support byte-based row group size limit (max_row_group_bytes) in ParquetOptions #22650

@Satyr09

Description

@Satyr09

Is your feature request related to a problem or challenge?

DataFusion's Parquet writer only exposes a row-count limit for row group sizing, via ParquetOptions.max_row_group_size (datafusion.execution.parquet.max_row_group_size, default 1M rows). There is no way to bound a row group by bytes.

A row count could be a poor proxy for row group size depending on your workload, because bytes-per-row varies widely with schema width. The same max_row_group_size = 1M yields a small row group for a narrow schema and a multi-hundred-MB row group for a wide one.

Describe the solution you'd like

Add an optional max_row_group_bytes to ParquetOptions, wired to WriterPropertiesBuilder::set_max_row_group_bytes.

Describe alternatives you've considered

No response

Additional context

The capability is already available on DataFusion main, so no dependency bump is required. I have an implementation ready (config field, WriterPropertiesBuilder wiring, round-trip tests, and docs) and can open a PR against this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions