Describe the enhancement requested
ParquetFileReader already computes, per row group, the RowRanges that may pass the configured filter — but only as a private step on the internal read path (getRowRanges(int) feeding readFilteredRowGroup). There is also no supported way to ask how many compressed bytes a given set of row ranges corresponds to without actually reading the column data.
This makes it hard for an external reader to plan I/O up front. A concrete motivating case is a materialization path (e.g. a Spark-side scanner) that wants to (a) obtain the column-index-derived row ranges that may pass the filter for a row group, and (b) get a metadata-only estimate of the on-disk compressed bytes those ranges correspond to for the currently requested columns — so it can size buffers and schedule reads without first touching column data.
Proposed change
Two additive APIs on ParquetFileReader, both metadata-only (they consult the column/offset indexes from the footer; no column data is read):
-
public RowRanges getRowRanges(int blockIndex) — promote the existing private method to public. It returns the row ranges within the row group that may pass the configured filter. When no filter is configured it short-circuits to a RowRanges covering all rows in the row group (the previous private version asserted a filter was present, since it was only ever reached on the filtering path).
-
public long getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges) — sum of compressed page sizes (OffsetIndex.getCompressedPageSize, which includes the page header) across the reader's currently requested columns, for pages whose row range intersects rowRanges. Returns 0 for empty ranges or when no columns are requested. Throws MissingOffsetIndexException if a requested column lacks an offset index.
Note: dictionary pages are not represented in OffsetIndex, so they are excluded from the sum — the result is a lower bound on actual on-disk bytes for those columns/rows by exactly the dictionary-page contribution.
Scope
- Additive,
Core only. The only behavioral change is that getRowRanges now handles the no-filter case (returning all rows) instead of asserting; all existing callers already guard with a filter check, so they are unaffected.
- No user-facing API removal.
This is the second of two related enhancements opening up RowRanges/reader APIs needed by the materialization feature described above. The first (#3596) added RowRanges.Builder for incremental construction from selected row indices.
Component(s):
Core
Describe the enhancement requested
ParquetFileReaderalready computes, per row group, theRowRangesthat may pass the configured filter — but only as aprivatestep on the internal read path (getRowRanges(int)feedingreadFilteredRowGroup). There is also no supported way to ask how many compressed bytes a given set of row ranges corresponds to without actually reading the column data.This makes it hard for an external reader to plan I/O up front. A concrete motivating case is a materialization path (e.g. a Spark-side scanner) that wants to (a) obtain the column-index-derived row ranges that may pass the filter for a row group, and (b) get a metadata-only estimate of the on-disk compressed bytes those ranges correspond to for the currently requested columns — so it can size buffers and schedule reads without first touching column data.
Proposed change
Two additive APIs on
ParquetFileReader, both metadata-only (they consult the column/offset indexes from the footer; no column data is read):public RowRanges getRowRanges(int blockIndex)— promote the existing private method to public. It returns the row ranges within the row group that may pass the configured filter. When no filter is configured it short-circuits to aRowRangescovering all rows in the row group (the previousprivateversion asserted a filter was present, since it was only ever reached on the filtering path).public long getCompressedBytesForRowRanges(int blockIndex, RowRanges rowRanges)— sum of compressed page sizes (OffsetIndex.getCompressedPageSize, which includes the page header) across the reader's currently requested columns, for pages whose row range intersectsrowRanges. Returns0for empty ranges or when no columns are requested. ThrowsMissingOffsetIndexExceptionif a requested column lacks an offset index.Note: dictionary pages are not represented in
OffsetIndex, so they are excluded from the sum — the result is a lower bound on actual on-disk bytes for those columns/rows by exactly the dictionary-page contribution.Scope
Coreonly. The only behavioral change is thatgetRowRangesnow handles the no-filter case (returning all rows) instead of asserting; all existing callers already guard with a filter check, so they are unaffected.This is the second of two related enhancements opening up
RowRanges/reader APIs needed by the materialization feature described above. The first (#3596) addedRowRanges.Builderfor incremental construction from selected row indices.Component(s):
Core