Skip to content

bench(parquet): add row filter strategy baseline cases#10135

Open
hhhizzz wants to merge 18 commits into
apache:mainfrom
hhhizzz:codex/parquet-reader-bench-baseline
Open

bench(parquet): add row filter strategy baseline cases#10135
hhhizzz wants to merge 18 commits into
apache:mainfrom
hhhizzz:codex/parquet-reader-bench-baseline

Conversation

@hhhizzz

@hhhizzz hhhizzz commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

This PR is the first smaller PR split out from #9956 ("Optimize parquet row filter auto strategy with adaptive fallback").

The goal is to land the benchmark coverage first, before changing row-filter planning or execution behavior. This gives follow-up PRs a stable benchmark baseline already on main, making it easier to compare each later behavior change against the same benchmark cases.

Planned split from #9956:

  1. Add benchmark baseline cases. This PR.
  2. Split row-selection strategy / sparse mask correctness changes.
  3. Add post-filter execution primitives.
  4. Add Auto policy / adaptive materialization core.
  5. Add policy refinements for projected predicates, fixed-prefix guards, and cacheable predicate cases.

What changes are included in this PR?

This PR adds benchmark coverage only. The diff is limited to benchmark targets under parquet/benches, with no changes to production reader code or public APIs.

It extends arrow_reader_row_filter with:

  • strategy comparison cases for:
    • manual full-scan post-filtering;
    • current RowSelectionPolicy::Auto;
    • explicit Selectors;
    • explicit Mask;
  • focused row-filter shapes inspired by ClickBench and TPC-DS workloads;
  • projected-predicate cases;
  • count-only / filter-only / fixed-width / variable-width projection cases;
  • nested whole-root output benchmark coverage;
  • projected scan focus cases that do not construct a RowFilter.

It also extends row_selection_cursor with shape-focused selector/mask cases that vary:

  • selected-run length;
  • selectivity;
  • primitive vs variable-width payloads.

This PR intentionally does not change production reader behavior.

Are these changes tested?

Yes. This PR was validated with:

cargo fmt -- parquet/benches/arrow_reader_row_filter.rs parquet/benches/row_selection_cursor.rs
cargo check -p parquet --bench row_selection_cursor --features arrow
cargo check -p parquet --bench arrow_reader_row_filter --features arrow,async
git diff --check

No benchmark result is claimed in this PR. The purpose is to add baseline benchmark coverage so later PRs can report comparable performance evidence.

Are there any user-facing changes?

No. This only changes benchmark code.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 12, 2026
@hhhizzz

hhhizzz commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

Hi @alamb , could you take a look at this PR when you have a chance? It is a relatively small benchmark-only PR for Parquet row-filter/materialization policy coverage, with no production reader behavior changes. The current checks are green.

@alamb

alamb commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Hi @alamb , could you take a look at this PR when you have a chance? It is a relatively small benchmark-only PR for Parquet row-filter/materialization policy coverage, with no production reader behavior changes. The current checks are green.

Yes, sorry @hhhizzz -- I will do so

The notion of what is a 'relatively small' PR has grown massively since even 9 months ago 😆 😭

Screenshot 2026-06-22 at 10 25 06 AM

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hhhizzz --- this is quite a comprehensive set of benchmarks -- however, I wonder if we really need all this coverage?

For example, there are 27 variants of FilterType -- I see there are some comments about some of the filter shapes matching particular TPCH / TPCDS predicate shapes, but I am trying ti figure out why a benchmark that compares all the different is going to be helpful in the long run to avoid regressions -- I worry taht this benchmark will generate so much data that we will find it hard to run / reason about

It seems like this benchmark may be most useful a development/tuning benchmark as it helps establish baseline timings for row-filter materialization policy choices: That is useful when designing future heuristics but it is hard to grok the results

Are there any important filter cases we should add to arrow_row_filter?

/// FilterType encapsulates the different filter comparisons.
/// The variants correspond to the different filter patterns.
#[derive(Clone, Copy, Debug)]
pub(crate) enum FilterType {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is similar to arrow_reader_row_filter -

pub(crate) enum FilterType {
/// point lookup: selects a single row in 500K.
/// ```text
/// ┌───────────────┐ ┌───────────────┐
/// │ │ │ │
/// │ │ │ ... │
/// │ │ │ │
/// │ │ │ │
/// │ ... │ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │ │ │ │
/// │ │ │ ... │
/// │ │ │ │
/// │ │ │ │
/// └───────────────┘ └───────────────┘
/// ```
PointLookup,
/// selective (1%) unclustered filter: approx 5K selected rows in 500K.
/// ```text
/// ┌───────────────┐ ┌───────────────┐
/// │ ... │ │ │
/// │ │ │ │
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │ │
/// │ │ │ ... │
/// │ │ │ │
/// │ │ │ │
/// │ ... │ │ │
/// │ │ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │ │ │ │
/// └───────────────┘ └───────────────┘
/// ```
SelectiveUnclustered,
/// moderately selective (10%) clustered filter: 50 selected runs of 1K
/// rows each in 500K.
/// ```text
/// ┌───────────────┐ ┌───────────────┐
/// │ │ │ │
/// │ │ │ │
/// │ ... │ │ ... │
/// │ │ │ │
/// │ │ │ │
/// │ │ │ │
/// │ │ │ │
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// └───────────────┘ └───────────────┘
/// ```
ModeratelySelectiveClustered,
/// moderately selective (~9%) unclustered filter: approx 45K selected
/// rows in 500K.
/// ```text
/// ┌───────────────┐ ┌───────────────┐
/// │ ... │ │ │
/// │ │ │ │
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │ │ │ │
/// │ │ │ │
/// │ │ │ ... │
/// │ ... │ │ │
/// │ │ │ │
/// │ │ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// └───────────────┘ └───────────────┘
/// ```
ModeratelySelectiveUnclustered,
/// unselective (99%) unclustered filter: approx 495K selected rows in
/// 500K.
/// ```text
/// ┌───────────────┐ ┌───────────────┐
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │ │
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// └───────────────┘ └───────────────┘
/// ```
UnselectiveUnclustered,
/// unselective (90%) clustered filter: 50 selected runs of 9K rows each
/// in 500K.
/// ```text
/// ┌───────────────┐ ┌───────────────┐
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │ ... │
/// │ │ │ │
/// └───────────────┘ └───────────────┘
/// ```
UnselectiveClustered,
/// composite sparse filter: `SelectiveUnclustered` AND
/// `ModeratelySelectiveClustered`, approx 0.1% selected rows in 500K.
/// ```text
/// ┌───────────────┐ ┌───────────────┐
/// │ │ │ │
/// │ │ │ ... │
/// │ │ │ │
/// │ │ │ │
/// │ ... │ │ │
/// │ │ │ │
/// │ │ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │ │ │ │
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │ │
/// └───────────────┘ └───────────────┘
/// ```
Composite,
/// `utf8View <> ''` modeling [ClickBench] [Q21-Q27] with fragmented
/// short string runs and sentinel values every 1K rows.
/// ```text
/// ┌───────────────┐ ┌───────────────┐
/// │ │ │ │
/// │ ... │ │ ... │
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │ │
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │ │
/// │ │ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │ │ │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│
/// │ ... │ │ ... │
/// │ │ │ │
/// │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒│ │ │
/// └───────────────┘ └───────────────┘
/// ```
///
/// [ClickBench]: https://github.com/ClickHouse/ClickBench
/// [Q21-Q27]: https://github.com/apache/datafusion/blob/b7177234e65cbbb2dcc04c252f6acd80bb026362/benchmarks/queries/clickbench/queries.sql#L22-L28
Utf8ViewNonEmpty,
}

Can we consolidate into a shared module like parquet/benches/row_filter_fixture.rs ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I extracted the shared synthetic reader fixture into parquet/benches/arrow_reader_common/mod.rs and wired both arrow_reader_row_filter.rs and arrow_reader_materialization_policy.rs through it.

I used a bench-local module directory instead of a flat row_filter_fixture.rs so the shared helpers stay scoped to these reader benches. I can rename it if you prefer the flatter file name.

)
.await
}
AsyncStrategy::PushdownMask => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only thing different in each of these branches is the selection policy, right? We could probably collapse them significantly

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The async strategy handling is now collapsed through AsyncStrategy::row_selection_policy(), so the pushdown path is shared and only the full post-filter path remains separate.

@hhhizzz

hhhizzz commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Thank you @hhhizzz --- this is quite a comprehensive set of benchmarks -- however, I wonder if we really need all this coverage?

For example, there are 27 variants of FilterType -- I see there are some comments about some of the filter shapes matching particular TPCH / TPCDS predicate shapes, but I am trying ti figure out why a benchmark that compares all the different is going to be helpful in the long run to avoid regressions -- I worry taht this benchmark will generate so much data that we will find it hard to run / reason about

It seems like this benchmark may be most useful a development/tuning benchmark as it helps establish baseline timings for row-filter materialization policy choices: That is useful when designing future heuristics but it is hard to grok the results

Are there any important filter cases we should add to arrow_row_filter?

I took another pass at narrowing the benchmark scope.

The materialization-policy target is now explicitly focused on policy-boundary/tuning cases rather than being a broad row-filter regression matrix:

  • reduced the materialization-policy benchmark to one group: arrow_reader_materialization_policy_async_focus
  • pruned duplicate/weak cases, including the small scalar-prefix case
  • moved chained predicate-order coverage out of materialization and into arrow_reader_row_filter_async_predicate_order_focus
  • kept the materialization cases as 21 named reader-level shapes rather than a full Cartesian product
  • extracted the shared synthetic reader fixture so the row-filter and materialization benches use the same setup

For arrow_reader_row_filter, the important missing case I added is chained RowFilter predicate ordering: fixed-width predicate before var-width predicate, and the reverse order. The existing Composite case evaluated both predicates inside one ArrowPredicateFn, so it did not exercise pruning between sequential predicates.

If this still feels too broad, I can trim the materialization-policy target further, but the current split is intended to keep generic row-filter behavior in arrow_reader_row_filter and leave the separate materialization target for future Auto policy tuning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants