bench(parquet): add row filter strategy baseline cases#10135
Conversation
This reverts commit a717d7f.
|
Hi @alamb , could you take a look at this PR when you have a chance? It is a relatively small benchmark-only PR for Parquet row-filter/materialization policy coverage, with no production reader behavior changes. The current checks are green. |
Yes, sorry @hhhizzz -- I will do so The notion of what is a 'relatively small' PR has grown massively since even 9 months ago 😆 😭
|
alamb
left a comment
There was a problem hiding this comment.
Thank you @hhhizzz --- this is quite a comprehensive set of benchmarks -- however, I wonder if we really need all this coverage?
For example, there are 27 variants of FilterType -- I see there are some comments about some of the filter shapes matching particular TPCH / TPCDS predicate shapes, but I am trying ti figure out why a benchmark that compares all the different is going to be helpful in the long run to avoid regressions -- I worry taht this benchmark will generate so much data that we will find it hard to run / reason about
It seems like this benchmark may be most useful a development/tuning benchmark as it helps establish baseline timings for row-filter materialization policy choices: That is useful when designing future heuristics but it is hard to grok the results
Are there any important filter cases we should add to arrow_row_filter?
| /// FilterType encapsulates the different filter comparisons. | ||
| /// The variants correspond to the different filter patterns. | ||
| #[derive(Clone, Copy, Debug)] | ||
| pub(crate) enum FilterType { |
There was a problem hiding this comment.
this is similar to arrow_reader_row_filter -
arrow-rs/parquet/benches/arrow_reader_row_filter.rs
Lines 322 to 452 in cf450b5
Can we consolidate into a shared module like parquet/benches/row_filter_fixture.rs ?
There was a problem hiding this comment.
Done. I extracted the shared synthetic reader fixture into parquet/benches/arrow_reader_common/mod.rs and wired both arrow_reader_row_filter.rs and arrow_reader_materialization_policy.rs through it.
I used a bench-local module directory instead of a flat row_filter_fixture.rs so the shared helpers stay scoped to these reader benches. I can rename it if you prefer the flatter file name.
| ) | ||
| .await | ||
| } | ||
| AsyncStrategy::PushdownMask => { |
There was a problem hiding this comment.
the only thing different in each of these branches is the selection policy, right? We could probably collapse them significantly
There was a problem hiding this comment.
Done. The async strategy handling is now collapsed through AsyncStrategy::row_selection_policy(), so the pushdown path is shared and only the full post-filter path remains separate.
I took another pass at narrowing the benchmark scope. The materialization-policy target is now explicitly focused on policy-boundary/tuning cases rather than being a broad row-filter regression matrix:
For If this still feels too broad, I can trim the materialization-policy target further, but the current split is intended to keep generic row-filter behavior in |

Which issue does this PR close?
Rationale for this change
This PR is the first smaller PR split out from #9956 ("Optimize parquet row filter auto strategy with adaptive fallback").
The goal is to land the benchmark coverage first, before changing row-filter planning or execution behavior. This gives follow-up PRs a stable benchmark baseline already on
main, making it easier to compare each later behavior change against the same benchmark cases.Planned split from #9956:
What changes are included in this PR?
This PR adds benchmark coverage only. The diff is limited to benchmark targets under
parquet/benches, with no changes to production reader code or public APIs.It extends
arrow_reader_row_filterwith:RowSelectionPolicy::Auto;Selectors;Mask;RowFilter.It also extends
row_selection_cursorwith shape-focused selector/mask cases that vary:This PR intentionally does not change production reader behavior.
Are these changes tested?
Yes. This PR was validated with:
No benchmark result is claimed in this PR. The purpose is to add baseline benchmark coverage so later PRs can report comparable performance evidence.
Are there any user-facing changes?
No. This only changes benchmark code.