Add StatisticsContext parameter to partition_statistics#21815
Open
asolimando wants to merge 2 commits intoapache:mainfrom
Open
Add StatisticsContext parameter to partition_statistics#21815asolimando wants to merge 2 commits intoapache:mainfrom
asolimando wants to merge 2 commits intoapache:mainfrom
Conversation
Introduce StatisticsContext that carries pre-computed child statistics and external context for statistics computation. Change the ExecutionPlan::partition_statistics signature to accept it, and add compute_statistics() utility for bottom-up computation with automatic child stats threading. Update all ~35 in-tree ExecutionPlan implementations and ~40 call sites. Passthrough operators return ctx.child_stats() directly, transform operators use it instead of re-fetching from children, and operators that always need overall child stats (RepartitionExec, CoalescePartitionsExec, SortPreservingMergeExec, SortExec non-preserving, HashJoinExec CollectLeft/Auto, CrossJoinExec, NestedLoopJoinExec) call compute_statistics with None internally.
Member
Author
|
Hi @xudong963, I have opened the PR as a prerequisite for #21122, as discussed. This is a breaking change and I therefore added a section under .../library-user-guide/upgrading/54.0.0.md, I have checked around what usually goes there, but I'd appreciate if you could take a deeper look and confirm if I captured what's expected for the update guide. Looking forward to your feedback! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #20184
Rationale for this change
ExecutionPlan::partition_statisticsforces each operator to re-fetch child statistics internally, causing exponential recomputation in deep plans and making it impossible to inject enriched statistics from external sources (e.g., expression-level analyzers, custom statistics providers).What changes are included in this PR?
Breaking change: the
ExecutionPlan::partition_statisticssignature changes from(&self, partition: Option<usize>)to(&self, partition: Option<usize>, ctx: &StatisticsContext). Migration guide added todocs/source/library-user-guide/upgrading/54.0.0.md.Add a
StatisticsContextparameter topartition_statisticsthat carries pre-computed child statistics, and acompute_statistics()utility that walks the plan tree bottom-up, threading child statistics through the context automatically.StatisticsContextcarries oneArc<Statistics>per child node and is designed to be extended with additional context (e.g., expression-level analyzers, custom statistics providers) without further signature changes.Operator categories
DataSourcetrait which has a separatepartition_statisticsthat was not changed.ctx.child_stats()[0]directly.ctx.child_stats()[0]as input, then apply their transformation (selectivity, column projection, grouping cardinality, fetch limit, etc.).!preserve_partitioning, RepartitionExec): always need overall child stats regardless of which output partition is requested, since they merge/redistribute input partitions. These callcompute_statistics(child, None)internally instead of using the context.ctx.child_stats()is correct for bothNoneandSome(i)cases.compute_statistics(left, None)for theSomecase. The right side is partitioned and usesctx.child_stats()[1]directly. HashJoinExec Partitioned mode is symmetric (both use context). HashJoinExec Auto mode needs overall stats from both sides.ctx.child_stats()for theNonecase (reduces withstats_union). ForSome(partition), Union remaps partition indices across children and callscompute_statisticson the specific child with the remapped index. InterleaveExec usesctx.child_stats()directly (symmetric across all inputs).Callers
All direct
plan.partition_statistics(None)calls in optimizer rules (JoinSelection, AggregateStatistics, EnforceDistribution), display code, StatisticsRegistry, and tests are replaced withcompute_statistics(plan, None).Tests
No new tests added. This is a no-op refactoring confirmed by all existing tests passing unchanged across all affected crates (datafusion-physical-plan, datafusion-physical-optimizer, datafusion, datafusion-datasource).
What remains for follow-up
StatisticsContext(eliminates the separateStatisticsRegistrytree walk and theExpressionAnalyzerinjection machinery from Add ExpressionAnalyzer for pluggable expression-level statistics estimation #21122)DataSource::partition_statisticswith context if neededcompute_statistics(child, None)calls: partition-merging operators (CoalescePartitions, SortPreservingMerge, etc.) and asymmetric joins (HashJoin CollectLeft, CrossJoin, NestedLoopJoin) currently callcompute_statistics(child, None)internally when the requested partition isSome, triggering a separate bottom-up walk. A cache onStatisticsContextkeyed by (plan node, partition) would let these reuse already-computed results.Test plan
cargo fmt --allcargo clippy --all-targets --all-features -- -D warnings(affected crates)cargo test --profile cion datafusion-physical-plan, datafusion-physical-optimizer, datafusion, datafusion-datasourceDisclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.