Count aware aggregates#36776
Draft
frankmcsherry wants to merge 2 commits into
Draft
Conversation
Change the aggregate evaluation surface from `Item = Datum` to `Item = (Datum, Diff)` so aggregates can consume input multiplicity directly rather than requiring callers to expand each value into `diff` copies. `count` sums the diffs, multiplicity-insensitive aggregates (min/max/any/all) drop them, and the rest expand as before. This avoids materializing a number of rows proportional to the diff in `fold_reduce_constant`, where a compact `(row, N)` previously expanded to N rows before aggregation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…egative When a `generate_series(start, stop, step)` FlatMap's output column is not demanded, projection_pushdown now rewrites it into a `RepeatRowNonNegative` emitting the same row count. This is the count-aware-aggregate synergy: the cardinality is encoded as a single `(empty_row, diff=N)` rather than N physical rows, so downstream count-aware reduces stay O(1). Uses a sign-guarded emptiness `CASE` so truncating `DivInt64` coincides with floor division and empty series collapse to zero. Only fires for non-zero literal steps of integer series. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related optimizer/eval changes that together let a row-wise
count(*)over an integer
generate_seriesevaluate in O(1) instead of materializingone row per generated value.
Count-aware aggregate eval.
AggregateFunc::evalnow consumesItem = (Datum, Diff)instead of bareDatum.countsums the diffs,multiplicity-insensitive aggregates (min/max/any/all) drop them, and the
rest expand by count as before. Avoids
fold_reduce_constantexpanding acompact
(row, N)into N rows before aggregation.Collapse unused integer
generate_series. When agenerate_series(start, stop, step)FlatMap's output column is notdemanded, projection_pushdown rewrites it into a
RepeatRowNonNegativeemitting the same row count, encoded as a single
(empty_row, diff=N).Sign-guarded emptiness
CASEso truncatingDivInt64coincides with floordivision and empty series collapse to zero. Fires only for non-zero literal
steps of integer series.
This reproduces the work of #22753, but with the assistance of 🤖 to grind through the aggregate eval work.
cc: @def- : this has the potential to make some of your generate-series based stress tests less stressful.