Skip to content

Unify AVG group state conversion and filter handling across Spark and built-in accumulators#22639

Open
kosiew wants to merge 4 commits into
apache:mainfrom
kosiew:unify-avg-22638
Open

Unify AVG group state conversion and filter handling across Spark and built-in accumulators#22639
kosiew wants to merge 4 commits into
apache:mainfrom
kosiew:unify-avg-22638

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented May 30, 2026

Which issue does this PR close?

Rationale for this change

Spark AVG and the built-in AVG implementations maintained separate logic for converting input values into group-accumulator state, including null and filter handling. This duplication increased the risk of behavioral divergence and made maintenance more difficult.

This change introduces a shared helper for AVG state conversion and aligns null/filter semantics across both implementations. It also ensures that Spark AVG applies filters consistently during state merges, reducing the chance of discrepancies between the two implementations.

What changes are included in this PR?

  • Added a shared AVG state conversion helper in functions-aggregate-common:

    • groups_accumulator::avg::convert_to_avg_state
  • Refactored the built-in AVG implementation to use the shared helper instead of maintaining local null-mask and state-conversion logic.

  • Refactored the Spark AVG implementation to use the same shared helper for convert_to_state.

  • Updated Spark AVG merge_batch to skip rows that are:

    • null in the partial count state,
    • null in the partial sum state,
    • filtered out by a merge filter (including NULL filter values).
  • Preserved implementation-specific state ordering:

    • Built-in AVG: [count, sum]
    • Spark AVG: [sum, count]
  • Added regression tests covering:

    • propagation of input nulls into AVG state,
    • propagation of filter nulls into AVG state,
    • preservation of sum data types,
    • state ordering expectations,
    • Spark AVG merge behavior with filters.

Are these changes tested?

Yes.

New unit tests were added for the shared helper and both AVG implementations:

functions-aggregate-common

  • convert_to_avg_state_applies_input_nulls_to_sum_and_count
  • convert_to_avg_state_applies_filter_nulls_to_sum_and_count
  • convert_to_avg_state_preserves_sum_data_type

Built-in AVG (functions-aggregate)

  • float64_convert_to_state_uses_count_sum_order_and_null_filter
  • decimal_convert_to_state_preserves_sum_type_and_nulls
  • duration_convert_to_state_preserves_sum_type_and_applies_filter

Spark AVG

  • convert_to_state_with_null_filter
  • merge_batch_applies_filter

Existing AVG tests continue to validate state round-tripping and related behavior.

Are there any user-facing changes?

No user-facing changes are intended. This PR refactors and aligns internal AVG group-accumulator state handling and adds regression coverage for null and filter semantics.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 4 commits May 30, 2026 14:31
…rk integration

- Added shared helper for average calculations in `avg.rs` with conversion to average state.
- Exported the aggregate module in `groups_accumulator.rs`.
- Updated Spark's average function to maintain state order and count type.
- Added tests for common helper null/filter semantics and Spark null filter cases.
…ve data type integrity

- Updated built-in Avg to utilize shared `convert_to_avg_state`.
- Ensured the order of state is preserved as [count, sum].
- Maintained count type as UInt64.
- Ensured sum data type consistency for Decimal and Duration.

Added tests for:
- Float64: validating count/sum order and null filter semantics.
- Decimal128: checking sum type and input null semantics.
- DurationNanosecond: verifying sum type and filter semantics.
- Added assert_validity test helpers for improved validation in tests
- Reduced repeated null assertions to streamline code
- Shortened common helper tests using local imports and type aliasing
- Introduced built-in Avg avg_state test helper
- Added comment for decimal test closure to clarify the unused avg_fn by convert_to_state
…lter

- Implemented logic to skip false and NULL values in merge_batch.
- Maintained skipping of null converted state rows.
- Added regression test: merge_batch_applies_filter.
- Introduced spark_avg_state test helper for better testing.
- Refactored code to eliminate repeated state[0]/state[1] decode boilerplate.
@github-actions github-actions Bot added functions Changes to functions implementation spark labels May 30, 2026
@kosiew kosiew marked this pull request as ready for review May 30, 2026 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation spark

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Refactor] Unify Spark Avg and Built-in Avg Group-Accumulator State Handling

1 participant