Skip to content

fix: recompute Union schema when field names or types change#22640

Open
Brijesh-Thakkar wants to merge 6 commits into
apache:mainfrom
Brijesh-Thakkar:fix/22447-recompute-schema-union
Open

fix: recompute Union schema when field names or types change#22640
Brijesh-Thakkar wants to merge 6 commits into
apache:mainfrom
Brijesh-Thakkar:fix/22447-recompute-schema-union

Conversation

@Brijesh-Thakkar
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

LogicalPlan::recompute_schema() contains special handling for LogicalPlan::Union to avoid unnecessarily rebuilding the schema when inputs have not changed.

However, the current implementation only checks whether the cached schema and the first input schema have the same number of fields. This can leave the cached schema stale when optimizer rewrites modify field types, names, or qualifiers without changing the schema width.

For example, after a type coercion rewrite, the input schemas may change from Int32 to Int64 while preserving the same column count. In this case, recompute_schema() incorrectly considers the schema unchanged and returns the stale cached schema.

What changes are included in this PR?

This PR updates the LogicalPlan::Union branch in recompute_schema() to compare schema structure rather than only schema width.

The comparison now verifies:

  • Field count
  • Field data types
  • Field names
  • Field qualifiers

If any of these differ from the current input schema, the Union schema is recomputed using Union::try_new().

Additionally, two regression tests were added:

  • test_recompute_schema_union_type_mismatch
  • test_recompute_schema_union_name_mismatch

These tests verify that schema recomputation occurs when input field types or names change while the schema width remains unchanged.

Are these changes tested?

Yes.

Added regression tests covering:

  1. Input type changes (Int32Int64) with identical schema width.
  2. Input column name changes with identical schema width.

The new tests fail with the previous width-only validation logic and pass with this change.

The following test suites were also executed successfully:

  • cargo test -p datafusion-expr
  • cargo test -p datafusion-optimizer

Are there any user-facing changes?

No user-facing API changes.

This change fixes internal schema propagation for LogicalPlan::Union after optimizer rewrites and ensures cached schemas remain consistent with rewritten inputs.

…nion

Previously the Union arm only validated field count (width), causing
stale cached schemas when inputs were rewritten with different types
or column names (e.g. after type-coercion). Fix uses qualified_field()
to deep-compare qualifier + data_type + name.

Fixes apache#22447
Copilot AI review requested due to automatic review settings May 30, 2026 07:19
@github-actions github-actions Bot added the logical-expr Logical plan and expressions label May 30, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates LogicalPlan::recompute_schema for Union plans to detect stale cached schemas beyond just column count, and adds regression tests for type/name mismatches.

Changes:

  • Strengthen Union schema equivalence check to include qualifiers, field names, and data types.
  • Add tests that reproduce stale Union schema behavior when input types or names change after rewrites.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +710 to +722
LogicalPlan::Union(Union { inputs, schema }) => {
let first_input_schema = inputs[0].schema();
if schema.fields().len() == first_input_schema.fields().len() {
// If inputs are not pruned do not change schema
// Check field count AND field types AND field names/qualifiers.
// A width-only check misses cases where inputs were rewritten with
// different types or aliases (e.g. after type-coercion rewrites).
let schemas_match = schema.fields().len() == first_input_schema.fields().len()
&& (0..schema.fields().len()).all(|i| {
let (q1, f1) = schema.qualified_field(i);
let (q2, f2) = first_input_schema.qualified_field(i);
q1 == q2
&& f1.data_type() == f2.data_type()
&& f1.name() == f2.name()
});
Comment on lines +719 to +721
q1 == q2
&& f1.data_type() == f2.data_type()
&& f1.name() == f2.name()
Comment on lines +715 to +716
let schemas_match = schema.fields().len() == first_input_schema.fields().len()
&& (0..schema.fields().len()).all(|i| {
Previously the Union arm only checked field count (width), causing
stale cached schemas when inputs were rewritten with different types,
names, qualifiers, or nullability but the same column count.

Fix: call Union::try_new(inputs) to get the authoritative recomputed
schema, then compare against the cached schema via DFSchema PartialEq.
This handles all field properties in one place and is future-proof.

Added three unit tests covering type, name, and nullability mismatches.

Fixes apache#22447
@Brijesh-Thakkar
Copy link
Copy Markdown
Contributor Author

@cetra3 @nathanb9 heyy all ci tests are passing and the issue is adressed properly
Please review it and if everything seems good then please merge it

@Brijesh-Thakkar
Copy link
Copy Markdown
Contributor Author

@comphead Heyy could you please review this PR, the issue is addressed properly and all ci tests are also passing
Thank you

// `Union` was created `BY NAME`, and can safely rely on the
// `try_new` initializer to derive the new schema based on
// column positions.
let recomputed = Union::try_new(inputs)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we always pay the allocation cost it. I don't think we need to do that. Instead we can do structural comparison first which can short circuit sooner. Like:

let schemas_match = inputs.iter().all(|input| {
    let input_schema = input.schema();
    schema.fields().len() == input_schema.fields().len()
        && schema
            .iter()
            .zip(input_schema.iter())
            .all(|((q1, f1), (q2, f2))| {
                q1 == q2
                    && f1.name() == f2.name()
                    && f1.data_type() == f2.data_type()
                    && f1.is_nullable() == f2.is_nullable()
            })
});

This way we avoid allocation if no type-coercion.

// `Union` was created `BY NAME`, and can safely rely on the
// `try_new` initializer to derive the new schema based on
// column positions.
let recomputed = Union::try_new(inputs)?;
Copy link
Copy Markdown
Contributor

@nathanb9 nathanb9 May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, after looking at Union::try_new more closely, I see that it always loses metadata.
Metadata may be useful since I see that it is preserved in coerce_union_schema_with_schema (https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/analyzer/type_coercion.rs)

/// **Schema-level metadata merging**: Later schemas take precedence for duplicate keys.
///
/// **Field-level metadata merging**: Later fields take precedence for duplicate metadata keys.

So we might need an API alternative to try_new or a change to it so that we can preserve schema-level and field-level metadata in the case of type coercion

Previously the Union arm only checked field count (width), causing
stale cached schemas when inputs were rewritten with different types,
names, qualifiers, or nullability but the same column count.

Fix:
- Fast path: structural comparison across ALL inputs (field count,
  type, name, qualifier, nullability) with zero allocation on the
  common no-change case, as suggested by the reviewer.
- Slow path: Union::try_new() for structure, then HashMap::extend()
  semantics for schema-level and field-level metadata preservation,
  matching the behavior of coerce_union_schema_with_schema in
  type_coercion.rs.

Added four unit tests covering type mismatch, name mismatch,
nullability mismatch, and metadata preservation.

Fixes apache#22447
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

recompute_schema() is broken with LogicalPlan::Union

3 participants