Snowflake Unparser dialect and UNNEST support by yonatan-sevenai · Pull Request #21593 · apache/datafusion

yonatan-sevenai · 2026-04-13T17:29:17Z

Which issue does this PR close?

Closes Snowflake dialect support for Unparser #21592.

Rationale for this change

The SQL unparser needs a Snowflake dialect. Basic dialect settings (identifier quoting, NULLS FIRST/NULLS LAST, timestamp types) are straightforward, but UNNEST support required more than configuration.

Snowflake has no UNNEST keyword. Its equivalent, LATERAL FLATTEN(INPUT => expr), is a table function in the FROM clause with output accessed via alias."VALUE". This differs structurally from standard SQL: the unparser must emit a FROM-clause table factor with a CROSS JOIN instead of a SELECT-clause expression. It also must rewrite column references to point at the FLATTEN output, and handle several optimizer-produced plan shapes (intermediate Limit/Sort nodes, SubqueryAlias wrappers, composed expressions wrapping the unnest output, multi-expression projections). None of this can be expressed through CustomDialectBuilder.

What changes are included in this PR?

dialect.rs - New SnowflakeDialect with double-quote identifiers, NULLS FIRST/NULLS LAST, no empty select lists, no column aliases in table aliases, Snowflake timestamp types, and unnest_as_lateral_flatten(). Also wired into CustomDialect/CustomDialectBuilder.

ast.rs - New FlattenRelationBuilder that produces LATERAL FLATTEN(INPUT => expr, OUTER => bool) table factors, parallel to the existing UnnestRelationBuilder.

utils.rs - New unproject_unnest_expr_as_flatten_value transform that rewrites unnest placeholder columns to _unnest.VALUE references.

plan.rs - Changes to select_to_sql_recursively:

The Projection handler scans all expressions for unnest placeholders (not just single-expression projections), then branches into the FLATTEN path or the existing table-factor path.
peel_to_unnest_with_modifiers walks through Limit/Sort nodes between Projection and Unnest, applying their SQL modifiers to the query builder. This handles an optimizer behavior where these nodes are inserted between the two.
peel_to_inner_projection walks through SubqueryAlias to find the inner Projection that feeds an Unnest.
reconstruct_select_statement gained FLATTEN-aware expression rewriting and a has_internal_unnest_alias predicate to strip internal UNNEST(...) display names.
The Unnest handler rejects struct columns for the FLATTEN dialect with a clear error.

Are these changes tested?

Yes. 18 new tests covering:

Simple inline arrays, string arrays, cross joins
Implicit FROM (UNNEST in SELECT clause)
User aliases, table aliases, literal + unnest
Subselect source with filters and limit
UDF result as FLATTEN input
Limit between Projection and Unnest
Sort between Projection and Unnest
Limit + SubqueryAlias combined
Composed expressions wrapping unnest output (e.g. CAST)
Composed expressions with Limit
Multi-expression projections
Multi-expression projections with Limit
SubqueryAlias between Unnest and inner Projection

Are there any user-facing changes?

Yes. New public API surface:

SnowflakeDialect struct and its constructor
Dialect::unnest_as_lateral_flatten() method (default false)
CustomDialectBuilder::with_unnest_as_lateral_flatten()
FlattenRelationBuilder and FLATTEN_DEFAULT_ALIAS in the AST module

None of these are breaking changes, and all previous APIs should work.
New traits have default implementations to ease migrations.

…gregate When the SQL unparser encountered a SubqueryAlias node whose direct child was an Aggregate (or other clause-building plan like Window, Sort, Limit, Union), it would flatten the subquery into a simple table alias, losing the aggregate entirely. For example, a plan representing: SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m would unparse to: SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m dropping the MAX aggregate and the subquery. Root cause: the SubqueryAlias handler in select_to_sql_recursively would call subquery_alias_inner_query_and_columns (which only unwraps Projection children) and unparse_table_scan_pushdown (which only handles TableScan/SubqueryAlias/Projection). When both returned nothing useful for an Aggregate child, the code recursed directly into the Aggregate, merging its GROUP BY into the outer SELECT instead of wrapping it in a derived subquery. The fix adds an early check: if the SubqueryAlias's direct child is a plan type that builds its own SELECT clauses (Aggregate, Window, Sort, Limit, Union), emit it as a derived subquery via self.derive() with the alias always attached, rather than falling through to the recursive path that would flatten it.

…gregate When the SQL unparser encountered a SubqueryAlias node whose direct child was an Aggregate (or other clause-building plan like Window, Sort, Limit, Union), it would flatten the subquery into a simple table alias, losing the aggregate entirely. For example, a plan representing: SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m would unparse to: SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m dropping the MAX aggregate and the subquery. Root cause: the SubqueryAlias handler in select_to_sql_recursively would call subquery_alias_inner_query_and_columns (which only unwraps Projection children) and unparse_table_scan_pushdown (which only handles TableScan/SubqueryAlias/Projection). When both returned nothing useful for an Aggregate child, the code recursed directly into the Aggregate, merging its GROUP BY into the outer SELECT instead of wrapping it in a derived subquery. The fix adds an early check: if the SubqueryAlias's direct child is a plan type that builds its own SELECT clauses (Aggregate, Window, Sort, Limit, Union), emit it as a derived subquery via self.derive() with the alias always attached, rather than falling through to the recursive path that would flatten it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The existing tests pass with broken SQL output — the SELECT list still uses DataFusion internal names (__unnest_placeholder) instead of Snowflake's alias.VALUE convention. Update expectations to the correct Snowflake SQL so these tests will drive the implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nd Projection When a table is accessed through a passthrough/virtual table mapping, DataFusion inserts a SubqueryAlias node between Unnest and its inner Projection. The FLATTEN rendering code assumed a direct Projection child and failed with "Unnest input is not a Projection: SubqueryAlias(...)". Peel through SubqueryAlias in three code paths that inspect unnest.input: try_unnest_to_lateral_flatten_sql, the inline-vs-table source check, and the general unnest recursion. Also fix a pre-existing collapsible_if clippy warning in check_unnest_placeholder_with_outer_ref. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…_unparser

nuno-faria · 2026-04-13T18:36:33Z

Thanks @yonatan-sevenai. I think there is another PR that adds support for the Snowflake dialect from @goldmedal (#20648). Maybe you could collaborate together on one of the PRs to avoid duplicate work.

yonatan-sevenai · 2026-04-13T20:47:55Z

Thanks @yonatan-sevenai. I think there is another PR that adds support for the Snowflake dialect from @goldmedal (#20648). Maybe you could collaborate together on one of the PRs to avoid duplicate work.

Thanks!
Quite the find :)

I believe the implementation I added covers many more use cases, but we'll see if we can collaborate on a single implementation.
Specifically, there's a lot of complexity when the array to unnest is the result of a UDF, Subquery, and things like that?
I saw a lot of edge cases where the optimizer might include limits and sorts between the unnest and the TableScan / SubqueryAlias as well and some complexity when you need to cross join the original table.

Hope we can figure out a single stable implementation!

goldmedal · 2026-04-14T01:57:25Z

Thanks @yonatan-sevenai, I'll take a look at this PR

goldmedal

@yonatan-sevenai Thanks for picking this up — I haven't had bandwidth to finish #20648 recently, but I'd like to help get this landed.

Before a detailed review, I want to discuss the design. LATERAL FLATTEN(INPUT => expr) is Snowflake-specific syntax, and this PR embeds that logic directly in the core unparser. I'd prefer delegating UNNEST-to-table-factor conversion to the Dialect trait — see my detailed comment below.

What do you think?

goldmedal · 2026-04-14T13:54:44Z

datafusion/sql/src/unparser/dialect.rs

+    /// Unparse the unnest plan as `LATERAL FLATTEN(INPUT => expr, ...)`.
+    ///
+    /// Snowflake uses FLATTEN as a table function instead of the SQL-standard UNNEST.
+    /// When this returns `true`, the unparser emits
+    /// `LATERAL FLATTEN(INPUT => <col>, OUTER => <bool>)` in the FROM clause.
+    fn unnest_as_lateral_flatten(&self) -> bool {
+        false
+    }


I'd prefer a trait method that returns an Option rather than a boolean flag. Something like (The design of #20648 ):

fn unparse_unnest_table_factor( &self, _unnest: &Unnest, _columns: &[Ident], _unparser: &Unparser, ) -> Result<Option<TableFactorBuilder>> { Ok(None) }

My concern is that LATERAL FLATTEN(INPUT => <col>, OUTER => <bool>) is Snowflake-specific syntax, and this PR embeds a significant amount of that dialect-specific behavior directly in the core unparser (plan.rs). The unparser should ideally stay focused on generic SQL generation, with database-specific behavior delegated to the Dialect trait.

A trait-based approach has a few advantages:

Isolation — If Snowflake changes the FLATTEN arguments or syntax, only SnowflakeDialect needs updating, not the core unparser.

Extensibility — Other databases that handle UNNEST differently (e.g., Trino's CROSS JOIN UNNEST, BigQuery's table-factor UNNEST) can implement the same trait method with their own output, without adding more boolean flags or conditional branches to plan.rs.

Consistency — This follows the existing pattern in the codebase where dialects override behavior through trait methods that return AST nodes (e.g., scalar_function_to_sql_overrides), rather than flags that gate hardcoded logic.

Great question.
I generally followed the existing design where the dialect implementations are policies and do not implement the changes directly. Shifting that design to place business logic inside the dialect is possible, but I'm not sure it's better strategy.

All in all, there are multiple touch points and multiple areas to go from Datafusion UNNEST forms into Snowflake's lateral movements.

Specifically in the test cases I've added I've built unnest from the result of a UDF (consider select UNNEST(ExtractArrayFromCol(col)) from table and other cases where the DF optimizer is putting Limit, Sort, multiple layers of SubqueryAlias and other nodes inside the DF logical plan. Handing all of these felt like something to big and invasive to add to the dialect traints.

But as I said, the real motivation was to keep the dialect as slim as possible (like today), and not introduce very partial overridable behaviors into the various Unparser as dialect hooks to structurally alter the Unparsed output.

WDYT @goldmedal ?

Specifically in the test cases I've added I've built unnest from the result of a UDF (consider select UNNEST(ExtractArrayFromCol(col)) from table and other cases where the DF optimizer is putting Limit, Sort, multiple layers of SubqueryAlias and other nodes inside the DF logical plan. Handing all of these felt like something to big and invasive to add to the dialect traints.

Agreed. The plan-tree traversal logic to find the UNNEST belongs in the core unparser — that's generic behavior, not dialect-specific, and it would be too invasive to push into the trait.

But as I said, the real motivation was to keep the dialect as slim as possible (like today)

I'm not sure if it's the unparser dialect's goal now 🤔 (maybe you have discussed this with other people?). In Dialect, there are many methods that produce a result (e.g. scalar_function_to_sql_overrides, col_alias_overrides, timestamp_with_tz_to_string,.. ) instead of a bool flag.

My suggestion is more about separating the two concerns:

Plan traversal stays in the unparser, as you've done

SQL rendering (the specific LATERAL FLATTEN(INPUT => expr, OUTER => bool) syntax) → delegated to the dialect via a trait method

The core unparser would still do all the heavy lifting — the dialect only receives the prepared context and decides how to render the table factor. This keeps the dialect slim (just the rendering logic) while avoiding Snowflake-specific AST construction in plan.rs.

WDYT?

yonatan-sevenai and others added 19 commits March 22, 2026 00:06

Merge branch 'main' into main

2f8f667

Fixes in PR

42f7f64

Merge branch 'main' into main

ae2fdcf

Merge branch 'main' into main

1667252

Merge branch 'apache:main' into main

ab8acf9

Merge remote-tracking branch 'datafusion/main'

adfec24

Working on unnest support for snowflake

6acaa9c

Snowflake Dialect

e701962

Merge remote-tracking branch 'datafusion/main' into feature/snowflake…

4cc97c8

…_unparser

Snowflake Dialect

065e111

More snowflake dialect fixes for unnest

909d8c6

More snowflake dialect fixes for unnes

54d8a1d

Quoting fix

c083c6c

Unnest fixes for multi-arguments

455fc83

Few more fixes

5f1d805

github-actions bot added the sql SQL Planner label Apr 13, 2026

yonatan-sevenai mentioned this pull request Apr 13, 2026

feat: support unparsing UNNEST as Snowflake FLATTEN table function #20648

Open

goldmedal self-requested a review April 14, 2026 03:48

goldmedal reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snowflake Unparser dialect and UNNEST support#21593

Snowflake Unparser dialect and UNNEST support#21593
yonatan-sevenai wants to merge 19 commits intoapache:mainfrom
yonatan-sevenai:feature/snowflake_unparser

yonatan-sevenai commented Apr 13, 2026

Uh oh!

nuno-faria commented Apr 13, 2026

Uh oh!

yonatan-sevenai commented Apr 13, 2026

Uh oh!

goldmedal commented Apr 14, 2026

Uh oh!

goldmedal left a comment

Uh oh!

goldmedal Apr 14, 2026

Uh oh!

yonatan-sevenai Apr 14, 2026

Uh oh!

goldmedal Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yonatan-sevenai commented Apr 13, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

nuno-faria commented Apr 13, 2026

Uh oh!

yonatan-sevenai commented Apr 13, 2026

Uh oh!

goldmedal commented Apr 14, 2026

Uh oh!

goldmedal left a comment

Choose a reason for hiding this comment

Uh oh!

goldmedal Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

yonatan-sevenai Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

goldmedal Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants