Skip to content

Snowflake Unparser dialect and UNNEST support#21593

Open
yonatan-sevenai wants to merge 19 commits intoapache:mainfrom
yonatan-sevenai:feature/snowflake_unparser
Open

Snowflake Unparser dialect and UNNEST support#21593
yonatan-sevenai wants to merge 19 commits intoapache:mainfrom
yonatan-sevenai:feature/snowflake_unparser

Conversation

@yonatan-sevenai
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

The SQL unparser needs a Snowflake dialect. Basic dialect settings (identifier quoting, NULLS FIRST/NULLS LAST, timestamp types) are straightforward, but UNNEST support required more than configuration.

Snowflake has no UNNEST keyword. Its equivalent, LATERAL FLATTEN(INPUT => expr), is a table function in the FROM clause with output accessed via alias."VALUE". This differs structurally from standard SQL: the unparser must emit a FROM-clause table factor with a CROSS JOIN instead of a SELECT-clause expression. It also must rewrite column references to point at the FLATTEN output, and handle several optimizer-produced plan shapes (intermediate Limit/Sort nodes, SubqueryAlias wrappers, composed expressions wrapping the unnest output, multi-expression projections). None of this can be expressed through CustomDialectBuilder.

What changes are included in this PR?

dialect.rs - New SnowflakeDialect with double-quote identifiers, NULLS FIRST/NULLS LAST, no empty select lists, no column aliases in table aliases, Snowflake timestamp types, and unnest_as_lateral_flatten(). Also wired into CustomDialect/CustomDialectBuilder.

ast.rs - New FlattenRelationBuilder that produces LATERAL FLATTEN(INPUT => expr, OUTER => bool) table factors, parallel to the existing UnnestRelationBuilder.

utils.rs - New unproject_unnest_expr_as_flatten_value transform that rewrites unnest placeholder columns to _unnest.VALUE references.

plan.rs - Changes to select_to_sql_recursively:

  • The Projection handler scans all expressions for unnest placeholders (not just single-expression projections), then branches into the FLATTEN path or the existing table-factor path.
  • peel_to_unnest_with_modifiers walks through Limit/Sort nodes between Projection and Unnest, applying their SQL modifiers to the query builder. This handles an optimizer behavior where these nodes are inserted between the two.
  • peel_to_inner_projection walks through SubqueryAlias to find the inner Projection that feeds an Unnest.
  • reconstruct_select_statement gained FLATTEN-aware expression rewriting and a has_internal_unnest_alias predicate to strip internal UNNEST(...) display names.
  • The Unnest handler rejects struct columns for the FLATTEN dialect with a clear error.

Are these changes tested?

Yes. 18 new tests covering:

  • Simple inline arrays, string arrays, cross joins
  • Implicit FROM (UNNEST in SELECT clause)
  • User aliases, table aliases, literal + unnest
  • Subselect source with filters and limit
  • UDF result as FLATTEN input
  • Limit between Projection and Unnest
  • Sort between Projection and Unnest
  • Limit + SubqueryAlias combined
  • Composed expressions wrapping unnest output (e.g. CAST)
  • Composed expressions with Limit
  • Multi-expression projections
  • Multi-expression projections with Limit
  • SubqueryAlias between Unnest and inner Projection

Are there any user-facing changes?

Yes. New public API surface:

  • SnowflakeDialect struct and its constructor
  • Dialect::unnest_as_lateral_flatten() method (default false)
  • CustomDialectBuilder::with_unnest_as_lateral_flatten()
  • FlattenRelationBuilder and FLATTEN_DEFAULT_ALIAS in the AST module

None of these are breaking changes, and all previous APIs should work.
New traits have default implementations to ease migrations.

yonatan-sevenai and others added 19 commits March 22, 2026 00:06
…gregate

When the SQL unparser encountered a SubqueryAlias node whose direct
child was an Aggregate (or other clause-building plan like Window, Sort,
Limit, Union), it would flatten the subquery into a simple table alias,
losing the aggregate entirely.

For example, a plan representing:
  SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m

would unparse to:
  SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m

dropping the MAX aggregate and the subquery.

Root cause: the SubqueryAlias handler in select_to_sql_recursively would
call subquery_alias_inner_query_and_columns (which only unwraps
Projection children) and unparse_table_scan_pushdown (which only handles
TableScan/SubqueryAlias/Projection). When both returned nothing useful
for an Aggregate child, the code recursed directly into the Aggregate,
merging its GROUP BY into the outer SELECT instead of wrapping it in a
derived subquery.

The fix adds an early check: if the SubqueryAlias's direct child is a
plan type that builds its own SELECT clauses (Aggregate, Window, Sort,
Limit, Union), emit it as a derived subquery via self.derive() with the
alias always attached, rather than falling through to the recursive
path that would flatten it.
…gregate

When the SQL unparser encountered a SubqueryAlias node whose direct
child was an Aggregate (or other clause-building plan like Window, Sort,
Limit, Union), it would flatten the subquery into a simple table alias,
losing the aggregate entirely.

For example, a plan representing:
  SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m

would unparse to:
  SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m

dropping the MAX aggregate and the subquery.

Root cause: the SubqueryAlias handler in select_to_sql_recursively would
call subquery_alias_inner_query_and_columns (which only unwraps
Projection children) and unparse_table_scan_pushdown (which only handles
TableScan/SubqueryAlias/Projection). When both returned nothing useful
for an Aggregate child, the code recursed directly into the Aggregate,
merging its GROUP BY into the outer SELECT instead of wrapping it in a
derived subquery.

The fix adds an early check: if the SubqueryAlias's direct child is a
plan type that builds its own SELECT clauses (Aggregate, Window, Sort,
Limit, Union), emit it as a derived subquery via self.derive() with the
alias always attached, rather than falling through to the recursive
path that would flatten it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing tests pass with broken SQL output — the SELECT list
still uses DataFusion internal names (__unnest_placeholder) instead
of Snowflake's alias.VALUE convention. Update expectations to the
correct Snowflake SQL so these tests will drive the implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nd Projection

When a table is accessed through a passthrough/virtual table mapping,
DataFusion inserts a SubqueryAlias node between Unnest and its inner
Projection. The FLATTEN rendering code assumed a direct Projection child
and failed with "Unnest input is not a Projection: SubqueryAlias(...)".

Peel through SubqueryAlias in three code paths that inspect unnest.input:
try_unnest_to_lateral_flatten_sql, the inline-vs-table source check, and
the general unnest recursion. Also fix a pre-existing collapsible_if
clippy warning in check_unnest_placeholder_with_outer_ref.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the sql SQL Planner label Apr 13, 2026
@nuno-faria
Copy link
Copy Markdown
Contributor

Thanks @yonatan-sevenai. I think there is another PR that adds support for the Snowflake dialect from @goldmedal (#20648). Maybe you could collaborate together on one of the PRs to avoid duplicate work.

@yonatan-sevenai
Copy link
Copy Markdown
Contributor Author

Thanks @yonatan-sevenai. I think there is another PR that adds support for the Snowflake dialect from @goldmedal (#20648). Maybe you could collaborate together on one of the PRs to avoid duplicate work.

Thanks!
Quite the find :)

I believe the implementation I added covers many more use cases, but we'll see if we can collaborate on a single implementation.
Specifically, there's a lot of complexity when the array to unnest is the result of a UDF, Subquery, and things like that?
I saw a lot of edge cases where the optimizer might include limits and sorts between the unnest and the TableScan / SubqueryAlias as well and some complexity when you need to cross join the original table.

Hope we can figure out a single stable implementation!

@goldmedal
Copy link
Copy Markdown
Contributor

Thanks @yonatan-sevenai, I'll take a look at this PR

@goldmedal goldmedal self-requested a review April 14, 2026 03:48
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yonatan-sevenai Thanks for picking this up — I haven't had bandwidth to finish #20648 recently, but I'd like to help get this landed.

Before a detailed review, I want to discuss the design. LATERAL FLATTEN(INPUT => expr) is Snowflake-specific syntax, and this PR embeds that logic directly in the core unparser. I'd prefer delegating UNNEST-to-table-factor conversion to the Dialect trait — see my detailed comment below.

What do you think?

Comment on lines +209 to +216
/// Unparse the unnest plan as `LATERAL FLATTEN(INPUT => expr, ...)`.
///
/// Snowflake uses FLATTEN as a table function instead of the SQL-standard UNNEST.
/// When this returns `true`, the unparser emits
/// `LATERAL FLATTEN(INPUT => <col>, OUTER => <bool>)` in the FROM clause.
fn unnest_as_lateral_flatten(&self) -> bool {
false
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a trait method that returns an Option rather than a boolean flag. Something like (The design of #20648 ):

fn unparse_unnest_table_factor(
    &self,
    _unnest: &Unnest,
    _columns: &[Ident],
    _unparser: &Unparser,
) -> Result<Option<TableFactorBuilder>> {
    Ok(None)
}

My concern is that LATERAL FLATTEN(INPUT => <col>, OUTER => <bool>) is Snowflake-specific syntax, and this PR embeds a significant amount of that dialect-specific behavior directly in the core unparser (plan.rs). The unparser should ideally stay focused on generic SQL generation, with database-specific behavior delegated to the Dialect trait.

A trait-based approach has a few advantages:

  1. Isolation — If Snowflake changes the FLATTEN arguments or syntax, only SnowflakeDialect needs updating, not the core unparser.
  2. Extensibility — Other databases that handle UNNEST differently (e.g., Trino's CROSS JOIN UNNEST, BigQuery's table-factor UNNEST) can implement the same trait method with their own output, without adding more boolean flags or conditional branches to plan.rs.
  3. Consistency — This follows the existing pattern in the codebase where dialects override behavior through trait methods that return AST nodes (e.g., scalar_function_to_sql_overrides), rather than flags that gate hardcoded logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question.
I generally followed the existing design where the dialect implementations are policies and do not implement the changes directly. Shifting that design to place business logic inside the dialect is possible, but I'm not sure it's better strategy.

All in all, there are multiple touch points and multiple areas to go from Datafusion UNNEST forms into Snowflake's lateral movements.

Specifically in the test cases I've added I've built unnest from the result of a UDF (consider select UNNEST(ExtractArrayFromCol(col)) from table and other cases where the DF optimizer is putting Limit, Sort, multiple layers of SubqueryAlias and other nodes inside the DF logical plan. Handing all of these felt like something to big and invasive to add to the dialect traints.

But as I said, the real motivation was to keep the dialect as slim as possible (like today), and not introduce very partial overridable behaviors into the various Unparser as dialect hooks to structurally alter the Unparsed output.

WDYT @goldmedal ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically in the test cases I've added I've built unnest from the result of a UDF (consider select UNNEST(ExtractArrayFromCol(col)) from table and other cases where the DF optimizer is putting Limit, Sort, multiple layers of SubqueryAlias and other nodes inside the DF logical plan. Handing all of these felt like something to big and invasive to add to the dialect traints.

Agreed. The plan-tree traversal logic to find the UNNEST belongs in the core unparser — that's generic behavior, not dialect-specific, and it would be too invasive to push into the trait.

But as I said, the real motivation was to keep the dialect as slim as possible (like today)

I'm not sure if it's the unparser dialect's goal now 🤔 (maybe you have discussed this with other people?). In Dialect, there are many methods that produce a result (e.g. scalar_function_to_sql_overrides, col_alias_overrides, timestamp_with_tz_to_string,.. ) instead of a bool flag.

My suggestion is more about separating the two concerns:

  • Plan traversal stays in the unparser, as you've done
  • SQL rendering (the specific LATERAL FLATTEN(INPUT => expr, OUTER => bool) syntax) → delegated to the dialect via a trait method

The core unparser would still do all the heavy lifting — the dialect only receives the prepared context and decides how to render the table factor. This keeps the dialect slim (just the rendering logic) while avoiding Snowflake-specific AST construction in plan.rs.

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Snowflake dialect support for Unparser

3 participants