Snowflake Unparser dialect and UNNEST support#21593
Snowflake Unparser dialect and UNNEST support#21593yonatan-sevenai wants to merge 19 commits intoapache:mainfrom
Conversation
…gregate When the SQL unparser encountered a SubqueryAlias node whose direct child was an Aggregate (or other clause-building plan like Window, Sort, Limit, Union), it would flatten the subquery into a simple table alias, losing the aggregate entirely. For example, a plan representing: SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m would unparse to: SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m dropping the MAX aggregate and the subquery. Root cause: the SubqueryAlias handler in select_to_sql_recursively would call subquery_alias_inner_query_and_columns (which only unwraps Projection children) and unparse_table_scan_pushdown (which only handles TableScan/SubqueryAlias/Projection). When both returned nothing useful for an Aggregate child, the code recursed directly into the Aggregate, merging its GROUP BY into the outer SELECT instead of wrapping it in a derived subquery. The fix adds an early check: if the SubqueryAlias's direct child is a plan type that builds its own SELECT clauses (Aggregate, Window, Sort, Limit, Union), emit it as a derived subquery via self.derive() with the alias always attached, rather than falling through to the recursive path that would flatten it.
…gregate When the SQL unparser encountered a SubqueryAlias node whose direct child was an Aggregate (or other clause-building plan like Window, Sort, Limit, Union), it would flatten the subquery into a simple table alias, losing the aggregate entirely. For example, a plan representing: SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m would unparse to: SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m dropping the MAX aggregate and the subquery. Root cause: the SubqueryAlias handler in select_to_sql_recursively would call subquery_alias_inner_query_and_columns (which only unwraps Projection children) and unparse_table_scan_pushdown (which only handles TableScan/SubqueryAlias/Projection). When both returned nothing useful for an Aggregate child, the code recursed directly into the Aggregate, merging its GROUP BY into the outer SELECT instead of wrapping it in a derived subquery. The fix adds an early check: if the SubqueryAlias's direct child is a plan type that builds its own SELECT clauses (Aggregate, Window, Sort, Limit, Union), emit it as a derived subquery via self.derive() with the alias always attached, rather than falling through to the recursive path that would flatten it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing tests pass with broken SQL output — the SELECT list still uses DataFusion internal names (__unnest_placeholder) instead of Snowflake's alias.VALUE convention. Update expectations to the correct Snowflake SQL so these tests will drive the implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nd Projection When a table is accessed through a passthrough/virtual table mapping, DataFusion inserts a SubqueryAlias node between Unnest and its inner Projection. The FLATTEN rendering code assumed a direct Projection child and failed with "Unnest input is not a Projection: SubqueryAlias(...)". Peel through SubqueryAlias in three code paths that inspect unnest.input: try_unnest_to_lateral_flatten_sql, the inline-vs-table source check, and the general unnest recursion. Also fix a pre-existing collapsible_if clippy warning in check_unnest_placeholder_with_outer_ref. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks @yonatan-sevenai. I think there is another PR that adds support for the Snowflake dialect from @goldmedal (#20648). Maybe you could collaborate together on one of the PRs to avoid duplicate work. |
Thanks! I believe the implementation I added covers many more use cases, but we'll see if we can collaborate on a single implementation. Hope we can figure out a single stable implementation! |
|
Thanks @yonatan-sevenai, I'll take a look at this PR |
goldmedal
left a comment
There was a problem hiding this comment.
@yonatan-sevenai Thanks for picking this up — I haven't had bandwidth to finish #20648 recently, but I'd like to help get this landed.
Before a detailed review, I want to discuss the design. LATERAL FLATTEN(INPUT => expr) is Snowflake-specific syntax, and this PR embeds that logic directly in the core unparser. I'd prefer delegating UNNEST-to-table-factor conversion to the Dialect trait — see my detailed comment below.
What do you think?
| /// Unparse the unnest plan as `LATERAL FLATTEN(INPUT => expr, ...)`. | ||
| /// | ||
| /// Snowflake uses FLATTEN as a table function instead of the SQL-standard UNNEST. | ||
| /// When this returns `true`, the unparser emits | ||
| /// `LATERAL FLATTEN(INPUT => <col>, OUTER => <bool>)` in the FROM clause. | ||
| fn unnest_as_lateral_flatten(&self) -> bool { | ||
| false | ||
| } |
There was a problem hiding this comment.
I'd prefer a trait method that returns an Option rather than a boolean flag. Something like (The design of #20648 ):
fn unparse_unnest_table_factor(
&self,
_unnest: &Unnest,
_columns: &[Ident],
_unparser: &Unparser,
) -> Result<Option<TableFactorBuilder>> {
Ok(None)
}My concern is that LATERAL FLATTEN(INPUT => <col>, OUTER => <bool>) is Snowflake-specific syntax, and this PR embeds a significant amount of that dialect-specific behavior directly in the core unparser (plan.rs). The unparser should ideally stay focused on generic SQL generation, with database-specific behavior delegated to the Dialect trait.
A trait-based approach has a few advantages:
- Isolation — If Snowflake changes the FLATTEN arguments or syntax, only SnowflakeDialect needs updating, not the core unparser.
- Extensibility — Other databases that handle UNNEST differently (e.g., Trino's
CROSS JOIN UNNEST, BigQuery's table-factorUNNEST) can implement the same trait method with their own output, without adding more boolean flags or conditional branches to plan.rs. - Consistency — This follows the existing pattern in the codebase where dialects override behavior through trait methods that return AST nodes (e.g.,
scalar_function_to_sql_overrides), rather than flags that gate hardcoded logic.
There was a problem hiding this comment.
Great question.
I generally followed the existing design where the dialect implementations are policies and do not implement the changes directly. Shifting that design to place business logic inside the dialect is possible, but I'm not sure it's better strategy.
All in all, there are multiple touch points and multiple areas to go from Datafusion UNNEST forms into Snowflake's lateral movements.
Specifically in the test cases I've added I've built unnest from the result of a UDF (consider select UNNEST(ExtractArrayFromCol(col)) from table and other cases where the DF optimizer is putting Limit, Sort, multiple layers of SubqueryAlias and other nodes inside the DF logical plan. Handing all of these felt like something to big and invasive to add to the dialect traints.
But as I said, the real motivation was to keep the dialect as slim as possible (like today), and not introduce very partial overridable behaviors into the various Unparser as dialect hooks to structurally alter the Unparsed output.
WDYT @goldmedal ?
There was a problem hiding this comment.
Specifically in the test cases I've added I've built unnest from the result of a UDF (consider
select UNNEST(ExtractArrayFromCol(col)) from tableand other cases where the DF optimizer is putting Limit, Sort, multiple layers of SubqueryAlias and other nodes inside the DF logical plan. Handing all of these felt like something to big and invasive to add to the dialect traints.
Agreed. The plan-tree traversal logic to find the UNNEST belongs in the core unparser — that's generic behavior, not dialect-specific, and it would be too invasive to push into the trait.
But as I said, the real motivation was to keep the dialect as slim as possible (like today)
I'm not sure if it's the unparser dialect's goal now 🤔 (maybe you have discussed this with other people?). In Dialect, there are many methods that produce a result (e.g. scalar_function_to_sql_overrides, col_alias_overrides, timestamp_with_tz_to_string,.. ) instead of a bool flag.
My suggestion is more about separating the two concerns:
- Plan traversal stays in the unparser, as you've done
- SQL rendering (the specific
LATERAL FLATTEN(INPUT => expr, OUTER => bool)syntax) → delegated to the dialect via a trait method
The core unparser would still do all the heavy lifting — the dialect only receives the prepared context and decides how to render the table factor. This keeps the dialect slim (just the rendering logic) while avoiding Snowflake-specific AST construction in plan.rs.
WDYT?
Which issue does this PR close?
Rationale for this change
The SQL unparser needs a Snowflake dialect. Basic dialect settings (identifier quoting,
NULLS FIRST/NULLS LAST, timestamp types) are straightforward, butUNNESTsupport required more than configuration.Snowflake has no
UNNESTkeyword. Its equivalent,LATERAL FLATTEN(INPUT => expr), is a table function in theFROMclause with output accessed viaalias."VALUE". This differs structurally from standard SQL: the unparser must emit aFROM-clause table factor with aCROSS JOINinstead of aSELECT-clause expression. It also must rewrite column references to point at the FLATTEN output, and handle several optimizer-produced plan shapes (intermediateLimit/Sortnodes,SubqueryAliaswrappers, composed expressions wrapping the unnest output, multi-expression projections). None of this can be expressed throughCustomDialectBuilder.What changes are included in this PR?
dialect.rs- NewSnowflakeDialectwith double-quote identifiers,NULLS FIRST/NULLS LAST, no empty select lists, no column aliases in table aliases, Snowflake timestamp types, andunnest_as_lateral_flatten(). Also wired intoCustomDialect/CustomDialectBuilder.ast.rs- NewFlattenRelationBuilderthat producesLATERAL FLATTEN(INPUT => expr, OUTER => bool)table factors, parallel to the existingUnnestRelationBuilder.utils.rs- Newunproject_unnest_expr_as_flatten_valuetransform that rewrites unnest placeholder columns to_unnest.VALUEreferences.plan.rs- Changes toselect_to_sql_recursively:Projectionhandler scans all expressions for unnest placeholders (not just single-expression projections), then branches into the FLATTEN path or the existing table-factor path.peel_to_unnest_with_modifierswalks throughLimit/Sortnodes betweenProjectionandUnnest, applying their SQL modifiers to the query builder. This handles an optimizer behavior where these nodes are inserted between the two.peel_to_inner_projectionwalks throughSubqueryAliasto find the innerProjectionthat feeds anUnnest.reconstruct_select_statementgained FLATTEN-aware expression rewriting and ahas_internal_unnest_aliaspredicate to strip internalUNNEST(...)display names.Unnesthandler rejects struct columns for the FLATTEN dialect with a clear error.Are these changes tested?
Yes. 18 new tests covering:
FROM(UNNEST in SELECT clause)LimitbetweenProjectionandUnnestSortbetweenProjectionandUnnestLimit+SubqueryAliascombinedCAST)LimitLimitSubqueryAliasbetweenUnnestand innerProjectionAre there any user-facing changes?
Yes. New public API surface:
SnowflakeDialectstruct and its constructorDialect::unnest_as_lateral_flatten()method (defaultfalse)CustomDialectBuilder::with_unnest_as_lateral_flatten()FlattenRelationBuilderandFLATTEN_DEFAULT_ALIASin the AST moduleNone of these are breaking changes, and all previous APIs should work.
New traits have default implementations to ease migrations.