[CALCITE-6104] Aggregate function that references outer column should be evaluated in outer query#5052
[CALCITE-6104] Aggregate function that references outer column should be evaluated in outer query#5052xuzifu666 wants to merge 10 commits into
Conversation
… be evaluated in outer query
|
|
||
| !ok | ||
|
|
||
| # [CALCITE-6104] Aggregate function that references outer column should be evaluated in outer query |
There was a problem hiding this comment.
Please add a comment that these are validated using an independent database.
There was a problem hiding this comment.
Can we also have a test with 2 outer aggregates?
Showing the plan could also help.
There was a problem hiding this comment.
Thanks for the reminder, I had added comment about it and the test case had been added.
But agg.iq to verify the correctness of the numerical results calculated after the Calcite fix within PostgreSQL, the original SQL must be rewritten into standard SQL supported by PostgreSQL.
Validated these sql in https://onecompiler.com/postgresql/44tgmwq6p and result is as expected.
|
|
||
| !ok | ||
|
|
||
| SELECT (SELECT sum(sal) FROM dept) AS sum_sal |
There was a problem hiding this comment.
The blog post quoted by @julianhyde has more examples, please add all of the relevant ones
There was a problem hiding this comment.
Yes, I review the blog overall and add new case for it. The correspondence is as follows:
SELECT (SELECT sum(1) FROM xx LIMIT 1) FROM aa
Aggregate with no column references aggregates at the innermost level.
SELECT (SELECT sum(a) FROM xx LIMIT 1) FROM aa
Aggregate function that references outer column should be evaluated in outer query.
SELECT (SELECT sum(x) FROM xx LIMIT 1) FROM aa
Aggregate referencing only inner column aggregates at the inner level.
| * <p>For example, | ||
| * <blockquote><pre>SELECT (SELECT sum(a) FROM t LIMIT 1) FROM aa</pre></blockquote> | ||
| * is rewritten to | ||
| * <blockquote><pre>SELECT sum(a) * (SELECT 1 FROM t LIMIT 1) FROM aa</pre></blockquote> |
There was a problem hiding this comment.
From this example it's not obvious to me at all what the general rewrite rule is.
Can you describe the algorithm?
There was a problem hiding this comment.
Okay, I’ve added the logic for the overall process; I hope that clarifies things.
| * </ol> | ||
| * | ||
| * <p>For example, | ||
| * <blockquote><pre>SELECT (SELECT sum(a) FROM t LIMIT 1) FROM aa</pre></blockquote> |
There was a problem hiding this comment.
We cannot read this example because we don't know where column a is.
Can you also add tests where one or both tables are empty or contain only nulls in the aggregated column?
There was a problem hiding this comment.
OK, The Javadoc examples for rewriteOuterAggregatesInSelectList have been improved, explicitly stating that a comes from aa and x comes from t.
Boundary tests have been added to agg.iq: Inner table empty → 1 row NULL
Outer table empty → 1 row NULL (because the outer aggregation has no input rows)
and postgresql test also updated (result is also keep the same with agg.iq, empty results for VALUE are not currently supported, so I am using the SQL statement SELECT 1 FROM emp WHERE 1 = 0 as a substitute.) https://onecompiler.com/postgresql/44tgmwq6p
| !ok | ||
|
|
||
| # [CALCITE-6104] Aggregate function that references outer column should be evaluated in outer query | ||
| # The expected results below have been verified in PostgreSQL by running |
There was a problem hiding this comment.
I am confused, I thought these queries are standard SQL, but, indeed, I cannot run them on postgres or other databases I tried. Isn't this PR supposed to implement a standard SQL feature? Why can't these queries use standard SQL then?
There was a problem hiding this comment.
Yes, this is indeed a confusing point. These correlated-aggregate queries are not standard SQL, and CALCITE-6104 should not implement a SQL standard feature. Instead, it makes Calcite support a non-standard extension that some databases already implement.
The SQL standard does not allow aggregate functions to reference outer columns
In standard SQL, an aggregate function like sum(a) can only reference columns from the SELECT level where it appears. So a query like:
SELECT (SELECT sum(a) FROM xx LIMIT 1) FROM aa;
will fail on PostgreSQL, Oracle, and MySQL/MariaDB because a belongs to the outer table aa, not the sub-query xx.
But PostgreSQL and others support this extension.
These databases allow an aggregate to "climb up" to the nearest SELECT that contains its free variables. Calcite previously handled this pattern incorrectly (it aggregated inside the sub-query, grouped by the outer rows). The fix makes Calcite behave consistently with those databases.
This test is provided in Jira (julianhyde@3f4cab5) is supported—at least with the current PR applied; these are not standard SQL either.
Additionally, I tested converting these SQL statements to standard SQL; they passed even without this PR, which is why I hold the view mentioned above.
There was a problem hiding this comment.
Thank you for the clarification. Maybe this SQL should not be accepted by default? There are these conformance modes, I think only modes like LENIENT or BABEL should accept this construct.
There was a problem hiding this comment.
You say that postgres accepts these queries, then why do you have to modify the queries to run them on postgres? I couldn't run them as written in Postgres.
There was a problem hiding this comment.
Yes, thanks for pointing this out. The current comment is incorrect, should state "These SQL queries are non-standard and their successful execution is only guaranteed within Calcite for now" makes perfect sense.
Regarding the limitation where correlated aggregate rewriting is only available under modes like LENIENT or BABEL, could log a separate jira to resolve this later(If ok I would create a new jira)? This way, we can track these changes independently and easily gather any additional feedback. From my perspective, this limitation bears little relevance to the current Jira ticket. I’d like to hear if you agree with my view on this.
There was a problem hiding this comment.
I think this will require 3 lines of code and should be done as part of this PR.
There was a problem hiding this comment.
Okay, I'll implement this restriction directly in this PR.
| // Rewrite scalar sub-queries whose single select item is an aggregate | ||
| // over outer columns. The aggregate belongs to the outer query per SQL | ||
| // standard. | ||
| rewriteOuterAggregatesInSelectList(select); |
There was a problem hiding this comment.
call this handleOuterAggregate, and make the function reject it if not in a suitable conformance mode.
There was a problem hiding this comment.
Yes, the latest commit implements controls in this area.
| !ok | ||
|
|
||
| # [CALCITE-6104] Aggregate function that references outer column should be evaluated in outer query | ||
| # The expected results below have been verified in PostgreSQL by running |
There was a problem hiding this comment.
You say that postgres accepts these queries, then why do you have to modify the queries to run them on postgres? I couldn't run them as written in Postgres.
| * <blockquote><pre> | ||
| * WITH aa(a) AS (VALUES 1, 2, 3), | ||
| * t(x) AS (VALUES 10, 20, 30) | ||
| * SELECT (SELECT sum(a) FROM t LIMIT 1) FROM aa |
There was a problem hiding this comment.
Is the LIMIT 1 needed here?
There was a problem hiding this comment.
Not need, I had updated it.
| * <blockquote><pre> | ||
| * WITH aa(a) AS (VALUES 1, 2, 3), | ||
| * t(x) AS (VALUES 10, 20, 30) | ||
| * SELECT sum(a) * (SELECT 1 FROM t LIMIT 1) FROM aa |
There was a problem hiding this comment.
This rewrite will only work for numeric aggregates on types which have a multiplication operation
There was a problem hiding this comment.
I've added this message to the comments first.
|



jira: https://issues.apache.org/jira/browse/CALCITE-6104