Skip to content

Fix expression dimensions causing GROUP BY errors#32

Merged
nicosuave merged 6 commits intomainfrom
nicosuave/dim-expr-fix
Apr 11, 2026
Merged

Fix expression dimensions causing GROUP BY errors#32
nicosuave merged 6 commits intomainfrom
nicosuave/dim-expr-fix

Conversation

@nicosuave
Copy link
Copy Markdown
Member

Closes #29

Summary

  • Expression dimensions like region || 'foo' in SEMANTIC queries now work correctly
  • The correlation condition generator properly qualifies column references within expressions and wraps outer references in ANY_VALUE() so DuckDB accepts them in grouped context
  • Also handles arithmetic expressions (year + 1), CASE expressions, and function call dimensions

What changed

  • Added is_expression_dim() helper to detect dimensions that are expressions (containing operators like ||, +, -, *, /, or function calls / CASE)
  • Updated correlation_exprs_for_dim to use qualify_where_for_inner/qualify_where_for_outer for expression dimensions (previously only handled ( containing expressions, missing || and arithmetic operators)
  • Updated has_expression_dimensions check to use the new detection, ensuring _outer alias is added when needed
  • Added integration tests for string concatenation and arithmetic expression dimensions

When a query uses an expression as a dimension (e.g., `region || 'foo'`),
the correlation condition generator was naively prepending table qualifiers
to the whole expression, producing invalid SQL like `sales_v.region || 'foo'`
where `region` wasn't recognized as being in the GROUP BY.

The fix:
- Add `is_expression_dim()` to detect non-simple dimensions (operators,
  function calls, CASE)
- For expression dimensions, qualify individual column references via
  `qualify_where_for_inner`/`qualify_where_for_outer` and wrap the outer
  side in `ANY_VALUE()` so DuckDB accepts it in grouped context
- Extend `has_expression_dimensions` to detect operator-based expressions
  (not just parenthesized ones), ensuring an `_outer` alias is added
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 336d6e8b32

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 5279 to 5281
let has_expression_dimensions = original_dim_cols.iter().any(|col| is_expression_dim(col));
// Check if query needs an explicit outer alias for correlation handling.
let needs_outer_alias = has_expression_dimensions
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve qualified dim refs when forcing _outer alias

Broadening has_expression_dimensions to use is_expression_dim makes qualified expression dimensions (for example sales_v.region || 'foo') trigger needs_outer_alias, so the outer query is rewritten to FROM sales_v _outer. In that path, expression correlation still keeps already-qualified identifiers as sales_v.* (the qualifier helper does not rewrite existing qualifiers), so the generated correlated predicate can reference sales_v after it has been aliased away, producing a binder error for otherwise valid SQL. This is a regression for qualified arithmetic/concatenation dimensions without an explicit table alias.

Useful? React with 👍 / 👎.

When an expression dimension uses a qualified column (e.g., sales_v.region || 'foo'),
inserting an _outer alias would invalidate the existing qualifier. Fix:

- Revert has_expression_dimensions to only check for '(' (original behavior) so
  operator-based expressions don't trigger unnecessary _outer alias insertion
- Add strip_column_qualifiers() to remove table prefixes before re-qualifying,
  so both qualified (sales_v.region || 'foo') and unqualified (region || 'foo')
  expressions work correctly
- ANY_VALUE() wrapping already handles the GROUP BY constraint without needing
  the _outer alias
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ae1007319

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +3977 to +3979
if chars.peek() == Some(&'.') {
chars.next(); // consume the dot, drop the qualifier
} else {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid stripping non-table qualifiers from expressions

strip_column_qualifiers currently drops every identifier before a dot, so expression dimensions like my_schema.bucket(ts) or payload.city are rewritten to bucket(ts)/city before correlation qualification. That changes semantics (or causes binder errors) for schema-qualified functions and struct/field dereferences, because the generated correlated predicate no longer references the original expression. The previous code path preserved these dotted expressions, so this is a regression introduced by the new qualifier-stripping step.

Useful? React with 👍 / 👎.

strip_column_qualifiers was dropping every identifier.dot prefix, which
would break schema-qualified functions (my_schema.bucket(ts)) and struct
field dereferences (payload.city). Renamed to strip_table_qualifier and
added a table_name parameter so only the known source table prefix is
removed. The outer_alias (which is the source table name when no _outer
alias is injected) is passed as the table to strip.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8d1476ad3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +3980 to +3982
if chars.peek() == Some(&'.') && ident.eq_ignore_ascii_case(table_name) {
chars.next(); // consume the dot, drop this specific qualifier
} else {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve schema-qualified calls when stripping table qualifiers

strip_table_qualifier removes any <ident>. token that matches outer_alias, but that also strips schema-qualified function names when the schema and table alias are the same (for example FROM sales_v s with dimension s.bucket(ts)). In that case the code rewrites the expression to bucket(ts) before qualification, changing function resolution (or causing missing-function errors). This is a regression from the previous path, which preserved function qualifiers; the fresh evidence is that the drop happens unconditionally at the ident.eq_ignore_ascii_case(table_name) check.

Useful? React with 👍 / 👎.

strip_table_qualifier now peeks past the dot: if the next token is an
identifier followed by '(', it is a function call (e.g., s.bucket(ts))
and the qualifier is preserved. Only plain column references
(s.region, s.year) have the table prefix removed.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5285137577

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +3970 to +3974
} else if c.is_alphabetic() || c == '_' {
let mut ident = String::from(c);
while let Some(&next) = chars.peek() {
if next.is_alphanumeric() || next == '_' {
ident.push(chars.next().unwrap());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Strip quoted table qualifiers in expression dimensions

strip_table_qualifier only recognizes qualifiers that begin with alphabetic/underscore characters, so quoted aliases (for example "s".region || 'foo') are never stripped. In the new expression-dimension path, that unstripped qualifier is passed to qualify_where_for_inner, which preserves it and generates an inner correlation predicate that references the outer alias instead of _inner, leading to binder errors when users alias tables with quoted identifiers. This is a regression for ||/arithmetic expression dimensions because those now route through this stripping logic.

Useful? React with 👍 / 👎.

The identifier collector now recognizes double-quoted identifiers
(e.g., "s".region) in addition to bare identifiers, so quoted table
aliases are properly stripped before re-qualification. The unquoted
name is extracted for case-insensitive comparison against table_name,
and quoted column names after the dot are also preserved correctly.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9d1daa044

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
}
}
let is_function_call = chars.peek() == Some(&'(');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve schema-qualified calls with spaced parentheses

strip_table_qualifier only recognizes a schema-qualified function when the next character after the function name is exactly (. For valid SQL formatted as s.bucket (ts), chars.peek() sees a space, so the code misclassifies s.bucket as table.column and strips s., rewriting the expression to bucket (ts) before re-qualification. This can change function resolution or fail binding when schema and table alias names overlap, so the qualifier check should ignore intervening whitespace.

Useful? React with 👍 / 👎.

strip_table_qualifier now skips spaces/tabs between the function name
and '(' so that `s.bucket (ts)` is correctly recognized as a
schema-qualified function call and not misclassified as table.column.
@nicosuave nicosuave merged commit 5105d15 into main Apr 11, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expressions in dimensions cause error

1 participant