Fix expression dimensions causing GROUP BY errors#32
Conversation
When a query uses an expression as a dimension (e.g., `region || 'foo'`), the correlation condition generator was naively prepending table qualifiers to the whole expression, producing invalid SQL like `sales_v.region || 'foo'` where `region` wasn't recognized as being in the GROUP BY. The fix: - Add `is_expression_dim()` to detect non-simple dimensions (operators, function calls, CASE) - For expression dimensions, qualify individual column references via `qualify_where_for_inner`/`qualify_where_for_outer` and wrap the outer side in `ANY_VALUE()` so DuckDB accepts it in grouped context - Extend `has_expression_dimensions` to detect operator-based expressions (not just parenthesized ones), ensuring an `_outer` alias is added
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 336d6e8b32
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
yardstick-rs/src/sql/measures.rs
Outdated
| let has_expression_dimensions = original_dim_cols.iter().any(|col| is_expression_dim(col)); | ||
| // Check if query needs an explicit outer alias for correlation handling. | ||
| let needs_outer_alias = has_expression_dimensions |
There was a problem hiding this comment.
Preserve qualified dim refs when forcing _outer alias
Broadening has_expression_dimensions to use is_expression_dim makes qualified expression dimensions (for example sales_v.region || 'foo') trigger needs_outer_alias, so the outer query is rewritten to FROM sales_v _outer. In that path, expression correlation still keeps already-qualified identifiers as sales_v.* (the qualifier helper does not rewrite existing qualifiers), so the generated correlated predicate can reference sales_v after it has been aliased away, producing a binder error for otherwise valid SQL. This is a regression for qualified arithmetic/concatenation dimensions without an explicit table alias.
Useful? React with 👍 / 👎.
When an expression dimension uses a qualified column (e.g., sales_v.region || 'foo'),
inserting an _outer alias would invalidate the existing qualifier. Fix:
- Revert has_expression_dimensions to only check for '(' (original behavior) so
operator-based expressions don't trigger unnecessary _outer alias insertion
- Add strip_column_qualifiers() to remove table prefixes before re-qualifying,
so both qualified (sales_v.region || 'foo') and unqualified (region || 'foo')
expressions work correctly
- ANY_VALUE() wrapping already handles the GROUP BY constraint without needing
the _outer alias
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ae1007319
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
yardstick-rs/src/sql/measures.rs
Outdated
| if chars.peek() == Some(&'.') { | ||
| chars.next(); // consume the dot, drop the qualifier | ||
| } else { |
There was a problem hiding this comment.
Avoid stripping non-table qualifiers from expressions
strip_column_qualifiers currently drops every identifier before a dot, so expression dimensions like my_schema.bucket(ts) or payload.city are rewritten to bucket(ts)/city before correlation qualification. That changes semantics (or causes binder errors) for schema-qualified functions and struct/field dereferences, because the generated correlated predicate no longer references the original expression. The previous code path preserved these dotted expressions, so this is a regression introduced by the new qualifier-stripping step.
Useful? React with 👍 / 👎.
strip_column_qualifiers was dropping every identifier.dot prefix, which would break schema-qualified functions (my_schema.bucket(ts)) and struct field dereferences (payload.city). Renamed to strip_table_qualifier and added a table_name parameter so only the known source table prefix is removed. The outer_alias (which is the source table name when no _outer alias is injected) is passed as the table to strip.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f8d1476ad3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
yardstick-rs/src/sql/measures.rs
Outdated
| if chars.peek() == Some(&'.') && ident.eq_ignore_ascii_case(table_name) { | ||
| chars.next(); // consume the dot, drop this specific qualifier | ||
| } else { |
There was a problem hiding this comment.
Preserve schema-qualified calls when stripping table qualifiers
strip_table_qualifier removes any <ident>. token that matches outer_alias, but that also strips schema-qualified function names when the schema and table alias are the same (for example FROM sales_v s with dimension s.bucket(ts)). In that case the code rewrites the expression to bucket(ts) before qualification, changing function resolution (or causing missing-function errors). This is a regression from the previous path, which preserved function qualifiers; the fresh evidence is that the drop happens unconditionally at the ident.eq_ignore_ascii_case(table_name) check.
Useful? React with 👍 / 👎.
strip_table_qualifier now peeks past the dot: if the next token is an
identifier followed by '(', it is a function call (e.g., s.bucket(ts))
and the qualifier is preserved. Only plain column references
(s.region, s.year) have the table prefix removed.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5285137577
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
yardstick-rs/src/sql/measures.rs
Outdated
| } else if c.is_alphabetic() || c == '_' { | ||
| let mut ident = String::from(c); | ||
| while let Some(&next) = chars.peek() { | ||
| if next.is_alphanumeric() || next == '_' { | ||
| ident.push(chars.next().unwrap()); |
There was a problem hiding this comment.
Strip quoted table qualifiers in expression dimensions
strip_table_qualifier only recognizes qualifiers that begin with alphabetic/underscore characters, so quoted aliases (for example "s".region || 'foo') are never stripped. In the new expression-dimension path, that unstripped qualifier is passed to qualify_where_for_inner, which preserves it and generates an inner correlation predicate that references the outer alias instead of _inner, leading to binder errors when users alias tables with quoted identifiers. This is a regression for ||/arithmetic expression dimensions because those now route through this stripping logic.
Useful? React with 👍 / 👎.
The identifier collector now recognizes double-quoted identifiers (e.g., "s".region) in addition to bare identifiers, so quoted table aliases are properly stripped before re-qualification. The unquoted name is extracted for case-insensitive comparison against table_name, and quoted column names after the dot are also preserved correctly.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a9d1daa044
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| } | ||
| } | ||
| let is_function_call = chars.peek() == Some(&'('); |
There was a problem hiding this comment.
Preserve schema-qualified calls with spaced parentheses
strip_table_qualifier only recognizes a schema-qualified function when the next character after the function name is exactly (. For valid SQL formatted as s.bucket (ts), chars.peek() sees a space, so the code misclassifies s.bucket as table.column and strips s., rewriting the expression to bucket (ts) before re-qualification. This can change function resolution or fail binding when schema and table alias names overlap, so the qualifier check should ignore intervening whitespace.
Useful? React with 👍 / 👎.
strip_table_qualifier now skips spaces/tabs between the function name
and '(' so that `s.bucket (ts)` is correctly recognized as a
schema-qualified function call and not misclassified as table.column.
Closes #29
Summary
region || 'foo'in SEMANTIC queries now work correctlyANY_VALUE()so DuckDB accepts them in grouped contextyear + 1), CASE expressions, and function call dimensionsWhat changed
is_expression_dim()helper to detect dimensions that are expressions (containing operators like||,+,-,*,/, or function calls / CASE)correlation_exprs_for_dimto usequalify_where_for_inner/qualify_where_for_outerfor expression dimensions (previously only handled(containing expressions, missing||and arithmetic operators)has_expression_dimensionscheck to use the new detection, ensuring_outeralias is added when needed