Skip to content

Add parser_override to eliminate SEMANTIC prefix requirement#30

Merged
nicosuave merged 4 commits intomainfrom
nicosuave/parser-replacement
Apr 11, 2026
Merged

Add parser_override to eliminate SEMANTIC prefix requirement#30
nicosuave merged 4 commits intomainfrom
nicosuave/parser-replacement

Conversation

@nicosuave
Copy link
Copy Markdown
Member

Summary

Uses DuckDB's parser_override API (v1.5+) to intercept all queries before the native parser runs, eliminating the need for the SEMANTIC prefix on AGGREGATE() calls. This enables AGGREGATE() in statement types the bind-time fallback couldn't handle (CTAS, INSERT...SELECT). The existing parse_function fallback and yardstick_bind operator extension are preserved for backwards compatibility.

  • parser_override set to fallback mode: yardstick gets first crack at every query, non-yardstick SQL falls through to the native parser
  • Filters AGGREGATE() argument extraction to simple identifiers so DuckDB's built-in aggregate([list], 'fn') list function is not intercepted
  • Validates expanded SQL parses before committing to it, falling through on failure
  • Full regression suite: 1085 original assertions re-run without any SEMANTIC prefix, plus CTAS/INSERT/list-aggregate compatibility tests (2204 total assertions)

Use DuckDB's parser_override API (v1.5+) to intercept queries before
the native parser, allowing AGGREGATE() to work in all statement types
(SELECT, CTAS, INSERT...SELECT) without the SEMANTIC prefix. The
existing parse_function fallback and yardstick_bind operator extension
are kept for backwards compatibility. Also filters AGGREGATE() argument
extraction to simple identifiers so DuckDB's built-in list aggregate()
function is not intercepted.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dbb7c0afb0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +450 to +455
if (result.error) {
// Expansion failed: this might not be a yardstick AGGREGATE() call
// (e.g. DuckDB's built-in list aggregate function). Fall through to
// the native parser in case it can handle the query.
yardstick_free_aggregate_result(result);
return ParserOverrideResult();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep AGGREGATE rewrite path for CTAS/INSERT errors

When yardstick_expand_aggregate returns an error, this branch always falls back to DuckDB's native parser, which leaves the original SQL unchanged. That breaks the new non-SEMANTIC path for statement types like CREATE TABLE ... AS SELECT and INSERT ... SELECT: expand_aggregate still relies on parse_select, and yardstick_parse_select explicitly rejects non-SELECT top-level statements (src/yardstick_parser_ffi.cpp around lines 1098-1100), so these queries are never rewritten and end up failing on DuckDB's built-in aggregate function instead of executing Yardstick semantics.

Useful? React with 👍 / 👎.

// Only include calls WITHOUT AT modifier, and only when the argument
// is a simple identifier (measure name). This avoids intercepting
// DuckDB's built-in aggregate([list], 'fn') list function.
if modifiers.is_empty() && parse_simple_measure_ref(measure).is_some() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept quoted measure names in aggregate extraction

The new simple-reference gate causes extract_all_aggregate_calls to skip valid SQL identifiers that require quoting (for example AGGREGATE("total revenue")), because parse_simple_measure_ref rejects spaces and other quoted-identifier characters. Before this change, plain AGGREGATE(...) calls were still extracted; now these quoted measure names are not rewritten and will fall through to DuckDB's unrelated aggregate function path.

Useful? React with 👍 / 👎.

Fix parse_simple_measure_ref to allow any characters inside matched
quotes (double-quotes, backticks, brackets) so AGGREGATE("col_name")
is correctly recognized as a measure reference. Add test for quoted
measure names.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cfd15a3a9f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +482 to +484
// Wrap in table function call and parse with DuckDB's native parser
string wrapper_sql = "SELECT * FROM yardstick('" + escaped_sql + "')";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve statement context when rewriting non-SELECT queries

This override always rewrites successful AGGREGATE() expansions into SELECT * FROM yardstick('...'), even when the original statement was CREATE TABLE ... AS SELECT or INSERT ... SELECT. That routes execution through YardstickQueryBind, which runs con.Query(...) on a new Connection during bind time, so side effects occur outside the caller's transaction/temp-schema context (e.g., rollback does not reliably undo the write). For non-SELECT statements, return parsed expanded_sql directly instead of wrapping it in the table function.

Useful? React with 👍 / 👎.

// Only include calls WITHOUT AT modifier, and only when the argument
// is a simple identifier (measure name). This avoids intercepting
// DuckDB's built-in aggregate([list], 'fn') list function.
if modifiers.is_empty() && parse_simple_measure_ref(measure).is_some() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow 3+ part qualified measure refs in AGGREGATE extraction

extract_all_aggregate_calls now drops plain AGGREGATE(...) calls unless parse_simple_measure_ref succeeds, but that parser only accepts one- or two-part identifiers. Valid SQL references like AGGREGATE(schema.view.measure) (or catalog-qualified variants) are therefore skipped, so the rewrite leaves AGGREGATE intact and the query falls through to DuckDB's native aggregate path instead of Yardstick semantics.

Useful? React with 👍 / 👎.

…ion context

For CTAS and INSERT...SELECT, return the expanded SQL as parsed
statements instead of wrapping in yardstick() table function. The
table function executes via con.Query() on a fresh Connection, which
runs outside the caller's transaction context. SELECT statements
still use the table function wrapper for the second expansion pass.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bfafe8d1dc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Only include calls WITHOUT AT modifier, and only when the argument
// is a simple identifier (measure name). This avoids intercepting
// DuckDB's built-in aggregate([list], 'fn') list function.
if modifiers.is_empty() && parse_simple_measure_ref(measure).is_some() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept spaced qualified refs when extracting AGGREGATE calls

The new simple-reference filter now rejects any whitespace outside quoted identifiers, so valid SQL like AGGREGATE(o . sumRevenue) (or AGGREGATE(o. sumRevenue)) is skipped by extract_all_aggregate_calls. When this happens, Yardstick leaves the call unreplaced and execution falls through to DuckDB's built-in aggregate function (which expects a list + function name), causing these previously-rewritable measure queries to fail.

Useful? React with 👍 / 👎.

parse_simple_measure_ref now accepts spaces around the dot in
qualified identifiers (e.g. "s . revenue", "s. revenue") and trims
parts before normalization. Add tests for spaced qualified refs.
@nicosuave nicosuave merged commit be64a96 into main Apr 11, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant