Skip to content

[Enhancement] Push eventstats down by rewriting RexOver to Join + Aggregate (#5483)#5495

Draft
RyanL1997 wants to merge 8 commits into
opensearch-project:mainfrom
RyanL1997:worktree-eventstats-pushdown-5483
Draft

[Enhancement] Push eventstats down by rewriting RexOver to Join + Aggregate (#5483)#5495
RyanL1997 wants to merge 8 commits into
opensearch-project:mainfrom
RyanL1997:worktree-eventstats-pushdown-5483

Conversation

@RyanL1997
Copy link
Copy Markdown
Collaborator

Description

PPL eventstats lowers to LogicalProject(RexOver(...)) directly above the scan. No rule in OpenSearchIndexRules.OPEN_SEARCH_PUSHDOWN_RULES matches that shape — every AggregateIndexScanRule config requires LogicalAggregate at the operand root, and RareTopPushdownRule requires a ROW_NUMBER window with a LESS_THAN_OR_EQUAL filter above. The plan therefore reaches Volcano with RexOver intact, becomes EnumerableWindow, and the scan beneath it stays in _source + requestedTotalSize=MAX_INT mode. On 47B-doc indices the coordinator times out streaming every matching document just to count it. Same pathological behavior with BY (the production query in the issue).

This PR rewrites Window AST nodes in CalciteRelNodeVisitor.visitWindow into Project → Join → (input, Aggregate(input)). The right-side Aggregate sits directly over a re-pushed copy of the input, matching AggregateIndexScanRule.AGGREGATE_SCAN (no-BY) or DEFAULT / BUCKET_NON_NULL_AGG (BY). OpenSearch sees the same shape as stats count()size:0 + track_total_hits or a terms aggregation — instead of an unsized row fetch. The left side returns rows as before; the join broadcasts the aggregate value(s) onto each row, preserving the [original cols, agg cols] row type so downstream consumers (limit, head, fields) see no change.

Design

Follows the precedent in buildStreamWindowSelfJoinPlan (CalciteRelNodeVisitor.java:2348-2362):

  • Join, not LogicalCorrelate, because LogicalCorrelate causes NPE in RelDecorrelator per the existing comment.
  • NULL bucket handling mirrors lines 2442-2449 of the streamstats self-join: bucketNullable=trueINNER join with IS NOT DISTINCT FROM ((left.k = right.k) OR (both NULL)); bucketNullable=falseLEFT join with simple equality, IS NOT NULL filter pushed below the right aggregate to match BUCKET_NON_NULL_AGG. NULL-keyed left rows are preserved with NULL aggregate values, matching the previous CASE-wrapped behavior.
  • aggregateWithTrimming is reused for the right-side aggregate construction, so agg-resolution semantics are identical to stats/streamstats.

Rewriteability predicate (canRewriteWindowAsAggregateJoin) rejects: non-aggregate window functions (ROW_NUMBER / LAG / etc.), non-empty sortList, non-default frame, non-bare-field partition keys. Anything outside the eventstats shape falls through to visitWindowAsRexOver, preserving existing behavior for any future Window producer.

Coverage against the issue's 7-requirement comment

Comment:

  1. No-BY + BY both addressed in a single rewrite; hasGroup toggles cross-join vs equi-join. ✓
  2. All aggregate functions — uses aggregateWithTrimming (same path stats uses), supports COUNT/SUM/AVG/MIN/MAX/STDDEV/VAR/etc. uniformly. ✓
  3. Multiple aggregates per call packed into one right-side Aggregate with multiple agg calls, one broadcast join. ✓
  4. NULL semantics on BY join-backIS NOT DISTINCT FROM for bucketNullable=true; LEFT join + right-side IS NOT NULL filter for bucketNullable=false. Equivalent to the existing Window+CASE behavior. ✓
  5. Tight predicate rejects ordered/framed/non-aggregate/non-bare-field cases. ✓
  6. No decorrelation dependency — rewrite directly produces the operand chain AggregateIndexScanRule expects. ✓
  7. No symptom-masking shortcutsPredicate.not(LogicalProject::containsOver) guards in AggregateIndexScanRule are untouched. Real Window→Aggregate rewrite at the AST level. ✓

Verification — status

Draft because the following needs CI to finalize, blocked locally by a pre-existing core:compileJava failure on AnalyticsExecutionEngine.java:92 (unrelated analytics-api:3.7.0-SNAPSHOT symbol issue that also reproduces on main with no edits):

  • CalcitePPLEventstatsTest verifyLogical expectations have been updated to my pattern-derived prediction of the new lowered shape. May need 1-2 small adjustments (trim-project corner cases) based on actual output. verifyPPLToSparkSQL calls are temporarily removed pending observation of the SparkSqlDialect output for the join+aggregate form.
  • Existing eventstats EXPLAIN expected-output files (explain_eventstats_avg.json, _dc.json, _distinct_count.json, _earliest_latest.json, _earliest_latest_custom_time.json, _earliest_latest_no_group.json, _null_bucket.yaml) will diff against the new plan shape. Regeneration from CI logs incoming as a follow-up commit.
  • A new positive-pushdown EXPLAIN assertion (testEventstatsPushdownExplain mirroring stats count() — asserting AGGREGATION->...COUNT() in PushDownContext and track_total_hits in sourceBuilder) incoming.
  • NULL-bucket BY integration tests in CalcitePPLEventstatsIT against TEST_INDEX_STATE_COUNTRY_WITH_NULL (both bucketNullable=true and =false) incoming.

Adjacent unit suites (CalcitePPLAggregationTest, CalcitePPLStreamstatsTest, CalcitePPLRareTest, CalcitePPLTopTest) should all be unchanged — different visitor paths or StreamWindow/RareTopN AST nodes.

Related Issues

Resolves #5483

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using `--signoff` or `-s`.
  • Public documentation issue/PR created.

opensearch-project#5483)

PPL eventstats lowers to LogicalProject(RexOver(...)) above the scan. No
rule in OpenSearchIndexRules.OPEN_SEARCH_PUSHDOWN_RULES matches that
shape: every AggregateIndexScanRule config requires LogicalAggregate at
the operand root, and RareTopPushdownRule requires a ROW_NUMBER window
with a LESS_THAN_OR_EQUAL filter above it. The plan therefore reaches
Volcano with RexOver intact, gets converted to EnumerableWindow, and the
scan beneath it stays in _source-includes + requestedTotalSize=MAX_INT
mode, streaming every matching document to the coordinator just to
count it. On 47B-doc indices this times out.

This change rewrites Window AST nodes in CalciteRelNodeVisitor.visitWindow
into a Join + Aggregate plan: the right side is an Aggregate over a
re-pushed copy of the input, which matches AggregateIndexScanRule and
pushes down to OpenSearch as size:0 + track_total_hits (no-BY) or a
terms aggregation (BY). The left side returns rows as before. The join
broadcasts the aggregate value(s) onto each row, preserving the row type
[original cols, agg cols] that the legacy lowering produced so
downstream consumers see no shape change.

NULL-bucket semantics:
- bucketNullable=true: INNER join with IS NOT DISTINCT FROM on each
  partition key, so the NULL bucket on each side matches and NULL-keyed
  left rows still receive the NULL-bucket aggregate value.
- bucketNullable=false: LEFT join with simple equality, IS NOT NULL
  filter pushed below the right aggregate to match the BUCKET_NON_NULL_AGG
  pushdown shape stats already uses. NULL-keyed left rows survive with a
  NULL aggregate value, matching the previous CASE-wrapped behavior.

The rewriteability predicate (canRewriteWindowAsAggregateJoin) rejects
non-aggregate window functions (ROW_NUMBER / LAG / etc.), non-empty sort
lists, non-default frames, and non-bare-field partition keys. Anything
outside the eventstats shape falls through to visitWindowAsRexOver,
preserving existing behavior for any future Window producer.

Follows the precedent in buildStreamWindowSelfJoinPlan: uses Join (not
LogicalCorrelate, which causes NPE in RelDecorrelator per the comment at
CalciteRelNodeVisitor.java:2348-2352) and mirrors the canonical NULL
bucket handling at lines 2442-2449. Reuses aggregateWithTrimming for
the right-side aggregate construction so agg-resolution semantics are
identical to stats and streamstats.

CalcitePPLEventstatsTest verifyLogical expectations are updated to the
new lowered shape. verifyPPLToSparkSQL assertions are temporarily
removed pending observation of the SparkSqlDialect output for the
join+aggregate form; the previous window-form expectations no longer
apply.

Draft: existing CalciteExplainIT eventstats expected-output files and
new NULL-bucket BY integration tests in CalcitePPLEventstatsIT will be
added in follow-up commits once CI confirms the lowered shape is exact.

Resolves opensearch-project#5483

Signed-off-by: Jialiang Liang <ryanleeang@gmail.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

PR Reviewer Guide 🔍

(Review updated until commit 6f333fc)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

In rewriteWindowAsAggregateJoin, when hasGroup is false, the code projects a literal-0 key onto both sides and then references these keys by absolute position (leftFieldCount on left, aggCount on right). However, leftFieldCount is the count of fields in leftInput before the literal-0 key is appended. After projectPlus adds the key, the left side has leftFieldCount + 1 fields, so the key is at position leftFieldCount, not leftFieldCount itself. The subsequent field(2, 0, leftFieldCount) call at line 2204 will reference the wrong column (the last original column instead of the appended key), causing the join condition to be incorrect and potentially leading to a cross-join or wrong results.

RelNode leftForJoin = leftInput;
RelNode rightForJoin = rightAggregate;
int leftJoinKeyOffset = 0;
if (!hasGroup) {
  context.relBuilder.push(leftInput);
  context.relBuilder.projectPlus(
      context.relBuilder.alias(context.relBuilder.literal(0), EVENTSTATS_NOGROUP_JOIN_KEY));
  leftForJoin = context.relBuilder.build();
  context.relBuilder.push(rightAggregate);
  context.relBuilder.projectPlus(
      context.relBuilder.alias(context.relBuilder.literal(0), EVENTSTATS_NOGROUP_JOIN_KEY));
  rightForJoin = context.relBuilder.build();
  leftJoinKeyOffset = 1;
}

context.relBuilder.push(leftForJoin);
context.relBuilder.push(rightForJoin);

RexNode joinCondition;
if (!hasGroup) {
  // Equi-join on the literal-0 keys we just projected (positions: leftFieldCount on the left
  // input post-projection, aggCount on the right after the agg outputs).
  RexNode leftKey = context.relBuilder.field(2, 0, leftFieldCount);
  RexNode rightKey = context.relBuilder.field(2, 1, aggCount);
  joinCondition = context.relBuilder.equals(leftKey, rightKey);
Possible Issue

The extractFieldName method at line 2399 calls toString() on QualifiedName and Field.getField() to produce the field name used in relBuilder.field(2, side, name) for the join condition. However, QualifiedName.toString() may include table qualifiers or other formatting that does not match the actual field name in the row type's field list. If the row type uses simple unqualified names but toString() returns a qualified name, the field(...) lookup will fail or reference the wrong column, breaking the join condition for BY-case eventstats.

private static String extractFieldName(UnresolvedExpression expr) {
  if (expr instanceof Field f) {
    return f.getField().toString();
  }
  if (expr instanceof QualifiedName qn) {
    return qn.toString();
  }
  if (expr instanceof Alias a) {
    return extractFieldName(a.getDelegated());
  }
  throw new IllegalArgumentException(
      "Cannot extract field name from non-field expression: " + expr);
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

PR Code Suggestions ✨

Latest suggestions up to 6f333fc

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Validate partition key extraction for filtering

When bucketNullable=false, the code filters IS NOT NULL on partition keys before
aggregation. However, if PlanUtils.getSelectColumns(groupRex) returns an empty list
or fails to extract columns correctly, the filter is silently skipped. Verify that
getSelectColumns returns the expected columns and log a warning if the extraction
fails to ensure the NULL-filtering logic is applied as intended.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2154-2163]

 List<RexNode> groupRex =
     groupList.stream().map(expr -> rexVisitor.analyze(expr, context)).toList();
+List<RexNode> selectColumns = PlanUtils.getSelectColumns(groupRex);
+if (selectColumns.isEmpty() && !groupRex.isEmpty()) {
+  throw new IllegalStateException("Failed to extract partition key columns for NULL filtering");
+}
 List<RexNode> isNotNullList =
-    PlanUtils.getSelectColumns(groupRex).stream()
+    selectColumns.stream()
         .map(context.relBuilder::field)
         .map(context.relBuilder::isNotNull)
         .toList();
 if (!isNotNullList.isEmpty()) {
   context.relBuilder.filter(isNotNullList);
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion validates that PlanUtils.getSelectColumns() successfully extracts columns before applying NULL filtering. This is a reasonable defensive check to ensure the NULL-filtering logic is applied correctly when bucketNullable=false. However, the suggestion throws an exception rather than logging a warning as described, and it's unclear if getSelectColumns can legitimately return empty for non-empty groupRex. The improvement adds validation but may be overly strict.

Low
Prevent unbounded recursion in extractFieldName

The recursive call in extractFieldName when handling Alias can lead to unbounded
recursion if there are circular or deeply nested alias chains. Consider adding a
depth limit or cycle detection to prevent stack overflow in pathological cases.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2399-2411]

 private static String extractFieldName(UnresolvedExpression expr) {
+  return extractFieldNameWithDepth(expr, 0, 100);
+}
+
+private static String extractFieldNameWithDepth(UnresolvedExpression expr, int depth, int maxDepth) {
+  if (depth > maxDepth) {
+    throw new IllegalArgumentException("Maximum alias nesting depth exceeded for expression: " + expr);
+  }
   if (expr instanceof Field f) {
     return f.getField().toString();
   }
   if (expr instanceof QualifiedName qn) {
     return qn.toString();
   }
   if (expr instanceof Alias a) {
-    return extractFieldName(a.getDelegated());
+    return extractFieldNameWithDepth(a.getDelegated(), depth + 1, maxDepth);
   }
   throw new IllegalArgumentException(
       "Cannot extract field name from non-field expression: " + expr);
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion addresses a potential stack overflow from deeply nested or circular Alias chains by adding depth limiting. While this is a valid concern for robustness, the likelihood of encountering such pathological cases in practice is low given the controlled nature of the AST construction. The improvement adds safety but is not critical for typical usage.

Low
Possible issue
Add null-safety check for getSortList

The method checks wf.getSortList().isEmpty() but does not verify that getSortList()
itself is non-null before calling isEmpty(). If getSortList() returns null, this
will throw a NullPointerException. Add a null-safety check before invoking
isEmpty().

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2297-2299]

 private static boolean canRewriteWindowAsAggregateJoin(Window node) {
   if (node.getWindowFunctionList().isEmpty()) {
     return false;
   }
   for (UnresolvedExpression expr : node.getWindowFunctionList()) {
     UnresolvedExpression inner = (expr instanceof Alias a) ? a.getDelegated() : expr;
     if (!(inner instanceof WindowFunction wf)) {
       return false;
     }
     String funcName = extractAggregateFunctionName(wf.getFunction());
     if (funcName == null) {
       return false;
     }
     Optional<BuiltinFunctionName> windowName = BuiltinFunctionName.ofWindowFunction(funcName);
     if (windowName.isEmpty()) {
       return false;
     }
     String canonical = canonicalAggregationName(windowName.get());
     if (BuiltinFunctionName.ofAggregation(canonical).isEmpty()) {
       return false;
     }
-    if (!wf.getSortList().isEmpty()) {
+    if (wf.getSortList() != null && !wf.getSortList().isEmpty()) {
       return false;
     }
     if (wf.getWindowFrame() != null
         && !Objects.equals(wf.getWindowFrame(), WindowFrame.rowsUnbounded())) {
       return false;
     }
   }
   if (node.getGroupList() != null) {
     for (UnresolvedExpression expr : node.getGroupList()) {
       if (!isBareFieldReference(expr)) {
         return false;
       }
     }
   }
   return true;
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion adds a null check for getSortList() before calling isEmpty(). While this is a defensive programming practice, the existing code pattern !wf.getSortList().isEmpty() is commonly used in Java and would only fail if the API contract is violated. Without evidence that getSortList() can return null, this is a minor defensive improvement rather than a critical fix.

Low

Previous suggestions

Suggestions up to commit bb8d9b2
CategorySuggestion                                                                                                                                    Impact
General
Validate partition keys are non-dotted

The join condition construction references partition keys by name via
relBuilder.field(2, side, keyName), which does not perform nested-field resolution.
If a dotted path (e.g., doc.user.city) bypasses the isBareFieldReference check due
to a future code change, the join will silently fail to match rows. Add a runtime
assertion to verify that keyName contains no dots, catching any validation gaps
early.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2181-2194]

 for (UnresolvedExpression groupExpr : groupList) {
   String keyName = extractFieldName(groupExpr);
+  if (keyName.contains(".")) {
+    throw new IllegalStateException("Dotted partition key bypassed validation: " + keyName);
+  }
   RexNode leftKey = context.relBuilder.field(2, 0, keyName);
   RexNode rightKey = context.relBuilder.field(2, 1, keyName);
   RexNode eq = context.relBuilder.equals(leftKey, rightKey);
   if (bucketNullable) {
     RexNode bothNull =
         context.relBuilder.and(
             context.relBuilder.isNull(leftKey), context.relBuilder.isNull(rightKey));
     perKeyConditions.add(context.relBuilder.or(eq, bothNull));
   } else {
     perKeyConditions.add(eq);
   }
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion adds a defensive runtime check to catch validation gaps if dotted paths bypass isBareFieldReference. This is a reasonable safeguard that could prevent silent failures, though the existing validation in canRewriteWindowAsAggregateJoin should already prevent this scenario. The check adds minimal overhead and improves robustness.

Low
Deduplicate partition keys before filtering

When bucketNullable=false, the code filters IS NOT NULL on each partition key before
grouping. However, if groupRex contains duplicate columns (e.g., BY field1, field1),
PlanUtils.getSelectColumns may return duplicates, causing redundant IS NOT NULL
filters on the same field. Consider deduplicating the column list before building
the filter to avoid unnecessary predicates.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2156-2163]

 List<RexNode> isNotNullList =
     PlanUtils.getSelectColumns(groupRex).stream()
+        .distinct()
         .map(context.relBuilder::field)
         .map(context.relBuilder::isNotNull)
         .toList();
 if (!isNotNullList.isEmpty()) {
   context.relBuilder.filter(isNotNullList);
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies a potential issue where duplicate partition keys could lead to redundant IS NOT NULL filters. However, this is a minor optimization that would only affect edge cases with duplicate BY keys, which are unlikely in practice. The .distinct() call is a reasonable safeguard.

Low
Use IllegalStateException for internal errors

The method throws IllegalArgumentException for unsupported expression types, but
canRewriteWindowAsAggregateJoin already validates that all partition keys are bare
field references. If an invalid expression reaches this point, it indicates a logic
error in the validation predicate. Consider using IllegalStateException to signal an
internal inconsistency rather than invalid user input.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2367-2379]

 private static String extractFieldName(UnresolvedExpression expr) {
   if (expr instanceof Field f) {
     return f.getField().toString();
   }
   if (expr instanceof QualifiedName qn) {
     return qn.toString();
   }
   if (expr instanceof Alias a) {
     return extractFieldName(a.getDelegated());
   }
-  throw new IllegalArgumentException(
+  throw new IllegalStateException(
       "Cannot extract field name from non-field expression: " + expr);
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion is technically correct that IllegalStateException better signals an internal inconsistency since canRewriteWindowAsAggregateJoin validates inputs. However, this is a minor semantic improvement that doesn't affect functionality. The choice between exception types is somewhat subjective.

Low
Suggestions up to commit 515f35e
CategorySuggestion                                                                                                                                    Impact
General
Add validation before field extraction

The method throws IllegalArgumentException for unexpected expression types, but
canRewriteWindowAsAggregateJoin only validates that partition keys are bare field
references. If validation logic changes or is bypassed, this could cause runtime
failures. Add a defensive check in rewriteWindowAsAggregateJoin before calling
extractFieldName to ensure all partition keys pass isBareFieldReference.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2336-2348]

 private static String extractFieldName(UnresolvedExpression expr) {
+  if (!isBareFieldReference(expr)) {
+    throw new IllegalArgumentException(
+        "Expected bare field reference, got: " + expr);
+  }
   if (expr instanceof Field f) {
     return f.getField().toString();
   }
   if (expr instanceof QualifiedName qn) {
     return qn.toString();
   }
   if (expr instanceof Alias a) {
     return extractFieldName(a.getDelegated());
   }
   throw new IllegalArgumentException(
       "Cannot extract field name from non-field expression: " + expr);
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion adds a redundant check since extractFieldName is only called after canRewriteWindowAsAggregateJoin validates all partition keys via isBareFieldReference. The existing validation logic already ensures this precondition, making the defensive check unnecessary.

Low
Suggestions up to commit f19e1d9
CategorySuggestion                                                                                                                                    Impact
General
Validate join output field count

The final projection assumes the join output has exactly leftFieldCount +
rightGroupKeyCount + aggCount fields, but if the right aggregate produces a
different number of columns (e.g., due to a mismatch between groupList.size() and
the actual group keys in the aggregate), context.relBuilder.field(rightAggStart + i)
will reference an out-of-bounds field index. Add a bounds check or assertion to
verify the join output row type matches the expected shape before projecting.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2212-2216]

 int rightAggStart = leftFieldCount + rightGroupKeyCount;
+int joinFieldCount = context.relBuilder.peek().getRowType().getFieldCount();
+int expectedFieldCount = leftFieldCount + rightGroupKeyCount + aggCount;
+if (joinFieldCount != expectedFieldCount) {
+  throw new IllegalStateException("Join output field count mismatch: expected " + expectedFieldCount + ", got " + joinFieldCount);
+}
 for (int i = 0; i < aggCount; i++) {
   finalProjects.add(context.relBuilder.field(rightAggStart + i));
   finalNames.add(extractAliasName(node.getWindowFunctionList().get(i)));
 }
Suggestion importance[1-10]: 7

__

Why: This suggestion addresses a potential runtime error if the join output doesn't match the expected field count. Adding a bounds check before field projection would prevent out-of-bounds access and make the code more defensive against unexpected aggregate output shapes.

Medium
Validate non-empty NULL filter generation

The bucketNullable=false path filters IS NOT NULL on partition keys before
aggregation, but if PlanUtils.getSelectColumns(groupRex) returns an empty list
(e.g., if groupRex contains only literals or complex expressions), the filter is
skipped silently. This could allow NULL-keyed rows into the aggregate when they
should be excluded. Verify that getSelectColumns always returns at least one column
reference for each partition key, or add a check to ensure the filter is non-empty.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2152-2163]

 if (hasGroup && !bucketNullable) {
   List<RexNode> groupRex =
       groupList.stream().map(expr -> rexVisitor.analyze(expr, context)).toList();
   List<RexNode> isNotNullList =
       PlanUtils.getSelectColumns(groupRex).stream()
           .map(context.relBuilder::field)
           .map(context.relBuilder::isNotNull)
           .toList();
-  if (!isNotNullList.isEmpty()) {
-    context.relBuilder.filter(isNotNullList);
+  if (isNotNullList.isEmpty()) {
+    throw new IllegalStateException("Expected non-empty IS NOT NULL filter for bucketNullable=false with groupList: " + groupList);
   }
+  context.relBuilder.filter(isNotNullList);
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion identifies a potential edge case where PlanUtils.getSelectColumns might return an empty list, which could lead to incorrect NULL handling. However, the existing code already has a check if (!isNotNullList.isEmpty()), so the concern is partially addressed. Adding an explicit validation would improve robustness.

Low
Handle unexpected expression types defensively

The extractFieldName method can throw an IllegalArgumentException for unexpected
expression types, but rewriteWindowAsAggregateJoin calls it on every groupList entry
without handling this exception. If canRewriteWindowAsAggregateJoin incorrectly
allows a non-field expression through (e.g., due to a future code change), the
rewrite will fail at runtime. Add defensive validation or document the precondition
that canRewriteWindowAsAggregateJoin guarantees all group expressions pass
isBareFieldReference.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2336-2348]

 private static String extractFieldName(UnresolvedExpression expr) {
   if (expr instanceof Field f) {
     return f.getField().toString();
   }
   if (expr instanceof QualifiedName qn) {
     return qn.toString();
   }
   if (expr instanceof Alias a) {
     return extractFieldName(a.getDelegated());
   }
-  throw new IllegalArgumentException(
-      "Cannot extract field name from non-field expression: " + expr);
+  throw new IllegalStateException(
+      "extractFieldName called on non-field expression (should be prevented by canRewriteWindowAsAggregateJoin): " + expr);
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion to change IllegalArgumentException to IllegalStateException is a minor improvement in exception semantics, but the method is already protected by canRewriteWindowAsAggregateJoin which validates all group expressions. The impact is low since the precondition is already enforced.

Low
Suggestions up to commit 15c52f7
CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate filter count matches group key count

When bucketNullable=false, the code filters out NULL partition keys before
aggregation. However, PlanUtils.getSelectColumns(groupRex) may return an empty list
if groupRex contains only literals or complex expressions that don't resolve to
column references. In such cases, no filter is applied, and NULL rows may
incorrectly remain in the aggregate. Verify that getSelectColumns always returns the
expected columns for the partition keys, or add validation to ensure the filter list
matches the group key count.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2153-2163]

 List<RexNode> isNotNullList =
     PlanUtils.getSelectColumns(groupRex).stream()
         .map(context.relBuilder::field)
         .map(context.relBuilder::isNotNull)
         .toList();
+if (isNotNullList.size() != groupList.size()) {
+  throw new IllegalStateException(
+      "Mismatch between group keys and IS NOT NULL filters: expected " + groupList.size() + " but got " + isNotNullList.size());
+}
 if (!isNotNullList.isEmpty()) {
   context.relBuilder.filter(isNotNullList);
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion identifies a potential edge case where PlanUtils.getSelectColumns(groupRex) might not return the expected number of columns. Adding validation would catch mismatches early, but the concern is somewhat speculative since groupRex is derived from validated groupList entries that pass isBareFieldReference.

Low
General
Validate aggregate function argument requirements

When converting a Function to AggregateFunction, the code sets field to null if args
is empty. However, AggregateFunction may not handle a null field correctly for all
aggregate types (e.g., COUNT() expects no field, but other aggregates like SUM
require one). If an aggregate function is called without arguments when it requires
them, this could lead to incorrect results or runtime errors downstream. Validate
that the function name is compatible with a null field before constructing the
AggregateFunction.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2323-2330]

 if (fn instanceof Function f) {
   List<UnresolvedExpression> args = f.getFuncArgs();
+  String funcName = f.getFuncName();
   UnresolvedExpression field = args.isEmpty() ? null : args.get(0);
+  // COUNT() is the only common aggregate that accepts no field argument
+  if (field == null && !funcName.equalsIgnoreCase("count")) {
+    throw new IllegalArgumentException(
+        "Aggregate function " + funcName + " requires at least one argument");
+  }
   List<UnresolvedExpression> argList =
       args.size() <= 1 ? List.of() : args.subList(1, args.size());
-  AggregateFunction agg = new AggregateFunction(f.getFuncName(), field, argList);
+  AggregateFunction agg = new AggregateFunction(funcName, field, argList);
   return agg;
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion adds validation for aggregate functions that require arguments, which could prevent runtime errors. However, canRewriteWindowAsAggregateJoin already filters for functions present in both window and aggregation maps, reducing the likelihood of invalid configurations. The check is defensive but not critical.

Low
Add defensive error handling for unexpected types

The extractFieldName method can throw an IllegalArgumentException if the expression
type is unexpected, but this exception is not caught in
rewriteWindowAsAggregateJoin. Since canRewriteWindowAsAggregateJoin already
validates that all partition keys are bare field references via
isBareFieldReference, this exception should never occur in practice. However, if the
validation logic has a gap or the expression structure changes unexpectedly, the
uncaught exception will propagate and potentially crash the query. Add defensive
error handling or document the precondition clearly.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2336-2348]

 private static String extractFieldName(UnresolvedExpression expr) {
   if (expr instanceof Field f) {
     return f.getField().toString();
   }
   if (expr instanceof QualifiedName qn) {
     return qn.toString();
   }
   if (expr instanceof Alias a) {
     return extractFieldName(a.getDelegated());
   }
-  throw new IllegalArgumentException(
-      "Cannot extract field name from non-field expression: " + expr);
+  // This should never happen if canRewriteWindowAsAggregateJoin validated correctly
+  throw new IllegalStateException(
+      "Unexpected expression type in extractFieldName (validation gap): " + expr.getClass().getName());
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion to change IllegalArgumentException to IllegalStateException is a minor improvement in semantic clarity, but the existing code is already correct. Since canRewriteWindowAsAggregateJoin validates inputs, this exception should never occur in practice, making the change low-impact.

Low
Suggestions up to commit 1a043bc
CategorySuggestion                                                                                                                                    Impact
General
Deduplicate column references before filtering

The getSelectColumns call may return duplicate column references when the same field
appears multiple times in groupRex, causing redundant IS NOT NULL filters. This
could lead to incorrect query plans or performance degradation. Consider
deduplicating the column references before building the filter list.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2155-2162]

 List<RexNode> isNotNullList =
     PlanUtils.getSelectColumns(groupRex).stream()
+        .distinct()
         .map(context.relBuilder::field)
         .map(context.relBuilder::isNotNull)
         .toList();
 if (!isNotNullList.isEmpty()) {
   context.relBuilder.filter(isNotNullList);
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion addresses a potential edge case where duplicate column references could create redundant filters. However, PlanUtils.getSelectColumns likely already handles deduplication or the input groupRex is unlikely to contain duplicates in typical usage. The impact is minor and the issue may not occur in practice.

Low
Deduplicate partition keys in join condition

If groupList contains duplicate field names, the join condition will include
redundant equality checks for the same column pair. This could lead to suboptimal
query plans. Consider deduplicating groupList by field name before constructing the
join condition.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2180-2193]

+Set<String> seenKeys = new HashSet<>();
 for (UnresolvedExpression groupExpr : groupList) {
   String keyName = extractFieldName(groupExpr);
+  if (!seenKeys.add(keyName)) {
+    continue;
+  }
   RexNode leftKey = context.relBuilder.field(2, 0, keyName);
   RexNode rightKey = context.relBuilder.field(2, 1, keyName);
   RexNode eq = context.relBuilder.equals(leftKey, rightKey);
   if (bucketNullable) {
     RexNode bothNull =
         context.relBuilder.and(
             context.relBuilder.isNull(leftKey), context.relBuilder.isNull(rightKey));
     perKeyConditions.add(context.relBuilder.or(eq, bothNull));
   } else {
     perKeyConditions.add(eq);
   }
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion prevents redundant join conditions if groupList contains duplicate field names. While this could theoretically improve query plans, the PPL parser likely ensures groupList contains unique fields, making this edge case unlikely. The improvement is defensive but has limited practical impact.

Low
Clarify precondition violation with IllegalStateException

The method throws IllegalArgumentException for unexpected expression types, but
canRewriteWindowAsAggregateJoin only validates that expressions are bare field
references. If validation logic changes or has gaps, this could cause runtime
failures. Add a defensive check in the caller or document the precondition more
explicitly.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2336-2348]

 private static String extractFieldName(UnresolvedExpression expr) {
   if (expr instanceof Field f) {
     return f.getField().toString();
   }
   if (expr instanceof QualifiedName qn) {
     return qn.toString();
   }
   if (expr instanceof Alias a) {
     return extractFieldName(a.getDelegated());
   }
-  throw new IllegalArgumentException(
-      "Cannot extract field name from non-field expression: " + expr);
+  throw new IllegalStateException(
+      "extractFieldName called with invalid expression type (should be validated by canRewriteWindowAsAggregateJoin): " + expr);
 }
Suggestion importance[1-10]: 3

__

Why: Changing IllegalArgumentException to IllegalStateException with a more descriptive message improves code clarity by indicating a precondition violation. However, this is a minor stylistic improvement with limited practical impact since canRewriteWindowAsAggregateJoin already validates inputs.

Low

…tions

PPL AstExpressionBuilder.visitWindowFunction wraps the parsed function
in a WindowFunction whose inner is a Function, not an AggregateFunction
(SQL emits AggregateFunction). The original predicate required
AggregateFunction, so it returned false for every eventstats case and
the rewrite never fired.

Use BuiltinFunctionName.ofAggregation(funcName) so the predicate
accepts both inner types, and convert Function to AggregateFunction in
stripWindowFunctionForAggregate so aggVisitor resolves it the same way
stats does.

Test expectation adjustments observed from actual planner output:
- IS NOT DISTINCT FROM: Calcite canonicalizes OR(=, AND(IS NULL, IS NULL))
  to IS NOT DISTINCT FROM on nullable partition keys (DEPTNO in EMP).
- Plain =: on non-nullable partition keys (server in POST.LOGS),
  RexSimplify drops the IS NULL conjuncts and leaves equality.
- Outer Project folded for no-BY cases: the final passthrough projection
  is a no-op identity in the no-BY case and Calcite folds it; the BY case
  keeps the project because it drops the right-side group-key column.

verifyPPLToSparkSQL calls in CalcitePPLEventstatsEarliestLatestTest are
removed pending stabilization of SparkSqlDialect emission for the
join+aggregate form.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Persistent review updated to latest commit a444b56

…Y keys

CI integration failures revealed two cases the rewrite shouldn't fire on:

1. testUnsupportedWindowFunctions — percentile / percentile_approx are in
   AGGREGATION_FUNC_MAPPING but not WINDOW_FUNC_MAPPING. The legacy rex
   visitor throws "Unexpected window function: ..." for them, and the test
   pins that error. My predicate used only ofAggregation so it accepted
   percentile and the rewrite ran instead of throwing. Now require presence
   in both maps — percentile (and take/first/last/median, all aggregation-
   only) fall through to the legacy throw; dc/distinct_count/row_number
   (all window-only) also fall through unchanged.

2. testEventstatsOnMapPath — `eventstats count() by doc.user.city`. The
   join condition uses relBuilder.field(2, side, name), which doesn't
   resolve nested paths; my predicate accepted any QualifiedName so the
   rewrite produced a plan that failed at field lookup. Now isSimpleQualifiedName
   requires parts.size() == 1; dotted paths fall through to the legacy
   RexOver lowering, which handles nested fields via the existing rexVisitor.

Plus a spotless reformat to the earliest/latest test that wasn't picked
up before push.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Persistent review updated to latest commit 1a043bc

The eventstats rewrite changes the lowered plan from `Project(RexOver)`
to `Project + Join + Aggregate`. The four affected EXPLAIN tests run in
both modes:
  - CalciteExplainIT (pushdown enabled)
  - CalciteNoPushdownIT (pushdown disabled)

Both modes share the same logical plan (the rewrite is AST-level, not
pushdown-gated) but the physical plan diverges:
  - Pushdown ON: right-side LogicalAggregate gets pushed into the inner
    scan as `PushDownContext=[[AGGREGATION->...]]` with `size:0` +
    `composite`+`terms` (BY) or `track_total_hits` (no-BY) source builder.
  - Pushdown OFF: right side stays as a coordinator-side EnumerableAggregate
    over a raw CalciteEnumerableIndexScan, no AGGREGATION in PushDownContext.

The existing convention (loadExpectedPlan in PPLIntegTestCase) loads from
`expectedOutput/calcite/` when pushdown is on and `expectedOutput/calcite_no_pushdown/`
when it's off, so each test keeps its NoPushdownIT coverage — both variants
of the same expected file get regenerated against a local node running
the rewrite.

Files regenerated (4 per mode, 8 total):
  - explain_eventstats_earliest_latest.{json}
  - explain_eventstats_earliest_latest_custom_time.{json}
  - explain_eventstats_earliest_latest_no_group.{json}
  - explain_eventstats_null_bucket.{yaml}

Captured against:
  - logs_index_mapping.json + logs.json fixtures (5 docs)
  - account_index_mapping.json + accounts.json fixtures (1000 docs)
  - same fixtures CI uses; rel#N / RelSubset#N IDs are normalized at
    compare time by assertJsonEqualsIgnoreId / assertYamlEqualsIgnoreId.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Persistent review updated to latest commit 15c52f7

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Persistent review updated to latest commit f19e1d9

My local OS captured composite size as 10000; CI's integ-test cluster
sets it to 1000 (matches every other expected file under
expectedOutput/calcite/agg_composite_*.{json,yaml}). The number doesn't
affect correctness — it's just the per-bucket page size for the composite
aggregation request. Align with the test-framework convention.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Persistent review updated to latest commit 515f35e

PPL eventstats accepts three aliases for the cardinality aggregation —
dc, distinct_count, distinct_count_approx — all resolving to
BuiltinFunctionName.DISTINCT_COUNT_APPROX. The stats command only
accepts distinct_count_approx, so AGGREGATION_FUNC_MAPPING registers
only that name; the other two are window-only aliases in
WINDOW_FUNC_MAPPING.

The previous predicate required intersection of both maps, which
rejected dc and distinct_count. They fell through to the legacy
RexOver lowering — which is the exact buggy "EnumerableWindow over a
row-fetching scan" shape opensearch-project#5483 was filed against. Fix was incomplete.

Replace the intersection check with: name is in ofWindowFunction AND
its canonical aggregation name (BuiltinFunctionName.name().toLowerCase,
e.g. "distinct_count_approx") is in ofAggregation. Translate the same
way in stripWindowFunctionForAggregate so aggVisitor sees the
registered name. For names already in both maps (count/sum/avg/etc.)
the canonical name equals the user-typed name, so the lookup is a
no-op — no behavior change for the cases that already worked.

ROW_NUMBER still falls through because its canonical name "row_number"
isn't in the aggregation map. Same for percentile / take / first /
last / median / list / values — all rejected by the canonical-name
lookup.

Verified locally:
- eventstats dc(state)               → cardinality agg, size:0
- eventstats distinct_count(state) by gender
  → composite over gender + nested cardinality on state, size:0

Regenerated explain_eventstats_dc.json and
explain_eventstats_distinct_count.json with the new shape (composite
size 1000 to match CI). Both tests are pushdown-only
(enabledOnlyWhenPushdownIsEnabled() + loadFromFile hardcoded to
calcite/), so no calcite_no_pushdown/ variants needed.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Persistent review updated to latest commit bb8d9b2

@RyanL1997 RyanL1997 added the enhancement New feature or request label Jun 2, 2026
@RyanL1997 RyanL1997 changed the title [BugFix] Push eventstats down by rewriting RexOver to Join + Aggregate (#5483) [Enhancement] Push eventstats down by rewriting RexOver to Join + Aggregate (#5483) Jun 2, 2026
@RyanL1997 RyanL1997 added the PPL Piped processing language label Jun 2, 2026
…owup

A perf A/B on a local 20k-doc index uncovered a real problem in the
no-BY rewrite that was hidden by tests with bounded result sets.

Before this fix, the no-BY case emitted:
  EnumerableNestedLoopJoin(condition=[true], joinType=[inner])
    leftScan  (returns N rows)
    rightScan (returns 1 row — the COUNT() scalar)

Calcite's NestedLoopJoin contract calls Enumerable.enumerator() on the
right side once per left tuple. Each enumerator open on a
CalciteEnumerableIndexScan triggers a fresh OpenSearch _search request.
For a 10k-row left side that means 10k OpenSearch calls. On a remote
cluster (1-10ms RTT per call), the head-less query would take tens of
seconds.

Measured on the local node, 20k docs, single shard, no head:
  before: 10,004 OS calls per PPL query, ~1055ms wall
  after:    4 OS calls per PPL query,  ~174ms wall
That's ~6x faster wall and ~2500x fewer OS round-trips, with no
correctness change (results identical).

Fix: in the no-BY branch of rewriteWindowAsAggregateJoin, project a
literal-0 key column onto both sides (left: append after orig cols;
right: append after agg outputs) and join on equality. The equi-join
condition makes the planner pick EnumerableHashJoin, which drains the
single-row right side once into a hash table and probes per left row
in O(1).

Pushdown still fires on the right side — verified via EXPLAIN that the
right scan still carries `AGGREGATION->...COUNT()` and `size:0` in
PushDownContext; the literal-0 projection is a top-level wrapper that
doesn't disrupt the Aggregate→Scan operand chain
AggregateIndexScanRule.AGGREGATE_SCAN matches.

The BY case is unchanged — it already has an equi-join condition (or
IS NOT DISTINCT FROM for bucketNullable=true) which Calcite handles
correctly via EnumerableMergeJoin.

Test expectation updates:
  - CalcitePPLEventstatsTest.testEventstatsCount / testEventstatsAvg
  - CalcitePPLEventstatsEarliestLatestTest no-BY variants (4 tests)
  - explain_eventstats_dc.json (no-BY, pushdown)
  - explain_eventstats_earliest_latest_no_group.json (no-BY, both modes)
Outer LogicalProject now appears in the no-BY case because we must
strip the literal-0 key columns from the join output — it's no longer
a no-op passthrough that Calcite folds.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@RyanL1997
Copy link
Copy Markdown
Collaborator Author

Perf finding & fix on the no-BY rewrite

An internal review flagged a concern about the cost of the two-scan + coordinator-join shape vs the original window form. Ran a local A/B against a 20k-doc accounts index to put numbers on it.

Initial measurement uncovered a real problem

Scenario Before rewrite (legacy window) With rewrite (commit before this fix)
`... head 10` (10-row result) ~234ms, ~13 OS calls
no head (10k-row result) ~169ms, ~1 OS call ~1055ms, ~10,004 OS calls

For the unbounded result-set case, the no-BY rewrite was triggering one OpenSearch search request per left-side row.

Root cause

The no-BY case emitted:

EnumerableNestedLoopJoin(condition=[true], joinType=[inner])
  leftScan   ← returns N rows
  rightScan  ← returns 1 row (the COUNT() scalar)

EnumerableNestedLoopJoin calls Enumerable.enumerator() on the right side per left tuple. Each enumerator open on a CalciteEnumerableIndexScan triggers a fresh _search request. For a 10k-row left side that means 10k OS calls. On a remote cluster (1–10ms RTT per call) that would blow up to tens of seconds.

The BY case never hit this — it already has an equi-join condition (or IS NOT DISTINCT FROM for bucketNullable=true) which lets Calcite pick EnumerableMergeJoin and drain one side once.

Fix (commit 6f333fc22)

For the no-BY case, project a literal-0 key column onto both sides and join on equality:

  • Left: append the key after the original columns.
  • Right: append the key after the aggregate outputs.
  • Join condition: left.__eventstats_join_key__ = right.__eventstats_join_key__.

The equi-join condition makes the planner pick EnumerableHashJoin, which drains the single-row right side once into a hash table and probes per left row in O(1). Final projection strips the two key columns so the output row type is unchanged.

After-fix measurement (same data, same node, same queries)

Scenario After fix
`... head 10`
no head (10k-row result) ~174ms warm, 4 OS calls across 5 runs

That's ~6× faster wall and ~2,500× fewer OS round-trips for the unbounded case. Latency for the head-bounded case ticks up slightly (~30ms) because of the extra projection + hash-join overhead at small scale, but the absolute numbers are well under one round-trip.

Pushdown verification

Confirmed via EXPLAIN that the right-side aggregate still pushes down with the new shape:

EnumerableHashJoin(condition=[=($1, $4)], joinType=[inner])
  EnumerableCalc(... LITERAL(0) as join_key)
    CalciteEnumerableIndexScan(
      PushDownContext=[[AGGREGATION->...COUNT()]],
      sourceBuilder={"size":0,"track_total_hits":2147483647})
  EnumerableCalc(... LITERAL(0) as join_key)
    CalciteEnumerableIndexScan(... full _source path ...)

The literal-0 projection wraps the aggregate but doesn't disrupt the Aggregate → Scan operand chain that AggregateIndexScanRule.AGGREGATE_SCAN matches — pushdown fires normally and the right side still issues exactly one size:0 + track_total_hits (or cardinality for dc(), or composite + top_hits for earliest/latest) request.

Test impact

Unit-test expectations updated for 2 tests in CalcitePPLEventstatsTest and 4 in CalcitePPLEventstatsEarliestLatestTest (the no-BY variants now show an outer LogicalProject because the literal-0 key columns must be stripped). 3 EXPLAIN expected files regenerated (explain_eventstats_dc.json pushdown, explain_eventstats_earliest_latest_no_group.json in both modes).

BY-case tests are unchanged — they already use equi-join conditions.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Persistent review updated to latest commit 6f333fc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugFix enhancement New feature or request PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] eventstats does not push down to OpenSearch (RexOver excluded from aggregation pushdown)

1 participant