[Enhancement] Support bare-field join criteria join on <field> shorthand#5517
[Enhancement] Support bare-field join criteria join on <field> shorthand#5517Hanyu-W wants to merge 1 commit into
join on <field> shorthand#5517Conversation
PR Reviewer Guide 🔍(Review updated until commit 19e6693)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 19e6693
Previous suggestionsSuggestions up to commit 374d412
|
join on <field> shorthandjoin on <field> shorthand
| new Field(QualifiedName.of(rightQual, f)))) | ||
| .reduce((l, r) -> new And(l, r)) | ||
| .orElseThrow( | ||
| () -> new SemanticCheckException("join 'on' shorthand requires at least one field")); |
There was a problem hiding this comment.
is this the only error case that can throw here?
There was a problem hiding this comment.
There were two (this one, and an empty-list guard above it). Both are removed in the redesign, so neither throws anymore.
| private static String qualifierOf(UnresolvedPlan plan, String side) { | ||
| Relation relation = findRelation(plan); | ||
| if (relation == null) { | ||
| throw new SemanticCheckException( |
There was a problem hiding this comment.
nice detail on the error message!
consider using the ReportBuilder to split out suggestions from the problem
There was a problem hiding this comment.
That error path is going away — once matching is done by position in the planner, there's no "must be a named table" restriction, so the exception it was attached to is deleted.
| if (rightAlias.isEmpty() && this.right instanceof SubqueryAlias alias) { | ||
| rightAlias = Optional.of(alias.getAlias()); | ||
| } | ||
| rewriteBareFieldCriteria(); |
There was a problem hiding this comment.
Is putting it here the right boundary over resolving these fields in the constructor (or even frontloading it to the AST builder)?
Putting the rewrite here introduces a weird boundary, where now Join plays the role of dynamically parsing itself during attachment based directly on AST grammar choices. That multiple places comment "we actually resolve these fields in attach()" also hints that this is unintuitive.
It might be clearer to put this in the constructor. Alternatively, it's a little more work, but we could consider moving all of leftAlias, rightAlias, joinFields and similar into a dedicated JoinClause data type that supports operations like clause.expandBareFields() and clause.resolveAliases()
There was a problem hiding this comment.
Yes — that's the fix I'm making. I'm going to move the expansion into visitJoin instead of attach(), so the AST keeps query how wrote it and the analysis happens in the planner where it belongs. Explain output will now match the original query too.
Yeah, I'll make the change Ryan suggested to fix this.
| | `source=table1 \| join right=tt on table1.id=t2.id [ source=table2 as t2 \| eval b = id ] \| eval a = 1` | `table1.id, tt.id, tt.b, a` | | ||
|
|
||
| * **Field deduplication in extended syntax** – When using the extended syntax with a field list, duplicate field names in the output are deduplicated according to the `overwrite` option. | ||
| * **Field deduplication in extended syntax** – When using the extended syntax with a field list (`<join-field-list>`), duplicate field names are deduplicated according to the `overwrite` option. The bare-field `on`/`where` shorthand keeps both key columns, unlike the field-list form. |
There was a problem hiding this comment.
the query grammar at the top of the doc wasn't updated to show what join-field-list is
There was a problem hiding this comment.
Good catch! I'll cut it because it isn't adding anything valuable to the docs.
| * are known; qualifier is the side alias else the table name. Non-bare criteria (qualified, OR, | ||
| * comparisons) are left untouched. | ||
| */ | ||
| private void rewriteBareFieldCriteria() { |
There was a problem hiding this comment.
This method resolves the left and right qualifiers independently, with no check that they differ. When both sides are the same unaliased table (e.g. source=EMP | inner join on DEPTNO EMP), qualifierOf returns "EMP" for both, and the conjunction it builds becomes EMP.DEPTNO = EMP.DEPTNO — a tautology that Calcite simplifies to IS NOT NULL(...), not an equi-join. The user gets a near-cross-product with no error or warning.
There was a problem hiding this comment.
I think it should be fixed now I'm moving it into the planner, as the planner already knows which input is left and which is right by position, so doing the match there produces a correct equi-join even for join on DEPTNO EMP.
| if (rightAlias.isEmpty() && this.right instanceof SubqueryAlias alias) { | ||
| rightAlias = Optional.of(alias.getAlias()); | ||
| } | ||
| rewriteBareFieldCriteria(); |
There was a problem hiding this comment.
Performing semantic rewriting inside the AST node's attach() mixes tree construction with analysis, which usually lives in the analyzer layer here. The unresolved tree no longer reflects what the user wrote, which can make explain output and debugging confusing. Was an analyzer pass considered?
There was a problem hiding this comment.
Yes — that's the fix I'm making. I'm going to move the expansion into visitJoin instead of attach(), so the AST
keeps query how wrote it and the analysis happens in the planner where it belongs. Explain output will now match the original query too.
| new Field(QualifiedName.of(leftQual, f)), | ||
| new Field(QualifiedName.of(rightQual, f)))) | ||
| .reduce((l, r) -> new And(l, r)) | ||
| .orElseThrow( |
There was a problem hiding this comment.
This orElseThrow looks unreachable: rewriteBareFieldCriteria early-returns when no bare fields are collected, and collectBareFields only returns true after adding at least one. Can you double check?
There was a problem hiding this comment.
You're right, it's unreachable. The method will be deleted in the redesign
| } | ||
|
|
||
| /** Walk the single-child chain down to the underlying Relation. */ | ||
| private static Relation findRelation(UnresolvedPlan plan) { |
There was a problem hiding this comment.
This bails as soon as it hits a node with !=1 child. Please add a test for an aliased subquery with a bare field (source=t1 | join on a [ source=t2 ] as s) to confirm it uses the alias and never reaches qualifierOf.
There was a problem hiding this comment.
This method will get deleted too, but I'll still add the aliased-subquery test to confirm.
| private final JoinType joinType; | ||
| private final Optional<UnresolvedExpression> joinCondition; | ||
| // Not final: bare-field `on <field>` shorthand is rewritten to qualified equality in attach(). | ||
| private Optional<UnresolvedExpression> joinCondition; |
There was a problem hiding this comment.
Dropping final to allow mutation in attach() weakens this AST node's immutability. If attach() ever runs twice or the node is shared, the rewrite re-runs against already-rewritten state. It's idempotent in practice (the qualified result fails collectBareFields), but the invariant is fragile — consider a guard flag or doing this rewrite in an analyzer pass instead of mutating here.
There was a problem hiding this comment.
Good point. I'll restore final as the redesign should no longer mutate the AST here
| + " COMM=[$6], DEPTNO=[$7], EMP.EMPNO=[$8], EMP.ENAME=[$9], EMP.JOB=[$10]," | ||
| + " EMP.MGR=[$11], EMP.HIREDATE=[$12], EMP.SAL=[$13], EMP.COMM=[$14]," | ||
| + " EMP.DEPTNO=[$15])\n" | ||
| + " LogicalJoin(condition=[AND(IS NOT NULL($15), IS NOT NULL($13))]," |
There was a problem hiding this comment.
This asserts AND(IS NOT NULL($15), IS NOT NULL($13)) as expected output, which locks in the self-join tautology as if intended. Should flip to an expected-exception test once the guard is added.
There was a problem hiding this comment.
I'd actually go the other way: instead of making the self-join an error, the redesign makes it work correctly. So I'll update these to assert the real join condition rather than the tautology — the query is valid, it was just being built wrong
| + " COMM=[$6], DEPTNO=[$7], EMP.EMPNO=[$8], EMP.ENAME=[$9], EMP.JOB=[$10]," | ||
| + " EMP.MGR=[$11], EMP.HIREDATE=[$12], EMP.SAL=[$13], EMP.COMM=[$14]," | ||
| + " EMP.DEPTNO=[$15])\n" | ||
| + " LogicalJoin(condition=[IS NOT NULL($15)], joinType=[inner])\n" |
There was a problem hiding this comment.
Same concern as the above— this enshrines IS NOT NULL($15) as correct. The test name even admits it's a tautology. Convert to expected-exception, or revisit once self-join handling is decided.
There was a problem hiding this comment.
I'd actually go the other way: instead of making the self-join an error, the redesign makes it work correctly. So I'll update these to assert the real join condition rather than the tautology — the query is valid, it was just being built wrong
same response as above
Let `on`/`where` join criteria accept a bare field name (or `AND` of bare fields) as shorthand for the qualified equality on that common field: source=t1 | inner join on a t2 == on t1.a = t2.a source=t1 | inner join on a AND b t2 == on t1.a = t2.a AND t1.b = t2.b source=t1 | join on a t2 == on t1.a = t2.a (no prefix/alias) The shorthand behaves exactly like the explicit `on l.f = r.f` criteria: it KEEPS BOTH key columns (the right key renamed `<rightAlias>.f`, or `<rightTable>.f` when no alias is given). This differs from the field-list join (`join f t2`), which merges the duplicate key to a single column. Implementation: - Grammar: reorder the joinCommand alternatives so the criteria alternative is listed first. This lets the no-prefix form `join on a <right>` parse `on` as the criteria keyword instead of the field-list alternative greedily consuming `on`/`where` as a field name. ANTLR ALL(*) picks the first-listed alternative on genuine ambiguity. - AstBuilder: a bare single-part field (or AND-chain of them) is left as the join condition verbatim -- the unresolved AST reflects what the user wrote. - Planner (CalciteRelNodeVisitor.visitJoin): the criteria branch detects a bare-field condition and builds the equi-join from it, resolving each field by stack position (LEFT.f = RIGHT.f) via the existing buildJoinConditionByFieldName helper. No new planner code, and no AST mutation. Because resolution is positional rather than by qualifier, a self-join `join on f t` is a genuine cross-scan equi-join rather than the `f = f` tautology a name-based rewrite would collapse to. Adds AstBuilder, Calcite plan, anonymizer, and integration tests; updates the join command user manual. Signed-off-by: Hanyu Wei <weihanyu@amazon.com>
374d412 to
19e6693
Compare
|
Persistent review updated to latest commit 19e6693 |
Description
Lets
joinon/wherecriteria accept a bare field name (or anANDchain of bare fields) as shorthand for the qualified equality on that common field. Before this change, the only way to join on a shared column was to spell out both qualifiers:The shorthand behaves exactly like the explicit
on l.f = r.fit sugars: it keeps both key columns (the right key surfaced as<rightAlias>.f, or<rightTable>.fwhen no alias is given). This is intentionally different from the field-list joinjoin f t2, which merges the duplicate key into a single column.Two things made this more than a one-line grammar tweak, and shaped the design:
onis a keyword that can also be an identifier, so the no-prefix formjoin on a t2was greedily consumed by the field-list alternative —onparsed as a field name, failing withfield [on] not found. Reordering thejoinCommandalternatives so the criteria alternative is listed first makes ANTLR ALL(*) prefer readingonas the criteria keyword on genuine ambiguity, with zero change to the field-list form's behavior.on acarries no table qualifier, but the keep-both planner branch needsleft.a = right.a. The left table name isn't known at parse time — it's only known once both inputs are attached. So the rewrite is deferred toJoin.attach, where the qualifier is taken from the side alias (left=/right=/AS) when present, and otherwise from the underlying table name (walking the single-child plan chain down to theRelation). An un-aliased subquery side with no reachable table name throws aSemanticCheckExceptiontelling the user to add an alias.This keeps the planner 100% unchanged — the rewritten criteria is an ordinary
Compare("=", l.f, r.f)that flows through the existing keep-both join branch; all shorthand-specific logic lives in the AST node and grammar.Resolves #5484.
Changes
ppl/src/main/antlr/OpenSearchPPLParser.g4joinCommandso the criteria alternative precedes the field-list alternative, sojoin on areadsonas the criteria keyword instead of a fieldppl/.../ppl/parser/AstBuilder.javaANDchain) as the join condition rather than treating it as a field listcore/.../ast/tree/Join.javaattach()rewrites the bare-field condition into the qualified equalityl.f = r.fusing side aliases or table names;joinConditionmade non-final; newrewriteBareFieldCriteria/collectBareFields/buildQualifiedEqualityConjunction/qualifierOf/findRelationhelpers;SemanticCheckExceptionfor an unresolvable un-aliased sidedocs/user/ppl/cmd/join.md<joinCriteria>parameter descriptions, and the field-dedup limitation noteppl/.../calcite/CalcitePPLJoinTest.java,ppl/.../parser/AstBuilderTest.java,ppl/.../utils/PPLQueryDataAnonymizerTest.javainteg-test/.../calcite/remote/CalcitePPLJoinIT.javaTesting
CalcitePPLJoinITon a live cluster, exercising the new shorthand against pre-change behavior:testJoinWithImplicitFieldinner join on name— keep-both; asserts shorthand output is identical to expliciton l.name = r.name; co-named columns resolve to the lefttestJoinNoPrefixImplicitFieldjoin on name— no prefix;onparsed as the criteria keyword, identical to the explicit form (not the field-list merge)testLeftJoinWithImplicitFieldleft join on name— unmatched-left rows keep the left key (no NULL key; keep-both does not coalesce)CalcitePPLJoinIT(new) totalUnit suites (all green, run locally):
CalcitePPLJoinTest: 50/50 (10 new) — plan shape for single/multi-field, alias,where, no-prefix, left/semi, self-join tautology, and not-on-both-sides cases.AstBuilderTest: 158/158 (11 new) — rewrite-vs-passthrough decisions (bare field rewritten; qualified/comparison criteria untouched; field-list still parses as a field list; prefix-without-criteria is a syntax error).PPLQueryDataAnonymizerTest: 119/119 (1 new) — shorthand anonymizes correctly.Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.