Skip to content

Push down topk through join#21621

Open
SubhamSinghal wants to merge 4 commits intoapache:mainfrom
SubhamSinghal:push-down-topk-through-join
Open

Push down topk through join#21621
SubhamSinghal wants to merge 4 commits intoapache:mainfrom
SubhamSinghal:push-down-topk-through-join

Conversation

@SubhamSinghal
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

#11900

Rationale for this change

When a query has ORDER BY <cols> LIMIT N on top of an outer join and all sort columns come from the preserved side,
DataFusion currently runs the full join first, then sorts and limits. We can push a copy of the Sort(fetch=N) to the preserved input, reducing the number of rows entering the join.

Before:

Sort: t1.b ASC, fetch=3
   Left Join: t1.a = t2.a
     Scan: t1     ← scans ALL rows
     Scan: t2

After:

  Sort: t1.b ASC, fetch=3
    Left Join: t1.a = t2.a
      Sort: t1.b ASC, fetch=3  ← pushed down
        Scan: t1               ← only top-3 rows enter join
      Scan: t2

What changes are included in this PR?

A new logical optimizer rule PushDownTopKThroughJoin that:

  1. Matches Sort with fetch = Some(N) (TopK)
  2. Looks through an optional Projection to find a Join
  3. Checks join type is LEFT or RIGHT with no non-equijoin filter
  4. Verifies all sort expression columns come from the preserved side
  5. Inserts a copy of the Sort(fetch=N) on the preserved child
  6. Keeps the top-level sort for correctness

Are these changes tested?

Yes through UT

Are there any user-facing changes?

No API changes.

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant