Skip to content

feat(mem_wal): support prefilters in LSM vector and full-text search#7138

Open
touch-of-grey wants to merge 3 commits into
lance-format:mainfrom
touch-of-grey:LsmPrefilter
Open

feat(mem_wal): support prefilters in LSM vector and full-text search#7138
touch-of-grey wants to merge 3 commits into
lance-format:mainfrom
touch-of-grey:LsmPrefilter

Conversation

@touch-of-grey
Copy link
Copy Markdown
Contributor

Motivation

The MemWAL LSM read path (added in lancedb/lancedb#3489, use_lsm_read) supports plain scan, full-text search, and vector (ANN) search across the base table, flushed generations, and the in-memory memtables. However, the vector and full-text search planners ignored a user WHERE predicate — a filtered LSM search returned rows that the same query without the filter would also return, i.e. the filter was silently dropped. Only the plain LSM scan honored filters.

This PR closes that gap and aligns the LSM scanner surface with the dataset Scanner.

Changes

1. Prefilter support in LSM vector and full-text search

Both LSM search planners now accept an optional predicate via with_filter(Option<Expr>) and apply it as a true prefilter (matching a normal filtered scan), not a lossy post-filter on the per-source top-k:

  • Base / flushed arms reuse the dataset scanner's native prefilter (filter_expr + prefilter(true)), so the ANN / BM25 search runs over rows matching the predicate.
  • Active / frozen memtable arms apply the predicate before the top-k cut: the brute-force vector exec masks rows in compute_topk (a filtered vector search routes to brute force rather than HNSW, whose graph traversal cannot honor an arbitrary predicate), and the FTS exec masks the materialized full-schema hits before projection.

A NULL predicate result excludes the row, matching SQL semantics.

2. Align LsmScanner with the dataset Scanner interface

LsmScanner gains a Scanner-aligned builder so an LSM read reads like a normal scan:

  • nearest() (+ nprobes / refine / distance_metric) and full_text_search(FullTextSearchQuery) are now state setters; create_plan() dispatches to the vector, FTS, point-lookup, or plain planner (mirroring Scanner::create_plan).
  • project (<T: AsRef<str>> + Result) and limit (Option<i64>, Option<i64> + Result) match Scanner, validating non-negative bounds.
  • Query-vector dimension is validated against the column dimension before search.
  • Knobs the LSM planner cannot yet honor (ef, distance_range, maximum_nprobes, with_row_id) are intentionally not exposed rather than silently ignored.

Tests

Adds deterministic regressions that fail under a post-filter and pass under a true prefilter, for both the memtable arms and the indexed base arm (vector and FTS), plus facade-dispatch coverage for nearest() / full_text_search().

The MemWAL LSM vector and full-text search planners ignored a user WHERE
predicate, so a filtered search returned rows the same query without the
filter would exclude. Only the plain LSM scan honored filters.

Add prefilter support to both LSM search planners via with_filter(Option<Expr>):

- Base and flushed arms reuse the dataset scanner's native prefilter.
- The active/frozen memtable arms apply the predicate before the top-k cut: the
  brute-force vector exec masks rows in compute_topk (a filtered vector search
  routes to brute force rather than HNSW), and the FTS exec masks the
  materialized full-schema hits before projection.
- LsmScanner::full_text_search forwards its filter to the FTS planner.

This is a true prefilter (matching a normal filtered scan), not a lossy
post-filter on the per-source top-k.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the enhancement New feature or request label Jun 6, 2026
@github-actions github-actions Bot added A-python Python bindings A-java Java bindings + JNI labels Jun 6, 2026
LsmScanner gains a Scanner-aligned (owned) builder so an LSM read reads like a
normal scan. `nearest()` and `full_text_search()` become state setters and
`create_plan()` dispatches to the vector, FTS, point-lookup, or plain planner,
mirroring `Scanner::create_plan` / `MemTableScanner::create_plan`.

- Add `nearest`, `nprobes`, `refine`, `distance_metric` (vector search folds the
  LsmVectorSearchPlanner behind the builder; honors the builder filter).
- `full_text_search(FullTextSearchQuery)` is now a setter (column from the query,
  k from `limit`) instead of returning a plan directly.
- Align `project` (`<T: AsRef<str>>` + `Result`) and `limit`
  (`Option<i64>, Option<i64>` + `Result`) with `Scanner`.

Knobs the LSM planner cannot yet honor (ef, distance_range, maximum_nprobes,
with_row_id) are intentionally not exposed to avoid silently ignoring them.
@jackye1995
Copy link
Copy Markdown
Contributor

Thanks for the fix! Since we are already doing the refactoring here, I think let's also make sure the MemTable scanner is consistent with the other scanners.

…h PK

The active-memtable vector and full-text search arms applied the prefilter
predicate inside the per-source exec, before the within-source dedup that
collapses an in-memtable update's duplicate-PK appends to the newest version.
When the newest version of a PK failed the predicate but an older version
passed, the filter dropped the newest and the dedup kept the stale older
match — returning a row whose current version should have been excluded. This
broke the "true prefilter == normal filtered scan" contract on the active arm
(flushed/base were already correct: the deletion vector / block-list remove
superseded rows before the filter).

Evaluate the predicate against the newest version of each PK:

- MemTableBruteForceVectorExec and FtsIndexExec take the primary-key columns and,
  when filtering, drop superseded versions (via compute_pk_hash) before applying
  the predicate, so a newer non-matching version excludes the PK.
- LsmScanner plumbs pk_columns into the active MemTableScanner arms.

Also align MemTableScanner's builder with the dataset Scanner (the API
consistency this work started from): project (AsRef + Result), limit
(Option<i64> + Result), nearest (&dyn Array + Result), and full_text_search
(FullTextSearchQuery, converted to the local query model) now match Scanner.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-java Java bindings + JNI A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants