feat(mem_wal): support prefilters in LSM vector and full-text search#7138
Open
touch-of-grey wants to merge 3 commits into
Open
feat(mem_wal): support prefilters in LSM vector and full-text search#7138touch-of-grey wants to merge 3 commits into
touch-of-grey wants to merge 3 commits into
Conversation
The MemWAL LSM vector and full-text search planners ignored a user WHERE predicate, so a filtered search returned rows the same query without the filter would exclude. Only the plain LSM scan honored filters. Add prefilter support to both LSM search planners via with_filter(Option<Expr>): - Base and flushed arms reuse the dataset scanner's native prefilter. - The active/frozen memtable arms apply the predicate before the top-k cut: the brute-force vector exec masks rows in compute_topk (a filtered vector search routes to brute force rather than HNSW), and the FTS exec masks the materialized full-schema hits before projection. - LsmScanner::full_text_search forwards its filter to the FTS planner. This is a true prefilter (matching a normal filtered scan), not a lossy post-filter on the per-source top-k.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
9e10664 to
c95ecaf
Compare
LsmScanner gains a Scanner-aligned (owned) builder so an LSM read reads like a normal scan. `nearest()` and `full_text_search()` become state setters and `create_plan()` dispatches to the vector, FTS, point-lookup, or plain planner, mirroring `Scanner::create_plan` / `MemTableScanner::create_plan`. - Add `nearest`, `nprobes`, `refine`, `distance_metric` (vector search folds the LsmVectorSearchPlanner behind the builder; honors the builder filter). - `full_text_search(FullTextSearchQuery)` is now a setter (column from the query, k from `limit`) instead of returning a plan directly. - Align `project` (`<T: AsRef<str>>` + `Result`) and `limit` (`Option<i64>, Option<i64>` + `Result`) with `Scanner`. Knobs the LSM planner cannot yet honor (ef, distance_range, maximum_nprobes, with_row_id) are intentionally not exposed to avoid silently ignoring them.
c95ecaf to
20cdb67
Compare
Contributor
|
Thanks for the fix! Since we are already doing the refactoring here, I think let's also make sure the MemTable scanner is consistent with the other scanners. |
0239904 to
edfa4f1
Compare
…h PK The active-memtable vector and full-text search arms applied the prefilter predicate inside the per-source exec, before the within-source dedup that collapses an in-memtable update's duplicate-PK appends to the newest version. When the newest version of a PK failed the predicate but an older version passed, the filter dropped the newest and the dedup kept the stale older match — returning a row whose current version should have been excluded. This broke the "true prefilter == normal filtered scan" contract on the active arm (flushed/base were already correct: the deletion vector / block-list remove superseded rows before the filter). Evaluate the predicate against the newest version of each PK: - MemTableBruteForceVectorExec and FtsIndexExec take the primary-key columns and, when filtering, drop superseded versions (via compute_pk_hash) before applying the predicate, so a newer non-matching version excludes the PK. - LsmScanner plumbs pk_columns into the active MemTableScanner arms. Also align MemTableScanner's builder with the dataset Scanner (the API consistency this work started from): project (AsRef + Result), limit (Option<i64> + Result), nearest (&dyn Array + Result), and full_text_search (FullTextSearchQuery, converted to the local query model) now match Scanner.
edfa4f1 to
e2c2698
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The MemWAL LSM read path (added in lancedb/lancedb#3489,
use_lsm_read) supports plain scan, full-text search, and vector (ANN) search across the base table, flushed generations, and the in-memory memtables. However, the vector and full-text search planners ignored a userWHEREpredicate — a filtered LSM search returned rows that the same query without the filter would also return, i.e. the filter was silently dropped. Only the plain LSM scan honored filters.This PR closes that gap and aligns the LSM scanner surface with the dataset
Scanner.Changes
1. Prefilter support in LSM vector and full-text search
Both LSM search planners now accept an optional predicate via
with_filter(Option<Expr>)and apply it as a true prefilter (matching a normal filtered scan), not a lossy post-filter on the per-source top-k:filter_expr+prefilter(true)), so the ANN / BM25 search runs over rows matching the predicate.compute_topk(a filtered vector search routes to brute force rather than HNSW, whose graph traversal cannot honor an arbitrary predicate), and the FTS exec masks the materialized full-schema hits before projection.A
NULLpredicate result excludes the row, matching SQL semantics.2. Align
LsmScannerwith the datasetScannerinterfaceLsmScannergains aScanner-aligned builder so an LSM read reads like a normal scan:nearest()(+nprobes/refine/distance_metric) andfull_text_search(FullTextSearchQuery)are now state setters;create_plan()dispatches to the vector, FTS, point-lookup, or plain planner (mirroringScanner::create_plan).project(<T: AsRef<str>>+Result) andlimit(Option<i64>, Option<i64>+Result) matchScanner, validating non-negative bounds.Tests
Adds deterministic regressions that fail under a post-filter and pass under a true prefilter, for both the memtable arms and the indexed base arm (vector and FTS), plus facade-dispatch coverage for
nearest()/full_text_search().