feat: rename vector_usearch to vector_search_vector, return full rows#22
feat: rename vector_usearch to vector_search_vector, return full rows#22anoop-narang merged 7 commits intomainfrom
Conversation
Keep _distance in an inner projection when ORDER BY uses a vector\ndistance expression that is not part of the final select list.\n\nThis fixes split-provider execution for queries like SELECT id ORDER\nBY l2_distance(vector, ARRAY[...]) LIMIT k while preserving the final\noutput schema. Add an execution test for the direct ORDER BY shape to\ncover the production case.
… pub(crate) These helpers are needed by the new vector_search_vector UDTF to reuse the same HNSW search → fetch → attach pattern as the ORDER BY path.
Replace the old vector_usearch UDTF that returned only (key, _distance)
with vector_search_vector that returns all table columns plus _distance.
New signature:
vector_search_vector('conn.schema.table', 'column', ARRAY[...], k)
The UDTF reuses usearch_search, attach_distances, and provider_key_col_idx
from the planner module to follow the same HNSW search → fetch_by_keys →
attach_distances pattern as the ORDER BY execution path.
Update UDTF section to reflect the new vector_search_vector signature and full-row return schema. Update module structure reference.
| @@ -145,33 +128,71 @@ impl TableProvider for USearchProvider { | |||
| projection: Option<&Vec<usize>>, | |||
| _filters: &[Expr], | |||
| _limit: Option<usize>, | |||
There was a problem hiding this comment.
Blocking — The scan() method does the HNSW search and fetch_by_keys() eagerly at planning time, but _limit is always ignored. This is fine by design (k is the user limit), but there's a precision issue worth flagging alongside it: the query vector parsed by extract_f32_vec is Vec<f32>, then promoted to Vec<f64> here for usearch_search. The optimizer path (rule.rs) extracts the query vector as Vec<f64> directly from the SQL AST, preserving full precision. If the index was built with ScalarKind::F64, the UDTF silently loses precision in the query vector vs what the optimizer path would use.
This is probably a minor accuracy difference for the common F32 case, but it is a behavioral inconsistency. The fix is to parse as f64 from the start — either change extract_f32_vec to extract_f64_vec and store Vec<f64> in VectorSearchVectorProvider, or convert lazily only when calling usearch_search.
|
|
||
| /// SELECT * with distance UDF — should fall back to UDF brute-force | ||
| /// (since vector column is not in lookup provider schema). | ||
| #[tokio::test] |
There was a problem hiding this comment.
Blocking — vector_search_vector is the primary new feature in this PR, but there are zero tests exercising it. All 69 passing tests cover the optimizer/planner path and the three new tests added here cover changes to rule.rs. The UDTF itself is completely untested.
At minimum the following cases need test coverage:
- Basic happy path:
SELECT * FROM vector_search_vector('conn.schema.table', 'vector', ARRAY[...], k)returns correct rows +_distance - Projection pushdown:
SELECT id, _distance FROM vector_search_vector(...)only returns the requested columns - Ordering: results are sorted correctly by
_distancewhen used withORDER BY parse_dot_table_reffailure: passing'table'or'schema.table'(fewer than 3 parts) returns aPlanerror- Index miss: resolving a key not in the registry returns an
Executionerror with the expected message - Empty result: search returns 0 matches → empty batch with the correct schema
There was a problem hiding this comment.
Review
Blocking Issues
-
No tests for
vector_search_vector(tests/execution.rs) — The renamed and rewritten UDTF is the core of this PR, but it has zero test coverage. All 69 passing tests exercise the optimizer/planner path; the new UDTF code path (argument parsing,parse_dot_table_ref, registry lookup,fetch_by_keys, distance attachment, and projection pushdown) is completely untested. See inline comment attests/execution.rsline 627. -
Query vector precision loss (
src/udtf.rs,scan()) —extract_f32_vecstores the query vector asVec<f32>, thenscan()widens it toVec<f64>before callingusearch_search. The optimizer path (rule.rs) keepsVec<f64>from the SQL AST throughout, preserving full literal precision. ForF64indices this is a silent accuracy divergence. See inline comment atsrc/udtf.rsline 130.
Action Required
- Add integration tests that call
vector_search_vectorvia SQL (register the table under aconn::schema::table::columnkey, query throughctx.sql(), assert correct rows and_distancevalues). Cover the happy path, projection pushdown, invalid table-ref format, and registry miss. - Change the query vector storage to
Vec<f64>(matching the optimizer path) to avoid the precision roundtrip.
Parse query vectors as f64 to match the optimizer path's precision, avoiding silent accuracy loss for F64-quantized indexes. Add 5 tests for vector_search_vector: basic happy path, projection pushdown, bad table ref error, registry miss error, k > dataset size.
Summary
vector_usearchUDTF tovector_search_vectorwith new signature:vector_search_vector('conn.schema.table', 'column', ARRAY[...], k)_distanceinstead of just(key, _distance)usearch_search(),attach_distances(),provider_key_col_idx()from the planner module (madepub(crate))The UDTF uses
resolve()(sync, cache-only) — the caller is responsible for ensuring the index is pre-loaded before planning.Test plan
cargo fmt --checkcleancargo clippy --all-targets --all-features -- -D warningsclean