Skip to content

feat: rename vector_usearch to vector_search_vector, return full rows#22

Merged
anoop-narang merged 7 commits intomainfrom
feat/vector-search-udtf
Apr 16, 2026
Merged

feat: rename vector_usearch to vector_search_vector, return full rows#22
anoop-narang merged 7 commits intomainfrom
feat/vector-search-udtf

Conversation

@anoop-narang
Copy link
Copy Markdown
Collaborator

Summary

  • Rename vector_usearch UDTF to vector_search_vector with new signature:
    vector_search_vector('conn.schema.table', 'column', ARRAY[...], k)
  • Return all table columns + _distance instead of just (key, _distance)
  • Reuse usearch_search(), attach_distances(), provider_key_col_idx() from the planner module (made pub(crate))
  • Update README to reflect new UDTF signature and behavior

The UDTF uses resolve() (sync, cache-only) — the caller is responsible for ensuring the index is pre-loaded before planning.

Test plan

  • All 69 existing tests pass (optimizer rule, execution, providers)
  • cargo fmt --check clean
  • cargo clippy --all-targets --all-features -- -D warnings clean
  • Manually tested via runtimedb server: basic queries, aliases, ORDER BY, WHERE, GROUP BY, subqueries, CTEs, aggregates, EXPLAIN, error cases

Keep _distance in an inner projection when ORDER BY uses a vector\ndistance expression that is not part of the final select list.\n\nThis fixes split-provider execution for queries like SELECT id ORDER\nBY l2_distance(vector, ARRAY[...]) LIMIT k while preserving the final\noutput schema. Add an execution test for the direct ORDER BY shape to\ncover the production case.
… pub(crate)

These helpers are needed by the new vector_search_vector UDTF to reuse
the same HNSW search → fetch → attach pattern as the ORDER BY path.
Replace the old vector_usearch UDTF that returned only (key, _distance)
with vector_search_vector that returns all table columns plus _distance.

New signature:
  vector_search_vector('conn.schema.table', 'column', ARRAY[...], k)

The UDTF reuses usearch_search, attach_distances, and provider_key_col_idx
from the planner module to follow the same HNSW search → fetch_by_keys →
attach_distances pattern as the ORDER BY execution path.
Update UDTF section to reflect the new vector_search_vector signature
and full-row return schema. Update module structure reference.
Comment thread src/udtf.rs
@@ -145,33 +128,71 @@ impl TableProvider for USearchProvider {
projection: Option<&Vec<usize>>,
_filters: &[Expr],
_limit: Option<usize>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking — The scan() method does the HNSW search and fetch_by_keys() eagerly at planning time, but _limit is always ignored. This is fine by design (k is the user limit), but there's a precision issue worth flagging alongside it: the query vector parsed by extract_f32_vec is Vec<f32>, then promoted to Vec<f64> here for usearch_search. The optimizer path (rule.rs) extracts the query vector as Vec<f64> directly from the SQL AST, preserving full precision. If the index was built with ScalarKind::F64, the UDTF silently loses precision in the query vector vs what the optimizer path would use.

This is probably a minor accuracy difference for the common F32 case, but it is a behavioral inconsistency. The fix is to parse as f64 from the start — either change extract_f32_vec to extract_f64_vec and store Vec<f64> in VectorSearchVectorProvider, or convert lazily only when calling usearch_search.

Comment thread tests/execution.rs

/// SELECT * with distance UDF — should fall back to UDF brute-force
/// (since vector column is not in lookup provider schema).
#[tokio::test]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blockingvector_search_vector is the primary new feature in this PR, but there are zero tests exercising it. All 69 passing tests cover the optimizer/planner path and the three new tests added here cover changes to rule.rs. The UDTF itself is completely untested.

At minimum the following cases need test coverage:

  • Basic happy path: SELECT * FROM vector_search_vector('conn.schema.table', 'vector', ARRAY[...], k) returns correct rows + _distance
  • Projection pushdown: SELECT id, _distance FROM vector_search_vector(...) only returns the requested columns
  • Ordering: results are sorted correctly by _distance when used with ORDER BY
  • parse_dot_table_ref failure: passing 'table' or 'schema.table' (fewer than 3 parts) returns a Plan error
  • Index miss: resolving a key not in the registry returns an Execution error with the expected message
  • Empty result: search returns 0 matches → empty batch with the correct schema

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Blocking Issues

  • No tests for vector_search_vector (tests/execution.rs) — The renamed and rewritten UDTF is the core of this PR, but it has zero test coverage. All 69 passing tests exercise the optimizer/planner path; the new UDTF code path (argument parsing, parse_dot_table_ref, registry lookup, fetch_by_keys, distance attachment, and projection pushdown) is completely untested. See inline comment at tests/execution.rs line 627.

  • Query vector precision loss (src/udtf.rs, scan()) — extract_f32_vec stores the query vector as Vec<f32>, then scan() widens it to Vec<f64> before calling usearch_search. The optimizer path (rule.rs) keeps Vec<f64> from the SQL AST throughout, preserving full literal precision. For F64 indices this is a silent accuracy divergence. See inline comment at src/udtf.rs line 130.

Action Required

  1. Add integration tests that call vector_search_vector via SQL (register the table under a conn::schema::table::column key, query through ctx.sql(), assert correct rows and _distance values). Cover the happy path, projection pushdown, invalid table-ref format, and registry miss.
  2. Change the query vector storage to Vec<f64> (matching the optimizer path) to avoid the precision roundtrip.

Parse query vectors as f64 to match the optimizer path's precision,
avoiding silent accuracy loss for F64-quantized indexes.

Add 5 tests for vector_search_vector: basic happy path, projection
pushdown, bad table ref error, registry miss error, k > dataset size.
Comment thread tests/execution.rs
@anoop-narang anoop-narang merged commit 67ddf6d into main Apr 16, 2026
5 checks passed
@anoop-narang anoop-narang deleted the feat/vector-search-udtf branch April 16, 2026 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant