feat(index): support raw-query ivf rq search#7078
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
ce548a4 to
763ecc0
Compare
|
@claude review |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 42b122a38b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
This is a substantial new feature touching the Lance format spec, the IVF_RQ storage layout, and the core distance-calculator path — I'm leaving it for a human reviewer.
Extended reasoning...
Overview
This PR introduces a new raw_query estimator for IVF_RQ search alongside the existing residual_query estimator, including multi-bit (num_bits > 1) split-code search. Changes span 13 files: a format-spec doc update, new metadata fields (query_estimator), substantial additions to bq/storage.rs (~1100 lines including new ex-code distance tables, FastScan paths, and lower-bound gating), a refactor of the flat index search loops to use new accumulate_topk_with_scratch trait methods, a fix to PERM0_INVERSE in the SIMD dist table (the prior value was incorrect — note the new test_perm0_inverse_matches_perm0 test), and IVF v2 plumbing for per-query rotated-query contexts and per-index rotated-centroid caches.
Security risks
No direct security risks — this is internal indexing code with no auth, crypto, or untrusted-input boundaries.
Level of scrutiny
High. This PR touches the format specification (per the format-change-vote reminder, this needs a PMC vote with 3 binding +1s and a 1-week voting period), modifies the storage layout for newly-built IVF_RQ indexes, and rewrites the search hot path with new lower-bound gating that affects recall/accuracy. The compatibility shim (default_query_estimator_compat → ResidualQuery) is the right approach for old indexes, but the correctness of the new raw-query estimator factors, error-factor gating, and FastScan ex-code path all warrant careful human review.
Other factors
- The PR has good test coverage: new unit tests for the estimator factors, FastScan ex-code paths, lower-bound gating, cache slice borrowing, and a parameterized integration test for multi-bit L2/Cosine search.
- A previously-incorrect
PERM0_INVERSEconstant is silently fixed here — worth a human verifying no callers depended on the buggy value. - The Cosine→L2 distance-type rewrite in
try_from_batchfor raw-query indexes is subtle and could surprise readers; worth confirming the rewrite is correct and intentional. - The format-spec change requires a PMC vote that has not yet been linked on this PR.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| def skip_write_after_current_write(self, version: str) -> bool: | ||
| # Newly written IVF_RQ indexes carry raw-query estimator metadata and | ||
| # split-code schema that older runtimes can query but cannot optimize. | ||
| # The upgrade_downgrade variant still covers old 1-bit residual-query | ||
| # indexes being read and rewritten by the current runtime. | ||
| return True |
There was a problem hiding this comment.
🔴 The new skip_write_after_current_write override (test_vector_indices.py:277-282) leaves skip_read_after_current_write at the default False, so the downgrade flow still runs the old runtime's check_read against the newly-written num_bits=1 IVF_RQ index — but builder.rs::new_with_rotation now unconditionally writes query_estimator: RawQuery (lines 240, 251), a field old serde silently drops, so the old runtime applies the legacy ResidualQuery distance formula to __add_factors/__scale_factors that were computed for the RawQuery formula. check_read only asserts result.num_rows == 4 and never validates rankings, so the test passes vacuously while real downgraded users see silent recall regression and the PR-author comment claiming "older runtimes can query the new indexes (just cannot optimize)" is wrong for the num_bits=1 path. Fix: either also override skip_read_after_current_write to True (and remove the misleading comment), or keep ResidualQuery as the default query_estimator for num_bits=1 so the on-disk meaning of __add_factors/__scale_factors stays compatible with released runtimes (the num_bits>1 path was already rejected by old runtimes via the removed validate_supported_rq_num_bits gate).
Extended reasoning...
What the bug is
The PR adds skip_write_after_current_write = True to IvfRqVectorIndex (test_vector_indices.py:277-282) with a comment claiming "older runtimes can query the new indexes (just cannot optimize)." That premise is false for num_bits=1, and the downgrade direction is therefore exercising a silently-broken read path that the compat test cannot catch.
In compat_decorator.py:340-343, the downgrade flow is now gated as:
if not obj.skip_read_after_current_write(version):
venv.execute_method(obj, "check_read", obj.compat_env(version, "check_read"))
if not obj.skip_write_after_current_write(version):
venv.execute_method(obj, "check_write", obj.compat_env(version, "check_write"))IvfRqVectorIndex overrides only skip_write_after_current_write (returns True). skip_read_after_current_write inherits the default False from UpgradeDowngradeTest (compat_decorator.py:152-154), so the old venv still runs check_read against the index written by the current runtime.
The on-disk meaning of factor columns changed
builder.rs::RabitQuantizer::new_with_rotation unconditionally sets query_estimator: RabitQueryEstimator::RawQuery in both the Matrix branch (line 240) and the Fast branch (line 251) — there is no num_bits=1 carve-out. So every newly-built IVF_RQ index, including num_bits=1, ships with the raw-query estimator and the new RawQuery factor formula:
// transform.rs (raw-query, num_bits=1, L2)
scale = -2 * |res|^2 / binary_res_dot // binary_res_dot = 0.5 * sum(|rotated|)
add = |res|^2 + 2 * binary_correction // binary_correction = |res|^2 * binary_cent_dot / binary_res_dotThe released runtime's transform writes:
// transform.rs (residual-query, L2)
scale = -2 * |res|^2 / ip_rq_res // ip_rq_res = sum(|rotated|) / sqrt(d)
add = |res|^2binary_res_dot and ip_rq_res are related by binary_res_dot = (sqrt(d) / 2) * ip_rq_res, so the new scale magnitude is (2 / sqrt(d)) times the legacy magnitude. add gains an extra 2 * binary_correction term that depends on per-row residual signs vs the rotated centroid — a row-dependent perturbation that breaks rank ordering, not just a constant offset.
The old runtime parses the new index as if it were a residual-query index
The pre-PR RabitQuantizationMetadata struct has no query_estimator field, and there is no #[serde(deny_unknown_fields)] (verified in storage.rs:220 and earlier release commits). Serde silently ignores the new "query_estimator":"raw_query" key. The old runtime then plugs the on-disk factor values — which now mean RawQuery — into the legacy ResidualQuery formula in storage.rs::distance():
let dist_vq_qr = (2.0 * dist - sum_q) / sqrt_d;
dist_vq_qr * scale + add + query_factorwhile the new code applies (dist - 0.5 * sum_q) * scale + add + query_factor to the same column values. The old code also residualizes the query before building dist_table (via QueryResidual::Centroid), while the raw-query factors are computed assuming a raw (un-residualized) query. The new extra column __error_factors and the new metadata field are both silently ignored by the old reader (column_by_name tolerates extras and old metadata struct has no field for them), so the index loads cleanly and just returns wrong distances.
Why check_read does not catch it
IvfRqVectorIndex.check_read (test_vector_indices.py:311-340) asserts:
assert result.num_rows == 4
# plus num_indexed_rows > 0 and (under current runtime) num_bits == 1No ground-truth k-NN comparison, no recall threshold, no row-id verification. The IVF search still returns 4 rows when distances are completely wrong — they are just the wrong 4 rows. CLAUDE.md:111 mandates a >=0.5 recall threshold for vector index tests precisely to catch this class of silent regression, and the compat test was the place where it should have triggered.
Step-by-step proof
- Build a num_bits=1 IVF_RQ index on the current branch. The metadata JSON includes
"query_estimator":"raw_query"and the aux file gains an__error_factorscolumn. transform.rs::transformtakes the new branch (query_estimator != ResidualQuery) and writes__add_factors/__scale_factorsviacompute_raw_query_factors— values that the newstorage.rs::distance()RawQuery branch consumes as(binary_dot - 0.5 * sum_q) * scale + add + query_factor.- The downgrade venv (pre-PR) opens the same aux file. Its
RabitQuantizationMetadatadeserialize ignoresquery_estimator(nodeny_unknown_fields) and ignores__error_factors(only named columns are pulled). - Its
distance()applies the legacy formula(2 * dist - sum_q) / sqrt_d * scale + add + query_factorto columns whose magnitudes are off by2/sqrt(d)inscaleand disturbed by the row-dependent2 * binary_correctioninadd. For dim=32 (the compat dataset),2/sqrt(32) ≈ 0.354, so the scale magnitudes are roughly a third of what the old formula expects. check_readissues a 4-NN search and assertsresult.num_rows == 4. The IVF/FLAT pipeline returns four rows from the probed partition regardless of distance values, so the assertion holds and the test passes — but the rows returned are not the four nearest under either metric.
How to fix
Either option from the original report works and is small:
- (a) Override
skip_read_after_current_writetoTrueinIvfRqVectorIndex, and replace the misleading comment with an explicit note that downgrade-direction reads are not supported once a num_bits=1 index is rewritten by current. This honestly marks the compat regression rather than papering over it. - (b) Keep
ResidualQueryas the defaultquery_estimatorwhennum_bits == 1inbuilder.rs::new_with_rotation(only switch toRawQueryfornum_bits > 1, which old runtimes already rejected viavalidate_supported_rq_num_bitsbefore this PR removed that gate). The binary-only on-disk layout then stays bit-for-bit compatible with released runtimes, and the PR's stated compat claim becomes true.
Optionally combine with strengthening check_read to assert recall against an exhaustive top-k so this class of silent regression cannot regress in the future.
Feature
query_estimatormetadata so released indexes without the field continue to read asresidual_query, while newly written indexes useraw_query.num_bits == 1indexes and multi-bit split-code indexes, including ex-code factors and runtime-only rotated centroid caches derived from the originalIvfModelcentroids.num_bits > 1gate for supported metrics, including cosine via Lance's normalized-L2 handling.Compatibility
query_estimatormetadata still default to the legacy residual-query estimator.Performance Improvement
The benchmark below was run with
search-benchmarkon GCP VMyang-agent-00bd-ivfrq-rerun-20260605, datasetgist,k=10,max_threads=1,target_partition_size=4096, no refine. Latencies are converted from CSV seconds to milliseconds.Provenance:
61ef8f7b97589032a83eeae1e52664be9f035551437849118f380d92c1ea849f99996e9072be58dfce548a49766670b80275daae6f1bf97c70e885e4Additional DBpedia comparison on the same VM, current branch only, dataset
dbpedia,k=10,max_threads=1,target_partition_size=4096. For IVF_PQ,sub_vector_dim=8; one extra row includesrefine_factor=2atnprobes=24.Tests
cargo fmt --allcargo test -p lance-index raw_querycargo test -p lance-index try_from_batch_cargo test -p lance-index rabit_quantizercargo test -p lance test_rabitq_distance_typescargo test -p lance test_build_ivf_rqcargo clippy -p lance-index -p lance --tests --benches -- -D warningsuv run make build