fix: dedup active-memtable predicate-crossing stale reads (vector + FTS)#7067
fix: dedup active-memtable predicate-crossing stale reads (vector + FTS)#7067hamersaw wants to merge 13 commits into
Conversation
… reads The active memtable's vector (HNSW) and FTS (inverted) indexes are append-only, and both index arms dedup results with WithinSourceDedupExec, which only suppresses a stale row when the fresh version is also present in the result set. When a PK is updated out of the query's match set (vector: moved out of the neighborhood; FTS: new text no longer matches the query), the fresh row isn't returned, so the stale version leaks. Two #[ignore]d repros capture the gap: - vector_search: a PK moved far from the query still surfaces at its stale near distance because the fresh node is evicted from the over-fetched top-k. - fts_search: a PK updated "alpha lance" -> "beta lance" still matches an "alpha" query because the old posting stays live in the append-only index. Both are #[ignore]d so the default suite stays green; un-ignore once the active arms gain a predicate-independent recency filter over the whole memtable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… vector/FTS search The active memtable's HNSW (vector) and inverted (FTS) indexes are append-only, so an updated row's old index entries stay live. Both arms deduped results with WithinSourceDedupExec, which only suppresses a stale row when the fresh version is also in the result set. When an update moves a row out of the query's match set (vector: far from the query; FTS: new text no longer matches), the fresh version isn't returned, so the stale version leaked. Fix: maintain a per-memtable MVCC PK-position index — a lock-free arena skiplist keyed on (compute_pk_hash(pk_columns), row_position), enabled on the active memtable and carried through freeze. This reuses the exact recency primitive point-lookup already trusts (get_newest_visible), so the fix is "make the index arms do the seek point-lookup already does": - NewestPkFilterExec keeps an index hit iff pk_position_index.get_newest_visible(pk_hash, max_visible) == row_position, predicate-independent and snapshot-exact (it keys on the same max_visible the scanner latched, read back from the scanner, not a fresh load, to avoid a concurrent-append race). - Wired into the active vector arm (replacing WithinSourceDedupExec) and the active FTS arm (adding with_row_id + the filter). - The cross-source block-list now sources in-memory membership from the same index (in_memory_pk_hashes -> PkPositionIndex::pk_hashes), falling back to the BatchStore scan only when no PK index is present. Keyset == whole-store membership, so behavior is preserved; on-disk tiers unchanged. The hash keying covers single and composite primary keys uniformly and matches the hashing WithinSourceDedupExec/block_list already use. Both previously-ignored repro tests now pass; adds pk_position unit tests and an index-sourced block-list test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ition index; drop WithinSourceDedupExec Unifies the in-memory newest-per-PK / membership decisions onto the maintained MVCC PK-position index and removes the now-redundant WithinSourceDedupExec. - point_lookup: `probe_position` prefers the value-keyed PK BTree (collision- free) and falls back to the PK-position index — hashing the key and verifying the candidate's actual value (via `visible_positions`, newest-first) to guard hash collisions — when no scalar BTree exists, so composite / unindexed PKs get the plan-free fast probe too. The plan-path active arm's WithinSourceDedupExec is replaced by `SortExec(_rowid DESC).fetch(1)` (value-exact via the filter, newest by row id). - block-list: `compute_source_block_lists` yields `Vec<GenMembership>` instead of `Vec<Arc<HashSet>>`. In-memory generations are probed per candidate (`GenMembership::Index` -> `PkPositionIndex::contains_visible`, snapshot-bounded) with no per-query set materialized; flushed/base keep their cached sets, and an in-memory memtable without a PK-position index falls back to the BatchStore scan. Snapshot-bounding also closes a latent over-block where a not-yet-visible newer write could shadow an older visible copy. `PkHashFilterExec` now consumes `Vec<GenMembership>`. - cleanup: WithinSourceDedupExec and DedupDirection (incl. the dead KeepMinRowAddr) have no remaining users and are deleted. Added shared helpers `BatchStore::max_visible_row` and point_lookup's `resolve_position`. Full mem_wal suite green; adds point-lookup-without-btree and index-sourced block-list tests. fresh_tier_block_list (external contains_pks API) still materializes sets; migrating MemTableDedupScanExec's reverse-walk HashSet onto the same probe is a benchmark-gated follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e pk_hashes() `fresh_tier_block_list` now returns `Vec<GenMembership>` instead of `Vec<Arc<HashSet<u64>>>`, and `LsmScanner::contains_pks` tests each PK by probing the membership (`GenMembership::contains`) rather than against a materialized set. In-memory tiers are probed against their PK-position index (snapshot-bounded); flushed/base keep their cached hash sets. This retires the last per-query PK-hash set built from an in-memory memtable: `PkPositionIndex::pk_hashes()` and `block_list::in_memory_pk_hashes` are deleted. `pk_hashes_from_batch_store` remains only as the no-PK-index fallback. `contains_pks`'s public signature (&RecordBatch -> Vec<bool>) is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the two correctness cases the index probing introduces: - NewestPkFilterExec (within-source): keeps only the newest visible position per PK, and — crucially — does NOT vanish a visible row when a newer version was committed after the query's snapshot (max_visible bounds the seek). Also covers the no-PK-index pass-through. - block_list (cross-source): index-sourced membership is snapshot-bounded, so a newer generation's not-yet-visible write does not shadow an older generation's visible copy (while a visible newer write still does). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
I originally removed the skiplist on primary key columns because it impacts perf, trading it off with stale data. But now that we've made it even faster than RocksDB, I agree with the general direction of this PR to add it back for more accuracy. However, if we're going to maintain a per-PK skiplist anyway, I think we should just always maintain a BTree index (which is a skiplist in the memtable) on the primary key columns rather than introducing a separate If we always maintain a BTree on the primary key columns, |
…ndex Maintain a BTree index on each primary-key column (reusing a user's scalar index when present, else auto-creating an in-memory `__pk__*` one) and compose them for tuple-level MVCC lookups, replacing the separate hash-keyed PkPositionIndex. Because row position is the physical row identity, a composite key's newest visible row is the largest position present in every column's BTree at or below the watermark — collision-free and uniform for single and composite PKs. - index/pk_lookup.rs: new PkLookup over the per-column PK BTrees (get_newest_visible / is_newest / contains_visible). - BTreeMemIndex: add visible_positions + contains_position primitives. - IndexStore: btree_indexes now Arc-valued; enable_pk_index reuses or auto-creates PK BTrees (maintained by the existing insert loop, in-memory only — flush still rebuilds from user index configs); pk_index() -> PkLookup. - NewestPkFilterExec / point_lookup / block_list now probe by value; point_lookup's hash fallback + re-check deleted. Block-list flushed/base arm stays hash-keyed; needs_values() skips per-row value extraction otherwise. - Delete index/pk_position.rs. Addresses jackye1995's review on lance-format#7067. Sharp edge: composite-PK lookups intersect per-column position lists, so put the most selective column first. cargo test -p lance --lib mem_wal: 385 passed; fmt + clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PkLookup was a borrow-only view over the pk_btrees slice with three trivial methods; move them onto IndexStore (which owns the BTrees) as pk_newest_visible / pk_is_newest / pk_contains_visible + has_pk_index / pk_is_empty, and delete index/pk_lookup.rs. No wrapper type, no lifetime view. Also trims the verbose doc/comments added in the previous commit. cargo test -p lance --lib mem_wal: 385 passed; fmt + clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The enable_pk_index doc referenced the removed `Self::pk_index`, failing `cargo doc -D warnings`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jackye1995
left a comment
There was a problem hiding this comment.
This mostly looks good to me! A couple of design suggestions:
1. Composite-key BTree instead of per-column intersection
The PR creates one single-column BTree per PK column (pk_btrees: Vec<Arc<BTreeMemIndex>>) and intersects them by position in pk_newest_visible(). This works, but supporting multi-column BTree indexes has been a long-hanging task we need to complete anyway — it would be better to do that as part of this rather than building a per-column intersection layer on top of single-column BTrees.
With a composite-key BTreeMemIndex (lexicographically ordered tuple key), one seek answers the "newest visible" question for any PK arity, and we avoid the pk_btrees vector, enable_pk_index / __pk__* auto-creation, and the walk-leading-column algorithm.
That said, I think for the initial impl it's fine to limit to single-column PKs and add composite-key BTree support as a follow-up — just don't build the multi-column intersection machinery that will get thrown away.
2. Flush the PK BTree to disk and drop GenMembership
The GenMembership enum splits in-memory (BTree probe) vs on-disk (hash set) generations. But we already flush user-configured BTrees to disk via flush_with_indexes(). If we always flush the PK BTree too, then every non-base generation has a BTree index — in-memory for active/frozen, on-disk for flushed. The base table is always the shadowed side (never provides a membership), so we can uniformly probe the BTree for every block-list check and remove GenMembership entirely.
Benefits:
- No upfront O(n) scan —
scan_pk_hashescurrently reads every row of every flushed generation to build aHashSet<u64>at plan construction. With a flushed BTree, there's zero construction cost. Each probe is O(log n) instead of O(1), but given the size of individual memtables that's more than good enough. - Unified caching — the flushed on-disk indexes go through the existing session cache / 2-tier cache mechanism instead of a separate
FlushedMemTableCachePK-hash cache path. - Simpler code — no
GenMembershipenum, noneeds_valuesbranching, no conditionalScalarValueextraction inPkHashFilterExec.
…index Implements jackye1995's PR lance-format#7067 review: maintain the primary key as a true LSM BTree end to end instead of per-column intersection + materialized PK-hash sets. In-memory: replace `pk_btrees: Vec` and the position-intersection walk with a single `PkIndex` — `Single` (arity 1, reuses the column BTree, no second skiplist) or `Composite` (arity >=2, one order-preserving encoded-tuple skiplist; new `index/pk_key.rs`). `pk_newest_visible` is now one seek for any arity. Drop `BTreeMemIndex::visible_positions`/`contains_position`. On disk: flush writes a standalone PK BTree sidecar at `{gen}/_pk_index` via `train_btree_index` (not a manifest index, no synthetic data column). The block-list opens it by path and probes with `Equals`; the opened index and its pages ride the injected `FlushedMemTableCache`/`LanceCache`, so steady-state probes are memory-resident with no upfront PK-column scan. Drop `GenMembership::Set`, `scan_pk_hashes`, `pk_hashes_from_batch*`, and `needs_values`; membership is now `InMemory` (sync value probe) or `OnDisk` (async `Equals`). Rename `PkHashFilterExec` -> `PkBlockFilterExec` (async in-flight-future stream). Repurpose `FlushedMemTableCache.pk_hashes` to cache the opened `ScalarIndex`. Net deletion. mem_wal suite green; clippy --tests --benches -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mtable # Conflicts: # rust/lance/src/dataset/mem_wal/scanner/fts_search.rs
…ache Aligns with jackye1995's review point lance-format#2 ("unified caching"): the flushed generation's standalone PK dedup index now loads through the opened dataset's session index cache (`dataset.index_cache.for_index(path)`, mirroring `open_scalar_index`) instead of a bespoke 256MB page cache + opened-index map inside `FlushedMemTableCache`. The opened index and its pages are cached and keyed by the immutable flushed path, with no separate cache path. Drops `FlushedMemTableCache::{pk_indices, index_page_cache, get_or_open_pk_index, index_page_cache}` and `PK_INDEX_PAGE_CACHE_BYTES`; the cache now holds only the opened datasets (which carry the session cache). mem_wal suite green; clippy --tests -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…KeyIndex Per jackye1995's review lance-format#1 ("composite-key BTreeMemIndex"): instead of a separate parallel skiplist (`PkKeyIndex`), the composite PK index is a plain `BTreeMemIndex` keyed on a synthetic `Binary` column (`__pk_key__`) holding the order-preserving encoded tuple. `pk_key.rs` is now just the encoder (`encode_pk_tuple` + `encode_pk_batch`); the insert path materializes the encoded `Binary` column and feeds the existing index, and the probe seeks with `ScalarValue::Binary(encode_pk_tuple(values))`. Benefits: the composite case reuses `BTreeMemIndex`'s byte backend (incl. the inline-small-key node optimization) and its `to_training_batches`, so the in-memory probe, flush sidecar, and single-column path share one index type and one code path. Arity-split and its single-column memory/typed-fast-path wins are unchanged. Net deletion of the hand-rolled `PkKeyIndex` skiplist. mem_wal suite green; clippy --tests -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After the composite-as-BTreeMemIndex change, the in-memory and flushed PK indexes share one key space, so the block-list probe no longer needs both the raw `values` and the pre-built `on_disk_key`. `GenMembership::contains` now takes one `key`, and `IndexStore::pk_contains_visible(values)` is replaced by `pk_contains_key(key)`, which forwards the already-built key straight to the keyed BTree (no re-encoding for composite). Callers build one key via `on_disk_pk_key`. (The two-variant `GenMembership` enum stays: in-memory is a sync, MVCC-bounded skiplist seek; flushed is an async Lance `ScalarIndex` search — genuinely different index types. The hash-set probe path jackye's review flagged is already gone.) mem_wal suite green; clippy --tests -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What
Fixes a stale-read phantom shared by the vector and FTS index search arms over the MemWAL active memtable, and routes the in-memory newest-per-PK / membership decisions through a single maintained MVCC index.
The bug
The active memtable is an append log; a PK update is a later append with the same key. The in-memory secondary indexes — HNSW (vector) and the inverted index (FTS) — are append-only, so an updated row's old entries stay live. Both arms deduped with
WithinSourceDedupExec, which only suppresses a stale row when the fresh version is also in the result set. When an update moves a row out of the query's match set (vector: far from the query; FTS: new text no longer matches), the fresh version isn't returned, so the stale version leaks. (point_lookupwas immune — it already did the MVCC recency seek.)The fix
Maintain a per-memtable MVCC PK-position index: a lock-free arena skiplist keyed on
(compute_pk_hash(pk_columns), row_position), enabled on the active memtable and carried through freeze. The row position is the version stamp, so this reuses the exact primitive point-lookup trusts (get_newest_visible).NewestPkFilterExeckeeps an index hit iffget_newest_visible(pk_hash, max_visible) == row_position— predicate-independent, snapshot-exact (keys on the scanner's latchedmax_visible). Wired into the active vector arm (replacingWithinSourceDedupExec) and the FTS arm (addingwith_row_id).SortExec(_rowid DESC).fetch(1)instead ofWithinSourceDedupExec.GenMembership::Index, snapshot-bounded) with no per-query set; flushed/base keep cached sets.contains_pksprobes too.WithinSourceDedupExec/DedupDirectionand the per-query PK-hash set builders (pk_hashes(),in_memory_pk_hashes) are deleted. Net negative LOC.Hash keying covers single and composite PKs uniformly. The snapshot-bounded probe also closes a latent over-block where a not-yet-visible newer write could shadow an older visible copy.
Tests
Both
#[ignore]d repros un-ignored and passing; newPkPositionIndexunit tests, point-lookup-without-btree, index-sourced block-list, and snapshot-bounded vanished-row guard tests (within- and cross-source). Fullmem_walsuite green;cargo fmt+clippy -D warningsclean.Deferred follow-ups
MemTableDedupScanExec's reverse-walkHashSet(filtered-read scan path) onto the same probe — the last within-source mechanism off the index; benchmark-gated.🤖 Generated with Claude Code