fix: dedup active-memtable predicate-crossing stale reads (vector + FTS) by hamersaw · Pull Request #7067 · lance-format/lance

hamersaw · 2026-06-02T23:16:04Z

What

Fixes a stale-read phantom shared by the vector and FTS index search arms over the MemWAL active memtable, and routes the in-memory newest-per-PK / membership decisions through a single maintained MVCC index.

The bug

The active memtable is an append log; a PK update is a later append with the same key. The in-memory secondary indexes — HNSW (vector) and the inverted index (FTS) — are append-only, so an updated row's old entries stay live. Both arms deduped with WithinSourceDedupExec, which only suppresses a stale row when the fresh version is also in the result set. When an update moves a row out of the query's match set (vector: far from the query; FTS: new text no longer matches), the fresh version isn't returned, so the stale version leaks. (point_lookup was immune — it already did the MVCC recency seek.)

The fix

Maintain a per-memtable MVCC PK-position index: a lock-free arena skiplist keyed on (compute_pk_hash(pk_columns), row_position), enabled on the active memtable and carried through freeze. The row position is the version stamp, so this reuses the exact primitive point-lookup trusts (get_newest_visible).

NewestPkFilterExec keeps an index hit iff get_newest_visible(pk_hash, max_visible) == row_position — predicate-independent, snapshot-exact (keys on the scanner's latched max_visible). Wired into the active vector arm (replacing WithinSourceDedupExec) and the FTS arm (adding with_row_id).
point_lookup falls back to the index (hash + value-equality collision guard) when no scalar BTree exists; its plan-path active arm uses SortExec(_rowid DESC).fetch(1) instead of WithinSourceDedupExec.
Cross-source block-list probes the index per candidate (GenMembership::Index, snapshot-bounded) with no per-query set; flushed/base keep cached sets. contains_pks probes too.
Cleanup: WithinSourceDedupExec / DedupDirection and the per-query PK-hash set builders (pk_hashes(), in_memory_pk_hashes) are deleted. Net negative LOC.

Hash keying covers single and composite PKs uniformly. The snapshot-bounded probe also closes a latent over-block where a not-yet-visible newer write could shadow an older visible copy.

Tests

Both #[ignore]d repros un-ignored and passing; new PkPositionIndex unit tests, point-lookup-without-btree, index-sourced block-list, and snapshot-bounded vanished-row guard tests (within- and cross-source). Full mem_wal suite green; cargo fmt + clippy -D warnings clean.

Deferred follow-ups

Migrate MemTableDedupScanExec's reverse-walk HashSet (filtered-read scan path) onto the same probe — the last within-source mechanism off the index; benchmark-gated.
In-graph HNSW within-gen eviction (perf end-game; correctness is now exact).

🤖 Generated with Claude Code

… reads The active memtable's vector (HNSW) and FTS (inverted) indexes are append-only, and both index arms dedup results with WithinSourceDedupExec, which only suppresses a stale row when the fresh version is also present in the result set. When a PK is updated out of the query's match set (vector: moved out of the neighborhood; FTS: new text no longer matches the query), the fresh row isn't returned, so the stale version leaks. Two #[ignore]d repros capture the gap: - vector_search: a PK moved far from the query still surfaces at its stale near distance because the fresh node is evicted from the over-fetched top-k. - fts_search: a PK updated "alpha lance" -> "beta lance" still matches an "alpha" query because the old posting stays live in the append-only index. Both are #[ignore]d so the default suite stays green; un-ignore once the active arms gain a predicate-independent recency filter over the whole memtable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-02T23:51:15Z

Codecov Report

❌ Patch coverage is 87.02461% with 174 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...c/dataset/mem_wal/scanner/exec/newest_pk_filter.rs	75.78%	48 Missing and 6 partials ⚠️
...rc/dataset/mem_wal/scanner/exec/pk_block_filter.rs	81.34%	29 Missing and 7 partials ⚠️
rust/lance/src/dataset/mem_wal/index.rs	84.93%	14 Missing and 8 partials ⚠️
rust/lance/src/dataset/mem_wal/index/pk_key.rs	81.03%	16 Missing and 6 partials ⚠️
...st/lance/src/dataset/mem_wal/scanner/block_list.rs	92.23%	2 Missing and 14 partials ⚠️
.../lance/src/dataset/mem_wal/scanner/point_lookup.rs	92.04%	3 Missing and 4 partials ⚠️
rust/lance/src/dataset/mem_wal/memtable/flush.rs	95.55%	1 Missing and 3 partials ⚠️
...lance/src/dataset/mem_wal/scanner/vector_search.rs	95.95%	0 Missing and 4 partials ⚠️
rust/lance/src/dataset/mem_wal/scanner/builder.rs	86.95%	0 Missing and 3 partials ⚠️
.../lance/src/dataset/mem_wal/memtable/batch_store.rs	80.00%	1 Missing and 1 partial ⚠️
... and 3 more

📢 Thoughts on this report? Let us know!

… vector/FTS search The active memtable's HNSW (vector) and inverted (FTS) indexes are append-only, so an updated row's old index entries stay live. Both arms deduped results with WithinSourceDedupExec, which only suppresses a stale row when the fresh version is also in the result set. When an update moves a row out of the query's match set (vector: far from the query; FTS: new text no longer matches), the fresh version isn't returned, so the stale version leaked. Fix: maintain a per-memtable MVCC PK-position index — a lock-free arena skiplist keyed on (compute_pk_hash(pk_columns), row_position), enabled on the active memtable and carried through freeze. This reuses the exact recency primitive point-lookup already trusts (get_newest_visible), so the fix is "make the index arms do the seek point-lookup already does": - NewestPkFilterExec keeps an index hit iff pk_position_index.get_newest_visible(pk_hash, max_visible) == row_position, predicate-independent and snapshot-exact (it keys on the same max_visible the scanner latched, read back from the scanner, not a fresh load, to avoid a concurrent-append race). - Wired into the active vector arm (replacing WithinSourceDedupExec) and the active FTS arm (adding with_row_id + the filter). - The cross-source block-list now sources in-memory membership from the same index (in_memory_pk_hashes -> PkPositionIndex::pk_hashes), falling back to the BatchStore scan only when no PK index is present. Keyset == whole-store membership, so behavior is preserved; on-disk tiers unchanged. The hash keying covers single and composite primary keys uniformly and matches the hashing WithinSourceDedupExec/block_list already use. Both previously-ignored repro tests now pass; adds pk_position unit tests and an index-sourced block-list test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ition index; drop WithinSourceDedupExec Unifies the in-memory newest-per-PK / membership decisions onto the maintained MVCC PK-position index and removes the now-redundant WithinSourceDedupExec. - point_lookup: `probe_position` prefers the value-keyed PK BTree (collision- free) and falls back to the PK-position index — hashing the key and verifying the candidate's actual value (via `visible_positions`, newest-first) to guard hash collisions — when no scalar BTree exists, so composite / unindexed PKs get the plan-free fast probe too. The plan-path active arm's WithinSourceDedupExec is replaced by `SortExec(_rowid DESC).fetch(1)` (value-exact via the filter, newest by row id). - block-list: `compute_source_block_lists` yields `Vec<GenMembership>` instead of `Vec<Arc<HashSet>>`. In-memory generations are probed per candidate (`GenMembership::Index` -> `PkPositionIndex::contains_visible`, snapshot-bounded) with no per-query set materialized; flushed/base keep their cached sets, and an in-memory memtable without a PK-position index falls back to the BatchStore scan. Snapshot-bounding also closes a latent over-block where a not-yet-visible newer write could shadow an older visible copy. `PkHashFilterExec` now consumes `Vec<GenMembership>`. - cleanup: WithinSourceDedupExec and DedupDirection (incl. the dead KeepMinRowAddr) have no remaining users and are deleted. Added shared helpers `BatchStore::max_visible_row` and point_lookup's `resolve_position`. Full mem_wal suite green; adds point-lookup-without-btree and index-sourced block-list tests. fresh_tier_block_list (external contains_pks API) still materializes sets; migrating MemTableDedupScanExec's reverse-walk HashSet onto the same probe is a benchmark-gated follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e pk_hashes() `fresh_tier_block_list` now returns `Vec<GenMembership>` instead of `Vec<Arc<HashSet<u64>>>`, and `LsmScanner::contains_pks` tests each PK by probing the membership (`GenMembership::contains`) rather than against a materialized set. In-memory tiers are probed against their PK-position index (snapshot-bounded); flushed/base keep their cached hash sets. This retires the last per-query PK-hash set built from an in-memory memtable: `PkPositionIndex::pk_hashes()` and `block_list::in_memory_pk_hashes` are deleted. `pk_hashes_from_batch_store` remains only as the no-PK-index fallback. `contains_pks`'s public signature (&RecordBatch -> Vec<bool>) is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds the two correctness cases the index probing introduces: - NewestPkFilterExec (within-source): keeps only the newest visible position per PK, and — crucially — does NOT vanish a visible row when a newer version was committed after the query's snapshot (max_visible bounds the seek). Also covers the no-PK-index pass-through. - block_list (cross-source): index-sourced membership is snapshot-bounded, so a newer generation's not-yet-visible write does not shadow an older generation's visible copy (while a visible newer write still does). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

jackye1995 · 2026-06-03T05:09:21Z

I originally removed the skiplist on primary key columns because it impacts perf, trading it off with stale data. But now that we've made it even faster than RocksDB, I agree with the general direction of this PR to add it back for more accuracy.

However, if we're going to maintain a per-PK skiplist anyway, I think we should just always maintain a BTree index (which is a skiplist in the memtable) on the primary key columns rather than introducing a separate PkPositionIndex. BTreeMemIndex already provides get_newest_visible — the exact same MVCC seek-and-stop primitive this PR reimplements with a hash key. Since in most cases the PK already has a BTree definition anyway, adding a second skiplist doubles the memory overhead for PK tracking.

If we always maintain a BTree on the primary key columns, NewestPkFilterExec and the block-list probes can use the existing BTree directly — no duplication. And if we flush it as a BTree index, this also makes the whole structure a true LSM tree — we essentially always maintain the sort order of the memtable, and even when it is flushed we still know the sort order.

…ndex Maintain a BTree index on each primary-key column (reusing a user's scalar index when present, else auto-creating an in-memory `__pk__*` one) and compose them for tuple-level MVCC lookups, replacing the separate hash-keyed PkPositionIndex. Because row position is the physical row identity, a composite key's newest visible row is the largest position present in every column's BTree at or below the watermark — collision-free and uniform for single and composite PKs. - index/pk_lookup.rs: new PkLookup over the per-column PK BTrees (get_newest_visible / is_newest / contains_visible). - BTreeMemIndex: add visible_positions + contains_position primitives. - IndexStore: btree_indexes now Arc-valued; enable_pk_index reuses or auto-creates PK BTrees (maintained by the existing insert loop, in-memory only — flush still rebuilds from user index configs); pk_index() -> PkLookup. - NewestPkFilterExec / point_lookup / block_list now probe by value; point_lookup's hash fallback + re-check deleted. Block-list flushed/base arm stays hash-keyed; needs_values() skips per-row value extraction otherwise. - Delete index/pk_position.rs. Addresses jackye1995's review on lance-format#7067. Sharp edge: composite-PK lookups intersect per-column position lists, so put the most selective column first. cargo test -p lance --lib mem_wal: 385 passed; fmt + clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

PkLookup was a borrow-only view over the pk_btrees slice with three trivial methods; move them onto IndexStore (which owns the BTrees) as pk_newest_visible / pk_is_newest / pk_contains_visible + has_pk_index / pk_is_empty, and delete index/pk_lookup.rs. No wrapper type, no lifetime view. Also trims the verbose doc/comments added in the previous commit. cargo test -p lance --lib mem_wal: 385 passed; fmt + clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The enable_pk_index doc referenced the removed `Self::pk_index`, failing `cargo doc -D warnings`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jackye1995

This mostly looks good to me! A couple of design suggestions:

1. Composite-key BTree instead of per-column intersection

The PR creates one single-column BTree per PK column (pk_btrees: Vec<Arc<BTreeMemIndex>>) and intersects them by position in pk_newest_visible(). This works, but supporting multi-column BTree indexes has been a long-hanging task we need to complete anyway — it would be better to do that as part of this rather than building a per-column intersection layer on top of single-column BTrees.

With a composite-key BTreeMemIndex (lexicographically ordered tuple key), one seek answers the "newest visible" question for any PK arity, and we avoid the pk_btrees vector, enable_pk_index / __pk__* auto-creation, and the walk-leading-column algorithm.

That said, I think for the initial impl it's fine to limit to single-column PKs and add composite-key BTree support as a follow-up — just don't build the multi-column intersection machinery that will get thrown away.

2. Flush the PK BTree to disk and drop `GenMembership`

The GenMembership enum splits in-memory (BTree probe) vs on-disk (hash set) generations. But we already flush user-configured BTrees to disk via flush_with_indexes(). If we always flush the PK BTree too, then every non-base generation has a BTree index — in-memory for active/frozen, on-disk for flushed. The base table is always the shadowed side (never provides a membership), so we can uniformly probe the BTree for every block-list check and remove GenMembership entirely.

Benefits:

No upfront O(n) scan — scan_pk_hashes currently reads every row of every flushed generation to build a HashSet<u64> at plan construction. With a flushed BTree, there's zero construction cost. Each probe is O(log n) instead of O(1), but given the size of individual memtables that's more than good enough.
Unified caching — the flushed on-disk indexes go through the existing session cache / 2-tier cache mechanism instead of a separate FlushedMemTableCache PK-hash cache path.
Simpler code — no GenMembership enum, no needs_values branching, no conditional ScalarValue extraction in PkHashFilterExec.

…index Implements jackye1995's PR lance-format#7067 review: maintain the primary key as a true LSM BTree end to end instead of per-column intersection + materialized PK-hash sets. In-memory: replace `pk_btrees: Vec` and the position-intersection walk with a single `PkIndex` — `Single` (arity 1, reuses the column BTree, no second skiplist) or `Composite` (arity >=2, one order-preserving encoded-tuple skiplist; new `index/pk_key.rs`). `pk_newest_visible` is now one seek for any arity. Drop `BTreeMemIndex::visible_positions`/`contains_position`. On disk: flush writes a standalone PK BTree sidecar at `{gen}/_pk_index` via `train_btree_index` (not a manifest index, no synthetic data column). The block-list opens it by path and probes with `Equals`; the opened index and its pages ride the injected `FlushedMemTableCache`/`LanceCache`, so steady-state probes are memory-resident with no upfront PK-column scan. Drop `GenMembership::Set`, `scan_pk_hashes`, `pk_hashes_from_batch*`, and `needs_values`; membership is now `InMemory` (sync value probe) or `OnDisk` (async `Equals`). Rename `PkHashFilterExec` -> `PkBlockFilterExec` (async in-flight-future stream). Repurpose `FlushedMemTableCache.pk_hashes` to cache the opened `ScalarIndex`. Net deletion. mem_wal suite green; clippy --tests --benches -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…mtable # Conflicts: # rust/lance/src/dataset/mem_wal/scanner/fts_search.rs

…ache Aligns with jackye1995's review point lance-format#2 ("unified caching"): the flushed generation's standalone PK dedup index now loads through the opened dataset's session index cache (`dataset.index_cache.for_index(path)`, mirroring `open_scalar_index`) instead of a bespoke 256MB page cache + opened-index map inside `FlushedMemTableCache`. The opened index and its pages are cached and keyed by the immutable flushed path, with no separate cache path. Drops `FlushedMemTableCache::{pk_indices, index_page_cache, get_or_open_pk_index, index_page_cache}` and `PK_INDEX_PAGE_CACHE_BYTES`; the cache now holds only the opened datasets (which carry the session cache). mem_wal suite green; clippy --tests -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…KeyIndex Per jackye1995's review lance-format#1 ("composite-key BTreeMemIndex"): instead of a separate parallel skiplist (`PkKeyIndex`), the composite PK index is a plain `BTreeMemIndex` keyed on a synthetic `Binary` column (`__pk_key__`) holding the order-preserving encoded tuple. `pk_key.rs` is now just the encoder (`encode_pk_tuple` + `encode_pk_batch`); the insert path materializes the encoded `Binary` column and feeds the existing index, and the probe seeks with `ScalarValue::Binary(encode_pk_tuple(values))`. Benefits: the composite case reuses `BTreeMemIndex`'s byte backend (incl. the inline-small-key node optimization) and its `to_training_batches`, so the in-memory probe, flush sidecar, and single-column path share one index type and one code path. Arity-split and its single-column memory/typed-fast-path wins are unchanged. Net deletion of the hand-rolled `PkKeyIndex` skiplist. mem_wal suite green; clippy --tests -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

After the composite-as-BTreeMemIndex change, the in-memory and flushed PK indexes share one key space, so the block-list probe no longer needs both the raw `values` and the pre-built `on_disk_key`. `GenMembership::contains` now takes one `key`, and `IndexStore::pk_contains_visible(values)` is replaced by `pk_contains_key(key)`, which forwards the already-built key straight to the keyed BTree (no re-encoding for composite). Callers build one key via `on_disk_pk_key`. (The two-variant `GenMembership` enum stays: in-memory is a sync, MVCC-bounded skiplist seek; flushed is an async Lance `ScalarIndex` search — genuinely different index types. The hash-set probe path jackye's review flagged is already gone.) mem_wal suite green; clippy --tests -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added the chore label Jun 2, 2026

hamersaw and others added 4 commits June 2, 2026 20:17

hamersaw changed the title ~~test: repros for active-memtable predicate-crossing stale reads~~ fix: dedup active-memtable predicate-crossing stale reads (vector + FTS) Jun 3, 2026

github-actions Bot added the bug Something isn't working label Jun 3, 2026

hamersaw marked this pull request as ready for review June 3, 2026 03:23

claude Bot reviewed Jun 3, 2026

View reviewed changes

hamersaw and others added 3 commits June 3, 2026 13:31

docs(mem_wal): fix broken intra-doc link to pk_newest_visible

bc11566

The enable_pk_index doc referenced the removed `Self::pk_index`, failing `cargo doc -D warnings`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jackye1995 reviewed Jun 4, 2026

View reviewed changes

hamersaw and others added 5 commits June 4, 2026 05:36

Merge remote-tracking branch 'upstream/main' into bug/dedup-active-me…

19257fb

…mtable # Conflicts: # rust/lance/src/dataset/mem_wal/scanner/fts_search.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: dedup active-memtable predicate-crossing stale reads (vector + FTS)#7067

fix: dedup active-memtable predicate-crossing stale reads (vector + FTS)#7067
hamersaw wants to merge 13 commits into
lance-format:mainfrom
hamersaw:bug/dedup-active-memtable

hamersaw commented Jun 2, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

jackye1995 commented Jun 3, 2026

Uh oh!

jackye1995 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hamersaw commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

The bug

The fix

Tests

Deferred follow-ups

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

jackye1995 commented Jun 3, 2026

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

1. Composite-key BTree instead of per-column intersection

2. Flush the PK BTree to disk and drop GenMembership

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hamersaw commented Jun 2, 2026 •

edited

Loading

codecov Bot commented Jun 2, 2026 •

edited

Loading

2. Flush the PK BTree to disk and drop `GenMembership`