Skip to content

fix: dedup active-memtable predicate-crossing stale reads (vector + FTS)#7067

Open
hamersaw wants to merge 13 commits into
lance-format:mainfrom
hamersaw:bug/dedup-active-memtable
Open

fix: dedup active-memtable predicate-crossing stale reads (vector + FTS)#7067
hamersaw wants to merge 13 commits into
lance-format:mainfrom
hamersaw:bug/dedup-active-memtable

Conversation

@hamersaw
Copy link
Copy Markdown
Contributor

@hamersaw hamersaw commented Jun 2, 2026

What

Fixes a stale-read phantom shared by the vector and FTS index search arms over the MemWAL active memtable, and routes the in-memory newest-per-PK / membership decisions through a single maintained MVCC index.

The bug

The active memtable is an append log; a PK update is a later append with the same key. The in-memory secondary indexes — HNSW (vector) and the inverted index (FTS) — are append-only, so an updated row's old entries stay live. Both arms deduped with WithinSourceDedupExec, which only suppresses a stale row when the fresh version is also in the result set. When an update moves a row out of the query's match set (vector: far from the query; FTS: new text no longer matches), the fresh version isn't returned, so the stale version leaks. (point_lookup was immune — it already did the MVCC recency seek.)

The fix

Maintain a per-memtable MVCC PK-position index: a lock-free arena skiplist keyed on (compute_pk_hash(pk_columns), row_position), enabled on the active memtable and carried through freeze. The row position is the version stamp, so this reuses the exact primitive point-lookup trusts (get_newest_visible).

  • NewestPkFilterExec keeps an index hit iff get_newest_visible(pk_hash, max_visible) == row_position — predicate-independent, snapshot-exact (keys on the scanner's latched max_visible). Wired into the active vector arm (replacing WithinSourceDedupExec) and the FTS arm (adding with_row_id).
  • point_lookup falls back to the index (hash + value-equality collision guard) when no scalar BTree exists; its plan-path active arm uses SortExec(_rowid DESC).fetch(1) instead of WithinSourceDedupExec.
  • Cross-source block-list probes the index per candidate (GenMembership::Index, snapshot-bounded) with no per-query set; flushed/base keep cached sets. contains_pks probes too.
  • Cleanup: WithinSourceDedupExec / DedupDirection and the per-query PK-hash set builders (pk_hashes(), in_memory_pk_hashes) are deleted. Net negative LOC.

Hash keying covers single and composite PKs uniformly. The snapshot-bounded probe also closes a latent over-block where a not-yet-visible newer write could shadow an older visible copy.

Tests

Both #[ignore]d repros un-ignored and passing; new PkPositionIndex unit tests, point-lookup-without-btree, index-sourced block-list, and snapshot-bounded vanished-row guard tests (within- and cross-source). Full mem_wal suite green; cargo fmt + clippy -D warnings clean.

Deferred follow-ups

  • Migrate MemTableDedupScanExec's reverse-walk HashSet (filtered-read scan path) onto the same probe — the last within-source mechanism off the index; benchmark-gated.
  • In-graph HNSW within-gen eviction (perf end-game; correctness is now exact).

🤖 Generated with Claude Code

… reads

The active memtable's vector (HNSW) and FTS (inverted) indexes are
append-only, and both index arms dedup results with WithinSourceDedupExec,
which only suppresses a stale row when the fresh version is also present in
the result set. When a PK is updated out of the query's match set (vector:
moved out of the neighborhood; FTS: new text no longer matches the query),
the fresh row isn't returned, so the stale version leaks.

Two #[ignore]d repros capture the gap:
- vector_search: a PK moved far from the query still surfaces at its stale
  near distance because the fresh node is evicted from the over-fetched top-k.
- fts_search: a PK updated "alpha lance" -> "beta lance" still matches an
  "alpha" query because the old posting stays live in the append-only index.

Both are #[ignore]d so the default suite stays green; un-ignore once the
active arms gain a predicate-independent recency filter over the whole
memtable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the chore label Jun 2, 2026
hamersaw and others added 4 commits June 2, 2026 20:17
… vector/FTS search

The active memtable's HNSW (vector) and inverted (FTS) indexes are
append-only, so an updated row's old index entries stay live. Both arms
deduped results with WithinSourceDedupExec, which only suppresses a stale
row when the fresh version is also in the result set. When an update moves
a row out of the query's match set (vector: far from the query; FTS: new
text no longer matches), the fresh version isn't returned, so the stale
version leaked.

Fix: maintain a per-memtable MVCC PK-position index — a lock-free arena
skiplist keyed on (compute_pk_hash(pk_columns), row_position), enabled on
the active memtable and carried through freeze. This reuses the exact
recency primitive point-lookup already trusts (get_newest_visible), so the
fix is "make the index arms do the seek point-lookup already does":

- NewestPkFilterExec keeps an index hit iff
  pk_position_index.get_newest_visible(pk_hash, max_visible) == row_position,
  predicate-independent and snapshot-exact (it keys on the same
  max_visible the scanner latched, read back from the scanner, not a
  fresh load, to avoid a concurrent-append race).
- Wired into the active vector arm (replacing WithinSourceDedupExec) and
  the active FTS arm (adding with_row_id + the filter).
- The cross-source block-list now sources in-memory membership from the
  same index (in_memory_pk_hashes -> PkPositionIndex::pk_hashes), falling
  back to the BatchStore scan only when no PK index is present. Keyset ==
  whole-store membership, so behavior is preserved; on-disk tiers
  unchanged.

The hash keying covers single and composite primary keys uniformly and
matches the hashing WithinSourceDedupExec/block_list already use. Both
previously-ignored repro tests now pass; adds pk_position unit tests and
an index-sourced block-list test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ition index; drop WithinSourceDedupExec

Unifies the in-memory newest-per-PK / membership decisions onto the
maintained MVCC PK-position index and removes the now-redundant
WithinSourceDedupExec.

- point_lookup: `probe_position` prefers the value-keyed PK BTree (collision-
  free) and falls back to the PK-position index — hashing the key and verifying
  the candidate's actual value (via `visible_positions`, newest-first) to guard
  hash collisions — when no scalar BTree exists, so composite / unindexed PKs
  get the plan-free fast probe too. The plan-path active arm's
  WithinSourceDedupExec is replaced by `SortExec(_rowid DESC).fetch(1)`
  (value-exact via the filter, newest by row id).

- block-list: `compute_source_block_lists` yields `Vec<GenMembership>` instead
  of `Vec<Arc<HashSet>>`. In-memory generations are probed per candidate
  (`GenMembership::Index` -> `PkPositionIndex::contains_visible`,
  snapshot-bounded) with no per-query set materialized; flushed/base keep their
  cached sets, and an in-memory memtable without a PK-position index falls back
  to the BatchStore scan. Snapshot-bounding also closes a latent over-block
  where a not-yet-visible newer write could shadow an older visible copy.
  `PkHashFilterExec` now consumes `Vec<GenMembership>`.

- cleanup: WithinSourceDedupExec and DedupDirection (incl. the dead
  KeepMinRowAddr) have no remaining users and are deleted. Added shared helpers
  `BatchStore::max_visible_row` and point_lookup's `resolve_position`.

Full mem_wal suite green; adds point-lookup-without-btree and index-sourced
block-list tests. fresh_tier_block_list (external contains_pks API) still
materializes sets; migrating MemTableDedupScanExec's reverse-walk HashSet onto
the same probe is a benchmark-gated follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e pk_hashes()

`fresh_tier_block_list` now returns `Vec<GenMembership>` instead of
`Vec<Arc<HashSet<u64>>>`, and `LsmScanner::contains_pks` tests each PK by
probing the membership (`GenMembership::contains`) rather than against a
materialized set. In-memory tiers are probed against their PK-position index
(snapshot-bounded); flushed/base keep their cached hash sets.

This retires the last per-query PK-hash set built from an in-memory memtable:
`PkPositionIndex::pk_hashes()` and `block_list::in_memory_pk_hashes` are
deleted. `pk_hashes_from_batch_store` remains only as the no-PK-index fallback.
`contains_pks`'s public signature (&RecordBatch -> Vec<bool>) is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the two correctness cases the index probing introduces:

- NewestPkFilterExec (within-source): keeps only the newest visible position
  per PK, and — crucially — does NOT vanish a visible row when a newer version
  was committed after the query's snapshot (max_visible bounds the seek). Also
  covers the no-PK-index pass-through.
- block_list (cross-source): index-sourced membership is snapshot-bounded, so a
  newer generation's not-yet-visible write does not shadow an older
  generation's visible copy (while a visible newer write still does).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hamersaw hamersaw changed the title test: repros for active-memtable predicate-crossing stale reads fix: dedup active-memtable predicate-crossing stale reads (vector + FTS) Jun 3, 2026
@github-actions github-actions Bot added the bug Something isn't working label Jun 3, 2026
@hamersaw hamersaw marked this pull request as ready for review June 3, 2026 03:23
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@jackye1995
Copy link
Copy Markdown
Contributor

I originally removed the skiplist on primary key columns because it impacts perf, trading it off with stale data. But now that we've made it even faster than RocksDB, I agree with the general direction of this PR to add it back for more accuracy.

However, if we're going to maintain a per-PK skiplist anyway, I think we should just always maintain a BTree index (which is a skiplist in the memtable) on the primary key columns rather than introducing a separate PkPositionIndex. BTreeMemIndex already provides get_newest_visible — the exact same MVCC seek-and-stop primitive this PR reimplements with a hash key. Since in most cases the PK already has a BTree definition anyway, adding a second skiplist doubles the memory overhead for PK tracking.

If we always maintain a BTree on the primary key columns, NewestPkFilterExec and the block-list probes can use the existing BTree directly — no duplication. And if we flush it as a BTree index, this also makes the whole structure a true LSM tree — we essentially always maintain the sort order of the memtable, and even when it is flushed we still know the sort order.

hamersaw and others added 3 commits June 3, 2026 13:31
…ndex

Maintain a BTree index on each primary-key column (reusing a user's scalar
index when present, else auto-creating an in-memory `__pk__*` one) and compose
them for tuple-level MVCC lookups, replacing the separate hash-keyed
PkPositionIndex. Because row position is the physical row identity, a composite
key's newest visible row is the largest position present in every column's
BTree at or below the watermark — collision-free and uniform for single and
composite PKs.

- index/pk_lookup.rs: new PkLookup over the per-column PK BTrees
  (get_newest_visible / is_newest / contains_visible).
- BTreeMemIndex: add visible_positions + contains_position primitives.
- IndexStore: btree_indexes now Arc-valued; enable_pk_index reuses or
  auto-creates PK BTrees (maintained by the existing insert loop, in-memory
  only — flush still rebuilds from user index configs); pk_index() -> PkLookup.
- NewestPkFilterExec / point_lookup / block_list now probe by value;
  point_lookup's hash fallback + re-check deleted. Block-list flushed/base arm
  stays hash-keyed; needs_values() skips per-row value extraction otherwise.
- Delete index/pk_position.rs.

Addresses jackye1995's review on lance-format#7067. Sharp edge: composite-PK lookups
intersect per-column position lists, so put the most selective column first.

cargo test -p lance --lib mem_wal: 385 passed; fmt + clippy clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PkLookup was a borrow-only view over the pk_btrees slice with three trivial
methods; move them onto IndexStore (which owns the BTrees) as
pk_newest_visible / pk_is_newest / pk_contains_visible + has_pk_index /
pk_is_empty, and delete index/pk_lookup.rs. No wrapper type, no lifetime view.
Also trims the verbose doc/comments added in the previous commit.

cargo test -p lance --lib mem_wal: 385 passed; fmt + clippy clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The enable_pk_index doc referenced the removed `Self::pk_index`,
failing `cargo doc -D warnings`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks good to me! A couple of design suggestions:

1. Composite-key BTree instead of per-column intersection

The PR creates one single-column BTree per PK column (pk_btrees: Vec<Arc<BTreeMemIndex>>) and intersects them by position in pk_newest_visible(). This works, but supporting multi-column BTree indexes has been a long-hanging task we need to complete anyway — it would be better to do that as part of this rather than building a per-column intersection layer on top of single-column BTrees.

With a composite-key BTreeMemIndex (lexicographically ordered tuple key), one seek answers the "newest visible" question for any PK arity, and we avoid the pk_btrees vector, enable_pk_index / __pk__* auto-creation, and the walk-leading-column algorithm.

That said, I think for the initial impl it's fine to limit to single-column PKs and add composite-key BTree support as a follow-up — just don't build the multi-column intersection machinery that will get thrown away.

2. Flush the PK BTree to disk and drop GenMembership

The GenMembership enum splits in-memory (BTree probe) vs on-disk (hash set) generations. But we already flush user-configured BTrees to disk via flush_with_indexes(). If we always flush the PK BTree too, then every non-base generation has a BTree index — in-memory for active/frozen, on-disk for flushed. The base table is always the shadowed side (never provides a membership), so we can uniformly probe the BTree for every block-list check and remove GenMembership entirely.

Benefits:

  • No upfront O(n) scanscan_pk_hashes currently reads every row of every flushed generation to build a HashSet<u64> at plan construction. With a flushed BTree, there's zero construction cost. Each probe is O(log n) instead of O(1), but given the size of individual memtables that's more than good enough.
  • Unified caching — the flushed on-disk indexes go through the existing session cache / 2-tier cache mechanism instead of a separate FlushedMemTableCache PK-hash cache path.
  • Simpler code — no GenMembership enum, no needs_values branching, no conditional ScalarValue extraction in PkHashFilterExec.

hamersaw and others added 5 commits June 4, 2026 05:36
…index

Implements jackye1995's PR lance-format#7067 review: maintain the primary key as a true
LSM BTree end to end instead of per-column intersection + materialized PK-hash
sets.

In-memory: replace `pk_btrees: Vec` and the position-intersection walk with a
single `PkIndex` — `Single` (arity 1, reuses the column BTree, no second
skiplist) or `Composite` (arity >=2, one order-preserving encoded-tuple
skiplist; new `index/pk_key.rs`). `pk_newest_visible` is now one seek for any
arity. Drop `BTreeMemIndex::visible_positions`/`contains_position`.

On disk: flush writes a standalone PK BTree sidecar at `{gen}/_pk_index` via
`train_btree_index` (not a manifest index, no synthetic data column). The
block-list opens it by path and probes with `Equals`; the opened index and its
pages ride the injected `FlushedMemTableCache`/`LanceCache`, so steady-state
probes are memory-resident with no upfront PK-column scan.

Drop `GenMembership::Set`, `scan_pk_hashes`, `pk_hashes_from_batch*`, and
`needs_values`; membership is now `InMemory` (sync value probe) or `OnDisk`
(async `Equals`). Rename `PkHashFilterExec` -> `PkBlockFilterExec` (async
in-flight-future stream). Repurpose `FlushedMemTableCache.pk_hashes` to cache
the opened `ScalarIndex`.

Net deletion. mem_wal suite green; clippy --tests --benches -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mtable

# Conflicts:
#	rust/lance/src/dataset/mem_wal/scanner/fts_search.rs
…ache

Aligns with jackye1995's review point lance-format#2 ("unified caching"): the flushed
generation's standalone PK dedup index now loads through the opened dataset's
session index cache (`dataset.index_cache.for_index(path)`, mirroring
`open_scalar_index`) instead of a bespoke 256MB page cache + opened-index map
inside `FlushedMemTableCache`. The opened index and its pages are cached and
keyed by the immutable flushed path, with no separate cache path.

Drops `FlushedMemTableCache::{pk_indices, index_page_cache, get_or_open_pk_index,
index_page_cache}` and `PK_INDEX_PAGE_CACHE_BYTES`; the cache now holds only the
opened datasets (which carry the session cache).

mem_wal suite green; clippy --tests -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…KeyIndex

Per jackye1995's review lance-format#1 ("composite-key BTreeMemIndex"): instead of a
separate parallel skiplist (`PkKeyIndex`), the composite PK index is a plain
`BTreeMemIndex` keyed on a synthetic `Binary` column (`__pk_key__`) holding the
order-preserving encoded tuple. `pk_key.rs` is now just the encoder
(`encode_pk_tuple` + `encode_pk_batch`); the insert path materializes the encoded
`Binary` column and feeds the existing index, and the probe seeks with
`ScalarValue::Binary(encode_pk_tuple(values))`.

Benefits: the composite case reuses `BTreeMemIndex`'s byte backend (incl. the
inline-small-key node optimization) and its `to_training_batches`, so the
in-memory probe, flush sidecar, and single-column path share one index type and
one code path. Arity-split and its single-column memory/typed-fast-path wins are
unchanged. Net deletion of the hand-rolled `PkKeyIndex` skiplist.

mem_wal suite green; clippy --tests -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After the composite-as-BTreeMemIndex change, the in-memory and flushed PK
indexes share one key space, so the block-list probe no longer needs both the
raw `values` and the pre-built `on_disk_key`. `GenMembership::contains` now takes
one `key`, and `IndexStore::pk_contains_visible(values)` is replaced by
`pk_contains_key(key)`, which forwards the already-built key straight to the
keyed BTree (no re-encoding for composite). Callers build one key via
`on_disk_pk_key`.

(The two-variant `GenMembership` enum stays: in-memory is a sync, MVCC-bounded
skiplist seek; flushed is an async Lance `ScalarIndex` search — genuinely
different index types. The hash-set probe path jackye's review flagged is
already gone.)

mem_wal suite green; clippy --tests -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working chore

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants