Skip to content

perf(scanner): answer unfiltered count_rows from fragment metadata#7076

Open
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:perf/6970-count-rows-metadata-only
Open

perf(scanner): answer unfiltered count_rows from fragment metadata#7076
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:perf/6970-count-rows-metadata-only

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

Summary

Scanner::count_rows always built and executed a count plan, even when the count could be satisfied from fragment metadata alone. For a plain count with no row-level filter or search this scanned row data unnecessarily — especially wasteful when the scanner is restricted to a subset of fragments via with_fragments (#6970).

This adds a metadata-only fast path to Scanner::count_rows: when nothing in the scan needs to inspect row data, it sums each fragment's live row count (physical rows − deletions, both tracked in fragment metadata) instead of building and executing a plan. Dataset::count_rows(None) already had such a fast path via count_all_rows; this brings the same benefit to the Scanner path (and, crucially, to fragment-restricted counts).

The fast path falls back to the existing count plan — preserving its results and its errors — whenever any of these is set:

  • a row-level filter, vector / full-text search, index_segments, fast_search, or include_deleted_rows;
  • order_by or limit / offset (the plan rejects these when combined with the count aggregate);
  • a dynamic-only projection such as SELECT 1 (also rejected by the plan).

The shared per-fragment summing is factored into Dataset::count_fragment_rows, which count_all_rows now also uses. Its fan-out uses the module-standard io_parallelism() bound, matching the sibling count_deleted_rows. In the common (new-format) case the count is answered entirely from cached metadata with zero I/O; only legacy/uncached fragments fall back to per-fragment metadata reads.

Closes #6970.

Test plan

  • New test_count_rows_metadata_only covering: whole dataset, fragment subset, empty fragment list, deletions (whole-dataset and subset), include_deleted_rows fallback, limit / offset / order_by error preservation, filtered-count fallback, and all-rows-deleted. It asserts zero read/write IOPS on the metadata fast path and > 0 reads on the plan fallback.
  • cargo test -p lance --lib — count, scanner (142), and dataset_io (52) suites pass.
  • cargo fmt --all --check clean.
  • cargo clippy -p lance --all-targets -- -D warnings clean.

`Scanner::count_rows` always built and executed a count plan, even when
the count could be satisfied from fragment metadata alone. For a plain
count with no row-level filter or search this scanned row data
unnecessarily, which is especially wasteful when the scanner is
restricted to a subset of fragments via `with_fragments`.

Add a fast path that sums each fragment's live row count (physical rows
minus deletions, both tracked in metadata) when nothing in the scan
needs to inspect row data. The path falls back to the existing plan —
preserving its results and errors — whenever a filter, vector/full-text
search, index_segments, fast_search, include_deleted_rows, ordering,
limit/offset, or a dynamic-only projection (e.g. `SELECT 1`) is set.

The shared per-fragment summing is factored into
`Dataset::count_fragment_rows`, which `count_all_rows` now also uses;
its fan-out matches the module's standard `io_parallelism()` bound.

Closes lance-format#6970.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

❌ Patch coverage is 97.80220% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 97.61% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize count query plans with just fragment filter to be metadata-only

1 participant