perf(scanner): answer unfiltered count_rows from fragment metadata#7076
Open
LuciferYang wants to merge 1 commit into
Open
perf(scanner): answer unfiltered count_rows from fragment metadata#7076LuciferYang wants to merge 1 commit into
LuciferYang wants to merge 1 commit into
Conversation
`Scanner::count_rows` always built and executed a count plan, even when the count could be satisfied from fragment metadata alone. For a plain count with no row-level filter or search this scanned row data unnecessarily, which is especially wasteful when the scanner is restricted to a subset of fragments via `with_fragments`. Add a fast path that sums each fragment's live row count (physical rows minus deletions, both tracked in metadata) when nothing in the scan needs to inspect row data. The path falls back to the existing plan — preserving its results and errors — whenever a filter, vector/full-text search, index_segments, fast_search, include_deleted_rows, ordering, limit/offset, or a dynamic-only projection (e.g. `SELECT 1`) is set. The shared per-fragment summing is factored into `Dataset::count_fragment_rows`, which `count_all_rows` now also uses; its fan-out matches the module's standard `io_parallelism()` bound. Closes lance-format#6970.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Scanner::count_rowsalways built and executed a count plan, even when the count could be satisfied from fragment metadata alone. For a plain count with no row-level filter or search this scanned row data unnecessarily — especially wasteful when the scanner is restricted to a subset of fragments viawith_fragments(#6970).This adds a metadata-only fast path to
Scanner::count_rows: when nothing in the scan needs to inspect row data, it sums each fragment's live row count (physical rows − deletions, both tracked in fragment metadata) instead of building and executing a plan.Dataset::count_rows(None)already had such a fast path viacount_all_rows; this brings the same benefit to theScannerpath (and, crucially, to fragment-restricted counts).The fast path falls back to the existing count plan — preserving its results and its errors — whenever any of these is set:
index_segments,fast_search, orinclude_deleted_rows;order_byorlimit/offset(the plan rejects these when combined with the count aggregate);SELECT 1(also rejected by the plan).The shared per-fragment summing is factored into
Dataset::count_fragment_rows, whichcount_all_rowsnow also uses. Its fan-out uses the module-standardio_parallelism()bound, matching the siblingcount_deleted_rows. In the common (new-format) case the count is answered entirely from cached metadata with zero I/O; only legacy/uncached fragments fall back to per-fragment metadata reads.Closes #6970.
Test plan
test_count_rows_metadata_onlycovering: whole dataset, fragment subset, empty fragment list, deletions (whole-dataset and subset),include_deleted_rowsfallback,limit/offset/order_byerror preservation, filtered-count fallback, and all-rows-deleted. It asserts zero read/write IOPS on the metadata fast path and> 0reads on the plan fallback.cargo test -p lance --lib— count, scanner (142), and dataset_io (52) suites pass.cargo fmt --all --checkclean.cargo clippy -p lance --all-targets -- -D warningsclean.