feat: add table-size based cache pass-through#1
Conversation
| let eager = self.eager_shredding; | ||
| let rewritten = plan | ||
| .transform_up(|node| try_optimize_parquet_source(node, cache, eager, max_scan_bytes)) | ||
| .unwrap(); |
There was a problem hiding this comment.
nit: optimize returns Result, so this should propagate the error instead of panicking. (not blocking)
| .unwrap(); | |
| .unwrap(); |
should be:
| .unwrap(); | |
| ?; |
| .flat_map(|g| g.files()) | ||
| .map(|f| f.object_meta.size) | ||
| .sum(); | ||
| if total > max_bytes { |
There was a problem hiding this comment.
nit: the size-based bypass has no test coverage. A small test that registers a parquet file, sets max_scan_bytes to 0 (or 1), and verifies the resulting plan still uses ParquetSource (not LiquidParquetSource) would make this easy to verify and protect against regression. (not blocking)
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
| RUSTDOCFLAGS: -D warnings | ||
|
|
||
| - name: Run clippy | ||
| run: cargo clippy --all-targets --all-features -- -D warnings |
There was a problem hiding this comment.
Blocking: The CI no longer runs cargo test. After this diff the pipeline only checks formatting and clippy — no tests execute. This means regressions in existing behaviour, including the new size-bypass logic, are invisible to CI.
Add a test step (coverage reporting can be dropped if CODECOV_TOKEN is unavailable in this fork):
| run: cargo clippy --all-targets --all-features -- -D warnings | |
| run: cargo clippy --all-targets --all-features -- -D warnings | |
| - name: Run tests | |
| run: cargo test --all-features |
There was a problem hiding this comment.
Review
Blocking Issues
.github/workflows/ci.ymlline 30 — The CI pipeline no longer runscargo test. After this PR the only checks are formatting and clippy; the test suite never executes. This means correctness regressions (including in the new size-bypass path) go undetected. Acargo test --all-featuresstep must be added.
Action Required
Add cargo test --all-features (or restore a dedicated test job) to the check job before merge. Coverage upload can be dropped if CODECOV_TOKEN is unavailable in this fork, but the test run itself must remain.
| uses: model-checking/kani-github-action@v1.1 | ||
| with: | ||
| working-directory: src/core | ||
| run: cargo clippy --workspace --exclude dev-tools --all-targets --all-features -- -D warnings |
There was a problem hiding this comment.
Blocking (unresolved from previous review): The pipeline still runs no tests. Format and clippy check correctness of syntax and style, not behaviour — any regression in the new size-bypass logic (or anything else) is invisible to CI.
Add a test step:
| run: cargo clippy --workspace --exclude dev-tools --all-targets --all-features -- -D warnings | |
| run: cargo clippy --all-targets --all-features -- -D warnings | |
| - name: Run tests | |
| run: cargo test --all-features |
| .transform_up(|node| try_optimize_parquet_source(node, cache, eager, max_scan_bytes)) | ||
| .unwrap(); | ||
| Ok(rewritten.data) | ||
| } |
There was a problem hiding this comment.
nit: optimize returns Result, so a panic here is a regression over the old code (which delegated to rewrite_data_source_plan and surfaced errors via Ok(...)). transform_up returns Result; propagate it instead of unwrapping. (not blocking)
| } | |
| ?; |
There was a problem hiding this comment.
Review
Blocking Issues
.github/workflows/ci.ymlline 30: No tests run in CI. The pipeline only checks formatting and clippy —cargo testwas not added back after the previous review flagged this. Any regression in the new size-bypass logic is invisible.
Action Required
Add cargo test --all-features as a step in the check job (or a dedicated job) before this can merge.
Summary
Scans whose total parquet file size exceeds a configurable threshold bypass LiquidCache and use DataFusion's vanilla parquet reader instead. This prevents large cold scans from polluting the RAM cache and
growing the on-disk spill file unboundedly.
Problem
In mixed workloads with small hot tables and large cold tables, caching everything causes the disk tier to expand significantly (decoded Arrow batches are ~5x larger than compressed parquet). This creates
memory pressure and degrades cache performance for the smaller datasets that benefit most from caching.
Solution
Add
with_max_scan_bytes(u64)toLocalModeOptimizerandLiquidCacheLocalBuilder. When set, the optimizer computes the total file size of a parquet scan and skips LiquidCache wrapping if it exceeds thethreshold. The scan falls through to DataFusion's vanilla parquet reader.
Default behavior is unchanged — no threshold means cache everything.
Usage
Ref: XiangpengHao#485