Skip to content

feat: add table-size based cache pass-through#1

Open
shefeek-jinnah wants to merge 6 commits intomainfrom
dev
Open

feat: add table-size based cache pass-through#1
shefeek-jinnah wants to merge 6 commits intomainfrom
dev

Conversation

@shefeek-jinnah
Copy link
Copy Markdown
Collaborator

@shefeek-jinnah shefeek-jinnah commented Apr 21, 2026

Summary

Scans whose total parquet file size exceeds a configurable threshold bypass LiquidCache and use DataFusion's vanilla parquet reader instead. This prevents large cold scans from polluting the RAM cache and
growing the on-disk spill file unboundedly.

Problem

In mixed workloads with small hot tables and large cold tables, caching everything causes the disk tier to expand significantly (decoded Arrow batches are ~5x larger than compressed parquet). This creates
memory pressure and degrades cache performance for the smaller datasets that benefit most from caching.

Solution

Add with_max_scan_bytes(u64) to LocalModeOptimizer and LiquidCacheLocalBuilder. When set, the optimizer computes the total file size of a parquet scan and skips LiquidCache wrapping if it exceeds the
threshold. The scan falls through to DataFusion's vanilla parquet reader.

Default behavior is unchanged — no threshold means cache everything.

Usage

let (ctx, _) = LiquidCacheLocalBuilder::new()
    .with_max_cache_bytes(8 * 1024 * 1024 * 1024)
    .with_max_scan_bytes(1024 * 1024 * 1024) // skip scans > 1 GB
    .build(session_config)                  
    .await?;                                                                                                                                                                                                   

Ref: XiangpengHao#485

claude[bot]
claude Bot previously approved these changes Apr 21, 2026
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-level feedback only — no blocking issues.

let eager = self.eager_shredding;
let rewritten = plan
.transform_up(|node| try_optimize_parquet_source(node, cache, eager, max_scan_bytes))
.unwrap();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: optimize returns Result, so this should propagate the error instead of panicking. (not blocking)

Suggested change
.unwrap();
.unwrap();

should be:

Suggested change
.unwrap();
?;

.flat_map(|g| g.files())
.map(|f| f.object_meta.size)
.sum();
if total > max_bytes {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the size-based bypass has no test coverage. A small test that registers a parquet file, sets max_scan_bytes to 0 (or 1), and verifies the resulting plan still uses ParquetSource (not LiquidParquetSource) would make this easy to verify and protect against regression. (not blocking)

@sentry
Copy link
Copy Markdown

sentry Bot commented Apr 21, 2026

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

claude[bot]
claude Bot previously approved these changes Apr 23, 2026
Comment thread .github/workflows/ci.yml Outdated
RUSTDOCFLAGS: -D warnings

- name: Run clippy
run: cargo clippy --all-targets --all-features -- -D warnings
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: The CI no longer runs cargo test. After this diff the pipeline only checks formatting and clippy — no tests execute. This means regressions in existing behaviour, including the new size-bypass logic, are invisible to CI.

Add a test step (coverage reporting can be dropped if CODECOV_TOKEN is unavailable in this fork):

Suggested change
run: cargo clippy --all-targets --all-features -- -D warnings
run: cargo clippy --all-targets --all-features -- -D warnings
- name: Run tests
run: cargo test --all-features

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Blocking Issues

  • .github/workflows/ci.yml line 30 — The CI pipeline no longer runs cargo test. After this PR the only checks are formatting and clippy; the test suite never executes. This means correctness regressions (including in the new size-bypass path) go undetected. A cargo test --all-features step must be added.

Action Required

Add cargo test --all-features (or restore a dedicated test job) to the check job before merge. Coverage upload can be dropped if CODECOV_TOKEN is unavailable in this fork, but the test run itself must remain.

Comment thread .github/workflows/ci.yml
uses: model-checking/kani-github-action@v1.1
with:
working-directory: src/core
run: cargo clippy --workspace --exclude dev-tools --all-targets --all-features -- -D warnings
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking (unresolved from previous review): The pipeline still runs no tests. Format and clippy check correctness of syntax and style, not behaviour — any regression in the new size-bypass logic (or anything else) is invisible to CI.

Add a test step:

Suggested change
run: cargo clippy --workspace --exclude dev-tools --all-targets --all-features -- -D warnings
run: cargo clippy --all-targets --all-features -- -D warnings
- name: Run tests
run: cargo test --all-features

.transform_up(|node| try_optimize_parquet_source(node, cache, eager, max_scan_bytes))
.unwrap();
Ok(rewritten.data)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: optimize returns Result, so a panic here is a regression over the old code (which delegated to rewrite_data_source_plan and surfaced errors via Ok(...)). transform_up returns Result; propagate it instead of unwrapping. (not blocking)

Suggested change
}
?;

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Blocking Issues

  • .github/workflows/ci.yml line 30: No tests run in CI. The pipeline only checks formatting and clippy — cargo test was not added back after the previous review flagged this. Any regression in the new size-bypass logic is invisible.

Action Required

Add cargo test --all-features as a step in the check job (or a dedicated job) before this can merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant