Initial impl for sort pushdown in DataFusion FileSource implementation#8235
Initial impl for sort pushdown in DataFusion FileSource implementation#8235AdamGS wants to merge 1 commit into
Conversation
b730f27 to
ed243f0
Compare
Polar Signals Profiling ResultsLatest Run
Previous Runs (1)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.972x ➖ How to read Verdict and Engines
datafusion / vortex-file-compressed (0.972x ➖, 0↑ 0↓)
No file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.995x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.003x ➖, 0↑ 0↓)
datafusion / parquet (1.007x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.024x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.022x ➖, 0↑ 0↓)
duckdb / parquet (1.019x ➖, 0↑ 0↓)
File Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.853x ✅, 17↑ 0↓)
datafusion / vortex-compact (0.824x ✅, 21↑ 0↓)
datafusion / parquet (0.917x ➖, 11↑ 0↓)
datafusion / arrow (0.807x ✅, 16↑ 0↓)
duckdb / vortex-file-compressed (0.921x ➖, 7↑ 0↓)
duckdb / vortex-compact (0.939x ➖, 5↑ 0↓)
duckdb / parquet (0.962x ➖, 5↑ 0↓)
duckdb / duckdb (0.987x ➖, 2↑ 0↓)
File Size Changes (10 files changed, -0.2% overall, 4↑ 6↓)
Totals:
|
Merging this PR will degrade performance by 20.23%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
30.1 µs | 45 µs | -33.15% |
| ❌ | Simulation | chunked_varbinview_canonical_into[(1000, 10)] |
161.3 µs | 197.6 µs | -18.33% |
| ❌ | Simulation | chunked_varbinview_into_canonical[(1000, 10)] |
175.9 µs | 212 µs | -17.03% |
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[128] |
246.1 ns | 275.3 ns | -10.6% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing adamg/pushdown-sort-df (babfac8) with develop (bd6fc3e)
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.003x ➖, 2↑ 4↓)
datafusion / vortex-compact (1.019x ➖, 0↑ 7↓)
datafusion / parquet (1.027x ➖, 0↑ 4↓)
duckdb / vortex-file-compressed (1.014x ➖, 0↑ 2↓)
duckdb / vortex-compact (1.018x ➖, 0↑ 2↓)
duckdb / parquet (1.017x ➖, 0↑ 1↓)
duckdb / duckdb (1.017x ➖, 1↑ 2↓)
File Size Changes (7 files changed, +0.0% overall, 3↑ 4↓)
Totals:
|
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.076x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.895x ➖, 2↑ 0↓)
datafusion / parquet (1.097x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.001x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.991x ➖, 0↑ 0↓)
duckdb / parquet (0.981x ➖, 0↑ 0↓)
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) How to read Verdict and Engines
duckdb / vortex-file-compressed (0.971x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.995x ➖, 0↑ 0↓)
duckdb / parquet (1.000x ➖, 0↑ 0↓)
File Size Changes (1 files changed, +0.0% overall, 1↑ 0↓)
Totals:
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.920x ➖, 3↑ 0↓)
datafusion / vortex-compact (0.916x ➖, 6↑ 0↓)
datafusion / parquet (0.936x ➖, 0↑ 0↓)
datafusion / arrow (0.922x ➖, 2↑ 0↓)
duckdb / vortex-file-compressed (0.937x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.945x ➖, 0↑ 0↓)
duckdb / parquet (0.958x ➖, 1↑ 0↓)
duckdb / duckdb (0.967x ➖, 0↑ 0↓)
File Size Changes (26 files changed, +0.0% overall, 13↑ 13↓)
Totals:
|
Benchmarks: Appian on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.957x ➖, 0↑ 0↓)
datafusion / parquet (1.001x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.915x ➖, 3↑ 0↓)
duckdb / parquet (0.906x ➖, 3↑ 0↓)
duckdb / duckdb (0.995x ➖, 0↑ 0↓)
File Size Changes (4 files changed, -0.0% overall, 0↑ 4↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.065x ➖, 0↑ 4↓)
datafusion / vortex-compact (1.125x ➖, 0↑ 6↓)
datafusion / parquet (1.070x ➖, 2↑ 3↓)
duckdb / vortex-file-compressed (0.992x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.021x ➖, 0↑ 0↓)
duckdb / parquet (1.064x ➖, 0↑ 1↓)
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.982x ➖, 4↑ 0↓)
datafusion / parquet (1.006x ➖, 2↑ 0↓)
duckdb / vortex-file-compressed (0.985x ➖, 4↑ 2↓)
duckdb / parquet (1.005x ➖, 0↑ 0↓)
duckdb / duckdb (1.009x ➖, 0↑ 0↓)
File Size Changes (103 files changed, -0.0% overall, 46↑ 57↓)
Totals:
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.201x ➖, 0↑ 2↓)
datafusion / vortex-compact (1.143x ➖, 0↑ 4↓)
datafusion / parquet (1.104x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.988x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.974x ➖, 0↑ 0↓)
duckdb / parquet (1.026x ➖, 0↑ 1↓)
|
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
ed243f0 to
babfac8
Compare
|
going to add tests later today |
|
Do we have any benchmarks that care about sortness |
| if !is_descending { | ||
| let mut this = self.clone(); | ||
| this.ordered = true; | ||
| return Ok(SortOrderPushdownResult::Inexact { |
There was a problem hiding this comment.
is this a bit optimistic? I don't know how datafusion sort operators treat this but would we want to fallback to a near sorted optimised sort strategy always if we have an ascending sort of a column that exists in the file?
There was a problem hiding this comment.
ordered here is just "the order of the file", instead of returning batches in whichever order we get them.
Summary
Naive implementation of
try_pushdown_sort, its mostly just parts of the parquet impl + making sure we propagate the information into the Vortex scan.