split registration to respect nested projection and filters#8213
Conversation
Merging this PR will not alter performance
|
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.019x ➖ How to read Verdict and Engines
datafusion / vortex-file-compressed (1.019x ➖, 0↑ 0↓)
No file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.983x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.970x ➖, 0↑ 0↓)
datafusion / parquet (0.977x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.002x ➖, 1↑ 1↓)
duckdb / vortex-compact (0.985x ➖, 0↑ 0↓)
duckdb / parquet (0.997x ➖, 0↑ 0↓)
File Size Changes (1 files changed, +0.0% overall, 1↑ 0↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.996x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.011x ➖, 0↑ 0↓)
datafusion / parquet (1.002x ➖, 1↑ 0↓)
datafusion / arrow (0.998x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (1.007x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.002x ➖, 0↑ 0↓)
duckdb / parquet (1.000x ➖, 0↑ 0↓)
duckdb / duckdb (1.001x ➖, 0↑ 0↓)
File Size Changes (10 files changed, +0.1% overall, 7↑ 3↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.081x ➖, 0↑ 28↓)
datafusion / vortex-compact (1.060x ➖, 1↑ 16↓)
datafusion / parquet (1.072x ➖, 0↑ 25↓)
duckdb / vortex-file-compressed (1.060x ➖, 1↑ 21↓)
duckdb / vortex-compact (1.052x ➖, 0↑ 13↓)
duckdb / parquet (1.041x ➖, 0↑ 6↓)
duckdb / duckdb (1.049x ➖, 0↑ 7↓)
File Size Changes (7 files changed, -0.0% overall, 3↑ 4↓)
Totals:
Full attributed analysis
|
Benchmarks: FineWeb S3Verdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.081x ➖, 0↑ 1↓)
datafusion / vortex-compact (0.951x ➖, 0↑ 0↓)
datafusion / parquet (1.093x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.162x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.220x ➖, 0↑ 2↓)
duckdb / parquet (1.093x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) How to read Verdict and Engines
duckdb / vortex-file-compressed (0.994x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.009x ➖, 0↑ 1↓)
duckdb / parquet (1.002x ➖, 0↑ 0↓)
File Size Changes (1 files changed, +0.0% overall, 1↑ 0↓)
Totals:
Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 0.938x ➖ How to read Verdict and Engines
unknown / unknown (1.013x ➖, 3↑ 1↓)
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.954x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.946x ➖, 0↑ 0↓)
datafusion / parquet (0.953x ➖, 0↑ 0↓)
datafusion / arrow (0.941x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.955x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.960x ➖, 0↑ 0↓)
duckdb / parquet (0.976x ➖, 0↑ 0↓)
duckdb / duckdb (0.972x ➖, 0↑ 0↓)
File Size Changes (26 files changed, -0.0% overall, 12↑ 14↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.972x ➖, 1↑ 1↓)
datafusion / vortex-compact (1.108x ➖, 0↑ 4↓)
datafusion / parquet (0.987x ➖, 3↑ 1↓)
duckdb / vortex-file-compressed (1.002x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.061x ➖, 0↑ 2↓)
duckdb / parquet (1.049x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Appian on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.033x ➖, 0↑ 1↓)
datafusion / parquet (1.070x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.958x ➖, 1↑ 0↓)
duckdb / parquet (0.942x ➖, 1↑ 0↓)
duckdb / duckdb (1.030x ➖, 0↑ 0↓)
File Size Changes (4 files changed, -0.0% overall, 3↑ 1↓)
Totals:
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.227x ❌, 0↑ 38↓)
datafusion / parquet (1.229x ❌, 0↑ 38↓)
duckdb / vortex-file-compressed (1.183x ❌, 0↑ 32↓)
duckdb / parquet (1.122x ❌, 0↑ 25↓)
duckdb / duckdb (1.188x ❌, 0↑ 42↓)
File Size Changes (102 files changed, -0.0% overall, 48↑ 54↓)
Totals:
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 1.002x ➖ How to read Verdict and Engines
unknown / unknown (1.000x ➖, 0↑ 0↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.187x ➖, 0↑ 4↓)
datafusion / vortex-compact (1.073x ➖, 1↑ 1↓)
datafusion / parquet (1.104x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.024x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.947x ➖, 1↑ 0↓)
duckdb / parquet (1.108x ➖, 0↑ 3↓)
Full attributed analysis
|
Signed-off-by: Onur Satici <onur@spiraldb.com>
| let referenced_heads: HashSet<FieldName> = if field_paths.iter().any(FieldPath::is_root) { | ||
| scope_fields.names().iter().cloned().collect() | ||
| } else { | ||
| field_paths | ||
| .iter() | ||
| .filter_map(|path| match path.parts().first() { | ||
| Some(Field::Name(name)) => Some(name.clone()), | ||
| _ => None, | ||
| }) | ||
| .collect() | ||
| }; | ||
| debug_assert_eq!( | ||
| referenced_heads, | ||
| immediate_scope_access(expr, scope_fields), | ||
| "referenced field path heads must match the immediately accessed scope fields" | ||
| ); |
There was a problem hiding this comment.
does this need to be in the macro? for dead code checks?
There was a problem hiding this comment.
I wanted to leave the entire block out of release builds as the debug_assert won't do anything, and it marks it clear that this is only for the debug assertion
Summary
split registration only cared about the immediate accessed fields of the projection and filter expression. We do handle these nested expressions correctly at layout readers, but we did still register splits as if we are referencing all nested fields.
so if we had a
select a.x, a.y, before, we were registering splits forPrefix(a), meaning all fields under a would have their splits registered, ending up with many more tasks than needed. Now we do passPrefix(a.y) | Prefix(a.x)correctly to splits